**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Importing the data](#toc1_2_)    
- [Missing values and How to handle them](#toc2_)    
  - [_Handling missing values_](#toc2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas Series objects @ https://pandas.pydata.org/pandas-docs/stable/reference/series.html**

**`Note:`** We can actually use python built in functions on pandas series objects. i.e., **len, type, dir, in, sum, product, mean, sorted, max, min** etc.

Also, the notion of **chaining functions/methods** in pandas is similar to python.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

One of the many datasets we will use for our examples in this notebook is the `/Data/vehicles.csv.zip` dataset.

In [2]:
# read the vehicles.csv dataset
df = pd.read_csv("Data/vehicles.csv.zip")

  df = pd.read_csv("Data/vehicles.csv.zip")


Columns of a dataframe can be accessed in various ways. One of which is to use the **dot i.e, ' . ' notation**.

In [3]:
# the city08 and highway08 columns from the vehicles.csv dataset provides information on
# miles per gallon usage while driving around in the city and highway respectively.
city_mpg = df.city08
highway_mpg = df.highway08

In [4]:
# The make in the vehicles dataset provides the manufacturer name (strings) and is stored as an object.
manufac = df.make

**Note:** The first thing we should do when we load in a dataset is to check the datatypes of each column and cast each of them to more suitable datatypes. This is to save space and speed up our code execution.

------------------------------------

## <a id='toc2_'></a>[Missing values and How to handle them](#toc0_)

------------------------------

In [5]:
# The cylinders column has missing values
cylinders = df.cylinders

> **Counting the total number of missing values**

The `series.isna()` function detects missing values. An interesting characteristic of the **sum()** method is that it treats **True as 1 and False as 0**. This property can be used to count the number of missing values.

In [6]:
# the series.isna() function detects missing values
cylinders.isna().sum()

206

> **Let's see for which manufacturers the cylinders value is missing**

In [7]:
# Boolean mask for missing values
mask_missing = cylinders.isna()

In [8]:
# value_counts() is used to count unique values
manufac[mask_missing].value_counts().head(5)

make
Tesla     74
smart     16
Ford      15
Nissan    14
BMW       10
Name: count, dtype: int64

Here we can see that most of the cars with missing cylinders values are Tesla cars. Since they are electric cars this makes sense. **Note:** An alternative way would be to use the `loc` method insted of this boolean masking.

Now let's discuss how to handle these missing values.

### <a id='toc2_1_'></a>[_Handling missing values_](#toc0_)

- The `.fillna()` method allows you to specify a replacement value for any missing data

In [9]:
cylinders.fillna(0).value_counts()  # Doesn't change in place

cylinders
4.0     15938
6.0     14284
8.0      8801
5.0       771
12.0      626
3.0       279
0.0       206
10.0      170
2.0        59
16.0       10
Name: count, dtype: int64

**Note that**, in this case it was reasonable to replace the "nan" values with "0" but it is not always the case. In other scenarios the **.mean(), .median(), .mode()** etc. may come in handy.

- The `.dropna()` method will drop the indexes (i.e, rows) with missing values

In [10]:
# see that 'nan' count of 206 is the number of rows that was dropped
len(cylinders) - len(cylinders.dropna())

206

- The `.interpolate()` method will replace 'nan' with interpolation of the values around the missing value.

This is a very useful method to fill in missing values. For example, this comes in handy if the data is ordered (as time series data often is) and there are holes in the data. But, you have to make sure that the data you are manipulating has a trend that can be captured by interpolation. Otherwise, this may lead to disastarous results. 

In [11]:
# say we have a series that captures the [somewhat] upward trend of temp. as the season goes into summer from winter
temp = pd.Series([19, 20, 20, None, 22, 24, 23, 24])
temp.interpolate()  # doesn't change in place

0    19.0
1    20.0
2    20.0
3    21.0
4    22.0
5    24.0
6    23.0
7    24.0
dtype: float64