#### Dealing with missing values in Python

**What is missing value?**

* Missing values occur when no data value is stored for a variable (feature) in an observation.
*Could be represented as **"?"**, **"N/A"**, 0 or just a blank cell

#### How to deal with missing data?

**Check** with the data collection source

**Drop** the missing values
* drop the variable
* drop the data entry

**Replace** the missing values
* replace it with an average (of similar datapoints)
* replace it by frecuency
* replace it basesd on other functions

**Leave** it as missing data

##### How to drop missing values in Python

**Use** dataframes.dropna()
* axis = **0** (drops the entire row)
* axis = **1** (drops the entire column)

**example**


In [2]:
import pandas as pd

In [8]:
df = pd.read_csv('automobile1.csv')
df

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845
197,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045
198,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485
199,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470


In [9]:
df.iloc[:, -1]

0      13495
1      16500
2      16500
3      13950
4      17450
       ...  
196    16845
197    19045
198    21485
199    22470
200    22625
Name: price, Length: 201, dtype: int64

In [10]:
df['price']

0      13495
1      16500
2      16500
3      13950
4      17450
       ...  
196    16845
197    19045
198    21485
199    22470
200    22625
Name: price, Length: 201, dtype: int64

**Use dataframes.dropna()**:

In [None]:
df.dropna(subset=['price'], axis = 0, inplace True) # Inplace =True = just writes the result back into the dataframe

#### How to replace missing values in Python

**Use** dataframe.replace(missing_value, new_value):

In [None]:
mean = df['namedataframe'].mean() #first calculate
df['namedataframe'].replace(np.nan, mean)