# Day 12 of 100 days of Data Science

a crowdsourced Data Science learning program by Mr. Sharan

---

## Pandas: Missing values and Handling them 

__First we will locate the missing values, or the values that don't belong.
Then we will handle them by 2 ways__
- Remove them
- Fill them out

In [1]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv")
data.head(10)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,1,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


__Counting the number of null values in the dataset__

In [2]:
data.isna().sum()

PID             1
ST_NUM          2
ST_NAME         0
OWN_OCCUPIED    1
NUM_BEDROOMS    2
NUM_BATH        1
SQ_FT           1
dtype: int64

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PID           8 non-null      float64
 1   ST_NUM        7 non-null      float64
 2   ST_NAME       9 non-null      object 
 3   OWN_OCCUPIED  8 non-null      object 
 4   NUM_BEDROOMS  7 non-null      object 
 5   NUM_BATH      8 non-null      object 
 6   SQ_FT         8 non-null      object 
dtypes: float64(2), object(5)
memory usage: 632.0+ bytes


__But some of the missing values are not counted as they are not written as 'NaN'. So we create a list of missing values, which will change the value to 'NaN'__

In [4]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
data = pd.read_csv("https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv",
                   na_values = missing_values)
data.head(10)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,,LEXINGTON,N,,1,850.0
3,100004000.0,201.0,BERKELEY,12,1.0,,700.0
4,,203.0,BERKELEY,Y,3.0,2,1600.0
5,100006000.0,207.0,BERKELEY,Y,,1,800.0
6,100007000.0,,WASHINGTON,,2.0,HURLEY,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,1,
8,100009000.0,215.0,TREMONT,Y,,2,1800.0


In [5]:
data['OWN_OCCUPIED'].isna()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
Name: OWN_OCCUPIED, dtype: bool

__Even though we have replaced the missing values with NaN, there are still some that are not replaced. Like in the OWN_OCCUPIED column, the 3rd row has an integer value, which is supposed to be replaced with NaN.__

In [6]:
# Detecting numbers 
import numpy as np
cnt=0
for row in data['OWN_OCCUPIED']:
    try:
        int(row)
        data.loc[cnt, 'OWN_OCCUPIED']=np.nan
    except ValueError:
        pass
    cnt+=1

In [7]:
data['OWN_OCCUPIED'].isna()

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
Name: OWN_OCCUPIED, dtype: bool

__Counting the NULL values in each column__

In [8]:
data.isna().sum()

PID             1
ST_NUM          2
ST_NAME         0
OWN_OCCUPIED    2
NUM_BEDROOMS    3
NUM_BATH        1
SQ_FT           2
dtype: int64

__Same goes with NUM_BATH where a value is in string and rest all in integer. Since it's an object we will change the datatype of the column to integer. We can coerce invalid values to NaN as follows using the errors keyword argument:__

In [9]:
data['NUM_BATH'] = pd.to_numeric(data['NUM_BATH'],  errors='coerce')

In [10]:
data.isna().sum()

PID             1
ST_NUM          2
ST_NAME         0
OWN_OCCUPIED    2
NUM_BEDROOMS    3
NUM_BATH        2
SQ_FT           2
dtype: int64

__Now that we have found out the missing values, it's time to handle them. We can do that with either Replacing them or Deleting them__

__Replacing the NULL values can be done by the following ways:__

- Replace by Location
- Replace all
- Replace using median

__Location Based Replace__

In [11]:
# Location based replacement
data.loc[2,'ST_NUM'] = 125

__Replacing all the missing values with one number.__

In [12]:
# Replace missing values with a number
data['ST_NUM'].fillna(125, inplace=True)

__Replacing the missing numbers with the median of the column__

In [13]:
# Replace using median 
median = data['NUM_BEDROOMS'].median()
data['NUM_BEDROOMS'].fillna(median, inplace=True)

__Replacing the missing values by the above or below given value, using pad method__

In [14]:
data['OWN_OCCUPIED']=data['OWN_OCCUPIED'].fillna(method="pad")

In [15]:
data.head(10)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1.0,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,125.0,LEXINGTON,N,2.5,1.0,850.0
3,100004000.0,201.0,BERKELEY,N,1.0,,700.0
4,,203.0,BERKELEY,Y,3.0,2.0,1600.0
5,100006000.0,207.0,BERKELEY,Y,2.5,1.0,800.0
6,100007000.0,125.0,WASHINGTON,Y,2.0,,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,1.0,
8,100009000.0,215.0,TREMONT,Y,2.5,2.0,1800.0


In [16]:
median = data['PID'].median()
data['PID'].fillna(median, inplace=True)
data.head(10)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1.0,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,125.0,LEXINGTON,N,2.5,1.0,850.0
3,100004000.0,201.0,BERKELEY,N,1.0,,700.0
4,100005000.0,203.0,BERKELEY,Y,3.0,2.0,1600.0
5,100006000.0,207.0,BERKELEY,Y,2.5,1.0,800.0
6,100007000.0,125.0,WASHINGTON,Y,2.0,,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,1.0,
8,100009000.0,215.0,TREMONT,Y,2.5,2.0,1800.0


In [17]:
median = data['SQ_FT'].median()
data['SQ_FT'].fillna(median, inplace=True)
data.head(10)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1.0,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,950.0
2,100003000.0,125.0,LEXINGTON,N,2.5,1.0,850.0
3,100004000.0,201.0,BERKELEY,N,1.0,,700.0
4,100005000.0,203.0,BERKELEY,Y,3.0,2.0,1600.0
5,100006000.0,207.0,BERKELEY,Y,2.5,1.0,800.0
6,100007000.0,125.0,WASHINGTON,Y,2.0,,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,1.0,950.0
8,100009000.0,215.0,TREMONT,Y,2.5,2.0,1800.0


__We can even drop the columns or rows instead of replacing. Deleting the column with NULL values by using axis=1__

In [18]:
data.dropna(axis=1, how='any')

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,950.0
2,100003000.0,125.0,LEXINGTON,N,2.5,850.0
3,100004000.0,201.0,BERKELEY,N,1.0,700.0
4,100005000.0,203.0,BERKELEY,Y,3.0,1600.0
5,100006000.0,207.0,BERKELEY,Y,2.5,800.0
6,100007000.0,125.0,WASHINGTON,Y,2.0,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,950.0
8,100009000.0,215.0,TREMONT,Y,2.5,1800.0


__Note:__ axis = 1 means column and axis = 0 means row. For Filling methods can be bfill/backfill and pad/ffill

## Thank you!