Different Methods of Dealing with Missing Values -

- Deleting the column with missing data
- Deleting the row with missing data
- Fill the Missing Values - IMPUTATION
    - Numerical data - use mean/median etc.
    - Categorical data - use most_frequent etc. 
                    - We can introduce new category for the NULL values

#### Removing the Column which has the missing data


In [2]:

import pandas as pd
df = pd.read_csv('./AB_NYC_2019.csv')
df1 = df.copy() #Copies the complete data from one dataset to another
df1.drop("last_review", axis=1, inplace= True)   #Helps in removing the data from the dataset. axis = 1 means remove column. inplace = True means it is a permanent change.
df1.isna().sum()


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

#### Removing the Rows which has the missing data


In [3]:

df2 = df.copy()
df2.isna().sum()
df2.dropna(inplace=True)
df2.isna().sum()


id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

#### Fill the missing values using IMPUTATION

In [5]:
df3 = df.copy()
df3.head()
mean_value = df3["reviews_per_month"].mean() #Helps in finding the mean from the column
df3["reviews_per_month"].fillna(mean_value, inplace= True)  #Helps in filling the data with the mean
df3.isna().sum()


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                     0
calculated_host_listings_count        0
availability_365                      0
dtype: int64

#### Handling the Categorical Data

- Method 1 - Using Most_frequent

In [7]:
df3.nunique()

id                                48895
name                              47905
host_id                           37457
host_name                         11452
neighbourhood_group                   5
neighbourhood                       221
latitude                          19048
longitude                         14718
room_type                             3
price                               674
minimum_nights                      109
number_of_reviews                   394
last_review                        1764
reviews_per_month                   938
calculated_host_listings_count       47
availability_365                    366
dtype: int64

In [17]:
modValue = df3['last_review'].value_counts()        #This will give us the count of how many times each value has occured in the df3 in the sorted Pandas Series 
print(type(modValue))
modValueMostOccured = modValue.index[0]     #This will give the first item from the Pandas series
print(modValueMostOccured)
df3['last_review'].fillna(modValue, inplace= True) #Fill the NULL values  of 'last_review' with most occured number
#df3['last_review'].fillna(df3['last_review'].value_counts().index[0] ,inplace= True) We can combine the above steps like this.
df3.isna().sum() 


<class 'pandas.core.series.Series'>
2019-06-23


id                                 0
name                              16
host_id                            0
host_name                         21
neighbourhood_group                0
neighbourhood                      0
latitude                           0
longitude                          0
room_type                          0
price                              0
minimum_nights                     0
number_of_reviews                  0
last_review                        0
reviews_per_month                  0
calculated_host_listings_count     0
availability_365                   0
dtype: int64

#### - Method 2 - By making a new category

In [19]:
df4 = df.copy()
df4.isna().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [29]:
df4["last_review"].fillna("Not_Reviewed", inplace= True)
df4["last_review"]
count = 0
for i in df4["last_review"] :
    if i == 'Not_Reviewed' :
        count += 1
print(count) 
df4.sample(10)

10052


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
12499,9612938,Studio apt in Historic House Museum,49736436,George,Manhattan,Washington Heights,40.83549,-73.93968,Entire home/apt,120,3,10,2019-07-01,0.7,2,244
27395,21608855,The Cozy Bushwick Modern (Room B),142625186,J.R.,Brooklyn,Bushwick,40.69104,-73.91685,Private room,75,1,44,2019-06-19,2.16,2,0
20981,16617217,LUXURY Flatiron Highrise 1 BR/1 BA,107883536,Tommy,Manhattan,Midtown,40.74473,-73.99005,Entire home/apt,159,2,0,Not_Reviewed,2.09,1,0
21920,17625824,Private Room in Brooklyn Brownstone,79882333,Hallie,Brooklyn,Bedford-Stuyvesant,40.68119,-73.93722,Private room,39,2,2,2017-03-29,0.07,1,0
42316,32841707,"Fresh, Spacious 2-Bed with Garage Parking",247064022,Sunny,Queens,Flushing,40.74551,-73.82662,Entire home/apt,189,3,10,2019-07-01,2.68,1,51
18524,14594441,Excellent Studio Space,49562545,Ella,Brooklyn,Williamsburg,40.70623,-73.94951,Entire home/apt,99,1,187,2019-06-23,5.43,2,39
8998,6910072,Sunny Room in Bushwick & central AC,36211685,Vienna,Queens,Ridgewood,40.70588,-73.91464,Private room,85,1,1,2015-08-17,0.02,1,0
47282,35673061,Luxury Central Park Apartment,268352337,Logan,Manhattan,Hell's Kitchen,40.76374,-73.98844,Entire home/apt,450,1,1,2019-06-27,1.0,2,360
35085,27817851,A Nice Room For Unique people,152246149,Catherine,Bronx,Throgs Neck,40.82952,-73.82818,Private room,140,1,31,2019-06-09,3.01,5,365
34937,27700366,Huge Apartment Midtown Manhattan Empire State ...,89332421,Mohit,Manhattan,Kips Bay,40.74462,-73.97838,Entire home/apt,187,3,4,2019-03-25,0.53,1,0


#### Method 3 - Advanced Imputation

In [31]:
df5 = df.copy()

#Using interpolate

df5['reviews_per_month'].interpolate(inplace=True)
df5.isna().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                     0
calculated_host_listings_count        0
availability_365                      0
dtype: int64