#### In the field of data-related research, it is very important to handle missing data either by deleting or imputation(handling the missing values with some estimation).

Different Methods of Dealing With Missing Data

1.Deleting the column with missing data

2.Deleting the row with missing data

3.Filling the Missing Values â€“ Imputation

(i) Numerical data - use mean

(ii) categorical data - use mode - assign the NaN values their own category

4.Advanced Imputation

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("AB_NYC_2019.csv")

In [3]:
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,19-10-2018,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,21-05-2019,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,05-07-2019,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,19-11-2018,0.1,1,0


In [4]:
data.shape

(48906, 16)

In [7]:
data.isna().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

## Removing column


In [8]:
df1 = data.drop("last_review", axis=1)

## one column is fewer

In [9]:
df1.shape

(48906, 15)

In [10]:
df1.isna().sum()


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

## Removing Rows with Missing Values

In [11]:
df2 = data.dropna()

In [12]:
df2.shape

(38832, 16)

## Insights:

The dataset originally had 48,906 rows.
After removing rows with missing values, we now have 38,821 rows.

## Why?

Removing rows works only if the percentage of missing values is small.
If too many rows are removed, it may lead to loss of important data.

## Filling Missing Values (Imputation)
Instead of deleting data, we can replace missing values using different strategies.

(i) Handling Numerical Data (Using Mean)
For numerical columns, mean imputation is commonly used.






In [16]:
df3 = pd.read_csv("AB_NYC_2019.csv")


In [19]:
mean_value = df3['reviews_per_month'].mean()
df3['reviews_per_month'].fillna(mean_value)

0        0.210000
1        0.380000
2        1.373151
3        4.640000
4        0.100000
           ...   
48901    1.500000
48902    1.340000
48903    0.910000
48904    0.220000
48905    1.200000
Name: reviews_per_month, Length: 48906, dtype: float64

In [20]:
df3.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,19-10-2018,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,21-05-2019,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,1.373151,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,05-07-2019,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,19-11-2018,0.1,1,0


## Handling Categorical Data (Using Mode)
For categorical data, we use the most frequently occurring value (mode).

In [23]:
df3["last_review"].fillna(df3["last_review"].value_counts().index[0])

0        19-10-2018
1        21-05-2019
2        23-06-2019
3        05-07-2019
4        19-11-2018
            ...    
48901    23-06-2019
48902    24-06-2019
48903    05-07-2019
48904    31-10-2018
48905    29-06-2019
Name: last_review, Length: 48906, dtype: object

## Now, 'last_review' has no missing values.
It was filled with the most frequent review date.

## Assigning a New Category for Categorical Data
Instead of using mode, we can introduce a new category like "Not Reviewed".

In [27]:
df3["last_review"].fillna("Not Reviewed")

0        19-10-2018
1        21-05-2019
2        23-06-2019
3        05-07-2019
4        19-11-2018
            ...    
48901    23-06-2019
48902    24-06-2019
48903    05-07-2019
48904    31-10-2018
48905    29-06-2019
Name: last_review, Length: 48906, dtype: object

In [28]:
df3.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,19-10-2018,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,21-05-2019,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,23-06-2019,1.373151,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,05-07-2019,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,19-11-2018,0.1,1,0


## Advanced Imputation
More advanced methods, such as interpolation, can estimate missing values.

In [30]:
df3["reviews_per_month"].interpolate()

0        0.210000
1        0.380000
2        1.373151
3        4.640000
4        0.100000
           ...   
48901    1.500000
48902    1.340000
48903    0.910000
48904    0.220000
48905    1.200000
Name: reviews_per_month, Length: 48906, dtype: float64