Finding Missing Values

https://medium.com/analytics-vidhya/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Handling Missing Values in a Data Frame


https://medium.com/analytics-vidhya/python-handling-missing-values-in-a-data-frame-4156dac4399

In [6]:
import pandas as pd
import numpy as np

## Seattle Airbnb Open Data

https://www.kaggle.com/datasets/airbnb/seattle?resource=download&select=reviews.csv

####  Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.

#### Content
The following Airbnb activity is included in this Seattle dataset:

Listings, including full descriptions and average review score

Reviews, including unique id for each reviewer and detailed comments

Calendar, including listing id and the price and availability for that day
#### Inspiration
Can you describe the vibe of each Seattle neighborhood using listing descriptions?
What are the busiest times of the year to visit Seattle? By how much do prices spike?
Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to Seattle?

http://insideairbnb.com/get-the-data/

# Loading Manchester AirBnB data

In [7]:
df_neighbour= pd.read_csv("neighbourhoods.csv")
df_listing = pd.read_csv("listings.csv")
df_reviews = pd.read_csv("reviews.csv")

# Check data information

In [9]:
print(df_listing.dtypes.value_counts())
print(df_listing.info())

int64      8
object     6
float64    4
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3584 entries, 0 to 3583
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              3584 non-null   int64  
 1   name                            3584 non-null   object 
 2   host_id                         3584 non-null   int64  
 3   host_name                       3584 non-null   object 
 4   neighbourhood_group             3584 non-null   object 
 5   neighbourhood                   3584 non-null   object 
 6   latitude                        3584 non-null   float64
 7   longitude                       3584 non-null   float64
 8   room_type                       3584 non-null   object 
 9   price                           3584 non-null   int64  
 10  minimum_nights                  3584 non-null   int64  
 11  number_of_reviews               3584 non-nu

# Separating columns into numeric and categorical

In [8]:
num_vars = df_listing.columns[df_listing.dtypes != 'object']
cat_vars = df_listing.columns[df_listing.dtypes == 'object']

# Checking percentage of nulls in each column

In [9]:
df_listing[num_vars].isnull().sum().sort_values(ascending=False)/len(df_listing)

license                           1.000000
reviews_per_month                 0.173885
id                                0.000000
host_id                           0.000000
latitude                          0.000000
longitude                         0.000000
price                             0.000000
minimum_nights                    0.000000
number_of_reviews                 0.000000
calculated_host_listings_count    0.000000
availability_365                  0.000000
number_of_reviews_ltm             0.000000
dtype: float64

In [21]:
df_listing[cat_vars].isnull().sum().sort_values(ascending=False)/len(df_listing)

name                   0.0
host_name              0.0
neighbourhood_group    0.0
neighbourhood          0.0
room_type              0.0
last_review            0.0
dtype: float64

# Dropping column which was 100% null

In [22]:
df_listing = df_listing.drop(columns=['license'],axis=1)

# Replacing NaNs in this column with 0, as NaN just means no reviews yet

In [23]:
df_listing['reviews_per_month'] = df_listing['reviews_per_month'].fillna(0)

In [24]:
df_listing

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,157612,New attic space/single & Dble room,757016,Margaret,Salford,Salford District,53.501140,-2.264290,Entire home/apt,43,2,119,2022-10-31,0.90,1,249,19
1,283495,En-suite room in detached house,1476718,Alison,Rochdale,Rochdale District,53.562590,-2.219450,Private room,75,100,10,2018-08-05,0.11,1,365,0
2,299194,Cosy Garden Chalet for all seasons,1542010,Minh,Stockport,Stockport District,53.376000,-2.044620,Entire home/apt,50,2,312,2022-11-14,2.39,1,270,19
3,310742,Nice room 10 minutes walk from town,1603652,Francisca,Manchester,Ancoats and Clayton,53.482510,-2.228020,Private room,34,180,65,2022-05-02,0.49,1,302,1
4,332580,**ELEGANT STAY** CENTRAL MANCHESTER,1694961,Manchester,Manchester,City Centre,53.478590,-2.231950,Private room,49,2,329,2022-12-21,2.63,3,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4590,789262602588233720,Tren istasyonuna yakın Hotel,460316840,Baran,Manchester,City Centre,53.478170,-2.233160,Private room,200,1,0,1970-01-01,0.00,15,87,0
4591,789270829032698628,Lüks Hotel,460316840,Baran,Manchester,City Centre,53.477425,-2.244982,Private room,200,1,0,1970-01-01,0.00,15,92,0
4592,789657340423683461,Fully Furnished 2 Bedroom Flat,492680656,Omar Mohammed,Manchester,Gorton North,53.460915,-2.167947,Entire home/apt,100,7,0,1970-01-01,0.00,1,363,0
4593,789889418261117861,Excellent Cosy Double Bed Room,492726868,Farina,Stockport,Stockport District,53.403998,-2.156110,Private room,35,1,0,1970-01-01,0.00,1,365,0


# Replacing NaNs in dates column, with unix date starting time, so that it can still indicate a missing value without breaking the column

In [25]:
df_listing['last_review'] = df_listing['last_review'].fillna('1970-01-01')

In [26]:
df_listing

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,157612,New attic space/single & Dble room,757016,Margaret,Salford,Salford District,53.501140,-2.264290,Entire home/apt,43,2,119,2022-10-31,0.90,1,249,19
1,283495,En-suite room in detached house,1476718,Alison,Rochdale,Rochdale District,53.562590,-2.219450,Private room,75,100,10,2018-08-05,0.11,1,365,0
2,299194,Cosy Garden Chalet for all seasons,1542010,Minh,Stockport,Stockport District,53.376000,-2.044620,Entire home/apt,50,2,312,2022-11-14,2.39,1,270,19
3,310742,Nice room 10 minutes walk from town,1603652,Francisca,Manchester,Ancoats and Clayton,53.482510,-2.228020,Private room,34,180,65,2022-05-02,0.49,1,302,1
4,332580,**ELEGANT STAY** CENTRAL MANCHESTER,1694961,Manchester,Manchester,City Centre,53.478590,-2.231950,Private room,49,2,329,2022-12-21,2.63,3,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4590,789262602588233720,Tren istasyonuna yakın Hotel,460316840,Baran,Manchester,City Centre,53.478170,-2.233160,Private room,200,1,0,1970-01-01,0.00,15,87,0
4591,789270829032698628,Lüks Hotel,460316840,Baran,Manchester,City Centre,53.477425,-2.244982,Private room,200,1,0,1970-01-01,0.00,15,92,0
4592,789657340423683461,Fully Furnished 2 Bedroom Flat,492680656,Omar Mohammed,Manchester,Gorton North,53.460915,-2.167947,Entire home/apt,100,7,0,1970-01-01,0.00,1,363,0
4593,789889418261117861,Excellent Cosy Double Bed Room,492726868,Farina,Stockport,Stockport District,53.403998,-2.156110,Private room,35,1,0,1970-01-01,0.00,1,365,0


# Now to do the same for the other tables

In [25]:
print(df_reviews.dtypes.value_counts())
print(df_reviews.info())

int64     1
object    1
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114145 entries, 0 to 114144
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  114145 non-null  int64 
 1   date        114145 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.7+ MB
None


# Reviews has no null values

In [27]:
df_reviews.isnull().sum().sort_values(ascending=False)/len(df_reviews)

listing_id    0.0
date          0.0
dtype: float64

# Neighbour has no null values as well

In [28]:
print(df_neighbour.dtypes.value_counts())
print(df_neighbour.info())

object    2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   neighbourhood_group  41 non-null     object
 1   neighbourhood        41 non-null     object
dtypes: object(2)
memory usage: 784.0+ bytes
None


In [29]:
df_neighbour

Unnamed: 0,neighbourhood_group,neighbourhood
0,Bolton,Bolton District
1,Bury,Bury District
2,Manchester,Ancoats and Clayton
3,Manchester,Ardwick
4,Manchester,Baguley
5,Manchester,Bradford
6,Manchester,Brooklands
7,Manchester,Burnage
8,Manchester,Charlestown
9,Manchester,Cheetham


In [30]:
df_listing

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,157612,New attic space/single & Dble room,757016,Margaret,Salford,Salford District,53.50114,-2.26429,Entire home/apt,36,2,103,2022-03-18,0.84,1,275,7
1,283495,En-suite room in detached house,1476718,Alison,Rochdale,Rochdale District,53.56259,-2.21945,Private room,75,3,10,2018-08-05,0.12,1,277,0
2,299194,Cosy Garden Chalet for all seasons,1542010,Minh,Stockport,Stockport District,53.37600,-2.04462,Entire home/apt,50,2,296,2022-02-14,2.44,1,360,26
3,310742,Nice room 10 minutes walk from town,1603652,Francisca,Manchester,Ancoats and Clayton,53.48251,-2.22802,Private room,34,180,64,2018-01-12,0.52,1,175,0
4,332580,**ELEGANT STAY** CENTRAL MANCHESTER,1694961,Manchester,Manchester,City Centre,53.47859,-2.23195,Private room,49,2,324,2022-01-16,2.80,5,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3579,43323251,The Annexe - Private home in forest location.,119767658,Emily,Wigan,Wigan District,53.45970,-2.55510,Entire home/apt,195,2,84,2022-03-09,4.11,1,50,53
3580,50386579,Woodbank is an ideal haven for an active escape.,85860535,Monica,Rochdale,Rochdale District,53.67811,-2.08412,Entire home/apt,165,2,10,2021-11-03,1.14,2,69,10
3581,48601153,Unique Cottage with fabulous views,85860535,Monica,Rochdale,Rochdale District,53.68086,-2.08369,Entire home/apt,120,2,11,2022-01-14,1.36,2,86,11
3582,24438146,6 Star Lux Large House. Stunning views! Hot tub !,279141591,David,Rochdale,Rochdale District,53.67838,-2.08533,Entire home/apt,639,2,118,2022-02-27,2.53,1,249,42


# Which neighbourhood groups have the most airbnb locations

In [27]:
df_listing.groupby(['neighbourhood_group']).apply(len).sort_values(ascending=False)

neighbourhood_group
Manchester    2314
Salford        829
Trafford       399
Stockport      242
Oldham         189
Bolton         159
Bury           146
Wigan          126
Tameside       119
Rochdale        72
dtype: int64

# Average price in each neighbourhood

In [29]:
df_listing.groupby(['neighbourhood_group']).agg({'price':np.mean}).sort_values(by='price',ascending=False)

Unnamed: 0_level_0,price
neighbourhood_group,Unnamed: 1_level_1
Bolton,396.842767
Manchester,164.687122
Salford,144.314837
Stockport,128.524793
Bury,124.958904
Trafford,121.090226
Oldham,119.62963
Rochdale,114.5
Wigan,106.126984
Tameside,96.680672


# Removing some columns which are not needed for analysis

In [31]:
df_listing_subset = df_listing.drop(columns=['id','host_id','name','latitude','longitude'],axis=1)

In [32]:
df_listing_subset

Unnamed: 0,host_name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,Margaret,Salford,Salford District,Entire home/apt,43,2,119,2022-10-31,0.90,1,249,19
1,Alison,Rochdale,Rochdale District,Private room,75,100,10,2018-08-05,0.11,1,365,0
2,Minh,Stockport,Stockport District,Entire home/apt,50,2,312,2022-11-14,2.39,1,270,19
3,Francisca,Manchester,Ancoats and Clayton,Private room,34,180,65,2022-05-02,0.49,1,302,1
4,Manchester,Manchester,City Centre,Private room,49,2,329,2022-12-21,2.63,3,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4590,Baran,Manchester,City Centre,Private room,200,1,0,1970-01-01,0.00,15,87,0
4591,Baran,Manchester,City Centre,Private room,200,1,0,1970-01-01,0.00,15,92,0
4592,Omar Mohammed,Manchester,Gorton North,Entire home/apt,100,7,0,1970-01-01,0.00,1,363,0
4593,Farina,Stockport,Stockport District,Private room,35,1,0,1970-01-01,0.00,1,365,0


# Average, min and max for the different types of rooms

In [33]:
df_listing_subset.groupby(['room_type']).agg({'price':[np.mean,min,max]})

Unnamed: 0_level_0,price,price,price
Unnamed: 0_level_1,mean,min,max
room_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Entire home/apt,185.116732,16,47271
Hotel room,93.833333,41,206
Private room,110.845397,11,5068
Shared room,45.965517,16,286


# More detailed breakdown of average price for different room types in the different neighbourhoods

In [34]:
group = df_listing_subset.groupby(['neighbourhood','room_type']).agg({'price':[np.mean,min,max]})
group

Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max
neighbourhood,room_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Ancoats and Clayton,Entire home/apt,209.711409,48,3009
Ancoats and Clayton,Hotel room,206.000000,206,206
Ancoats and Clayton,Private room,217.541667,27,1570
Ardwick,Entire home/apt,216.627451,50,686
Ardwick,Private room,418.807692,25,1763
...,...,...,...,...
Withington,Entire home/apt,204.241379,54,700
Withington,Private room,61.000000,34,101
Withington,Shared room,49.000000,49,49
Woodhouse Park,Entire home/apt,115.444444,90,175


In [46]:
df_calendar = pd.read_csv("calendar.csv")
df_calendar.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,157612,2022-12-27,f,$42.00,$42.00,2,365
1,157612,2022-12-28,f,$42.00,$42.00,2,365
2,157612,2022-12-29,f,$42.00,$42.00,2,365
3,157612,2022-12-30,f,$45.00,$45.00,2,365
4,157612,2022-12-31,f,$45.00,$45.00,2,365


In [47]:
df_listing_full = pd.read_csv("listings_full.csv")
df_listing_full.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,157612,https://www.airbnb.com/rooms/157612,20221227164051,2022-12-27,city scrape,New attic space/single & Dble room,"The loft space is a small but cosy, private an...",There is a public park within easy walking dis...,https://a0.muscache.com/pictures/18150828/c288...,757016,...,4.94,4.66,4.91,,f,1,1,0,0,0.9
1,6172921,https://www.airbnb.com/rooms/6172921,20221227164051,2022-12-27,previous scrape,"Beautiful big rm convenient Uni, City and Chri...",Bedroom with adjacent bathroom in a family hom...,Didsbury and West Didsbury are attractive and ...,https://a0.muscache.com/pictures/1f8319f0-0858...,32026858,...,4.98,4.9,4.93,,f,1,0,1,0,3.1
2,283495,https://www.airbnb.com/rooms/283495,20221227164051,2022-12-27,city scrape,En-suite room in detached house,<b>The space</b><br />Double bedroom with King...,The suburbaness of it all but 2 minutes from t...,https://a0.muscache.com/pictures/78775473/2d8f...,1476718,...,5.0,4.8,5.0,,f,1,0,1,0,0.11
3,6191116,https://www.airbnb.com/rooms/6191116,20221227164051,2022-12-27,city scrape,Pet Friendly Studio for 1 or 2 in Sale. Manche...,Our Executive Studios are perfect for 1 or 2 p...,We love our neighbourhood because it's quiet &...,https://a0.muscache.com/pictures/0333751a-db5a...,17136758,...,4.88,4.91,4.64,,f,10,10,0,0,0.77
4,6330967,https://www.airbnb.com/rooms/6330967,20221227164051,2022-12-27,city scrape,"Room with a view, very handy for MAN Airport",Large double bedroom with dressing room in 3 b...,Well served by shops and restaurants within ea...,https://a0.muscache.com/pictures/3fc9c337-f95d...,32954691,...,4.33,4.67,4.33,,f,3,1,2,0,0.03


In [48]:
df_listing_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4595 entries, 0 to 4594
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            4595 non-null   int64  
 1   listing_url                                   4595 non-null   object 
 2   scrape_id                                     4595 non-null   int64  
 3   last_scraped                                  4595 non-null   object 
 4   source                                        4595 non-null   object 
 5   name                                          4595 non-null   object 
 6   description                                   4508 non-null   object 
 7   neighborhood_overview                         2519 non-null   object 
 8   picture_url                                   4595 non-null   object 
 9   host_id                                       4595 non-null   i

In [57]:
num_vars = df_listing_full.columns[df_listing_full.dtypes != 'object']
cat_vars = df_listing_full.columns[df_listing_full.dtypes == 'object']

In [66]:
df_listing_full[num_vars].isnull().sum().sort_values(ascending=False)/len(df_listing_full)

id                                              0.0
availability_60                                 0.0
availability_365                                0.0
number_of_reviews                               0.0
number_of_reviews_ltm                           0.0
number_of_reviews_l30d                          0.0
review_scores_rating                            0.0
review_scores_accuracy                          0.0
review_scores_cleanliness                       0.0
review_scores_checkin                           0.0
review_scores_communication                     0.0
review_scores_location                          0.0
review_scores_value                             0.0
calculated_host_listings_count                  0.0
calculated_host_listings_count_entire_homes     0.0
calculated_host_listings_count_private_rooms    0.0
calculated_host_listings_count_shared_rooms     0.0
availability_90                                 0.0
availability_30                                 0.0
scrape_id   

In [68]:
df_listing_full[cat_vars].isnull().sum().sort_values(ascending=False)/len(df_listing_full)

last_review                     0.173885
first_review                    0.173885
host_response_time              0.103591
host_response_rate              0.103591
host_acceptance_rate            0.061806
description                     0.018934
bathrooms_text                  0.002176
listing_url                     0.000000
neighbourhood_group_cleansed    0.000000
property_type                   0.000000
room_type                       0.000000
amenities                       0.000000
host_identity_verified          0.000000
price                           0.000000
has_availability                0.000000
calendar_last_scraped           0.000000
neighbourhood_cleansed          0.000000
host_verifications              0.000000
host_has_profile_pic            0.000000
last_scraped                    0.000000
host_picture_url                0.000000
host_thumbnail_url              0.000000
host_is_superhost               0.000000
host_since                      0.000000
host_name       

In [55]:
df_listing_full = df_listing_full.drop(columns=['calendar_updated','license','bathrooms',
                              'host_neighbourhood','host_about','neighborhood_overview',
                              'neighbourhood','host_location'],axis=1)

In [56]:
df_listing_full.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,picture_url,host_id,host_url,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,157612,https://www.airbnb.com/rooms/157612,20221227164051,2022-12-27,city scrape,New attic space/single & Dble room,"The loft space is a small but cosy, private an...",https://a0.muscache.com/pictures/18150828/c288...,757016,https://www.airbnb.com/users/show/757016,...,4.98,4.94,4.66,4.91,f,1,1,0,0,0.9
1,6172921,https://www.airbnb.com/rooms/6172921,20221227164051,2022-12-27,previous scrape,"Beautiful big rm convenient Uni, City and Chri...",Bedroom with adjacent bathroom in a family hom...,https://a0.muscache.com/pictures/1f8319f0-0858...,32026858,https://www.airbnb.com/users/show/32026858,...,4.97,4.98,4.9,4.93,f,1,0,1,0,3.1
2,283495,https://www.airbnb.com/rooms/283495,20221227164051,2022-12-27,city scrape,En-suite room in detached house,<b>The space</b><br />Double bedroom with King...,https://a0.muscache.com/pictures/78775473/2d8f...,1476718,https://www.airbnb.com/users/show/1476718,...,5.0,5.0,4.8,5.0,f,1,0,1,0,0.11
3,6191116,https://www.airbnb.com/rooms/6191116,20221227164051,2022-12-27,city scrape,Pet Friendly Studio for 1 or 2 in Sale. Manche...,Our Executive Studios are perfect for 1 or 2 p...,https://a0.muscache.com/pictures/0333751a-db5a...,17136758,https://www.airbnb.com/users/show/17136758,...,4.93,4.88,4.91,4.64,f,10,10,0,0,0.77
4,6330967,https://www.airbnb.com/rooms/6330967,20221227164051,2022-12-27,city scrape,"Room with a view, very handy for MAN Airport",Large double bedroom with dressing room in 3 b...,https://a0.muscache.com/pictures/3fc9c337-f95d...,32954691,https://www.airbnb.com/users/show/32954691,...,5.0,4.33,4.67,4.33,f,3,1,2,0,0.03


In [64]:
fill_mean = lambda col: col.fillna(col.mean())
fill_mode = lambda col: col.fillna(col.mode())

In [65]:
df_listing_full[num_vars] = df_listing_full[num_vars].apply(fill_mean)
df_listing_full[cat_vars] = df_listing_full[cat_vars].apply(fill_mode)