# Data cleaning
The aim of this notebook is to clean the listings data and create some features that may influence the price to prepare for later data analysis and modeling.
It will be done in following ways:
1. Drop useless features.<br><br>
2. Fill missing data.<br><br>
3. Create new features.<br><br>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#### Gathering the data

In [2]:
listings = pd.read_csv('Seattle_Airbnb_Open_Data/listings.csv')
print('The shape of listings data:', listings.shape)
listings.head()

The shape of listings data: (3818, 92)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


#### Cleaning the data

### Step 1: Drop useless features

In [3]:
del_features = ['listing_url', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_url', 'host_thumbnail_url', 'host_picture_url',
                'id', 'name', 'summary', 'description', 'latitude', 'longitude', 'space', 'neighborhood_overview', 'notes', 'transit', 'host_id', 
                'host_name', 'host_location', 'host_about', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 
                'smart_location', 'calendar_updated', 'first_review', 'last_review', 'street', 'monthly_price', 'weekly_price', 
                'calculated_host_listings_count']
data = listings.drop(del_features, axis=1)
print(data.shape)

(3818, 58)


* Look at the probability of missing data in each features, drop features whose probabilities of missing data are more than 90%.

In [4]:
data.isnull().mean().sort_values()[::-1][:10]

license                      1.000000
square_feet                  0.974594
security_deposit             0.511262
cleaning_fee                 0.269775
host_acceptance_rate         0.202462
review_scores_accuracy       0.172342
review_scores_checkin        0.172342
review_scores_value          0.171818
review_scores_location       0.171556
review_scores_cleanliness    0.171032
dtype: float64

In [5]:
data = data.drop(['license', 'square_feet'], axis=1)
print(data.shape)

(3818, 56)


* Drop features whose values are the same

In [6]:
for col in data.columns:
    if len(data[col].unique()) <= 2:
        print(col, data[col].unique())

scrape_id [20160104002432]
last_scraped ['2016-01-04']
experiences_offered ['none']
state ['WA' 'wa']
market ['Seattle']
country_code ['US']
country ['United States']
is_location_exact ['t' 'f']
has_availability ['t']
calendar_last_scraped ['2016-01-04']
requires_license ['f']
jurisdiction_names ['WASHINGTON']
instant_bookable ['f' 't']
require_guest_profile_picture ['f' 't']
require_guest_phone_verification ['f' 't']


In [7]:
data = data.drop(['scrape_id', 'last_scraped', 'experiences_offered', 'state', 'market', 'country_code', 'country', 'has_availability', 'calendar_last_scraped', 'requires_license', 'jurisdiction_names'], axis=1)
print(data.shape)

(3818, 45)


In [8]:
data.dtypes

host_since                           object
host_response_time                   object
host_response_rate                   object
host_acceptance_rate                 object
host_is_superhost                    object
host_neighbourhood                   object
host_listings_count                 float64
host_total_listings_count           float64
host_verifications                   object
host_has_profile_pic                 object
host_identity_verified               object
zipcode                              object
is_location_exact                    object
property_type                        object
room_type                            object
accommodates                          int64
bathrooms                           float64
bedrooms                            float64
beds                                float64
bed_type                             object
amenities                            object
price                                object
security_deposit                

In [9]:
data.isnull().sum().sort_values()[::-1]

security_deposit                    1952
cleaning_fee                        1030
host_acceptance_rate                 773
review_scores_accuracy               658
review_scores_checkin                658
review_scores_value                  656
review_scores_location               655
review_scores_cleanliness            653
review_scores_communication          651
review_scores_rating                 647
reviews_per_month                    627
host_response_time                   523
host_response_rate                   523
host_neighbourhood                   300
bathrooms                             16
zipcode                                7
bedrooms                               6
host_is_superhost                      2
host_listings_count                    2
host_total_listings_count              2
host_has_profile_pic                   2
host_identity_verified                 2
host_since                             2
beds                                   1
property_type   

### Step 2: Clean the data feature by feature

* **host_since**

In [10]:
data['host_since'] = pd.to_datetime(data['host_since'])

In [11]:
data['host_since_year'] = data['host_since'].dt.year
data['host_since_year'].fillna(0, inplace=True)  #Using 0 to represent unknown data
data['host_since_year'] = data['host_since_year'].astype('int')

data['host_since_month'] = data['host_since'].dt.month
data['host_since_month'].fillna(0, inplace=True)  #Using 0 to represent unknown data
data['host_since_month'] = data['host_since_month'].astype('int')

data['host_since_day'] = data['host_since'].dt.day
data['host_since_day'].fillna(0, inplace=True)  #Using 0 to represent unknown data
data['host_since_day'] = data['host_since_day'].astype('int')
data.head()

Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,...,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,reviews_per_month,host_since_year,host_since_month,host_since_day
0,2011-08-11,within a few hours,96%,100%,f,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",t,...,9.0,10.0,f,moderate,f,f,4.07,2011,8,11
1,2013-02-21,within an hour,98%,100%,t,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,...,10.0,10.0,f,strict,t,t,1.48,2013,2,21
2,2014-06-12,within a few hours,67%,100%,f,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",t,...,10.0,10.0,f,strict,f,f,1.15,2014,6,12
3,2013-11-06,,,,f,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,...,,,f,flexible,f,f,,2013,11,6
4,2011-11-29,within an hour,100%,,f,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,...,9.0,9.0,f,strict,f,f,0.89,2011,11,29


In [12]:
data = data.drop('host_since', axis=1)
print(data.shape)

(3818, 47)


* **host_response_time**

In [13]:
data['host_response_time'].unique()

array(['within a few hours', 'within an hour', nan, 'within a day',
       'a few days or more'], dtype=object)

In [14]:
def response_time(x):
    """
    Transform hosts_response_time feature into numeric data.
    
    INPUT:
        x - a string that descibes the response time of host.
        
    OUTPUT:
        res - int data.
    """
    
    if x == 'within an hour':
        res = 1
    elif x == 'within a few hours':
        res = 2
    elif x == 'within a day':
        res = 24
    elif x == 'a few days or more':
        res = 48
    else:
        res = 96
    return res

In [15]:
data['host_response_time'] = data['host_response_time'].apply(response_time).astype('int')
data.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,...,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,reviews_per_month,host_since_year,host_since_month,host_since_day
0,2,96%,100%,f,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",t,t,...,9.0,10.0,f,moderate,f,f,4.07,2011,8,11
1,1,98%,100%,t,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,...,10.0,10.0,f,strict,t,t,1.48,2013,2,21
2,2,67%,100%,f,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",t,t,...,10.0,10.0,f,strict,f,f,1.15,2014,6,12
3,96,,,f,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,...,,,f,flexible,f,f,,2013,11,6
4,1,100%,,f,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,...,9.0,9.0,f,strict,f,f,0.89,2011,11,29


* **host_response_rate** and **host_acceptance_rate**

In [16]:
def rate_trans(x):
    """
    Transform rate feature into numeric data.
    
    INPUT:
        x - a string that descibes the response rate of host.
        
    OUTPUT:
        x - a float, percentage form of rate.
    """
    
    if type(x) == str:
        x = x.replace('%', '')
        x = float(x) / 100.0
    return x

In [17]:
#Where rate data is missing, response time of host are often very large, so that we can just consider it as 0%
data['host_response_rate'].fillna('0%', inplace=True)  
data['host_response_rate'] = data['host_response_rate'].map(lambda x: rate_trans(x))

#Where rate data is missing, response time of host are often very large, so that we can just consider it as 0%
data['host_acceptance_rate'].fillna('0%', inplace=True)
data['host_acceptance_rate'] = data['host_acceptance_rate'].map(lambda x: rate_trans(x))

data.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,...,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,reviews_per_month,host_since_year,host_since_month,host_since_day
0,2,0.96,1.0,f,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",t,t,...,9.0,10.0,f,moderate,f,f,4.07,2011,8,11
1,1,0.98,1.0,t,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,...,10.0,10.0,f,strict,t,t,1.48,2013,2,21
2,2,0.67,1.0,f,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",t,t,...,10.0,10.0,f,strict,f,f,1.15,2014,6,12
3,96,0.0,0.0,f,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,...,,,f,flexible,f,f,,2013,11,6
4,1,1.0,0.0,f,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,...,9.0,9.0,f,strict,f,f,0.89,2011,11,29


* **host_is_superhost** , **host_has_profile_pic** , **host_identity_verified** , **is_location_exact** , **instant_bookable** , **require_guest_profile_picture** , **require_guest_phone_verification**

In [18]:
def tf_trans(x):
    """
    Transform feature with only 't' and 'f' value into numeric data.
    
    INPUT:
        x - a string.
        
    OUTPUT:
        x - int data, 1 represents 't' and 0 represents 'f'.
    """
    
    if x == 'f':
        res = 0
    elif x == 't':
        res = 1
    else:
        res = -1
    return res

In [19]:
data['host_is_superhost'] = data['host_is_superhost'].map(lambda x: tf_trans(x))
data['host_has_profile_pic'] = data['host_has_profile_pic'].map(lambda x: tf_trans(x))
data['host_identity_verified'] = data['host_identity_verified'].map(lambda x: tf_trans(x))
data['is_location_exact'] = data['is_location_exact'].map(lambda x: tf_trans(x))
data['instant_bookable'] = data['instant_bookable'].map(lambda x: tf_trans(x))
data['require_guest_profile_picture'] = data['require_guest_profile_picture'].map(lambda x: tf_trans(x))
data['require_guest_phone_verification'] = data['require_guest_phone_verification'].map(lambda x: tf_trans(x))

data.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,...,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,reviews_per_month,host_since_year,host_since_month,host_since_day
0,2,0.96,1.0,0,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",1,1,...,9.0,10.0,0,moderate,0,0,4.07,2011,8,11
1,1,0.98,1.0,1,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",1,1,...,10.0,10.0,0,strict,1,1,1.48,2013,2,21
2,2,0.67,1.0,0,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",1,1,...,10.0,10.0,0,strict,0,0,1.15,2014,6,12
3,96,0.0,0.0,0,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",1,1,...,,,0,flexible,0,0,,2013,11,6
4,1,1.0,0.0,0,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",1,1,...,9.0,9.0,0,strict,0,0,0.89,2011,11,29


* **host_verifications**

In [20]:
verifs = []
all_verifs = []
verif_cnt = []
for v in data['host_verifications'].values:
    v = v.replace('[', '')
    v = v.replace(']', '')
    v = v.replace("'", '')
    v = v.replace(" ", '')
    v = v.split(',')
    if '' in v:
        v.remove('')
    if 'None' in v:
        v.remove('None')
    verif_cnt.append(len(v))
    all_verifs.append(v)
    verifs.extend(v)
verifs = list(set(verifs))
verifs

['phone',
 'google',
 'manual_offline',
 'linkedin',
 'jumio',
 'kba',
 'email',
 'sent_id',
 'reviews',
 'amex',
 'weibo',
 'manual_online',
 'facebook',
 'photographer']

In [21]:
data['verifications_count'] = verif_cnt

for verif in verifs:
    data[verif+'_verification'] = 0
for ind, v in enumerate(all_verifs):
    for verif in v:
        data.loc[ind, verif+'_verification'] = 1
data.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,...,jumio_verification,kba_verification,email_verification,sent_id_verification,reviews_verification,amex_verification,weibo_verification,manual_online_verification,facebook_verification,photographer_verification
0,2,0.96,1.0,0,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",1,1,...,0,1,1,0,1,0,0,0,0,0
1,1,0.98,1.0,1,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",1,1,...,1,0,1,0,1,0,0,0,1,0
2,2,0.67,1.0,0,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",1,1,...,1,0,1,0,1,0,0,0,0,0
3,96,0.0,0.0,0,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",1,1,...,1,0,1,0,1,0,0,0,1,0
4,1,1.0,0.0,0,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",1,1,...,0,1,1,0,1,0,0,0,1,0


In [22]:
data = data.drop('host_verifications', axis=1)
print(data.shape)

(3818, 61)


* **zipcode**

In [23]:
def zipcode_trans(x):
    """
    Transform zipcode feature into numeric data, and just use its last two digits to represent it.
    
    INPUT:
        x - a zipcode string.
        
    OUTPUT:
        x - int data.
    """
    
    if type(x) == str:
        return int(x[-2:])
data['zipcode'] = data['zipcode'].map(lambda x: zipcode_trans(x))

In [24]:
data['zipcode'].fillna(0, inplace=True)  #Using 0 to represent unknown zipcode
data['zipcode'] = data['zipcode'].astype('int')
data['zipcode'].isnull().sum()

0

* **bathrooms** , **bedrooms** , **beds**

In [26]:
#These features are unpossible to be 0, those missing data might come from human error, so here we fill them with mode
data['bathrooms'].fillna(data['bathrooms'].mode()[0], inplace=True)
data['bedrooms'].fillna(data['bedrooms'].mode()[0], inplace=True)
data['beds'].fillna(data['beds'].mode()[0], inplace=True)

* **amenities**

In [27]:
all_amenity = ['TV', 'internet', 'air_conditioning', 'kitchen', 'free_parking', 'heating', 'washer', 'dryer', 'elevator', 'pets_allowed', 'dog', 'cat', 'hot_tub']
for amenity in all_amenity:
    data[amenity] = 0

data['amenity_count'] = 0

amts = ['TV', 'Internet', 'Air Conditioning', 'Kitchen', 'Free Parking', 'Heating', 'Washer', 'Dryer', 'Elevator', 'Pets Allowed', 'Dog', 'Cat', 'Hot Tub']
for ind, v in enumerate(data['amenities'].values):
    v = v.replace('{', '')
    v = v.replace('}', '')
    v = v.replace('"', '')
    for i, amt in enumerate(amts):
        if amt in v:
            data.loc[ind, all_amenity[i]] = 1
    v = v.replace('Cable TV', 'TV')
    v = v.replace('Wireless Internet', 'Internet')
    v = v.split(',')
    v = list(set(v))
    data.loc[ind, 'amenity_count'] = len(v)

In [28]:
data.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,zipcode,...,free_parking,heating,washer,dryer,elevator,pets_allowed,dog,cat,hot_tub,amenity_count
0,2,0.96,1.0,0,Queen Anne,3.0,3.0,1,1,19,...,0,1,1,1,0,0,0,0,0,8
1,1,0.98,1.0,1,Queen Anne,6.0,6.0,1,1,19,...,1,1,1,1,0,0,0,0,0,15
2,2,0.67,1.0,0,Queen Anne,2.0,2.0,1,1,19,...,1,1,1,1,0,1,1,1,1,19
3,96,0.0,0.0,0,Queen Anne,1.0,1.0,1,1,19,...,0,1,1,1,0,0,0,0,0,13
4,1,1.0,0.0,0,Queen Anne,2.0,2.0,1,1,19,...,0,1,0,0,0,0,0,0,0,11


In [29]:
data = data.drop('amenities', axis=1)
print(data.shape)

(3818, 74)


* **price** , **monthly_price** , **security_deposit** , **weekly_price** , **cleaning_fee** , **extra_people**

In [30]:
#Transform the type of price into float
def price_trans(x):
    """
    Transform price feature into numeric data.
    
    INPUT:
        x - a string that describes the price.
        
    OUTPUT:
        x - a float, value of the price.
    """
    
    if type(x) == str:
        x = x.replace('$', '')
        x = x.replace(',', '')
        x = float(x)
    return x

In [31]:
data['price'] = data['price'].map(lambda x: price_trans(x))

data['extra_people'] = data['extra_people'].map(lambda x: price_trans(x))

#Missing data might just because there are no security deposit or cleaning fee, so here we fill them with 0
data['security_deposit'] = data['security_deposit'].map(lambda x: price_trans(x))
data['security_deposit'].fillna(0, inplace=True) 

data['cleaning_fee'] = data['cleaning_fee'].map(lambda x: price_trans(x))
data['cleaning_fee'].fillna(0, inplace=True)

* **host_listings_count** , **host_total_listings_count** , **host_neighbourhood**

In [32]:
data[data['host_listings_count'] != data['host_total_listings_count']][['host_listings_count', 'host_total_listings_count']]

Unnamed: 0,host_listings_count,host_total_listings_count
1297,,
1419,,


The values of **host_listings_count** and **host_total_listings_count** are the same, we need to drop one of them

In [33]:
data = data.drop('host_total_listings_count', axis=1)
#One record in listings data means hosts have at least 1 listing, so here we fill missing data with 1
data['host_listings_count'].fillna(1, inplace=True)  

In [34]:
#Using -1 to represent unknown data
data['host_neighbourhood'].fillna(-1, inplace=True)

* **property_type**

In [35]:
#Property type of house are unpossible to be 0, those missing data might come from human error, so here we fill them with mode
data['property_type'].fillna(data['property_type'].mode()[0], inplace=True)

* **review_scores_checkin** , **review_scores_accuracy** , **review_scores_value** , **review_scores_location** , **review_scores_cleanliness** , **review_scores_communication** , **review_scores_rating** , **reviews_per_month**

In [36]:
#Where the number of reviews is 0, features related with reviews can only be 0, so here we fill missing data with 0
inds = data[data['number_of_reviews']==0].index
for ind in inds:
    data.loc[ind, 'review_scores_checkin'] = 0
    data.loc[ind, 'review_scores_accuracy'] = 0
    data.loc[ind, 'review_scores_value'] = 0
    data.loc[ind, 'review_scores_location'] = 0
    data.loc[ind, 'review_scores_cleanliness'] = 0
    data.loc[ind, 'review_scores_communication'] = 0
    data.loc[ind, 'review_scores_rating'] = 0
    data.loc[ind, 'reviews_per_month'] = 0

In [37]:
#Where the number of reviews is not 0, missing data in features related with reviews might come from human error, so here we fill then with mean value
data['review_scores_checkin'].fillna(data['review_scores_checkin'].mean(), inplace=True)
data['review_scores_accuracy'].fillna(data['review_scores_accuracy'].mean(), inplace=True)
data['review_scores_value'].fillna(data['review_scores_value'].mean(), inplace=True)
data['review_scores_location'].fillna(data['review_scores_location'].mean(), inplace=True)
data['review_scores_cleanliness'].fillna(data['review_scores_cleanliness'].mean(), inplace=True)
data['review_scores_communication'].fillna(data['review_scores_communication'].mean(), inplace=True)
data['review_scores_rating'].fillna(data['review_scores_rating'].mean(), inplace=True)

In [38]:
data.isnull().any().sum()

0

In [39]:
data['review_scores_rating'] = data['review_scores_rating']/100.0

### Step 3: Create new features

In [40]:
#the number of guests = the number of reviews / reviews scores rating
data['guests_count'] = data['number_of_reviews'] / data['review_scores_rating']

In [41]:
#Where review_scores_rating is 0, the number of guests can only be 0, so here we fill missing data with 0
data['guests_count'].fillna(0, inplace=True)  

In [42]:
data['guests_count'] = data['guests_count'].map(lambda x: np.round(x))

In [43]:
data.shape

(3818, 74)

In [44]:
data.isnull().any().sum()

0

Save the cleaned data

In [517]:
data.to_csv('Seattle_Airbnb_Open_Data/cleaned_listings.csv', index=False)