<a id="1"></a>
# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Data cleaning</p>

<a id="1"></a>
## <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Boston Dataset</p>

**- Importing libraries :**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**- Importing data :**

In [2]:
boston = pd.read_csv("boston.csv")
boston.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,5506,https://www.airbnb.com/rooms/5506,20191204162830,2019-12-04,**$79 Special ** Private! Minutes to center!,"Private guest room with private bath, You do n...",**THE BEST Value in BOSTON!!*** PRIVATE GUEST ...,"Private guest room with private bath, You do n...",none,"Peacful, Architecturally interesting, historic...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.81
1,6695,https://www.airbnb.com/rooms/6695,20191204162830,2019-12-04,$99 Special!! Home Away! Condo,"Comfortable, Fully Equipped private apartment...",** WELCOME *** FULL PRIVATE APARTMENT In a His...,"Comfortable, Fully Equipped private apartment...",none,"Peaceful, Architecturally interesting, histori...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.91
2,8789,https://www.airbnb.com/rooms/8789,20191204162830,2019-12-04,Curved Glass Studio/1bd facing Park,"Bright, 1 bed with curved glass windows facing...",Fully Furnished studio with enclosed bedroom. ...,"Bright, 1 bed with curved glass windows facing...",none,Beacon Hill is a historic neighborhood filled ...,...,f,f,strict_14_with_grace_period,f,f,10,10,0,0,0.37
3,10730,https://www.airbnb.com/rooms/10730,20191204162830,2019-12-04,Bright 1bed facing Golden Dome,"Bright, spacious unit, new galley kitchen, new...",Bright one bed facing the golden dome of the S...,"Bright, spacious unit, new galley kitchen, new...",none,Beacon Hill is located downtown and is conveni...,...,f,f,strict_14_with_grace_period,f,f,10,10,0,0,0.24
4,10811,https://www.airbnb.com/rooms/10811,20191204162830,2019-12-04,"Back Bay Apt Studio-3 blocks to Pru center & ""T""",Stunning Back Bay furnished studio apartment. ...,"Back Bay Studio Apt - Private bath, A/C, Cabl...",Stunning Back Bay furnished studio apartment. ...,none,A one-square mile neighborhood that is arguabl...,...,f,f,strict_14_with_grace_period,f,f,7,7,0,0,0.19


**- Information about our dataset :**

In [3]:
boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3507 entries, 0 to 3506
Columns: 106 entries, id to reviews_per_month
dtypes: float64(23), int64(21), object(62)
memory usage: 2.8+ MB


In [4]:
print(f"- number of lines : {boston.shape[0]}\n- number of columns : {boston.shape[1]}")

- number of lines : 3507
- number of columns : 106


<a id="1"></a>
## <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Dropping columns that are of no use</p>


### Deletion according to the number of Nan values :


- Columns with more than 70% missing values will be removed.

In [5]:
columns_to_drop = []
for column in boston.columns:
    null_count = boston[column].isna().sum()
    total_rows = boston.shape[0]
    if (null_count / total_rows) >= 0.7:
        columns_to_drop.append(column)
        print(f"{column}: {null_count}")

thumbnail_url: 3507
medium_url: 3507
xl_picture_url: 3507
host_acceptance_rate: 3507
neighbourhood_group_cleansed: 3507
square_feet: 3383
weekly_price: 3215
monthly_price: 3211


In [6]:
boston.drop(columns=columns_to_drop, axis=1, inplace=True)

### Deletion according to the number of duplicated records :

In [7]:
nbr_duplicate = boston.duplicated().sum()
print("- The number of duplicates in our dataset is: ", nbr_duplicate)

- The number of duplicates in our dataset is:  0


### Removing cities that don't belong to Boston :

In [8]:
cities_to_remove = ['Brookline', 'Somerville', 
                    '88 Auckland Street, Apt 1, Dorchester, MA 0212', 
                    'Everett', 'Cambridge']

boston = boston[~boston['city'].isin(cities_to_remove)]

### Deleting identifiers and data having a relationship with those recovered :

In [9]:
boston.drop(columns=['id', 'scrape_id', 'last_scraped', 'host_id'], axis=1, inplace=True)

<a id="1"></a>
## <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Data normalization</p>


### We will transfer 't' to 1 and 'f' to 0 in these columns :

In [10]:
columns_to_transform = ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'is_location_exact', 
                        'has_availability', 'requires_license', 'instant_bookable', 'is_business_travel_ready', 
                        'require_guest_profile_picture', 'require_guest_phone_verification']

mapping = {'t': 1, 'f': 0, 'nan': 0, np.nan: 0}

boston[columns_to_transform] = boston[columns_to_transform].replace(mapping)

### Convert column 'price' 'security_deposit' 'cleaning_fee' 'extra_people' to float : 

In [11]:
# Remove non-numeric characters ('$' and ',') from the values in the columns
boston['price'] = boston['price'].str.replace('[\$,]', '', regex=True)
boston['security_deposit'] = boston['security_deposit'].str.replace('[\$,]', '', regex=True)
boston['cleaning_fee'] = boston['cleaning_fee'].str.replace('[\$,]', '', regex=True)
boston['extra_people'] = boston['extra_people'].str.replace('[\$,]', '', regex=True)

# Convert the values in the columns to float
boston['price'] = boston['price'].astype(float)
boston['security_deposit'] = boston['security_deposit'].astype(float)
boston['cleaning_fee'] = boston['cleaning_fee'].astype(float)
boston['extra_people'] = boston['extra_people'].astype(float)

### Convert percentages to decimal values :

In [12]:
boston['host_response_rate'] = boston['host_response_rate'].str.rstrip('%').astype(float) / 100

### Delete all columns of type object :

In [13]:
object_column = []
for column in boston.columns :
    if boston[column].dtype == 'object' :
        object_column.append(column)
object_column

['listing_url',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_verifications',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'property_type',
 'room_type',
 'bed_type',
 'amenities',
 'calendar_updated',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'license',
 'jurisdiction_names',
 'cancellation_policy']

In [14]:
boston.drop(columns=object_column, axis=1, inplace=True)

<a id="1"></a>
## <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Filling missing values</p>


- the average :

In [15]:
specified_columns = ['host_response_rate', 'host_listings_count', 'host_total_listings_count', 
                       'security_deposit', 'cleaning_fee', 'review_scores_rating', 'reviews_per_month']

for columns in specified_columns:
    Average_column = boston[columns].mean()
    boston[columns].fillna(Average_column, inplace=True)

- The mode :

In [16]:
specified_columns = ['bathrooms', 'bedrooms', 'beds', 'review_scores_accuracy', 'review_scores_cleanliness', 
                       'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 
                       'review_scores_value']

for columns in specified_columns:
    mode = boston[columns].value_counts().idxmax()
    boston[columns].fillna(mode, inplace=True)

<a id="1"></a>
## <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">correlation matrix</p>


In [17]:
corr_matrix = boston.corr()

corr_matrix

Unnamed: 0,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,is_location_exact,accommodates,...,requires_license,instant_bookable,is_business_travel_ready,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
host_response_rate,1.0,0.103008,0.055959,0.055959,0.010282,0.046288,-0.002317,0.070329,-0.024749,0.026283,...,,0.095212,,0.049884,0.069456,0.069396,0.070388,-0.002903,0.006611,0.098791
host_is_superhost,0.103008,1.0,-0.283337,-0.283337,0.046953,0.069634,-0.092251,-0.032018,0.1238,0.022527,...,,-0.053442,,-0.081065,-0.097324,-0.310082,-0.291458,-0.127322,-0.050904,0.339905
host_listings_count,0.055959,-0.283337,1.0,1.0,0.008935,-0.05424,0.196777,0.151282,-0.169796,0.007041,...,,0.228961,,0.477319,0.451494,0.91626,0.945774,-0.167659,-0.041816,-0.20086
host_total_listings_count,0.055959,-0.283337,1.0,1.0,0.008935,-0.05424,0.196777,0.151282,-0.169796,0.007041,...,,0.228961,,0.477319,0.451494,0.91626,0.945774,-0.167659,-0.041816,-0.20086
host_has_profile_pic,0.010282,0.046953,0.008935,0.008935,1.0,0.060522,0.074074,0.000596,-0.003539,0.009598,...,,0.006476,,0.016951,0.02107,0.043392,0.041462,0.011799,0.007067,-0.020523
host_identity_verified,0.046288,0.069634,-0.05424,-0.05424,0.060522,1.0,-0.037788,0.014323,-0.01873,-0.019756,...,,-0.145863,,0.269035,0.33902,-0.137096,-0.11819,-0.113325,-0.069648,-0.029271
latitude,-0.002317,-0.092251,0.196777,0.196777,0.074074,-0.037788,1.0,0.33809,0.022102,0.031623,...,,0.116771,,0.085873,0.09334,0.21452,0.231044,-0.101842,-0.043952,0.076726
longitude,0.070329,-0.032018,0.151282,0.151282,0.000596,0.014323,0.33809,1.0,-0.106379,0.025184,...,,0.091484,,0.128799,0.093615,0.166782,0.156368,0.072892,-0.066538,0.105566
is_location_exact,-0.024749,0.1238,-0.169796,-0.169796,-0.003539,-0.01873,0.022102,-0.106379,1.0,0.031033,...,,-0.149208,,-0.262108,-0.166887,-0.144127,-0.124985,-0.114644,-0.124887,0.009147
accommodates,0.026283,0.022527,0.007041,0.007041,0.009598,-0.019756,0.031623,0.025184,0.031033,1.0,...,,0.040308,,-0.016158,0.031074,-0.016684,0.011979,-0.1807,0.034757,0.044917


- Columns that have a strong correlation with each other :

In [18]:
seuil_corr = 0.85

paires_corr = []


for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) >= seuil_corr:
            paires_corr.append((corr_matrix.columns[i], corr_matrix.columns[j]))

for paire in paires_corr:
    print("{} et {} : {}.".format(paire[0], paire[1], corr_matrix.loc[paire[0], paire[1]]))

host_listings_count et host_total_listings_count : 1.0.
host_listings_count et calculated_host_listings_count : 0.9162604827297202.
host_listings_count et calculated_host_listings_count_entire_homes : 0.9457739162427965.
host_total_listings_count et calculated_host_listings_count : 0.9162604827297202.
host_total_listings_count et calculated_host_listings_count_entire_homes : 0.9457739162427965.
minimum_nights et minimum_minimum_nights : 0.968394816501031.
minimum_nights et maximum_minimum_nights : 0.9384516227672093.
minimum_nights et minimum_nights_avg_ntm : 0.9613891236042608.
maximum_nights et minimum_maximum_nights : 0.8665986925565248.
maximum_nights et maximum_maximum_nights : 0.8739745495614973.
maximum_nights et maximum_nights_avg_ntm : 0.873902936322002.
minimum_minimum_nights et maximum_minimum_nights : 0.9092609976424505.
minimum_minimum_nights et minimum_nights_avg_ntm : 0.9470701024055681.
maximum_minimum_nights et minimum_nights_avg_ntm : 0.985602956963614.
minimum_maximu

In [19]:
boston.drop(columns= [
    'host_total_listings_count',
    'calculated_host_listings_count',
    'calculated_host_listings_count_entire_homes',
    'minimum_minimum_nights',
    'maximum_minimum_nights',
    'minimum_nights_avg_ntm',
    'maximum_maximum_nights',
    'maximum_nights_avg_ntm',
    'maximum_nights',
    'availability_30',
    'availability_90'
], axis=1, inplace=True)

In [20]:
print(f"- number of lines : {boston.shape[0]}\n- number of columns : {boston.shape[1]}")

- number of lines : 3493
- number of columns : 39


## END