# Paris Airbnb in depth analysis

In this notebook, we'll use the the **listing.csv** file to gather and analyze most of Airbnb's data in the city of Paris.

## Imports

In [8]:
#Data manipulation
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

#Data visualisation
import matplotlib.pyplot as plt
import plotly as py
py.offline.init_notebook_mode(connected=True)
import plotly.express as px
import cufflinks as cf

## Data loading

In order to simplify the analysis I've created a first subset of columns which will be interesting to investigate.

In [61]:
col_list = ['id','host_id','host_since','host_response_time','host_response_rate','host_acceptance_rate',
            'host_is_superhost','host_listings_count','host_verifications','host_has_profile_pic',
            'host_identity_verified','neighbourhood_cleansed','zipcode','latitude','longitude','is_location_exact',
           'property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities',
            'price','weekly_price','monthly_price','security_deposit','cleaning_fee','minimum_nights',
           'maximum_nights','availability_365','availability_30','reviews_per_month','review_scores_rating']

In [62]:
listings = pd.read_csv('data/listings.csv', usecols = col_list)


Columns (43) have mixed types.Specify dtype option on import or set low_memory=False.



Let's get some first info on the data set using the **ProfileReport** function from **pandas_profiling** library which extends the **df.info()** pandas' function.

In [63]:
profile = ProfileReport(listings, minimal = True)
profile.to_notebook_iframe()

As we can see from the report, we have a lot of **nan** values within the dataset, hence we'll have to do some data cleaning before being able to proceed with the analysis.

## Data cleaning

First, let's see which **nan** we can replace by a known value like $0$ or $Null$ or just calculate it by relation with other columns.

In [64]:
listings.isna().sum()

id                            0
host_id                       0
host_since                   69
host_response_time        26311
host_response_rate        26311
host_acceptance_rate      65493
host_is_superhost            69
host_listings_count          69
host_verifications            0
host_has_profile_pic         69
host_identity_verified       69
neighbourhood_cleansed        0
zipcode                     623
latitude                      0
longitude                     0
is_location_exact             0
property_type                 0
room_type                     0
accommodates                  0
bathrooms                    63
bedrooms                    109
beds                        441
bed_type                      0
amenities                     0
price                         0
weekly_price              55702
monthly_price             59295
security_deposit          19614
cleaning_fee              16771
minimum_nights                0
maximum_nights                0
availabi

For the columns from the below list, we might have to drop the rows containing the **nan** : 
- host_since
- host_is_superhost
- host_listings_count
- host_has_profile_pic
- host_identity_verified

For the columns from the below list, we'll be able to replace the **nan** by a value : 
- security_deposit
- cleaning_fee

For the columns below, we may be able to infer the **nan** value based on other columns : 
- zipcode
- bathrooms
- bedrooms
- beds
- weekly_price
- monthly_price
- review_scores_rating
- reviews_per_month

Now let's get to, it beginning with the **nan** we can replace by a value based on the data itself.

In [72]:
drop_col = ['host_since','host_is_superhost','host_listings_count',
            'host_has_profile_pic','host_identity_verified']

In [78]:
listings[drop_col].isna().idxmax()

host_since                924
host_is_superhost         924
host_listings_count       924
host_has_profile_pic      924
host_identity_verified    924
dtype: int64

In [65]:
listings[['host_has_profile_pic','host_is_superhost']] = listings[['host_has_profile_pic','host_is_superhost']].replace({'t':1,'f':0})

In [66]:
listings['host_has_profile_pic'].isna().sum(),listings['host_is_superhost'].isna().sum()

(69, 69)

In [70]:
(
    listings[listings['host_has_profile_pic'].isna()]['host_is_superhost'].shape,
    listings[listings['host_has_profile_pic'].isna()]['host_is_superhost'].isna().sum()
)

((69,), 69)

Here we can see that there is a strong correlation between the **nan** values in both of this columns.