# Exploratory Data Analysis

Use of the clean dataset without dummies.

- Define assumptions
- Check the distribution of data and find out how the features relate to each other
- Identify outliers

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline
sns.set()

pd.set_option('max_columns',71)

In [None]:
df = pd.read_csv('../data/airbnb_paris_clean.csv')
df.head(10)

In [None]:
df.describe()

**Insights from describe:**

- Minority of superhost: less than 25% of hosts are superhost.
- Potential outliers on *host_total_listings_count* because of the huge standard deviation and max having 1270 listings (2% of listings own to the same host).
- Surprisingly many host doesn't have a verified account less than 50% of hosts are verified.
- Majority of location are exact, maybe this column is not really significative and could be dropped. 
- In average location can have 3 people, more than 50% only 2. We can identify outliers in accommodates with a location that can accept 19 people. 
- Majority of location have one bathroom and one bedroom which seems correct regarding the average of accommodates. We can also notice outliers with 50 bathrooms and bedrooms which appears to be even unrealistic so maybe wrong data or fake listing. 
- Some location have a price = 0€ per night which should be impossible. There are also outliers because of the huge standard deviation and the maximum price at 10000€ per night. 
- Majority of location accept guest for free. Outliers are also identified in these two columns. 
- Minimum and Maximum nights columns seems to be not so reliable with location having minimum 9999 nights and in average 856 maximum nights. Knowing there is a limitation in France for the number of nights rent per location this could be false data. 
- In average location are available 80 days per year that seems correct in regard of the local legislation (120 days maximum for primary residency and 365 days for private rooms). 
- Majority of location have few reviews (less than 20) and we can see some outliers here as well: can we detect listings with fake comments?)
- Majority of location are not instant bookable with more than 50% that doesn't allow the feature.
- *is_business_travel_ready* seems useless because no locaiton have the feature, we can drop this column. 
- Most of location doesn't require any verification of guests (picture or phone) so these two columns shouldn't have a huge impact. 
- *calculated_host_listings_count_private_rooms* and *calculated_host_listings_count_shared_rooms* give that more than 75% of host doesn't have listings in private rooms or shared_rooms so it may tell us something.
- Finally, in average host are subcribed on airbnb for 1728 days (4,73 years) and this data seems to be normally distributed. 


**TLDR;**
- Minority of hosts are superhost and get their account verified. Also in average, they are subscribed for more than 4 years on airbnb. 
- Most of location can receive 2-3 people, with 1 bathroom and 1 bedroom and accept 1 guest for free.
- We may drop the following columns: *is_business_travel_ready, is_location_exact, minimum_nights, maximum_nights, require_guest_profile_picture, require_guest_phone_verification*
- There are outliers on the dataset and they must be dropped in pre-processing part. 
- Only the *time_since_host* seems to be normally distributed. 

In [None]:
df1 = df.copy()

In [None]:
df1 = df1.drop(columns=['is_business_travel_ready', 'is_location_exact', 'minimum_nights', 
                  'maximum_nights', 'require_guest_profile_picture', 'require_guest_phone_verification'])

In [None]:
df1.head(10)

In [None]:
# Checking the relation between numerical variables (excluded 0,1 variables)

sns.pairplot(df1[['host_total_listings_count','accommodates','bathrooms','bedrooms',
                  'price','guests_included','extra_people','availability_365',
                  'number_of_reviews','time_since_host']], kind="scatter", diag_kind='kde')
plt.show()

________________________________________
### Distribution plots

In [None]:
numeric_df = df1[['host_total_listings_count','accommodates','bathrooms','bedrooms',
                  'price','guests_included','extra_people','availability_365',
                  'number_of_reviews','time_since_host']]

fig,axs=plt.subplots(2,5,figsize=(17,8))

for i in range(numeric_df.shape[1]):
    ax = axs[i//5,i%5]
    sns.distplot(numeric_df.iloc[:,i],ax=ax)
plt.show()

In [None]:
non_numerical_df = df1[df1.columns[df1.dtypes==object]]

fig, axs = plt.subplots(2,4, figsize=(17,10))

for i in range(non_numerical_df.shape[1]):
    ax=axs[i//4,i%4]
    ax.hist(non_numerical_df.iloc[:,i])
    ax.set_title(non_numerical_df.columns[i])
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

fig.delaxes(axs[1,3])   
fig.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(14,6))
plt.hist(non_numerical_df.neighbourhood_cleansed, bins=non_numerical_df.neighbourhood_cleansed.nunique())
plt.title('Neighborhoods frequency')
plt.xticks(rotation=60)
plt.show()

In [None]:
non_numerical_df.neighbourhood_cleansed.nunique()

______________________________
### Outliers

In [None]:
df1.groupby('neighbourhood_cleansed')['price'].describe().T

In [None]:
df2 = df1.copy()

In [None]:
plt.figure(figsize=(17,6))
sns.boxplot(df2[(df2.price>0)&(df2.price<350)].price)
plt.show()

In [None]:
# dropping outliers based on price

df2 = df2[(df2.price>0)&(df2.price<350)]
df2.shape

In [None]:
lst_order = df2.groupby('neighbourhood_cleansed').price.agg('mean').sort_values(ascending=False).index

plt.figure(figsize=(17,6))
sns.boxplot(x=df2.neighbourhood_cleansed, y=df2.price, order=lst_order)
plt.xticks(rotation=60)
plt.title('Most expensive neighborhoods',fontsize=14)
plt.savefig('../img/most_expensive_neighborhoods.png')
plt.show()

In [None]:
# Top 5 most expensive neighbourhood

df2.groupby('neighbourhood_cleansed').price.agg('mean').sort_values(ascending=False).head(5)

In [None]:
# Checking outliers for bathrooms

plt.figure(figsize=(17,6))
sns.boxplot(df2.bathrooms)
plt.show()

In [None]:
# Dropping all locations with more than 3 bathrooms

bath_outliers = df2[df2.bathrooms>3].index
df2 = df2.drop(index = bath_outliers)
df2.shape

In [None]:
# Checking outliers for bedrooms

plt.figure(figsize=(17,6))
sns.boxplot(df2[df2.bedrooms<6].bedrooms)
plt.show()

In [None]:
# Dropping all locations with more than 4 bedrooms

bedrooms_outliers = df2[df2.bedrooms>4].index
df2 = df2.drop(index = bedrooms_outliers)
df2.shape

In [None]:
# Checking outliers for availability_365

plt.figure(figsize=(17,6))
sns.boxplot(df2.availability_365)
plt.show()

# ==> availability_365 = 0 would mean that this location all always booked and it can be the case in real life

In [None]:
df2[df2.availability_365>300]

In [None]:
# Checking outliers for accommodates

plt.figure(figsize=(17,6))
sns.boxplot(df2.accommodates)
plt.show()

In [None]:
df2[df2.accommodates>10]

# Location with more than 10 accommodates seems to have realistic data in regards of 
# number of bathrooms and price

In [None]:
# Checking outliers for host_total_listings_count

""" 
This column doesn't seems releveant because majority of people have only 1 listing.
Also, people with 0 listing means there is something wrong with the data. 
"""

plt.figure(figsize=(17,6))
sns.boxplot(df2[df2.host_total_listings_count<20].host_total_listings_count)
plt.show()

In [None]:
df2[df2.host_total_listings_count==0]

In [None]:
# Dropping this column because it seems pointless to keep

df2 = df2.drop('host_total_listings_count',axis=1)
df2.shape

In [None]:
numeric_df = df2[['accommodates','bathrooms','bedrooms',
                  'price','guests_included','extra_people','availability_365',
                  'number_of_reviews','time_since_host']]

fig,axs=plt.subplots(2,5,figsize=(17,8))

for i in range(numeric_df.shape[1]):
    ax = axs[i//5,i%5]
    sns.distplot(numeric_df.iloc[:,i],ax=ax)
    
fig.delaxes(axs[1,4])
plt.show()

____________________________________
### Saving new dataframe

In [None]:
# Saving csv without dummies

df2.to_csv('../data/airbnb_paris_clean_wo_dummies.csv',index=False)

In [None]:
# Getting Dummies for feature engineering
df3 = df2.copy()
df3 = pd.get_dummies(df3,drop_first=True)
df3.shape

In [None]:
# Saving the csv with dummies

df3.to_csv('../data/airbnb_paris_clean_dummies.csv',index=False)

### Possible improvements 

- Compare each feature with the price (regplot)
- Other analysis
- More precise outlier cleaning by using z-score or IQR