# Exploratory Data Analysis

Use of the clean dataset without dummies.

- Check the distribution of data and find out how the features relate to each other
- Identify outliers
- Define assumptions

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline
sns.set()

pd.set_option('max_columns',71)

In [None]:
df = pd.read_csv('../data/airbnb_paris_clean.csv')
df.head()

In [None]:
df.describe()

**Insights from describe:**

- Minority of superhost: less than 25% of hosts are superhost.
- Potential outliers on *host_total_listings_count* because of the huge standard deviation and max having 1270 listings (2% of listings own to the same host).
- Surprisingly many host doesn't have a verified account less than 50% of hosts are verified.
- Majority of location are exact, maybe this column is not really significative and could be dropped. 
- In average location can have 3 people, more than 50% only 2. We can identify outliers in accommodates with a location that can accept 19 people. 
- Majority of location have one bathroom and one bedroom which seems correct regarding the average of accommodates. We can also notice outliers with 50 bathrooms and bedrooms which appears to be even unrealistic so maybe wrong data or fake listing. 
- Some location have a price = 0€ per night which should be impossible. There are also outliers because of the huge standard deviation and the maximum price at 10000€ per night. 
- Majority of location accept guest for free. Outliers are also identified in these two columns. 
- Minimum and Maximum nights columns seems to be not so reliable with location having minimum 9999 nights and in average 856 maximum nights. Knowing there is a limitation in France for the number of nights rent per location this could be false data. 
- In average location are available 80 days per year that seems correct in regard of the local legislation (120 days maximum for primary residency and 365 days for private rooms). 
- Majority of location have few reviews (less than 20) and we can see some outliers here as well: can we detect listings with fake comments?)
- Majority of location are not instant bookable with more than 50% that doesn't allow the feature.
- *is_business_travel_ready* seems useless because no locaiton have the feature, we can drop this column. 
- Most of location doesn't require any verification of guests (picture or phone) so these two columns shouldn't have a huge impact. 
- *calculated_host_listings_count_private_rooms* and *calculated_host_listings_count_shared_rooms* give that more than 75% of host doesn't have listings in private rooms or shared_rooms so it may tell us something.
- Finally, in average host are subcribed on airbnb for 1728 days (4,73 years) and this data seems to be normally distributed. 


**TLDR;**
- Minority of hosts are superhost and get their account verified. Also in average, they are subscribed for more than 4 years on airbnb. 
- Most of location can receive 2-3 people, with 1 bathroom and 1 bedroom and accept 1 guest for free.
- We may drop the following columns: *is_business_travel_ready, is_location_exact, minimum_nights, maximum_nights, require_guest_profile_picture, require_guest_phone_verification*
- There are outliers on the dataset and they must be dropped in pre-processing part. 
- Only the *time_since_host* seems to be normally distributed. 

In [None]:
df1 = df.copy()

In [None]:
df1 = df1.drop(columns=['is_business_travel_ready', 'is_location_exact', 'minimum_nights', 
                  'maximum_nights', 'require_guest_profile_picture', 'require_guest_phone_verification'])

In [None]:
df1.head()

In [None]:
sns.pairplot(df1, kind="scatter")
plt.show()

In [None]:
df1.groupby('neighbourhood_cleansed')['price'].describe().T

________________________________________
### Distribution plots

In [None]:
non_numerical_df = df1[df1.columns[df1.dtypes==object]]

plt.subplots(2,4,figsize=(17,6))

#for i in range(len(non_numerical_df)):
plt.hist(non_numerical_df.iloc[:,0])
plt.show()

______________________________
### Outliers

In [None]:
plt.figure(figsize=(17,6))
sns.boxplot(df2[(df2.price>0)&(df2.price<350)].price)
plt.show()

In [None]:
df2 = df1.copy()

In [None]:
# dropping outliers based on price

df2 = df2[(df2.price>0)&(df2.price<350)]
df2.shape

In [None]:
lst_order = df2.groupby('neighbourhood_cleansed').price.agg('mean').sort_values(ascending=False).index

plt.figure(figsize=(17,6))
sns.boxplot(x=df2.neighbourhood_cleansed, y=df2.price, order=lst_order)
plt.xticks(rotation=60)
plt.show()

In [None]:
# Top 5 most expensive neighbourhood

df2.groupby('neighbourhood_cleansed').price.agg('mean').sort_values(ascending=False).head(5)

In [None]:
# Checking the distribution of plots
fig,axs=plt.subplots(5,4,figsize=(17,6))

#y = df2.drop('price',axis=1)

for i in range(df1.shape[0]):
    ax = axs[i//4,i%4]
    plt.hist(df1.iloc[:,i], ax=ax);