# **Airbnb Case 1 Study Notebook**

## Objectives

* Have the first requirement of my client answered. 
 * The client would like to know what are the three cities with the best average (higher price) daily rental price for an entire house/apt.

## Inputs

* Use the data processed that was collected from Kaggle `outputs/datasets/collection/AirbnbEuropeanCities.csv`. 

## Outputs

* Have code generated to answer the first requirement of my client.  

---

# Change working directory

Have the working directory changed from its current folder to its parent folder.
* We access the current directory with `os.getcwd()`.

In [None]:
import os
current_dir = os.getcwd()
current_dir

Have the parent of the current directory set up as the new current directory.
* `os.path.dirname()` gets the parent directory;
* `os.chir()` defines the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Have the new current directory confirmed.

In [None]:
current_dir = os.getcwd()
current_dir

---

# Loading Data 

Have data loaded for the next steps of the analysis process.

In [None]:
import pandas as pd
df = pd.read_csv(f"outputs/datasets/collection/EuropeanCitiesAirbnb.csv")
df.head(10)

---

# Data Exploration 

Before starting the analysis to meet my client's need, I will have variables type and their distribuition checked to become more familiar with the dataset.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Processing and Analysing Data

### Processing Data

After analysing the result of the data exploration, data will be processed to have any errors and inacurances eliminated, so the analysis can be started.

1. Handling Skewness:

    + Applying the logarithm transformation to `daily_price`, so it can help normalize the data, making it easier to be analysed.

In [None]:
import numpy as np
df['log_daily_price'] = np.log1p(df['daily_price'])
df.head()

2. Filtering the Data:

    + Have the column `room_type` filtered out, since my client is only interested in entire homes or apartments.

In [None]:
df_filtered = df[df['room_type'] == 'Entire home/apt']
df_filtered.head()

### Analysing Data

1. Calculating the Average Price:

    + Have the distribuition of rental prices visualised to identify potencial outliers before making decisions;
    + Have the average daily prices calculated;
    + Have prices sorted in descending order to find the three best average daily rental prices for weekdays and weekends. In the `weekends` column we have **True = 1** and **False = 0**.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=df_filtered, x='daily_price')
plt.title('Boxplot of Daily Prices')
plt.show()

In [None]:
average_price = df_filtered.groupby(['city', 'weekends'])['daily_price'].mean().reset_index()
# Pivot the table to have weekends as columns for more clarity
avg_price_pivot = average_price.pivot(index='city', columns='weekends', values='daily_price')
# Have average prices sorted in descending order
best_prices = avg_price_pivot.sort_values(by=[0,1], ascending=False)
print(best_prices)


> **After analising the box plot chart I could identify some high values for daily prices that could misleading my conclusions, so I decided to run a second code to have the average price calculated where I have values `>=5000` filtered out of my analiysis and then I can compare both results retrieved and conclude if those high values are misleading my conclusions.**

In [None]:
# Filter the DataFrame to exclude daily prices over 5000
filtered_df_avg = df_filtered[df_filtered['daily_price'] <= 5000]
average_price2 = filtered_df_avg.groupby(['city', 'weekends'])['daily_price'].mean().reset_index()
avg_price_pivot2 = average_price2.pivot(index='city', columns='weekends', values='daily_price')
best_prices2 = avg_price_pivot2.sort_values(by=[0, 1], ascending=False)
print(best_prices2)

2. Correlation Analysis

    + **Pearson Correlation** will be analysed to get a sense of linear relationships between price and your features of interest.

In [None]:
# Define the list of cities of interest
cities_of_interest = ['Amsterdam', 'Barcelona', 'London']
# Filter data to include only the relevant cities and columns of interest
relevant_columns = ['daily_price', 'bedrooms', 'city_center_dist_km', 'metro_dist_km', 'weekends', 'city']
df_subset = df_filtered[df_filtered['city'].isin(cities_of_interest)][relevant_columns]

# Calculate Pearson correlation
pearson_correlation = df_subset.corr(method='pearson')
price_pearson_corr = pearson_correlation['daily_price'].sort_values(key=abs, ascending=False)[1:]
price_pearson_corr

3. Spearman Correlation:

    + **Spearman correlation** will be analysed to understand any non-linear relationships and account for potential outliers.

In [None]:
# Calculate Spearman correlation
spearman_correlation = df_subset.corr(method='spearman')
price_spearman_corr = spearman_correlation['daily_price'].sort_values(key=abs, ascending=False)[1:]
price_spearman_corr

I will consider for my analysis the three best correlation levels of **Pearson** and **Spearman**.

In [None]:
best_corr = 3
set(price_pearson_corr[:best_corr].index.to_list() + price_spearman_corr[:best_corr].index.to_list())

In [None]:
vars_to_analysis = ['bedrooms', 'city_center_dist_km', 'metro_dist_km']
vars_to_analysis

---

# Performing Exploration Data Analysis (EDA) on `vars_to_analysis`

1. Data Overview:

    + Confirming that there are no hidden `nulls` or `NaNs` outside the displayed counts;
    + Have data described to get descriptive statistcs. 

In [None]:
print(df_analysis.isnull().sum())

In [None]:
df_analysis.describe()

2. Data Visualization:

    + Charts were created to better interpret the analysis.

* Histograms charts were used to better understand the numerical variables distribuition.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
variables_to_plot = vars_to_analysis + ['daily_price']
df_subset[variables_to_plot].hist(bins=30, figsize=(15, 10), layout=(2, 3))
plt.tight_layout()
plt.show()

* Scatter plot charts were used to explore relationship between `vars_to_analysis` and `daily_price`.

In [None]:
# Loop through each variable in vars_to_analysis
for var in vars_to_analysis:
    plt.figure(figsize=(10, 6))
    plt.scatter(df_subset['daily_price'], df_subset[var], color='blue')
    plt.title(f'Scatter Plot of {var.capitalize()} vs. Daily Price')
    plt.xlabel('Daily Price')
    plt.ylabel(var.capitalize())
    plt.grid()
    plt.show()

+ A bar plot chart was used to visualize the average prices per city differentiated by weekends.

In [None]:
best_prices.plot(kind='bar', figsize=(10, 6), color=['grey', 'coral'])

# Adding labels and title
plt.title('Average Daily Price by City and Weekend')
plt.xlabel('City')
plt.ylabel('Average Daily Price')
plt.xticks(rotation=45)
plt.legend(title='Weekends', labels=['No', 'Yes'])
plt.grid(axis='y')
plt.show()

+ Bar plot charts were used to visualize the average value of `vars_to_analysis` by `city`.

In [None]:
# Loop through each variable in vars_to_analysis
colors = ['darkblue', 'grey', 'coral']
for var in vars_to_analysis:
    # Calculate average for each city for the current variable
    average_price = df_subset.groupby('city')[var].mean().reset_index()
    # Sort the averages in descending order
    average_price = average_price.sort_values(by=var, ascending=False)
    # Create a bar plot
    plt.figure(figsize=(10, 6))
    plt.bar(average_price['city'], average_price[var], color=colors[:len(average_price)])
    # Adding labels and title specific to the variable
    plt.title(f'{var.capitalize()} Average by City')
    plt.xlabel('City')
    plt.ylabel(f'Average {var.capitalize()}')
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.show()

---

# Cloncusion

After runing code to have variables type and their distribution checked, average of `daily_price` rent calculated and data visualisation analysed to better interpret the dataset, I can conclude in this step of my analysis that:

  + Three alerts were returned for the data exploration, two of them are for `bedrooms` and `weekends` variables. Both variables were analised and them won't cause any misleading to the analysis. The third alert is regarding values of the `daily_price` variable, it is highly skewed (γ1 = 21.41995499). The logarithm transformation was applied to `daily_price`, so it can help to normalize the data;

  + Although there are some high values for `daily_price` presented in the dataset, they don't cause any misleading in my conclusion of the three best average prices listing;

  + After have `pearson` and `spearman` correlation checked between `daily_prices` and variables available in the dataset, I could conclude that main variables with correlation to the price of airbnb properties are `bedrooms`, `city_center_dist_km` and `metro_dist_km` respectively;
  
  + Creating visualisation for the dataset helped me to better undesrtand the distribuition of numerical variables to interpret the results of my analysis.