#### Imports

In [None]:
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
import math
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Suppress warnings
warnings.filterwarnings("ignore")


#### Business Questions to Find answers to

Question I want to provide answers to:
 - What are the most popular Airbnb destinations?
 - What factors influence cost of Airbnb rental costs?
 - What contributes to good Airbnb ratings?
 - Does cancelation policy affect reviews?

#### Data Understanding:

In [None]:
listings = pd.read_csv("seattle/listings.csv")
print("\nShape of dataset: ",listings.shape,"\n")
listings.head(3)

In [None]:
listings.describe()

In [None]:
""" 
The output from this cell reveals there are a number of columns/rows 
that have nan/null values and hence require dropping(removing).
"""
listings.info()

#### Data Cleaning

The dataset has way so many features and not all of them are relevant for the analysis we want to do. This means we have to drop some columns(features) and rows however we'll need a good reason for each drop. 
I have listed categories of reasons and how to identify what to drop below:<br/>
<ol>
<li>Columns with non-unique values: These columns have a single repeated for all the rows of that columns and hence they do not add anything significant to the dataset.</li><br/>
<li>Rows that have about 70% of values as Nan, None or empty strings: 70% is a huge chunk of the dataset that we cannot impute data for and wouldn't make sense to maintain</li><br/>
</ol>

Having followed thr above process will reduce the features to 80 from 92.

In [None]:
# Calculate the threshold count for dropping columns
threshold_col = len(listings) * 0.7

# Replace empty and "none" values with NaN
listings_replaced = listings.replace(['', 'none'], np.nan)

# Count the number of NaN values in each column
na_counts = listings_replaced.isna().sum()

# Get the column indices where the count exceeds or equals the threshold
columns_to_drop_1 = na_counts[na_counts >= threshold_col].index

# Drop the columns
listings = listings_replaced.drop(columns=columns_to_drop_1)

# Get the unique value counts for each column
value_counts = listings.nunique()

# Get the column names where all values are the same
columns_to_drop_2 = value_counts[value_counts == 1].index

# Drop the non-unique columns of this dataset
listings = listings.drop(columns=columns_to_drop_2)
columns_to_drop = columns_to_drop_1.to_list() + columns_to_drop_2.to_list()
print("Dropped columns are :", columns_to_drop)
print("listings shape: ", listings.shape)
listings.head(3)

The code below will print out all the columns of the dataset.
You can go through them manually to identify which features are most relevant to our analysis and which aren't.
This way we can drop more features and have a more meaningful dataset to work with.
<br/>

This could mean checking out each column to know what datatype type it is and if the value thereof is relevant. 
You'll find out that most of the Ids are not relevant for the analysis we want to make. Also some other columns like summary, description, space and other features where we are mostly dealing with texts/urls that can't be grouped into some sort of categorical variables need to be dropped as well because they are all unique for each and every role of that columns and hence doesn't pose anything interesting to study or analyse.

In [None]:
# Remove non-numeric characters from the 'pricing' column
listings['price'] = listings['price'].apply(lambda x: re.sub(r'[^\d.]+', '', str(x)))

# Print the column names
print(listings.columns)

### Feature Selection & Visualizations:

In [None]:
# Select the columns of interest, including 'price'

columns_of_interest = [
    'host_is_superhost',
    'neighbourhood_cleansed',
    'accommodates',
    'bedrooms',
    'bathrooms',
    'is_location_exact',
    'review_scores_rating',
    'property_type',
    'room_type',
    'beds',
    'bed_type',
    'number_of_reviews',
    'instant_bookable',
    'review_scores_accuracy',
    'review_scores_cleanliness',
    'review_scores_communication',
    'review_scores_value',
    'cancellation_policy',
    'availability_365',
    'price'
]

len(columns_of_interest)

#### Heapmap showing correlation between selected non-categotical features

In [None]:
# Subset the data to include only the columns of interest
listings_relevant = listings.loc[:, columns_of_interest]

# Remove non-numeric characters from the 'pricing' column
listings_relevant['price'] = listings_relevant['price'].apply(lambda x: re.sub(r'[^\d.]+', '', str(x))).astype(float)

# Calculate the correlation matrix
correlation_matrix = listings_relevant.corr()

# Visualize the correlations using a heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix, cmap='Blues', annot=True, fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

Out of the 20 features, only 12 were accounted for in the heapmap above because the remaining values are categorical and need to be treated differently.

### Dealing with categorical variables & improperly represented Boolean Variables in the listings dataset:

This is still some part of the data cleaning process as we still do not have the dataset in the format it should be before completely anlysing. A few things:</br>
<ol>
<li>There are boolean values represent as strings in the form of 't' or 'f'. These should be replace with actual Boolean values of 0 and 1.</li>
<li>Idnetify the catogorical variables and treat them appropriately</li>
</ol>


In [None]:
# Identify columns containing 'f' or 't' values
boolean_cols = [col for col in listings_relevant if listings[col].isin(['f','t', np.nan]).all()]

# Convert 'f' to False and 't' to True in the identified columns
listings_relevant[boolean_cols] = listings_relevant[boolean_cols].replace({'f': False, 't': True})

#Remove all the nan values in the dataset and convert the True/False values to 0/1:
listings_relevant = listings_relevant.dropna(subset=[b for b in boolean_cols]).astype({b: int for b in boolean_cols})

In [None]:
# Select the variables plot based on their count/frequency of occurences
variables_to_plot = ['host_is_superhost', 'is_location_exact', 'property_type',
                     'room_type', 'bed_type', 'instant_bookable', 
                     'cancellation_policy','neighbourhood_cleansed']

# Group by each column and plot against the price
for column in variables_to_plot:
    status_city = listings_relevant[column].value_counts()
    (status_city/listings_relevant.shape[0]).plot(kind="bar");
    plt.xlabel(column)
    plt.title(f'Value count of {column}')
    plt.show()

In [None]:
# Remove non-numeric characters from the 'price' column
listings_relevant['price'] = listings_relevant['price'].apply(lambda x: re.sub(r'[^\d.]+', '', str(x)))

# Select the variables to normalize and plot against the price
variables_to_plot = ['host_is_superhost', 'is_location_exact', 'property_type',
                     'room_type', 'bed_type', 'instant_bookable',
                     'cancellation_policy', 'neighbourhood_cleansed']

# Calculate the number of rows and columns for the subplots
n_plots = len(variables_to_plot)
n_rows = math.ceil(n_plots / 2)
n_cols = 2
# Create the subplots
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(15, 30))

# Iterate over the variables and create the subplots
for i, column in enumerate(variables_to_plot):
    row = i // n_cols
    col = i % n_cols

    # Plot the value counts
    status_city = listings_relevant[column].value_counts()
    (status_city / listings_relevant.shape[0]).plot(kind="bar", ax=axes[row, col])
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].set_title(f'Value count of {column}')

# Adjust the spacing between subplots
plt.subplots_adjust(hspace=0.5, wspace=0.3)  # Increase the spacing between subplots

# Show the plots
plt.show()

In [None]:
# Convert the 'price' column to float
listings_relevant['price'] = listings_relevant['price'].astype(float)

# Define the columns to group by
group_by_columns = ['host_is_superhost', 'is_location_exact', 'property_type',
                    'room_type', 'bed_type', 'instant_bookable', 'cancellation_policy','neighbourhood_cleansed']

for col in group_by_columns:
    # Group the data by the specified columns and calculate the average pricing
    grouped_data = listings_relevant.groupby(col)['price'].mean().reset_index()

    # Sort the data in ascending order based on the 'price' column
    sorted_data = grouped_data.sort_values('price')

    # Plot the grouped data
    plt.figure(figsize=(10, 6))
    sns.barplot(data=sorted_data, x=col, y='price', palette='viridis')
    plt.xlabel(f'{col}')
    plt.ylabel('Average Pricing')
    plt.title(f'Average Pricing by {col}')

    # Get the x-axis tick labels
    x_ticks_labels = plt.gca().get_xticklabels()
    # Slant the x-axis tick labels by the specified angle
    plt.gca().set_xticklabels(x_ticks_labels, rotation=90)

    plt.show()



In [None]:
# Calculate the frequency of each neighborhood
neighborhood_counts = listings_relevant['neighbourhood_cleansed'].value_counts()

# Plot the frequency as a pie chart
plt.figure(figsize=(8, 8))  # Adjust the figure size if needed
plt.pie(neighborhood_counts, labels=neighborhood_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Frequency of Neighborhoods')

# Show the pie chart
plt.show()

In [None]:
# Calculate the value counts of each neighborhood
neighborhood_counts = listings_relevant['neighbourhood_cleansed'].value_counts()

# Calculate the threshold for the top neighborhoods
threshold = int(len(neighborhood_counts) * 0.1)

# Get the top neighborhoods based on value counts
top_neighborhoods = neighborhood_counts.head(threshold)

# Filter the data for the top neighborhoods
filtered_data = listings_relevant[listings_relevant['neighbourhood_cleansed'].isin(top_neighborhoods.index)]

# Set the color palette for grouping
palette = sns.color_palette('viridis', len(filtered_data['neighbourhood_cleansed'].unique()))

# Create the cluster plot
plt.figure(figsize=(10, 6))
top_neighborhoods.plot(kind="bar")
# sns.stripplot(data=filtered_data, x='neighbourhood_cleansed', y='price', hue='neighbourhood_cleansed', palette=palette, dodge=True)
plt.xlabel('Neighbourhood Cleansed')
plt.ylabel('Price')
plt.title('Bar chart: Pricing by top Neighbourhood Cleansed (Grouped)')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
print("\n".join([f"{column} : {listings_relevant[column].nunique()}" for column in listings_relevant.select_dtypes(exclude=['int']).columns]))

In [None]:
listings_relevant = listings_relevant.dropna(axis=0)
# Filter columns that start with 'review_'
review_columns = listings_relevant.filter(regex=r'^review_')

# Create a new DataFrame with the review columns
reviews = listings_relevant[review_columns.columns]

# Create a new DataFrame without the review columns
listings_relevant = listings_relevant.drop(review_columns, axis=1)

listings.info()

In [None]:
# Filter columns that start with 'review_'
review_columns = listings_relevant.filter(regex=r'^review_')

# Create a new DataFrame with the review columns
reviews = listings_relevant[review_columns.columns]

# Create a new DataFrame without the review columns
listings = listings_relevant.drop(review_columns.columns, axis=1)

The remaining columns in the calendar file do not have Nan values. However, there is common column names "date" in the calendar csv file and the reviews csv file. You may one to rename both as they do not represent the same kind of date.

In [None]:
reviews.isna().sum()

In [None]:
listings_relevant.isna().sum()

In [None]:
# Create a new DataFrame for encoded data
encoded_data = listings_relevant.copy()

# Identify categorical variables based on data type
categorical_vars = listings_relevant.select_dtypes(include=['object']).columns.tolist()

# Perform one-hot encoding for categorical variables
encoded_data = pd.get_dummies(encoded_data, columns=categorical_vars, drop_first=True)

# Print the resulting encoded data
encoded_data

In [None]:
encoded_data.isna().sum().sum()

In [None]:
# Function to perform descriptive statistics and data visualization
def explore_data(dataset):
    """
    Perform descriptive statistics and data visualization for the Airbnb dataset.
    
    Args:
        dataset (pandas.DataFrame): The input dataset to explore.
    
    Returns:
        None
    """
    
    # Perform descriptive statistics
    summary_stats = dataset.describe()
    
    # Print the summary statistics
    print("Summary Statistics:")
    print(summary_stats)
    
    # Perform data visualization
    
    # Histogram of prices
    plt.figure(figsize=(10, 6))
    dataset['price'].hist(bins=30, color='blue', alpha=0.7)
    plt.xlabel('Price')
    plt.ylabel('Frequency')
    plt.title('Distribution of Prices')
    plt.show()
    
    # Scatter plot of price vs. number_of_reviews
    plt.figure(figsize=(10, 6))
    plt.scatter(dataset['number_of_reviews'], dataset['price'], color='green', alpha=0.5)
    plt.xlabel('Number of Reviews')
    plt.ylabel('Price')
    plt.title('Price vs. Number of Reviews')
    plt.show()

# Call the explore_data function with the loaded dataset
explore_data(listings)