# Exploratory Data Analysis 

## About the Data

The **Airbnb Listing Data 2023** from Website dataset contains information about Airbnb listings in over 190 countries. The dataset includes information about the location of the listing, the price, the number of bedrooms and bathrooms, the amenities offered, and the reviews of the listing. The dataset was scraped from the Airbnb website using a web scraping tool.  

The dataset is a valuable resource for researchers and businesses who are interested in the short-term rental market. The dataset can be used to analyze the trends in the short-term rental market, to identify the most popular destinations for short-term rentals, and to compare the prices of short-term rentals in different locations.  

The dataset is also a valuable resource for travelers who are looking for a place to stay. The dataset can be used to find affordable and convenient accommodations in different locations.  

The dataset is available for download on Kaggle. The dataset is licensed under the Creative Commons Attribution 4.0 International License.

> https://www.kaggle.com/datasets/joyshil0599/airbnb-listing-data-for-data-science?select=airnb.csv

## 1. Project Understanding




**The Essence of Exploratory Data Analysis:**  

EDA is an iterative process that aims to discover patterns, detect anomalies, and extract insights from data. It involves examining the underlying structure, relationships, and distributions within a dataset before diving into complex modeling or hypothesis testing. By adopting a systematic approach to EDA, you can develop a comprehensive understanding of your data, leading to more effective problem-solving and actionable outcomes.

## 2. Data Cleaning and Preprocessing

### Data Collection

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# reads the data from the file
df = pd.read_csv('airnb.csv', encoding="ISO-8859-1")

### Data Description

In [None]:
# show the top 5 rows
df.head()

In [None]:
# Get the size of the data set
print("Data set size:", df.shape)

# Get the number of observations and the names and data types of the columns
print(df.info())

In [None]:
# show descriptive statistics
df.describe()

In [None]:
# show descriptive statistics transposed
df.describe().T

In [None]:
# show column names
df.columns

In [None]:
# select columns that are objects
df.select_dtypes(include=['object']).columns

In [None]:
# select columns that are integers
df.select_dtypes(include=['int']).columns

##### Missing Values  

In [None]:
# percentage of total missing values per column
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns
missing_values_table(df)

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.isnull(),cbar=False,yticklabels=False)

##### Zero Values

In [None]:
# percentage of 0 values in each column
values_tot = len(df.index)
df[df == 0].count()/values_tot

##### Duplicates

In [None]:
# Review how many duplicates are in the dataframe
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

#### Nan Values

In [None]:
# count number of NaNs
df.isna().sum()

In [None]:
#Display data with any missing values
df.loc[df.isnull().any(axis=1)]

#### Unique Values

In [None]:
# Displays total number of unique values in each column.
df.nunique()

##### Unique Values for Category Data

In [None]:
# unique values for 'Number of bed'
df['Number of bed'].unique()

In [None]:
import plotly.express as px
fig = px.bar(df, x='Number of bed')
fig.show()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Step 1: Define the variables for the pair plot
variables = df.columns  # Adjust this based on the variables you want to include in the pair plot

# Step 2: Create the figure and subplots
fig = make_subplots(rows=len(variables), cols=len(variables), shared_xaxes=False, shared_yaxes=False)

# Step 3: Populate the subplots with scatter plots
for i, var_i in enumerate(variables):
    for j, var_j in enumerate(variables):
        if i == j:
            # Diagonal plot: histogram or other plot for individual variables
            fig.add_trace(go.Histogram(x=df[var_i], name=var_i, nbinsx=30), row=i + 1, col=j + 1)
        else:
            # Scatter plot for variable pair
            fig.add_trace(go.Scatter(x=df[var_i], y=df[var_j], mode='markers', name=var_i + ' vs ' + var_j), row=i + 1, col=j + 1)

# Step 4: Update subplot layout
fig.update_layout(
    title='Pair Plot',
    height=800,
    width=800,
)

# Step 5: Show the plot
fig.show()


In [None]:
# Step 1: Calculate the correlation matrix
corr_matrix = df.corr()

# Step 2: Convert correlation matrix to Plotly DataFrame format
corr_df = pd.DataFrame(corr_matrix.stack(), columns=['correlation']).reset_index()
corr_df.columns = ['variable_1', 'variable_2', 'correlation']

# Step 3: Create Plotly heatmap
fig = go.Figure(data=go.Heatmap(
    x=corr_df['variable_1'],
    y=corr_df['variable_2'],
    z=corr_df['correlation'],
    colorscale='RdBu',
))

# Add labels and title to the plot
fig.update_layout(
    title='Correlation Matrix',
    xaxis=dict(title='Variable 1'),
    yaxis=dict(title='Variable 2'),
)

# Display the plot
fig.show()


#### Data Cleaning

In [None]:
# Replace NaN values in 'airline_sentiment_gold' with a 'neutral'
df['airline_sentiment_gold'].fillna('neutral', inplace=True)

In [None]:
# Replace NaN values in 'negativereason' with a 'no comment'
df['negativereason'].fillna('no comment', inplace=True)

In [None]:
# Replace NaN values in 'negativereason_gold' with a 'no comment'
df['negativereason_gold'].fillna('no comment', inplace=True)

In [None]:
# Replace NaN values in 'negativereason_confidence' with a 0
df['negativereason_confidence'].fillna(0, inplace=True)

In [None]:
# re-checking the count number of NaNs
df.isna().sum()

As location is not something required for this project I will drop the remaining columns.

##### Dropping Columns

In [None]:
# Drop columns from the DataFrame
columns_to_drop = ['?']
df.drop(columns_to_drop, axis=1, inplace=True)

In [None]:
# calling top 5 rows to review changes
df.head(5)

#### Changing Data Types

In [None]:
# changing data type
df[] = pd.to_datetime(df[])

In [None]:
# review datatype changes
df.info()

#### Feature Creation

In [None]:
# Split the datetime column into date and time columns
df['tweet_date'] = df['date'].dt.date
df['tweet_time'] = df['time'].dt.time