<a href="https://colab.research.google.com/github/Gareth11-max/Airbnb/blob/main/Airbnb_EDA_project_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Airbnb Bookings analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

# **Problem Statement**


**Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Explore and analyse the data to discover key understandings.**

#### **Define Your Business Objective?**

 For an Airbnb bookings exploratory data analysis (EDA), the business objectives  is shaped around understanding key metrics, trends, and patterns to inform decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/Airbnb.csv')


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
row,columns=df.shape
print(f"The Number of columns in the dataset is {columns} and rows is {row}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count=df.duplicated().sum()
duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = df.isnull().sum()
missing_values_count


In [None]:
# Plot a heatmap of missing values
plt.figure(figsize=(10, 6))  # Set the size of the figure
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', annot=False, fmt='d')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?



```
# This is formatted as code
```

rows=48895
columns=16
Missing values are in columns last_review,Number_of_columns

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Columns in the dataset:")
print(df.columns)

In [None]:
# Dataset Describe
print("Descriptive Statistics:")
print(df.describe())

### Variables Description

id: Unique identifier for each Airbnb listing.
name: Name of the Airbnb listing, typically descriptive (e.g., "Cozy 1-bedroom apartment").
price: The price for one night’s stay at the listing (numeric, e.g., $100).
location: The geographical location of the listing (e.g., "New York City").
room_type: The type of room available for rent (categorical, e.g., "Entire home/apt", "Private room").
reviews_per_month: The average number of reviews the listing receives each month (numeric, may have missing values).
availability_365: The number of days the listing is available to book in the year (numeric, ranges from 0 to 365).



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Get the count of unique values for each column
unique_counts = df.nunique()

# Display the result
print("Unique values count for each column:")
print(unique_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Mount Google Drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/Airbnb.csv')

# Dataset First Look
df.head()

# Dataset Duplicate Value Count
duplicate_count=df.duplicated().sum()
duplicate_count

# Missing Values/Null Values Count
missing_values_count = df.isnull().sum()
missing_values_count

# Dataset Columns
print("Columns in the dataset:")
print(df.columns)

# Plot a heatmap of missing values
plt.figure(figsize=(10, 6))  # Set the size of the figure
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', annot=False, fmt='d')
plt.title("Missing Values Heatmap")
plt.show()

# Dataset Describe
print("Descriptive Statistics:")
print(df.describe())

# Check Unique Values for each variable.
# Get the count of unique values for each column
unique_counts = df.nunique()

# Display the result
print("Unique values count for each column:")
print(unique_counts)


### What all manipulations have you done and insights you found?

**Data Manipulations**:
**Loading the Dataset**:The dataset is loaded from a CSV file into a Pandas DataFrame (df).
Initial Exploration:

**Data Structure**: Checked the shape of the dataset (number of rows and columns).


**Column Names and Data Types**: Used .info() to get an overview of each column’s data type (numeric, categorical) and checked for missing values.
Data Cleaning:

**Missing Values**: Identified and filled missing values:
Numerical columns: Filled with the median value (common practice when data is skewed).
Categorical columns: Filled with the mode (most frequent value).
Duplicate Rows: Checked for duplicate rows and removed any duplicates using .drop_duplicates().



Exploratory Data Analysis (EDA):

Descriptive Statistics: Used .describe() to get a summary of the numerical columns, including mean, median, standard deviation, and percentiles.
Unique Values: Displayed the unique values for each variable to identify the diversity of values (especially for categorical columns like room_type).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], kde=True, bins=50)
plt.title("Price Distribution")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

Because histogram is the best visualization to demonstrate price frequency.


##### 2. What is/are the insight(s) found from the chart?

The frequency for the properties with less price is more and their are very few properties with price higher than 700$.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes it can create a positive business impact as now airbnb can know the range in which the value of most of their properties lie.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='price', data=df)
plt.title("Price vs Room Type")
plt.xlabel("Room Type")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

We can easily get insights and compare the prices for different room types on box plot.

##### 2. What is/are the insight(s) found from the chart?

We can see form the graph that entire room type has the highest prices whereas shared room has the lowest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see the price distribution of various rooms.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(x='neighbourhood_group', y='price', data=df)
plt.title("Price vs Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

We can easily get insights and compare the prices for different neighbourhoods on box plot.

##### 2. What is/are the insight(s) found from the chart?

The big cities like Manhattan and Brooklyn have the highest prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can know the price distributin around various neighbourhoods.


#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 8))
sns.scatterplot(x='longitude', y='latitude', hue='price', palette='coolwarm', size='price', sizes=(20, 200), data=df)
plt.title("Price vs Latitude and Longitude")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot to see how price is distributed across different geographical coordinates

##### 2. What is/are the insight(s) found from the chart?

Most properties have a price around 2000$.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can easily know the price points across the country and where most property prices lie.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='availability_365', y='price', data=df)
plt.title("Availability vs Price")
plt.xlabel("Availability (365 days)")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot accurately shows the relationship between availability_365 and price.

##### 2. What is/are the insight(s) found from the chart?

The properties with the highest and lowest availability days of 365 have the highest prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can know form here that number of days availability has an impact on the price.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='reviews_per_month', y='price', data=df)
plt.title("Reviews per Month vs Price")
plt.xlabel("Reviews per Month")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot accurately shows the relationship between reviews per month and price.

##### 2. What is/are the insight(s) found from the chart?

Most of the properties get less than 10 reviews per month and the highest prices have the lowest reviews per month

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The highest prices have the lowest reviews per month. This can be because of low bookings of the highest price properties.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='minimum_nights', y='price', data=df)
plt.title("Minimum Nights vs Price")
plt.xlabel("Minimum Nights")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot accurately shows the relationship between minimum nights and price.

##### 2. What is/are the insight(s) found from the chart?

The highest prices have low number of minimum nights

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The highest prices have low number of minimum nights.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='number_of_reviews', y='price', data=df)
plt.title("Number of Reviews vs Price")
plt.xlabel("Number of Reviews")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot to explore the relationship between number_of_reviews and price.



##### 2. What is/are the insight(s) found from the chart?

The highest prices have the lowest reviews per month.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The highest prices have the lowest reviews per month. This can be because of low bookings of the highest price properties.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='number_of_reviews', data=df)
plt.title("Number of Reviews vs Room Type")
plt.xlabel("Room Type")
plt.ylabel("Number of Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot to show the distribution of number_of_reviews across different room_type.

##### 2. What is/are the insight(s) found from the chart?

Private room has most reviews and shared room has least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Private room has most reviews and shared room has least.Campaigns can be run to increase shared room reviews.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='calculated_host_listings_count', y='price', data=df)
plt.title("Price vs Host Listings Count")
plt.xlabel("Host Listings Count")
plt.ylabel("Price")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot showing how calculated_host_listings_count influences price.

##### 2. What is/are the insight(s) found from the chart?

when the host listing count is low he prices are highest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This comparison can tell us the relationship between host listings count and price.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
top_hosts = df['host_name'].value_counts().nlargest(10).index
filtered_df = df[df['host_name'].isin(top_hosts)]

plt.figure(figsize=(12, 6))
sns.barplot(x='host_name', y='price', data=filtered_df)
plt.title("Price vs Host Name")
plt.xlabel("Host Name")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot to show how the price varies by host_name (top hosts).

##### 2. What is/are the insight(s) found from the chart?

The top hosts and their listing's prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The top host with the highest price is blueground so the commission form this host will be the highest for one listing.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(x='neighbourhood', y='reviews_per_month', data=df)
plt.title("Reviews per Month vs Neighbourhood")
plt.xlabel("Neighbourhood")
plt.ylabel("Reviews per Month")
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot to see how reviews_per_month varies across different neighbourhood groups.

##### 2. What is/are the insight(s) found from the chart?

how reviews_per_month varies across different neighbourhood groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='room_type', data=df)
plt.title("Room Type Distribution")
plt.xlabel("Room Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot showing the count of each room_type in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Count of private room and entire home are more than 20000 while shared room count is less than 5000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is huge disparity between Count of private room and entire home  and the count of shared rooms.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Compute correlation matrix
numerical_df = df.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix
correlation_matrix = numerical_df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(12, 8))

# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)

# Add title and labels
plt.title("Correlation Heatmap", fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

To find the correlation between different numerical variables.

##### 2. What is/are the insight(s) found from the chart?

 This visualization highlights relationships between numerical variables in your dataset.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code


# Select numerical columns for pairplot
numerical_columns = ['price', 'minimum_nights', 'number_of_reviews',
                     'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

# Create the pair plot
sns.pairplot(df[numerical_columns], diag_kind='kde', corner=True, plot_kws={'alpha': 0.5})

# Add a title to the plot
plt.suptitle("Pair Plot of Numerical Variables", y=1.02, fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is an excellent way to visualize the relationships between multiple numerical variables in your dataset. It displays scatterplots for every pair of variables and histograms for individual variables on the diagonal.

##### 2. What is/are the insight(s) found from the chart?

Scatterplots for every pair of variables and histograms for individual variables on the diagonal.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve Airbnb's business objectives, I suggest the following actionable strategies tailored to the client based on the insights gained from data analysis and the platform's operational goals:

1. **Optimize Pricing Strategies**
**Dynamic Pricing Models:** Encourage hosts to use dynamic pricing tools that adjust rates based on demand, seasonality, and competitor pricing in the area.

**Highlight Competitive Listings:** Provide insights to hosts on pricing trends in their neighborhood to ensure competitive positioning.

**Offer Discounts:** Suggest promotional discounts for new listings or underperforming properties to attract bookings.

2. **Enhance Host Support and Engagement**
**Training and Resources:** Offer webinars, tutorials, or guides to help hosts improve their listings (e.g., better descriptions, professional photos).

**Feedback Integration:** Analyze guest reviews to identify areas where hosts can improve (e.g., cleanliness, communication).

**Loyalty Programs:** Introduce incentives for top-performing hosts, such as reduced fees or increased visibility for highly-rated properties.

3. **Improve Guest Experience**
**Personalized Recommendations:** Use machine learning to suggest accommodations and experiences tailored to individual preferences (e.g., based on past bookings or search behavior).

**Flexible Options:** Highlight listings with flexible cancellation policies or “long-term stay” discounts to cater to evolving traveler needs.

**Trust and Safety:** Invest in more robust verification processes, guest support, and emergency assistance to enhance trust.

# **Conclusion**

The analysis of Airbnb’s dataset provided valuable insights into its operations and potential areas of improvement. By focusing on the core aspects of the business—hosts, guests, pricing strategies, and market trends—we can outline actionable recommendations to achieve Airbnb’s business objectives effectively.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***