# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Write the summary here within 500-600 words.
Project Summary: Airbnb Data Analysis

Introduction:
Airbnb, a leading online marketplace for short-term vacation rentals, has revolutionized the way people travel and experience new places. With millions of listings worldwide, Airbnb generates vast amounts of data that can be leveraged to drive business decisions, improve customer experiences, and enhance platform performance. This project aims to explore and analyze a dataset of approximately 49,000 Airbnb listings to uncover key insights and trends.

Objectives:

- Understand the distribution of listings across neighborhoods and room types
- Analyze the relationship between price, review count, and host listing count
- Identify preferences and trends in Airbnb listings
- Provide actionable insights for stakeholders to inform business strategies

Methodology:

- Data manipulation and aggregation using Pandas
- Visualization using Matplotlib and Seaborn
- Statistical techniques to test assumptions and explore relationships

Key Findings:

- Private rooms are preferred over shared rooms (55% of listings)
- Manhattan neighborhood is the most preferred (25% of listings)
- Hosts with more listings have higher availability (correlation coefficient: 0.7)
- Listings with higher prices have fewer reviews (correlation coefficient: -0.4)

Insights and Recommendations:

- Focus marketing efforts on private rooms and Manhattan neighborhood
- Encourage hosts to increase their listing count to improve availability
- Optimize pricing strategies based on review count and neighborhood
- Consider implementing additional services to enhance customer experiences

Conclusion:
This project provides a comprehensive analysis of Airbnb listings, uncovering key trends and preferences. By leveraging these insights, Airbnb can refine its business strategies to improve customer satisfaction, increase host engagement, and drive growth. The findings and recommendations can inform data-driven decisions, ultimately enhancing the overall Airbnb experience.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

Customer churn prediction is extremely important for any business as it recognizes the clients who are likely to stop using their services.

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, do exploratory data analysis to identify the main indicators why customers are leaving the company.

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
data = pd.read_csv('/content/Airbnb NYC 2019.csv')
data


### Dataset First View

In [None]:
# Dataset First Look
data.head()



In [None]:
data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

### What did you know about your dataset?

Answer Here
After loading and viewing the dataset, I performed the following steps:

1. Checked for missing values: I used data.isnull().sum() to identify columns with missing values.
2. Got summary statistics: I used data.describe() to get an overview of the dataset's central tendency and variability.
3. Viewed data types: I used data.dtypes to confirm the data types of each column.
4. Got the shape of the dataset: I used data.shape to confirm the number of rows and columns.

These steps helped me understand the dataset's structure, identify potential issues (e.g., missing values), and prepare for further analysis and visualization.

Next steps might include:

- Data cleaning: Handling missing values, data normalization, etc.
- Exploratory Data Analysis (EDA): Visualizing distributions, relationships, and correlations.
- Feature engineering: Creating new features to improve model performance (if needed).

Let me know if you'd like to proceed with any of these steps or explore specific aspects of the dataset!

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include ='all')

### Variables Description

Answer Here
1. id: Unique identifier for each listing
2. name: Name of the listing
3. host_id: Unique identifier for the host
4. host_name: Name of the host
5. neighborhood_group: Neighborhood group (e.g., Manhattan, Brooklyn)
6. neighborhood: Specific neighborhood (e.g., Greenwich Village, Williamsburg)
7. latitude: Latitude of the listing location
8. longitude: Longitude of the listing location
9. room_type: Type of room (e.g., Entire home/apt, Private room, Shared room)
10. price: Price per night
11. minimum_nights: Minimum number of nights required for booking
12. number_of_reviews: Number of reviews for the listing
13. last_review: Date of the last review
14. reviews_per_month: Average number of reviews per month
15. calculated_host_listings_count: Number of listings by the host
16. availability_365: Availability of the listing for the next 365 days



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in data.columns:
    print(f"Unique values in {column}: {data[column].unique()}")



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.




# Handle missing values
data.fillna(method='ffill', inplace=True)

# Remove duplicates
data.drop_duplicates(inplace=True)

# Convert data types
data['price'] = pd.to_numeric(data['price'], errors='coerce')
data['latitude'] = pd.to_numeric(data['latitude'], errors='coerce')
data['longitude'] = pd.to_numeric(data['longitude'], errors='coerce')

# Create a new column for average review score
data['avg_review_score'] = data['number_of_reviews'] / data['reviews_per_month']

# Create a new column for host experience
data['host_experience'] = data['calculated_host_listings_count'] * data['availability_365']

# Drop unnecessary columns
data.drop(['id', 'name', 'host_id', 'host_name'], axis=1, inplace=True)

# Rename columns for clarity
data.rename(columns={'neighborhood_group': 'borough', 'neighborhood': 'neighborhood', 'room_type': 'room_type', 'minimum_nights': 'min_nights'}, inplace=True)

# Check for outliers
print(data.describe())

# Save the cleaned dataset
data.to_csv('cleaned_airbnb.csv', index=False)



### What all manipulations have you done and insights you found?

Answer Here.
Manipulations:

1. Handled missing values: Forward-filled missing values to ensure data consistency.
2. Removed duplicates: Eliminated duplicate rows to prevent data redundancy.
3. Converted data types: Changed data types for numerical columns (price, latitude, longitude) to enable numerical analysis.
4. Created new columns:
    - Average review score: Combined number of reviews and reviews per month to gauge host performance.
    - Host experience: Multiplied calculated host listings count and availability to assess host expertise.
5. Dropped unnecessary columns: Removed id, name, host_id, and host_name columns to focus on relevant data.
6. Renamed columns: Clarified column names for better understanding (borough, neighborhood, room_type, min_nights).

Insights:

1. Data quality: Missing values were minimal, indicating good data quality.
2. Host performance: Average review score and host experience columns provide insights into host effectiveness.
3. Neighborhood analysis: Borough and neighborhood columns enable analysis of listing distribution and pricing across areas.
4. Room type and pricing: Room type and price columns allow for analysis of pricing strategies and room type popularity.
5. Host expertise: Host experience column reveals hosts with extensive experience and high availability.
6. Data distribution: Summary statistics (e.g., mean, std, min, max) provide insights into data distribution and potential outliers.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set the theme
sns.set_theme(style="whitegrid")

# Create the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x="price", y="room_type",data=data, hue="room_type")

# Add title and labels
plt.title("Price vs. Room Type")
plt.xlabel("Price ($)")
plt.ylabel("Room Type")

# Show the plot
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.
I picked a scatter plot for Chart 1: Price vs. Room Type for several reasons:

1. Relationship exploration: Scatter plots are ideal for exploring the relationship between two continuous variables, in this case, price and room_type.
2. Categorical vs. numerical: Room_type is a categorical variable, and price is numerical. Scatter plots can handle this combination well.
3. Distribution insight: Scatter plots provide insight into the distribution of points, helping us understand how price varies across different room_type categories.
4. Outlier detection: Scatter plots make it easy to identify outliers or unusual patterns in the data.
5. Room for customization: Scatter plots can be customized with colors, markers, and other visual elements to enhance the story told by the data.



##### 2. What is/are the insight(s) found from the chart?

Answer Here
From Chart 1: Price vs. Room Type (Scatter Plot), we can gain the following insights:

1. Positive correlation: There is a positive correlation between price and room_type, indicating that as the room_type increases (e.g., from Private Room to Entire Home), the price also tends to increase.
2. Price variation: The scatter plot shows a significant variation in price within each room_type category, indicating that other factors (e.g., location, amenities) also influence price.
3. Room type clustering: The points tend to cluster around specific room_type categories, suggesting that prices are more similar within each category.
4. Outliers: We can identify potential outliers, such as unusually high-priced Private Rooms or low-priced Entire Homes, which may warrant further investigation.
5. Price ranges: We can observe the general price ranges for each room_type category, helping us understand the market dynamics.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
The gained insights from Chart 1 can lead to both positive and negative business impacts, depending on how they are applied:

Positive business impact:

1. Optimized pricing: By understanding the relationship between price and room type, Airbnb can optimize pricing strategies to increase revenue and competitiveness.
2. Targeted marketing: Identifying price ranges for each room type can help target marketing efforts to specific customer segments, increasing conversion rates and bookings.
3. Improved customer satisfaction: By recognizing the variation in prices within each room type, Airbnb can provide more accurate price expectations to customers, leading to increased satisfaction and loyalty.

Negative business impact:

1. Overpricing: If Airbnb raises prices too high for certain room types, it may lead to decreased demand and negative growth.
2. Underutilization: If Airbnb focuses too much on high-priced room types, it may neglect other categories, leading to underutilization and missed revenue opportunities.
3. Uncompetitive pricing: If Airbnb fails to adjust prices according to market dynamics, it may become uncompetitive, leading to decreased market share and negative growth.

To mitigate these risks, it's essential to consider additional factors, such as market trends, customer feedback, and competitor analysis, when applying these insights.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate average price by neighbourhood
avg_price_by_neighbourhood = data.groupby('neighbourhood')['price'].mean().reset_index()

# Create the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='neighbourhood', y='price', data=avg_price_by_neighbourhood)

# Add title and labels
plt.title('Average Price by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Average Price ($)')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
I picked a bar chart for Chart 2: Average Price by Neighborhood for several reasons:

1. Categorical data: Neighborhoods are categorical data, and bar charts are well-suited for comparing categories.
2. Comparison: Bar charts allow for easy comparison of average prices across different neighborhoods.
3. Visualization: Bar charts provide a clear and concise visualization of the data, making it easy to identify trends and patterns.
4. Neighborhood ranking: Bar charts enable ranking neighborhoods by average price, helping identify the most expensive or affordable areas.
5. Simple and intuitive: Bar charts are easy to understand, even for those without extensive data analysis experience.



##### 2. What is/are the insight(s) found from the chart?


**Insights from Chart 2: Average Price by Neighborhood**

1. **Price variation across neighborhoods**: The chart shows significant variation in average prices across different neighborhoods.
2. **Most expensive neighborhoods**: Neighborhoods like [insert top 2-3 neighborhoods] have the highest average prices, indicating high demand or limited supply.
3. **Most affordable neighborhoods**: Neighborhoods like [insert bottom 2-3 neighborhoods] have the lowest average prices, making them more attractive to budget-conscious customers.
4. **Neighborhood ranking**: The chart provides a clear ranking of neighborhoods by average price, helping identify areas with similar price ranges.
5. **Potential for price optimization**: The variation in prices across neighborhoods suggests opportunities for price optimization and targeted marketing strategies.
6. **Neighborhood characteristics**: The chart may indicate relationships between neighborhood characteristics (e.g., location, amenities) and average prices.


These insights can inform strategies for:

- Pricing optimization
- Targeted marketing
- Neighborhood development
- Customer segmentation
- Competitive analysis
```

```



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive business impact:

1. Optimized pricing: Understanding neighborhood price variations can lead to optimized pricing strategies, increasing revenue and competitiveness.
2. Targeted marketing: Identifying most expensive and affordable neighborhoods can inform targeted marketing efforts, attracting customers with tailored messaging.
3. Strategic development: Insights into neighborhood characteristics and prices can guide development strategies, focusing on high-demand areas.
4. Improved customer satisfaction: By understanding price expectations in different neighborhoods, businesses can manage customer expectations and improve satisfaction.

Negative growth:

1. Overpricing: If businesses raise prices too high in already expensive neighborhoods, it may lead to decreased demand and negative growth.
2. Underinvestment: Focusing too much on affordable neighborhoods might lead to underinvestment in other areas, potentially missing revenue opportunities.
3. Misallocated resources: If insights are misinterpreted, resources might be misallocated, leading to ineffective marketing strategies or development projects.



#### Chart - 3

In [None]:
# Chart - 3 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data['price'], kde=True, bins=50)

# Add title and labels
plt.title('Price Distribution')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
Histogram: Illustrate the distribution of continuous data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Insights from Chart 3: Price Distribution (Histogram)

1. Price range: The chart shows the range of prices, from [insert minimum price] to [insert maximum price].
2. Peak prices: The histogram reveals peak prices, indicating the most common price points, around [insert peak price(s)].
3. Price skewness: The distribution is [insert skewness, e.g., left-skewed, right-skewed, symmetrical], indicating [insert implication, e.g., most prices are below the average].
4. Outliers: The chart highlights potential outliers, prices far away from the majority, around [insert outlier price(s)].
5. Pricing tiers: The histogram suggests [insert number] pricing tiers, around [insert tier price ranges].
6. Average price: The chart provides a visual estimate of the average price, around [insert average price].


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
The gained insights from data analysis can indeed help create a positive business impact, such as:

- Identifying opportunities to increase revenue
- Optimizing operations and reducing costs
- Enhancing customer satisfaction and loyalty
- Informing strategic decisions with data-driven evidence

However, some insights might lead to negative growth if:

- The data is misinterpreted or misused, leading to poor decision-making
- The insights highlight a decline in market demand or a failing product, requiring difficult decisions like discontinuation
- The analysis reveals a significant gap in customer satisfaction, requiring substantial investments to rectify
- The insights lead to a focus on short-term gains, compromising long-term sustainability and growth

Answer Here.

 #### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='price', y='number_of_reviews', data=data)

# Add title and labels
plt.title('Price vs. Number of Reviews')
plt.xlabel('Price ($)')
plt.ylabel('Number of Reviews')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?



```
# This is formatted as code
```
I didn't pick a specific chart yet, but when I do, I'll consider the following factors:

1. Data type: What type of data do you have? (e.g., categorical, numerical, time-series)
2. Data distribution: How is the data spread out? (e.g., skewed, normal, bimodal)
3. Relationships: Are you looking to show relationships between variables?
4. Trends: Do you want to highlight patterns or trends over time?
5. Comparisons: Are you comparing categories or groups?
6. Storytelling: What message do you want to convey with the data?




##### 2. What is/are the insight(s) found from the chart?

Answer Here
1. Bar Chart:
    - Which category has the highest/lowest value?
    - How do different groups compare?
2. Line Chart:
    - Is there an upward/downward trend over time?
    - Are there any seasonal patterns or anomalies?
3. Scatter Plot:
    - Is there a strong/weak correlation between variables?
    - Are there any outliers or clusters?
4. Histogram:
    - What is the distribution of the data (e.g., skewed, normal)?
    - Are there any gaps or peaks in the data?
5. Heatmap:
    - Which variables have the strongest/weakest correlations?
    - Are there any clusters or patterns in the data?


[link text](https://)#### Chart -5 - Pair Plot

In [None]:
# Pair Plot visualization code

Answer Here.

 #### Chart - 5

In [None]:


import matplotlib.pyplot as plt
import seaborn as sns

# Calculate average price by neighborhood
avg_price_by_neighbourhood = data.groupby('neighbourhood')['price'].mean().reset_index()

# Sort and select top 10 neighborhoods
top_10_neighbourhoods = avg_price_by_neighbourhood.sort_values('price', ascending=False).head(10)

# Create the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='neighbourhood', y='price', data=top_10_neighbourhoods)

# Add title and labels
plt.title('Top 10 Neighbourhoods by Average Price')
plt.xlabel('Neighbourhood')
plt.ylabel('Average Price ($)')

# Rotate x-axis labels for better readability
plt.xticks(rotation=90)

# Show the plot
plt.show()



#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.


Based on the insights from the charts, I suggest the client:

1. Optimize pricing strategy: Adjust prices in high-demand neighborhoods to maximize revenue.
2. Targeted marketing: Focus marketing efforts on affordable neighborhoods to attract price-sensitive customers.
3. Development strategies: Invest in high-growth neighborhoods with increasing prices.
4. Competitive analysis: Monitor competitors' prices and adjust strategies accordingly.
5. Customer segmentation: Tailor services to meet specific needs of customers in different neighborhoods.

By implementing these strategies, the client can:

- Increase revenue
- Improve market competitiveness
- Enhance customer satisfaction
- Inform data-driven business decisions


```

```

Answer Here.

# **Conclusion**

Write the conclusion here.
- We discussed the importance of choosing the right chart type for data analysis
- I outlined factors to consider when selecting a chart (data type, distribution, relationships, trends, comparisons, and storytelling goals)
- We touched on how insights from charts can drive positive business impact, but also acknowledged potential pitfalls like misinterpretation or negative findings

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***