<a href="https://colab.research.google.com/github/Hajer45/9antra/blob/main/NYC_Airbnb_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**NYC Airbnb Data Analysis Project**

**Welcome to the NYC Airbnb Data Analysis Project! 🏙️✨!**

Get ready to dive into the **New York City Airbnb dataset**! In this project, we’ll be digging into real-world data to uncover trends, spot patterns, and gain valuable insights. You'll get hands-on experience with:

- Data cleaning 🧹
- Exploring relationships 🔍
- Answering key questions ❓
- Visualizing your findings 📊

This is a fully interactive project designed to sharpen your data analysis skills and put into practice what you’ve learned in your Python for Data Analytics course! 🚀

Ready to level up your skills?
It’s time to roll up your sleeves and apply all that Python knowledge to a real dataset! Let’s make data analytics fun and insightful! 🌟



##**Important Steps Before You Begin**



####**Create your own copy!** 📄

Since you won't be able to edit this notebook directly, make sure to create a copy in your Google Drive to modify, run, and save your own work.

Just go to **File** > **Save a copy in Drive**  before you begin.

####**What to Expect in This Notebook:**

We’re diving into real-world data analyst tasks—cleaning, analyzing, and visualizing data. Think of it as a glimpse into the day-to-day life of a data analyst! 💼

I’ve added comments in each section to explain the code and guide you through the process. But before you jump to the final solution and run code, let’s make it more fun:

- **Give it a try first!** ✏️ : Write your own code and see what you come up with.
- **Challenge yourself💪** : Think through each problem like a data pro before checking the answers.
You’ll learn a lot more by figuring things out on your own—and trust me, it feels awesome to crack it yourself!

Once you've tried, the solution will help you compare your approach and learn new tricks.





## **Dataset Overview**  📊
We are working with **the New York City Airbnb** dataset, which includes details about Airbnb listings across New York. The dataset contains valuable information such as neighborhood, price, room type, and availability of the listings.

To better understand the dataset, I highly recommend checking the full description and metadata on [Kaggle](https://www.kaggle.com/datasets/vrindakallu/new-york-dataset). There you will find more details about the data, including the meaning of each column and how the dataset was collected.


Let's import the necessary libraries and load the dataset!

In [None]:
import kagglehub
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

path = kagglehub.dataset_download("vrindakallu/new-york-dataset")
df = pd.read_csv(f'{path}/new_york_listings_2024.csv')


## **Step 1: Exploring the DataFrame  🔍**


Once the dataset is loaded, you can explore its structure, columns, and some sample records.

💡 **Hands-On Tip:**
Make sure to explore the columns and rows to familiarize yourself with the data you're working with. Understanding the structure is essential for any data analysis task!


####**Solution**

In [None]:
# Display the first 5 rows of the DataFrame
df.head()


In [None]:
# Get a summary of the DataFrame (columns, data types, and missing values)
df.info()

In [None]:
# Check basic statistics for numerical columns
df.describe()

## **Step 2: Data Cleaning**  🧹



In any data analysis project, it's essential to ensure that the data is clean before diving into analysis. Typically, this involves handling missing values, checking for duplicates, and ensuring that the data types are correct. However, after performing these checks, we can conclude that our dataset is already well-prepared for analysis.

**Key Data Cleaning Steps to Consider:**

1.   **Missing Values**: Usually, we would handle missing values by either filling or removing them.
2. **Duplicate Rows**: It's important to check for duplicates to avoid redundant data that can skew analysis.
3. **Data Types**: Another important step is to ensure that columns have the correct data types. We inspected the data types and found everything to be in order. If needed, we could have adjusted the types of some columns (e.g., converting numerical columns or date formats).


#### **Solution**

In [None]:
# Checking for missing values
missing_data = df.isnull().sum()
print("Missing values:\n", missing_data)


In [None]:
# Checking for duplicate rows
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")


In [None]:
# Example of data type conversion (not needed here but typically useful)
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')


After performing these checks, we found that the dataset is already clean and ready for analysis. This means no further data cleaning steps are necessary for now, so we can proceed directly to Exploratory Data Analysis (EDA). 🎉


## **Step 3: Exploratory Data Analysis (EDA)** 📈


Now that the dataset is cleaned, we can proceed with the exploratory data analysis (EDA). This step involves understanding the main trends, distributions, and relationships within the dataset.

###**3.1. Basic Statistical Overview**

Let's start by getting basic statistics of numerical columns

####**Solution**

In [None]:
# Describing only the relevant columns
relevant_columns = ['price', 'minimum_nights', 'number_of_reviews',
                    'reviews_per_month', 'calculated_host_listings_count',
                    'availability_365', 'bedrooms', 'beds', 'baths', 'rating']

df[relevant_columns].describe()



Now let's dive into the results of the `describe()` function and the metrics we have in front of us, to gain a better understanding of the characteristics of this dataset:

 **Price**:
- The average price of an Airbnb listing in New York City is around **\$188** per night.
- However, there’s a **wide variation** in prices, with some listings costing as low as **\$10** and some reaching extreme outliers of **\$100,000** (likely luxury or unique properties).
- The high standard deviation (**\$1022.80**) indicates that while most listings are in the lower range, a few **very high-priced listings** skew the data.

**Minimum Nights**:
- On average, hosts require a minimum stay of around **28 nights**, which is relatively high. This suggests that some listings are targeting **longer-term stays**.
- While some listings allow **1-night stays**, others require a minimum stay of **1250 nights**, indicating long-term rental options for certain properties.

**Number of Reviews**:
- The average listing has received around **43 reviews**. However, this varies significantly, with some listings having only **1 review**, while others have up to **1865 reviews**, reflecting the popularity and frequency of bookings.


 **Reviews Per Month**:
- On average, listings receive **1.26 reviews per month**, which suggests consistent booking activity.
- Some listings receive as many as **75 reviews per month**, indicating extremely high turnover and popularity.


**Host Listings**:
- On average, hosts manage **18.8 listings**, which suggests that many hosts on the platform are **professional property managers**.
- Some hosts manage only **1 listing**, typical of individual hosts, while the most active hosts manage up to **713 listings**, likely indicating commercial hosts.

**Availability (365 Days)**:
- Listings are available for an average of **206 days per year**, meaning many listings are booked or unavailable for a portion of the year.
- Some listings are available for **365 days** a year, while others are fully booked (or unavailable) for the entire year.

**Beds**:
- The average listing offers **1.72 beds**, indicating that many properties are small, likely **1-bedroom apartments or studios**. The maximum number of beds in a listing is **42**, suggesting that there are some large, group-oriented properties.


###**3.2. Visualizing Key Variables**


Now let's walk through some focused visualizations based on the most relevant columns

####**Solution**

In [None]:
# Price distribution
plt.figure(figsize=(10,6))
sns.histplot(df['price'], bins=50, kde=True, color='teal')
plt.title('Price Distribution of Airbnb Listings')
plt.xlabel('Price (in USD)')
plt.ylabel('Frequency')
plt.show()


In this plot, we attempted to visualize the full distribution of Airbnb listing prices. However, we can see that the vast majority of prices are clustered near $0, while the x-axis extends all the way up to \$100,000. This is because there are some extremely high-priced listings (outliers) that skew the distribution and compress the visual representation of most listings. As a result, the bulk of listings, which are priced much lower, are squeezed into a very narrow section of the plot, making it difficult to see meaningful trends.

To solve this, we will apply a price cap and only consider listings with prices up to \$1000 to get a clearer view of the price distribution for the majority of listings.



In [None]:
# Apply a cap to prices to remove extreme outliers
filtered_df = df[df['price'] <= 1000]

plt.figure(figsize=(10,6))
sns.histplot(filtered_df['price'], bins=50, kde=True, color='teal')
plt.title('Price Distribution of Airbnb Listings (Price <= $1000)')
plt.xlabel('Price (in USD)')
plt.ylabel('Frequency')
plt.show()


In this second plot, we restricted the dataset to only include listings priced at \$1000 or below, which represents the vast majority of listings. As a result, we can now see a more detailed distribution:

Most listings fall within the \$50 to \$300 range, with a peak around \$150 per night.
The distribution is right-skewed, meaning that while there are many affordable listings, a small number of higher-priced listings are still present.
This plot gives a much clearer view of the typical prices for Airbnb listings in New York City, without the distortion caused by a few extreme outliers.


##**Step 4: Interactive Visualization and Analytical Questions 🎯**




Now that we've completed our exploratory data analysis (EDA), it's time to dive deeper into specific visualizations and key analytical questions. You'll answer key questions about the Airbnb data, step by step, with explanations and hands-on tasks.




###**4.1. Understanding the Impact of Outliers Before Deeper Analysis**


Before we dive into deeper analysis, remember that during EDA, we encountered an issue with outliers. Outliers are extreme values (in this case, extremely high prices) that can skew visualizations and make it difficult to see the true patterns where most data points fall.

In our case, the presence of a few very expensive listings (outliers) distorted our price distribution visualizations, making it hard to analyze the more typical price range for most Airbnb listings.


**Why Explore Price Distribution?**

💡 To solve this, we need to take a step back and carefully explore how prices are distributed across all listings. This will help us:

1. Understand the range of prices in the dataset.
2. Identify the majority of listings that fall into reasonable price ranges.
3. Find a way to group prices into meaningful categories that will help us visualize the data more clearly.

To do this, we’ll group prices into bins that grow exponentially. This means we will have price ranges like **0-10,000, 10,000-20,000**, and so on. This approach helps handle the wide variation in prices by capturing listings in **small bins for low prices** (where most listings are) and **larger bins** for high prices (where only a few listings exist).

👉 Let’s start by visualizing the distribution of prices using exponentially increasing bins. This will help you see where most Airbnb listings fall and identify whether there are significant numbers of listings with extremely high prices.

####**Solution**

In [None]:
# Define price bins (exponentially increasing)
bins = [0, 200, 400, 600, 1000, 10000,200000]

# Create labels for the bins
bin_labels = ['0-200', '200-400', '400-600', '600-1k', '1k-10k','10k-200k']

# Assign each listing to a bin
df['price_bin'] = pd.cut(df['price'], bins=bins, labels=bin_labels, include_lowest=True)

# Plot the distribution of listings in each price bin
plt.figure(figsize=(10,6))
sns.countplot(x='price_bin', data=df, color='teal')
plt.title('Distribution of Airbnb Listings by Price Range')
plt.xlabel('Price Range (in USD)')
plt.ylabel('Number of Listings')
plt.show()





Here’s an updated version of your conclusion based on the new image:

We can notice that the majority of Airbnb listings fall within the **\$0-200**  price range, with a significant portion of listings also priced between **\$200-400**. There are fewer listings in the **\$400-600**  range, and the number of listings dramatically decreases beyond **\$600** .

Only a small number of listings are priced above **\$1,000** , and extremely high-priced listings (e.g., above **\$10,000** ) represent a very small fraction of the data. These outliers can distort visualizations and make it difficult to focus on the majority of listings, which are priced at more typical levels.

To improve clarity in our analysis moving forward, we will address these outliers by applying a price cap and filter on our data, allowing us to focus on the majority of Airbnb listings.


In [None]:
filtered_df = df[df['price'] <= 10000]

###**4.2. Visualizing Price Distribution Across Room Types**



💡 **Objective:** Understand how prices vary between different room types. Do certain room types tend to be more expensive than others?

👉 **Task:** Let's now visualize the price distribution across room types. This will give us an overview of how prices differ for room types like "Entire home/apt," "Private room," and "Shared room."


####**Solution**

In [None]:
# Define price bins
bins = [0, 100, 200, 400, 800, 1000,10000]
bin_labels = ['0-100', '100-200', '200-400', '400-800', '800-1000','1000-10000']

# Add a new column for binned prices
filtered_df['price_bin'] = pd.cut(filtered_df['price'], bins=bins, labels=bin_labels, include_lowest=True)

# Create a binned bar plot for price distribution by room type
plt.figure(figsize=(12,8))
sns.countplot(x='price_bin', hue='room_type', data=filtered_df, palette='coolwarm')

# Add a title and axis labels
plt.title('Price Distribution by Room Type (Binned Price Ranges)', fontsize=16)
plt.xlabel('Price Range (in USD)', fontsize=14)
plt.ylabel('Number of Listings', fontsize=14)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.legend(title='Room Type', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)

plt.show()


The majority of Airbnb listings are concentrated in the lower price ranges, particularly for **private rooms** and **entire homes/appartments**. This reinforces the notion that Airbnb is largely a platform for more affordable, short-term stays. As prices increase, the number of listings decreases significantly, with only a few high-end listings surpassing **$1000 USD**.

###**4.3. Relationship Between Price and Number of Reviews**



💡 Objective: Do more expensive Airbnb listings receive more reviews? Let’s explore the relationship between the price of a listing and the number of reviews.**bold text**

👉 Task: Create a scatter plot to visualize the relationship between price and number of reviews, still using the price cap of $10,000 to avoid the influence of extreme outliers.

####**Solution**

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='price', y='number_of_reviews', data=filtered_df)
plt.title('Relationship Between Price and Number of Reviews (Price <= $10,000)')
plt.xlabel('Price (in USD)')
plt.ylabel('Number of Reviews')
plt.show()



The scatter plot shows that:

- **Lower-priced listings** (below $1,000 USD) tend to have more reviews, with many reaching over 500 reviews. This suggests that affordable listings are booked more often, leading to more reviews.

- **Higher-priced listings** (above $2,000 USD) generally have fewer reviews, indicating that luxury or high-cost properties are booked less frequently, possibly catering to a more selective audience.

Which means : There is an **inverse relationship** between **price** and the number of **reviews**: cheaper listings get more reviews, while expensive listings tend to have fewer.



###**4.4. Average Price of Airbnb Listings by Neighborhood Group**



💡 **Objective:** Find out which neighborhood groups have the highest average listing prices.

👉 Task: Use a bar plot to compare the average prices of Airbnb listings across different neighborhood groups. We'll keep the price cap applied to focus on the majority of listings.

####**Solution**

In [None]:
# Task: Bar plot for average price per neighborhood group (with price cap)
average_price_neighbourhood = df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
average_price_neighbourhood.plot(kind='bar', color='teal')
plt.title('Average Price of Listings by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Average Price (in USD)')
plt.show()


From the bar plot, we can see how Airbnb prices vary across NYC’s neighborhoods:

- **Manhattan** tops the chart with average prices over $200 USD per night. This makes sense since it’s a prime location with high demand.

- **Brooklyn** comes in next, with listings averaging around $150 USD—a popular, slightly more affordable option compared to Manhattan.

- **Queens**, **Staten Island**, and the **Bronx** offer the most budget-friendly stays, all averaging under $100 USD per night, making them great choices for travelers looking for affordability.



## **Step 5: More Hands-On Exercises** 🖐️




Hands-On Practice: Your Turn! 🎉
Here are some more questions you can try answering with your newly learned skills:

1. **Which room types have the highest average number of reviews?**

  👉  Hint: Group the data by room_type and calculate the average number of reviews.

2. **Which host has the most listings?**

  👉 Hint: Group the data by host_name and count the number of listings for each host.

3. **Analyze the relationship between the number of bedrooms and price**

  👉 Hint: Create a scatter plot with bedrooms on the x-axis and price on the y-axis to see how the number of bedrooms impacts price.

## **Conclusion**  🎉

 🎉Congratulations! 🎉 You have completed the NYC Airbnb Data Analysis Project.

This project provided hands-on experience with data cleaning, analysis, and visualization techniques using a real-world dataset. We explored key questions that offered insights into trends in Airbnb listings across New York City.

**Continue exploring the dataset to uncover even more interesting patterns 🔍🕵️‍♂️ !**