Exploring the AirBnB Dataset in Python

Step 1: Importing Necessary Libraries and Loading the AirBNB Dataset
This script imports essential data manipulation and visualization libraries, loads an Airbnb dataset from a CSV file, and displays the first few rows of the dataset. This initial step helps in understanding the structure and content of the data before performing further analysis or visualization.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "C:\\Users\\LENOVO\\Downloads\\Airbnb_Open_Data.csv"
df = pd.read_csv(file_path)
print(df.head())

Step 2: Check the column names in the Dataset

In [None]:
df.columns

Step 3: Check for Missing Values

In [None]:
print(df.isnull().sum())

Step 4: Handle Missing Values
This code ensures that the 'last review' column is properly formatted as datetime, missing values in key columns are appropriately handled, and incomplete records are removed, preparing the dataset for further analysis or visualization.

In [8]:
# Convert 'last review' to datetime and handle errors
df['last review'] = pd.to_datetime(df['last review'], errors='coerce')

# Fill missing values
df.fillna({'reviews per month': 0, 'last review': df['last review'].min()}, inplace=True)

# Drop records with missing 'name' or 'host name'
df.dropna(subset=['NAME', 'host name'], inplace=True)

Step 5: Correct Data Types
Ensure that all columns have the correct data types.

In [None]:
# Remove dollar signs and convert to float
df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)
df['service fee'] = df['service fee'].replace('[\$,]', '', regex=True).astype(float)

Step 6: Remove Duplicates
Check for and remove any duplicate records.

In [10]:
df.drop_duplicates(inplace=True)

Step 7: Confirm Data Cleaning
Verify that the data cleaning steps were successful.

In [None]:
print(df.info())

Step 8: Descriptive Statistics
The df.describe() function in pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. This function is useful for understanding the basic statistical properties of the data.

In [None]:
print(df.describe())

Step 9: Visualization
Distribution of Prices
Plot the distribution of listing prices.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Distribution of Listing Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

Room Type Analysis
Analyze the distribution of different room types.

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='room type', data=df , color='hotpink')
plt.title('Room Type Distribution')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

Neighborhood Analysis
Examine how listings are distributed across different neighborhoods.

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(y='neighbourhood group', data=df,color="lightgreen" , order=df['neighbourhood group'].value_counts().index)
plt.title('Number of Listings by Neighborhood Group')
plt.xlabel('Count')
plt.ylabel('Neighborhood Group')
plt.show()

Price vs. Room Type
Visualize the relationship between price and room type.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='room type', y='price', hue='room type', data=df, palette='Set2')
plt.title('Price vs. Room Type')
plt.xlabel('Room Type')
plt.ylabel('Price ($)')
plt.legend(title='Room Type')
plt.show()

Reviews Over Time
Plot the number of reviews over time.

In [None]:
df['last review'] = pd.to_datetime(df['last review'])
reviews_over_time = df.groupby(df['last review'].dt.to_period('M')).size()

plt.figure(figsize=(12, 6))
reviews_over_time.plot(kind='line',color='red')
plt.title('Number of Reviews Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.show()

Key Insights From Exploratory Data Analysis of AirBnB Dataset
The key insights derived from the exploratory data analysis are discussed below:

1.Pricing Distribution:
Most Airbnb listings are priced within a moderate range.
There are a few high-priced outliers, indicating some premium listings with significantly higher prices.

2.Room Type Distribution:
The majority of listings are either entire homes/apartments or private rooms.
Shared rooms and hotel rooms constitute a very small portion of the listings.

3.Geographical Distribution:
Listings are predominantly concentrated in popular areas like Brooklyn and Manhattan.
Other boroughs such as Queens, Bronx, and Staten Island have fewer listings.

4.Price Comparison by Room Type:
Entire homes/apartments generally cost more than private rooms.
Shared rooms tend to have the lowest prices among the room types.

5.Seasonal Trends in Reviews:
There are observable seasonal trends in the number of reviews.
Certain months experience higher review activity.


Conclusion
EDA helps us understand the main trends and patterns in the AirBnB dataset. We found that most listings are reasonably priced, with popular areas having the highest concentration of listings. Entire homes/apartments are typically more expensive than private rooms. Additionally, review patterns show seasonal variations. These insights can guide both hosts and guests in making informed decisions.

