# NYC Airbnb Analysis

## Project Overview

This project analyzes New York City Airbnb listings to uncover trends, factors influencing pricing, and potential areas for investment. The analysis includes data cleaning, exploratory data analysis (EDA), geospatial analysis, clustering, predictive modeling, and sentiment analysis of reviews.

## Dataset

The dataset used in this project is the New York City Airbnb Open Data, available on [Kaggle](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data).

## Setup and Installation

To run this notebook, you need to have Python installed along with the necessary packages. You can install the required packages using the following command:

```bash
pip install pandas matplotlib seaborn folium scikit-learn textblob
```


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from textblob import TextBlob

# Load the dataset
df = pd.read_csv('AB_NYC_2019.csv')

# Display the first few rows of the dataframe
df.head()

## Data Cleaning and Preparation

In this section, we will handle missing values and convert data types where necessary. We will also perform feature engineering to extract additional insights.

In [2]:
# Handle missing values
df = df.dropna(subset=['name', 'host_name', 'last_review'])
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

# Convert 'last_review' to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

# Feature engineering: extracting host activity level
df['host_activity_level'] = df['number_of_reviews'] / df['host_listings_count']

# Display the cleaned dataframe
df.head()

## Exploratory Data Analysis (EDA)

Here, we explore the distribution of listings across different neighborhoods and room types. We also identify pricing trends and analyze correlations between various factors and pricing.

In [3]:
# Distribution of listings across neighborhoods
plt.figure(figsize=(12, 6))
sns.countplot(y='neighbourhood_group', data=df, order=df['neighbourhood_group'].value_counts().index)
plt.title('Distribution of Listings across Neighborhood Groups')
plt.show()

# Pricing trends across neighborhoods
plt.figure(figsize=(12, 6))
sns.boxplot(x='neighbourhood_group', y='price', data=df)
plt.ylim(0, 500)  # Limit y-axis to focus on majority
plt.title('Pricing Trends across Neighborhood Groups')
plt.show()

# Correlation analysis
correlation_matrix = df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Geospatial Analysis

In this section, we create a geographical map to visualize the distribution of Airbnb listings across New York City.

In [4]:
# Create a map centered around New York City
nyc_map = folium.Map(location=[40.7128, -74.0060], zoom_start=10)

# Add listings to the map
for _, row in df.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        radius=3,
                        color='blue',
                        fill=True,
                        fill_color='blue',
                        fill_opacity=0.6).add_to(nyc_map)

# Save the map to an HTML file
nyc_map.save('nyc_airbnb_listings_map.html')
nyc_map

## Clustering Analysis

Here, we perform clustering to segment listings into distinct groups based on their geographical location and price.

In [5]:
# Prepare data for clustering
X = df[['latitude', 'longitude', 'price']]

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
df['cluster'] = kmeans.labels_

# Plot clusters on a map
plt.figure(figsize=(12, 6))
sns.scatterplot(x='longitude', y='latitude', hue='cluster', data=df, palette='tab10')
plt.title('Clustered Listings')
plt.show()

## Predictive Modeling

In this section, we use a Random Forest Regressor to predict the price of Airbnb listings based on various features.

In [6]:
# Prepare data for modeling
features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
X = df[features]
y = df['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

## Sentiment Analysis on Reviews

Here, we perform sentiment analysis on the comments to understand guest satisfaction and its correlation with review scores.

In [7]:
# Perform sentiment analysis on reviews
df['review_sentiment'] = df['comments'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

# Correlation between sentiment and review scores
plt.figure(figsize=(12, 6))
sns.scatterplot(x='review_scores_rating', y='review_sentiment', data=df)
plt.title('Sentiment Analysis of Reviews')
plt.show()

## Conclusion and Insights

### Key Findings

- **Neighborhood Insights**: High-demand areas like Manhattan and Brooklyn command higher prices, whereas Staten Island offers more affordable options.
- **Pricing Factors**: Factors such as the number of reviews, availability, and minimum stay requirements influence listing prices.
- **Guest Sentiment**: Positive guest sentiment correlates with higher review scores, indicating the importance of maintaining good guest experiences.

### Recommendations

- **Hosts**: Focus on maintaining positive guest experiences and consider listing in high-demand areas for better returns.
- **Guests**: Explore listings in diverse neighborhoods to find options that suit different budgets and preferences.