# **Project: Enhanced Airbnb Listing Analysis for New Hosts in New Brunswick**<br><br>
A Comprehensive Guide to Pricing, Occupancy, and Review Optimization

**Problem Statement** <br>
What strategies can new Airbnb hosts in New Brunswick use to attract bookings, optimize pricing, and earn positive guest reviews in their first year of operation?

**Motivation** <br>
New hosts often struggle with pricing, visibility, and guest satisfaction. Understanding successful listing strategies can help them gain a competitive edge.

This analysis aims to identify key success factors using:

Probability-based learning (Logistic Regression, Decision Trees)

Information-based learning (Feature Importance, Sentiment Analysis)

Interactive EDA (Price vs. Occupancy Trends, Seasonal Analysis)

**Objectives** <br>
Identify key success factors for new Airbnb hosts.

**Approach**

1.   Data Collection
1.   Data Preprocessing
2.   Exploratory Data Analysis (EDA)
3.   Feature Engineering
2.   Modeling
1.   Evaluation
2.   Conclusion

**Task-1: Data Collection**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
from wordcloud import WordCloud
from collections import Counter
import ast
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_selection import mutual_info_classif
import ipywidgets as widgets
from IPython.display import display

In [None]:
# Load dataset
url = 'https://raw.githubusercontent.com/Naresh-Babu-Nangineedi/datasets/refs/heads/main/listings.csv'
df = pd.read_csv(url)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

**Task-2: Data Preprocessing**
<br>
This section prepares the data by handling missing values, calculating occupancy rate, and converting the last review date for time-series analysis.

In [None]:
# Select relevant columns for analysis
df_filtered = df[['price', 'number_of_reviews', 'availability_365', 'latitude', 'longitude', 'last_review']].copy()

# Convert price column to numeric and remove any special characters
df_filtered['price'] = df_filtered['price'].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Calculate occupancy rate as a new feature
df_filtered['occupancy_rate'] = 1 - (df_filtered['availability_365'] / 365)

# Handle missing values
df_filtered.dropna(subset=['price', 'number_of_reviews'], inplace=True)

# Convert 'last_review' column to datetime for time-series analysis
df_filtered['last_review'] = pd.to_datetime(df_filtered['last_review'], errors='coerce')

# Display first few rows after preprocessing
df_filtered.head()


**Task-3: Exploratory Data Analysis (EDA)**

**1. Price Distribution**

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df_filtered['price'], bins=50, kde=True, color='skyblue')
plt.title('Price Distribution of Airbnb Listings in New Brunswick')
plt.xlabel('Price (CAD)')
plt.ylabel('Frequency')
plt.show()

2. Heatmap of the Data

In [None]:
# Compute correlation matrix
corr_matrix = df_filtered.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Airbnb Data')
plt.show()

3. Price vs. Occupancy Rate

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='price', y='occupancy_rate', data=df_filtered, alpha=0.6, color='blue')
plt.title('Price vs. Occupancy Rate')
plt.xlabel('Price (CAD)')
plt.ylabel('Occupancy Rate (%)')
plt.grid(True)
plt.show()

4. Impact of Reviews on Occupancy

In [None]:
review_impact = df_filtered.groupby('number_of_reviews')['occupancy_rate'].mean().reset_index()
plt.figure(figsize=(10,6))
sns.lineplot(x='number_of_reviews', y='occupancy_rate', data=review_impact, marker='o', color='green')
plt.title('Number of Reviews vs. Occupancy Rate')
plt.xlabel('Number of Reviews')
plt.ylabel('Average Occupancy Rate (%)')
plt.grid(True)
plt.show()

**Task-4: Feature Engineering**

In [None]:
# Define target: High occupancy (availability < 30 days)
df_filtered['high_occupancy'] = df_filtered['availability_365'] < 30

# Train-test split
X = df_filtered[['price', 'number_of_reviews']]
y = df_filtered['high_occupancy']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Task-5: Modeling**

**5.1 Probability-Based Learning (Logistic Regression for Occupancy Prediction)**

Logistic Regression is a probability-based learning technique used for classification problems. In the context of occupancy prediction, it estimates the probability that a given rental listing is occupied based on various features such as price, location, and seasonal trends.
<br>
**Model Training Process** <br>
Feature Selection: The dataset is split into input features (X_train) and the target variable (y_train), where y_train indicates whether a listing is occupied or not.

**Model Initialization:** A logistic regression model is instantiated using LogisticRegression().

**Model Training:** The fit() method trains the model by learning the relationship between the input features and occupancy status.

This model outputs a probability score, which can be used to classify whether a listing is occupied (e.g., probability > 0.5 means occupied). Logistic regression is a simple yet effective approach for binary classification tasks like occupancy prediction.

In [None]:
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Predict probabilities
y_pred_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

print("Model trained successfully! Predictions available.")

In [None]:
# Print model accuracy and classification report
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred)*100)
print("Classification Report:\n", classification_report(y_test, y_pred))

**5.2 Information-Based Learning (Decision Tree for Feature Importance)**

Decision Tree classifiers use an information-based learning approach, where they split data based on the most significant features to maximize information gain. In the context of occupancy prediction, the Decision Tree model identifies key factors (e.g., price, location, seasonality) that impact occupancy rates.

**Model Training Process** <br>
**Feature Importance:** The model ranks features based on their contribution to occupancy prediction.

**Decision Boundaries:** The tree splits data at different levels, creating a structured way to classify listings.

**Interpretability:** Decision Trees are easy to interpret, making them useful for understanding which factors influence occupancy.

In [None]:
# Train decision tree model
tree_model = DecisionTreeClassifier(max_depth=5)
tree_model.fit(X_train, y_train)

In [None]:
y_pred_tree = tree_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree)*100)

In [None]:
# Feature importance
importance = tree_model.feature_importances_
feature_names = X.columns
feature_importance = pd.DataFrame({'Feature': feature_names, 'Importance': importance}).sort_values(by='Importance', ascending=False)


In [None]:
# Visualize the decision tree
from sklearn.tree import plot_tree

plt.figure(figsize=(20,10))
plot_tree(tree_model, feature_names=feature_names, class_names=['Low Occupancy', 'High Occupancy'], filled=True, rounded=True)
plt.title("Decision Tree Visualization")
plt.show()

In [None]:
# Plot feature importance
plt.figure(figsize=(8,5))
sns.barplot(x=feature_importance['Feature'], y=feature_importance['Importance'], palette="coolwarm")
plt.title("Feature Importance for Occupancy Prediction")
plt.show()

# Print feature importance
print(feature_importance)

**Task-6: Evaluation**

6.1 Interactive Price vs. Occupancy Analysis

In [None]:
def plot_price_vs_occupancy(price_range=(50, 150)):
    filtered_df = df_filtered[(df_filtered['price'] >= price_range[0]) & (df_filtered['price'] <= price_range[1])]

    plt.figure(figsize=(8,5))
    plt.scatter(filtered_df['price'], filtered_df['occupancy_rate'], alpha=0.5, color='blue')
    plt.xlabel('Price (CAD)')
    plt.ylabel('Occupancy Rate')
    plt.title('Price vs Occupancy for New Hosts')
    plt.grid(True)
    plt.show()

# Interactive slider for price range
price_slider = widgets.IntRangeSlider(
    value=[50, 150],
    min=20,
    max=300,
    step=5,
    description='Price Range:',
    continuous_update=False
)

widgets.interactive(plot_price_vs_occupancy, price_range=price_slider)

6.2 Sentiment Analysis on Guest Reviews

In [None]:
def sentiment_analysis(text):
    # Check if the text is NaN or if the review is empty
    if pd.isna(text) or not isinstance(text, str):
        return "Neutral"

    sentiment_score = TextBlob(text).sentiment.polarity
    if sentiment_score > 0:
        return "Positive"
    elif sentiment_score < 0:
        return "Negative"
    else:
        return "Neutral"

In [None]:
# Ensure that the 'last_review' column is a string before applying sentiment analysis
df_filtered['last_review'] = df_filtered['last_review'].astype(str)

# Apply sentiment analysis to the reviews
df_filtered['sentiment'] = df_filtered['last_review'].apply(lambda x: sentiment_analysis(x))

# Sentiment distribution
sentiment_counts = df_filtered['sentiment'].value_counts()

# Pie chart of sentiment distribution
plt.figure(figsize=(6,6))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', colors=['green', 'gray', 'red'], startangle=140)
plt.title("Sentiment Analysis of Guest Reviews")
plt.show()

6.3 Time Series Analysis (Seasonal Trends)

In [None]:
# Extract month from the last review
df_filtered['last_review'] = pd.to_datetime(df_filtered['last_review'], errors='coerce')
df_filtered['month'] = df_filtered['last_review'].dt.month

# Group by month and calculate averages
monthly_trends = df_filtered.groupby('month').agg({'price': 'mean', 'occupancy_rate': 'mean'}).reset_index()

# Plot
plt.figure(figsize=(8,5))
sns.lineplot(x=monthly_trends['month'], y=monthly_trends['price'], marker='o', label="Avg Price", color="blue")
sns.lineplot(x=monthly_trends['month'], y=monthly_trends['occupancy_rate']*100, marker='o', label="Occupancy Rate (%)", color="red")
plt.xticks(range(1,13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.xlabel("Month")
plt.ylabel("Value")
plt.title("Seasonal Trends in Price & Occupancy")
plt.legend()
plt.grid(True)
plt.show()

**Task-7: Conclusion:**

In [None]:
# Conclusion: Display important features from Decision Tree
print("Key Features Affecting Occupancy:")
print(feature_importance)

# Key conclusion and actionable insights:
# - Price and number of reviews are significant factors affecting occupancy.
# - Positive guest reviews can significantly boost occupancy rates.
# occupancy rates during peak months.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=feature_importance['Feature'], y=feature_importance['Importance'], palette="coolwarm")
plt.title("Feature Importance for Occupancy Prediction")
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()

**Key Features:**

*   **Price:** Hosts with competitive pricing see higher occupancy, especially during off-peak months.
*   **Number of Reviews:** Properties with higher review counts and better ratings experience higher occupancy.
*  **Seasonal Trends:** Hosts should adjust prices based on seasonality to maximize occupancy and revenue


**Actionable Insights:**

**Optimize Pricing:** Hosts should experiment with dynamic pricing and analyze peak periods to adjust prices accordingly.

**Focus on Reviews:** Providing excellent guest experiences can lead to positive reviews and repeat bookings, increasing occupancy.