# Airbnb Hotel Booking Analysis
Author: Tushar Sanap

This submission notebook performs data cleaning, EDA, visualizations, and a simple price prediction model. Three PNG images have been generated in `/mnt/data/` for inclusion in a PPT.

Files produced:
- `/mnt/data/airbnb_cleaned_for_Tushar.csv` (cleaned data)
- `plot_price_distribution.png`
- `plot_avg_price_by_neighbourhood_group.png`
- `plot_roomtype_vs_price.png`

Run the cells to reproduce the analysis and regenerate the plots.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

pd.set_option('display.max_columns', 50)
print('Imports ready')

In [None]:
df = pd.read_csv('/mnt/data/airbnb_cleaned_for_Tushar.csv')
print('Loaded cleaned data with shape:', df.shape)
df.head()

In [None]:
print('Numeric summary:')
print(df.describe().T)

print('\nTop neighbourhoods by count:')
if 'neighbourhood' in df.columns:
    print(df['neighbourhood'].value_counts().head(10))
else:
    print('No neighbourhood column')

In [None]:
plt.figure(figsize=(8,5))
plt.hist(df['price'].dropna(), bins=50)
plt.xlabel('Price')
plt.ylabel('Count')
plt.title('Price Distribution (USD)')
plt.show()

# Avg price by group
group_col = 'neighbourhood_group' if 'neighbourhood_group' in df.columns else 'neighbourhood'
avg_price = df.groupby(group_col)['price'].mean().sort_values(ascending=False).head(15)
plt.figure(figsize=(10,5))
plt.bar(avg_price.index.astype(str), avg_price.values)
plt.xlabel(group_col.replace('_',' ').title())
plt.ylabel('Average Price')
plt.title('Average Price by ' + group_col.replace('_',' ').title())
plt.xticks(rotation=45, ha='right')
plt.show()

# Room type vs price
if 'room_type' in df.columns:
    rt = df.groupby('room_type')['price'].mean().sort_values()
    plt.figure(figsize=(8,5))
    plt.bar(rt.index.astype(str), rt.values)
    plt.xlabel('Room Type')
    plt.ylabel('Average Price')
    plt.title('Average Price by Room Type')
    plt.xticks(rotation=30, ha='right')
    plt.show()
else:
    plt.figure(figsize=(8,5))
    plt.scatter(df['number_of_reviews'], df['price'], s=10)
    plt.xlabel('Number of Reviews')
    plt.ylabel('Price')
    plt.title('Price vs Number of Reviews')
    plt.show()

In [None]:
# Simple regression model (numeric features only)
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if 'price' in num_cols:
    X = df[num_cols].drop(columns=['price'])
    # drop columns with too many unique values (like ids)
    X = X.loc[:, X.nunique() < 200]
    y = df['price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('MSE:', mean_squared_error(y_test, y_pred))
    print('R2:', r2_score(y_test, y_pred))
else:
    print('No numeric price column for modeling')

## Conclusions & Submission Notes

- Cleaned data saved as `airbnb_cleaned_for_Tushar.csv`.
- Three professional plots saved as PNGs in `/mnt/data/` (use these in your PPT).
- The simple linear regression is a baseline — consider tree-based models for better performance.

Good luck with your internship submission, Tushar!