# Question 5: Present a model to predict the price of the properties using the data from 'listings.csv' file. It should relate the relationship with price and other factors.

# Dataset: Seattle Airbnb Open Data 
link: https://www.kaggle.com/datasets/airbnb/seattle/data 

Context:
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.


Content:
The following Airbnb activity is included in this Seattle dataset:

"calendar.csv" - Calendar, including listing id and the price and availability for that day     
"listings.csv" -   Listings, including full descriptions and average review score      
"reviews.csv" - Reviews, including unique id for each reviewer and detailed comments     


License:    
CC0: Public Domain

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
%matplotlib inline



# Inferences from the file : "listings.csv"

In [10]:
list_df = pd.read_csv('listings.csv')
list_df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,8101950,https://www.airbnb.com/rooms/8101950,20160104002432,2016-01-04,3BR Mountain View House in Seattle,Our 3BR/2BA house boasts incredible views of t...,"Our 3BR/2BA house bright, stylish, and wheelch...",Our 3BR/2BA house boasts incredible views of t...,none,We're located near lots of family fun. Woodlan...,...,8.0,f,,WASHINGTON,f,strict,f,f,8,0.30
3814,8902327,https://www.airbnb.com/rooms/8902327,20160104002432,2016-01-04,Portage Bay View!-One Bedroom Apt,800 square foot 1 bedroom basement apartment w...,This space has a great view of Portage Bay wit...,800 square foot 1 bedroom basement apartment w...,none,The neighborhood is a quiet oasis that is clos...,...,10.0,f,,WASHINGTON,f,moderate,f,f,1,2.00
3815,10267360,https://www.airbnb.com/rooms/10267360,20160104002432,2016-01-04,Private apartment view of Lake WA,"Very comfortable lower unit. Quiet, charming m...",,"Very comfortable lower unit. Quiet, charming m...",none,,...,,f,,WASHINGTON,f,moderate,f,f,1,
3816,9604740,https://www.airbnb.com/rooms/9604740,20160104002432,2016-01-04,Amazing View with Modern Comfort!,Cozy studio condo in the heart on Madison Park...,Fully furnished unit to accommodate most needs...,Cozy studio condo in the heart on Madison Park...,none,Madison Park offers a peaceful slow pace upsca...,...,,f,,WASHINGTON,f,moderate,f,f,1,


In [11]:
# Preprocess 'host_acceptance_rate' by removing '%' and other unwanted characters,
# and convert it to a float data type.
list_df['host_acceptance_rate'] = list_df['host_acceptance_rate'].str.replace('%', '').str.replace('/', '').str.replace('NA', 'NaN').astype(float)

# Preprocess 'host_response_rate' in a similar way.
list_df['host_response_rate'] = list_df['host_response_rate'].str.replace('%', '').str.replace('/', '').str.replace('NA', 'NaN').astype(float)

# Preprocess 'cleaning_fee' by removing '$' and ',' characters and convert it to a float.
list_df['cleaning_fee'] = list_df['cleaning_fee'].str.replace('$', '').str.replace(',', '').astype(float)

# Create a dictionary to map unique 'property_type' values to enumeration values,
# then apply the mapping to the 'property_type' column.
Property_type_to_enum = {'House': 1, 'Apartment': 2, 'Townhouse': 3,'Boat': 4,'Condominium': 5,'Cabin': 6,'Loft': 7,'other': 99}
list_df['property_type'] = list_df['property_type'].apply(lambda x: Property_type_to_enum.get(x, None))


# Create a dictionary to map unique 'cancellation_policy' values to enumeration values,
# then apply the mapping to the 'cancellation_policy' column.
Cancellation_type_to_enum = {'strict': 1, 'moderate': 2, 'flexible': 3}
list_df['cancellation_policy'] = list_df['cancellation_policy'].apply(lambda x: Cancellation_type_to_enum.get(x, None))


# Create a dictionary to map unique 'instant_bookable' values to enumeration values,
# then apply the mapping to the 'instant_bookable' column.
instant_book_to_enum = {'t': 1, 'f': 0}
list_df['instant_bookable'] = list_df['instant_bookable'].apply(lambda x: instant_book_to_enum.get(x, None))


# Create a dictionary to map unique 'bed_type' values to enumeration values,
# then apply the mapping to the 'bed_type' column.
bed_type_to_enum = {'Real Bed': 1, 'Futon': 2,'Pull-out Sofa': 3,}
list_df['bed_type'] = list_df['bed_type'].apply(lambda x: bed_type_to_enum.get(x, None))


# Create a dictionary to map unique 'room_type' values to enumeration values,
# then apply the mapping to the 'room_type' column.
room_type_to_enum = {'Private room': 1, 'Entire home/apt': 2,'Shared room': 3,}
list_df['room_type'] = list_df['room_type'].apply(lambda x: room_type_to_enum.get(x, None))


# Preprocess 'zipcode' by removing '\n' and ',' characters, and convert it to a float.
list_df['zipcode'] = list_df['zipcode'].str.replace('\n', '').str.replace(',', '').astype(float)

# Preprocess 'security_deposit' by removing '$' and ',' characters, and convert it to a float.
list_df['security_deposit'] = list_df['security_deposit'].str.replace('$', '').str.replace(',', '').astype(float)



# Enumerate and update the 'neighbourhood' column in place, assigning unique values.
list_df['neighbourhood']= pd.factorize(list_df['neighbourhood'])[0] + 1

# Enumerate and update the 'neighbourhood_cleansed' column in place, assigning unique values.
list_df['neighbourhood_cleansed']= pd.factorize(list_df['neighbourhood_cleansed'])[0] + 1

# Enumerate and update the 'neighbourhood_group_cleansed' column in place, assigning unique values.
list_df['neighbourhood_group_cleansed']= pd.factorize(list_df['neighbourhood_group_cleansed'])[0] + 1



list_df[['host_response_rate','host_acceptance_rate','cleaning_fee','property_type','zipcode','cancellation_policy',
        'instant_bookable','bed_type','room_type','zipcode',
        'security_deposit','neighbourhood','neighbourhood_group_cleansed']]

Unnamed: 0,host_response_rate,host_acceptance_rate,cleaning_fee,property_type,zipcode,cancellation_policy,instant_bookable,bed_type,room_type,zipcode.1,security_deposit,neighbourhood,neighbourhood_group_cleansed
0,96.0,100.0,,2.0,98119.0,2,0,1.0,2,98119.0,,1,1
1,98.0,100.0,40.0,2.0,98119.0,1,0,1.0,2,98119.0,100.0,1,1
2,67.0,100.0,300.0,1.0,98119.0,1,0,1.0,2,98119.0,1000.0,1,1
3,,,,2.0,98119.0,3,0,1.0,2,98119.0,,1,1
4,100.0,,125.0,1.0,98119.0,1,0,1.0,2,98119.0,700.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3813,99.0,100.0,230.0,1.0,98107.0,1,0,1.0,2,98107.0,,4,3
3814,100.0,100.0,50.0,2.0,98102.0,2,0,1.0,2,98102.0,500.0,23,16
3815,,,35.0,1.0,98178.0,2,0,1.0,2,98178.0,250.0,0,12
3816,100.0,,45.0,5.0,98112.0,2,0,1.0,2,98112.0,300.0,0,16


In [12]:
#Creating the X-matrix by selecting the parameters assumed to affecting the response vector
X = list_df[['zipcode', 'property_type', 'room_type','accommodates','bathrooms', 'bedrooms', 'bed_type','square_feet','cleaning_fee','guests_included','minimum_nights','number_of_reviews','review_scores_rating','review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin', 'review_scores_communication','review_scores_location','review_scores_value','security_deposit','latitude', 'longitude','neighbourhood','neighbourhood_cleansed','neighbourhood_group_cleansed']]

X.shape
# 0.5626520215369663

(3818, 25)

In [13]:
#Creating the Response vector y
list_df['price'] = list_df['price'].str.replace('$', '').str.replace(',', '').astype(float)
y=list_df['price']

#Fill the mean of the column for any missing values in X.
fill_mean = lambda col: col.fillna(col.mean())
X = X.apply(fill_mean,axis=0)

#Create train, test datasets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

#Instantiate LinearRegression model
lm_model=LinearRegression()

#Fit the model with training dataset
lm_model.fit(X_train,y_train)

#Predict the response for the training data and test data
y_test_pred = lm_model.predict(X_test)
y_train_pred=lm_model.predict(X_train)

#Score the R2 vale for both the training and test data
test_score_r2 = r2_score(y_test,y_test_pred)
train_score_r2 = r2_score(y_train,y_train_pred)


#Score the mean_squared_error for both the training and test data
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

train_score_r2, test_score_r2, mse_train, mse_test

(0.548809603292781, 0.5626520215369663, 3591.931287084777, 3745.2508814211506)

# Inference 5:

Based on the provided values:

train_score is approximately 0.5488, which suggests that the model's performance on the training data is relatively good. A value close to 1.0 would indicate a perfect fit, but 0.5488 is reasonable.

test_score is approximately 0.5627, which suggests that the model's performance on the test data is also decent. This indicates that the model is likely not overfitting the training data, as the test score is somewhat consistent with the training score.

mse_train is approximately 3591.93. This represents the Mean Squared Error on the training data. It measures how close the predicted values from the model are to the actual values in the training set. A lower MSE is generally better, so this value suggests that the model's predictions are relatively close to the actual values on the training data.

mse_test is approximately 3745.25. This represents the Mean Squared Error on the test data. It measures how well the model generalizes to unseen data. A lower MSE is better, but since the test data is different from the training data, it's expected that the MSE on the test data may be slightly higher than on the training data. In this case, the test MSE is relatively close to the training MSE, indicating that the model's performance is consistent between training and test datasets.

Overall, the provided values suggest that the model is performing reasonably well and is not significantly overfitting the training data. However, the performance may vary depending on the specific problem and dataset, so further analysis and fine-tuning may be necessary to improve the model's performance.