Google Drive Link: https://drive.google.com/file/d/1TL6NHkO4Xz11Cez8EJiYfDn6vNiOm92L/view?usp=sharing

## Project Overview

Airbnb is an online marketplace for short-term home and apartment rentals. It allows you to, to rent out your home for a week while you’re away, or rent out your empty house/bedroom to have a peacful and quality vacation.
The challenge is that Airbnb hosts face is determining the optimal nightly rent price. In many areas, renters are presented with a good selection of listings and can filter by criteria like price, number of bedrooms, room type, and more.

#### The target population of interest

This project helps the hosts at Airbnb in providing the proper listings on their webiste according to the seasons and helping them to improve the chance of their place getting booked with the more appropriate price.

#### Aim of the Project
The aim of this project is to answer the below questions
1. Predict rental prices and influential features of Airbnb.
2. Sentiment Analysis on Reviews of Airbnb which can impact on customer decision.
3. Top amenities that are essential to Airbnb and that customers expect.
4. Predicting future price trends for various seasons.

#### Innovativeness of the project

We not only tend to answer the above questions, but also answer by visualizing the following the questions:
<p>1. What is the season pattern of Airbnb in Chicago?<br>
2. What kinds of Airbnb homes are popular?<br>
3. What factors of the rental price have the most impact?</p>

#### Is the Data Already Available?

Airbnb does not release any data on the listings in its marketplace, but a separate group named Inside Airbnb scrapes and compiles publicly available information about many cities Airbnb's listings from the Airbnb web-site. For this project, their data set scraped on September 11, 2021, on the city of Chicago, Illinois, is used. It contains information on all Chicago Airbnb listings that were live on the site on that date (over 7400)

Dataset Link: http://insideairbnb.com/chicago

## Any changes
<p>The scope of the project remains the same i.e. to predict the rental prices and answer the questions such as seasonal pattern of Airbnb in Chicago, kind of homes which are popular and most infulential features about the rental price.</p>

In [None]:
import pandas as pd

In [None]:
airbnb_listings = pd.read_csv('listings.csv')

In [None]:
airbnb_listings.head(5)

In [None]:
airbnb_listings.info()

In [None]:
airbnb_listings.shape

In [None]:
airbnb_reviews = pd.read_csv("reviews.csv")

In [None]:
airbnb_reviews.head(5)

In [None]:
airbnb_reviews.info()

In [None]:
airbnb_reviews.shape

In [None]:
airbnb_calendar_date = pd.read_csv("calendar.csv")

In [None]:
airbnb_calendar = pd.read_csv("calendar.csv")

In [None]:
airbnb_calendar.head(5)

In [None]:
airbnb_calendar.info()

In [None]:
airbnb_calendar.shape

In [None]:
null = airbnb_reviews.isna().sum()
null.sort_values(ascending=False)

In [None]:
airbnb_listings.columns

## Data Cleaning and Pre-Processing

In [None]:
airbnb_listings.isna().sum().sort_values(ascending=False)

In [None]:
def dropping_column(data, col_name): 
    new_data = data.drop(col_name, axis=1)
    print('Dropping {}...'.format(col_name))
    return new_data

STEP 1: Dropping Useless Features
There are some kinds of useless data:

->not informative: id, url, name...

->although informative, but hard to deal with: text, latitude, longitude...

->values are identical: city, state...

->values are redundent: listing count, avaliability...

->not inner features: price, reveiw...


In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings, 'host_verifications') 
for col_name in airbnb_listings_clean.columns:
    if 'id' in col_name:
        airbnb_listings_clean=dropping_column(airbnb_listings_clean, col_name)
    if 'url' in col_name:
        airbnb_listings_clean=dropping_column(airbnb_listings_clean, col_name)
    if 'name' in col_name:
        airbnb_listings_clean=dropping_column(airbnb_listings_clean, col_name)

In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_listings_count')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_total_listings_count')

Dropping text columns which are informative but hard to deal with and arent necessary

In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'description')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'neighborhood_overview')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_about')

We only need 'neighbourhood_cleansed' as feature of home location so other columns related to location can be removed

In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'neighbourhood')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'latitude')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'longitude')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_location')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_neighbourhood') 
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'license')
airbnb_listings_clean = dropping_column(airbnb_listings_clean,'last_scraped')

Checking for duplicate rows and columns if any and removing them

In [None]:
airbnb_listings_clean = airbnb_listings_clean.loc[:,~airbnb_listings_clean.T.duplicated(keep='first')]
airbnb_listings_clean.drop_duplicates()

Only keeping the annual availabilty and removing other availabilty columns

In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'availability_30')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'availability_60')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'availability_90')

Dropping some unecessary columns

In [None]:
for column_name in airbnb_listings_clean.columns:
    if len(airbnb_listings_clean[column_name].value_counts()) <= 1:
        airbnb_listings_clean = dropping_column(airbnb_listings_clean, column_name)

Dropping all the review columns

In [None]:
for col_name in airbnb_listings_clean.columns:
    if 'review' in col_name and col_name!='review_scores_rating':
        airbnb_listings_clean = dropping_column(airbnb_listings_clean, col_name)

Dropping some columns which are not required

In [None]:
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'bathrooms_text')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_is_superhost')
airbnb_listings_clean = dropping_column(airbnb_listings_clean, 'host_response_time')

In [None]:
airbnb_listings_clean.hist(figsize=(20,20));

### Analysing Missing Data

In [None]:
import matplotlib.pyplot as plt
null_values_percentage = airbnb_listings_clean.isnull().sum().sort_values(ascending=False) / len(airbnb_listings_clean)
x = range(len(null_values_percentage[null_values_percentage != 0]))
y = null_values_percentage[null_values_percentage != 0]

plt.figure(figsize=(10,20))
ax = plt.subplot()

plt.gca().invert_yaxis()
ax.set_yticks(range(len(null_values_percentage)))
ax.set_yticklabels(null_values_percentage.index)

plt.barh(x, y, color='y')
plt.show()

In [None]:
# Checking the missing data in rows
airbnb_listings_clean['Missing_num'] = airbnb_listings_clean.isnull().sum(axis=1)
print('{:.2f}% rows have no missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']==0]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 1 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=1]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 2 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=2]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 3 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=3]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 4 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=4]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 5 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=5]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 6 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=6]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 7 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=7]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 8 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=8]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 9 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=9]) / len(airbnb_listings_clean) * 100))
print('{:.2f}% rows have less than 10 missing data.'.format(len(airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=10]) / len(airbnb_listings_clean) * 100))

In [None]:
#Distribution of missing data in rows
plt.figure(figsize=(16,8))
plt.xticks(range(50))
airbnb_listings_clean['Missing_num'].plot.hist(color='b', alpha=0.5, bins=50)
plt.show()

We calculate the percentage of the null values in each column and set a threshold “4” and drop them.

In [None]:
# dropping rows which have more than 4 missing values
missing_threshold = 4
airbnb_listings_clean = airbnb_listings_clean[airbnb_listings_clean['Missing_num']<=missing_threshold].drop('Missing_num', axis = 1)

In [None]:
airbnb_listings_clean.shape

## Checking for NAN values

In [None]:
airbnb_listings_clean.isnull().sum().sum()

In [None]:
airbnb_listings_clean.isnull().sum()

In [None]:
airbnb_listings_clean.info()

In [None]:
import numpy as np

Converting the data from string to float type

In [None]:
airbnb_listings_clean['price'] = airbnb_listings_clean['price'].map(lambda price: float(price[1:].replace(',', '')), na_action='ignore')
airbnb_listings_clean['host_response_rate'] = airbnb_listings_clean['host_response_rate'].map(lambda rate: np.float(rate[:-1]) / 100, na_action='ignore')
airbnb_listings_clean['host_acceptance_rate'] = airbnb_listings_clean['host_acceptance_rate'].map(lambda rate: np.float(rate[:-1]) / 100, na_action='ignore')

In [None]:
airbnb_listings_clean['host_response_rate'].fillna(airbnb_listings_clean['host_response_rate'].median(), inplace=True)
airbnb_listings_clean['bedrooms'].fillna(airbnb_listings_clean['bedrooms'].median(), inplace=True)
airbnb_listings_clean['beds'].fillna(airbnb_listings_clean['beds'].median(), inplace=True)
airbnb_listings_clean['host_acceptance_rate'].fillna(airbnb_listings_clean['host_acceptance_rate'].median(), inplace=True)
airbnb_listings_clean['review_scores_rating'].fillna(airbnb_listings_clean['review_scores_rating'].median(), inplace=True)

In [None]:
airbnb_listings_dataset = airbnb_listings_clean

In [None]:
airbnb_listings_clean.isnull().sum()

In [None]:
airbnb_listings_clean.dtypes

## Handling Outliers using Box-Plot

#### Accommodates column

In [None]:
import seaborn as sns
sns.boxplot( x = 'accommodates', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

 As you can see there are outliers being plotted in the box plot, so now we handle the outliers by replacing the outliers with the upper limit and lower limit

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['accommodates'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['accommodates'] = np.where(airbnb_listings_clean['accommodates']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['accommodates']< lower_limit, lower_limit,
                          airbnb_listings_clean['accommodates']))

In [None]:
sns.boxplot( x = 'accommodates', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### Bedrooms

In [None]:
sns.boxplot( x = 'bedrooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['bedrooms'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['bedrooms'] = np.where(airbnb_listings_clean['bedrooms']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['bedrooms']< lower_limit, lower_limit,
                          airbnb_listings_clean['bedrooms']))

In [None]:
sns.boxplot( x = 'bedrooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### Beds

In [None]:
sns.boxplot( x = 'beds', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['beds'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['beds'] = np.where(airbnb_listings_clean['beds']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['beds']< lower_limit, lower_limit,
                          airbnb_listings_clean['beds']))

In [None]:
sns.boxplot( x = 'beds', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### Price

In [None]:
sns.boxplot( x = 'price', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['price'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['price'] = np.where(airbnb_listings_clean['price']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['price']< lower_limit, lower_limit,
                          airbnb_listings_clean['price']))

In [None]:
sns.boxplot( x = 'price', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### calculated_host_listings_count

In [None]:
sns.boxplot( x = 'calculated_host_listings_count', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['calculated_host_listings_count'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['calculated_host_listings_count'] = np.where(airbnb_listings_clean['calculated_host_listings_count']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['calculated_host_listings_count']< lower_limit, lower_limit,
                          airbnb_listings_clean['calculated_host_listings_count']))

In [None]:
sns.boxplot( x = 'calculated_host_listings_count', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### calculated_host_listings_count_entire_homes

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_entire_homes', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['calculated_host_listings_count_entire_homes'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['calculated_host_listings_count_entire_homes'] = np.where(airbnb_listings_clean['calculated_host_listings_count_entire_homes']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['calculated_host_listings_count_entire_homes']< lower_limit, lower_limit,
                          airbnb_listings_clean['calculated_host_listings_count_entire_homes']))

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_entire_homes', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### calculated_host_listings_count_private_rooms

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_private_rooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['calculated_host_listings_count_private_rooms'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['calculated_host_listings_count_private_rooms'] = np.where(airbnb_listings_clean['calculated_host_listings_count_private_rooms']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['calculated_host_listings_count_private_rooms']< lower_limit, lower_limit,
                          airbnb_listings_clean['calculated_host_listings_count_private_rooms']))

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_private_rooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### calculated_host_listings_count_shared_rooms

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_shared_rooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['calculated_host_listings_count_shared_rooms'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['calculated_host_listings_count_shared_rooms'] = np.where(airbnb_listings_clean['calculated_host_listings_count_shared_rooms']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['calculated_host_listings_count_shared_rooms']< lower_limit, lower_limit,
                          airbnb_listings_clean['calculated_host_listings_count_shared_rooms']))

In [None]:
sns.boxplot( x = 'calculated_host_listings_count_shared_rooms', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### host_response_rate

In [None]:
sns.boxplot( x = 'host_response_rate', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['host_response_rate'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['host_response_rate'] = np.where(airbnb_listings_clean['host_response_rate']> upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['host_response_rate']< lower_limit, lower_limit,
                          airbnb_listings_clean['host_response_rate']))

In [None]:
sns.boxplot( x = 'host_response_rate', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

#### host_acceptance_rate

In [None]:
sns.boxplot( x = 'host_acceptance_rate', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
outliers = []

dat = sorted(airbnb_listings_clean['host_acceptance_rate'])
q1 = np.percentile(dat, 25)
q3 = np.percentile(dat, 75)
IQR = q3-q1
lower_limit = q1-(1.5*IQR)
upper_limit = q3+(1.5*IQR)

airbnb_listings_clean['host_acceptance_rate'] = np.where(airbnb_listings_clean['host_acceptance_rate']> 
                                                         upper_limit, upper_limit,
                        np.where(airbnb_listings_clean['host_acceptance_rate']< lower_limit, lower_limit,
                          airbnb_listings_clean['host_acceptance_rate']))

In [None]:
sns.boxplot( x = 'host_acceptance_rate', data =airbnb_listings_clean)
sns.set(rc={"figure.figsize":(10, 8)})

In [None]:
for col in airbnb_reviews:
    print(col)

In [None]:
airbnb_reviews.dtypes

In [None]:
for col in airbnb_calendar:
    print(col)

In [None]:
airbnb_calendar.dtypes

In [None]:
airbnb_reviews.isna().sum().sort_values(ascending=False)

The airbnb_reviews dataset does not have any null or na values, so there is no need of cleaning the dataset.

In [None]:
airbnb_calendar.isna().sum().sort_values(ascending=False)

The airbnb_calendar dataset does not have any null or na values, so there is no need of cleaning the dataset.

## Feature Encoding
Feature encoding is the process of turning categorical data in a dataset into numerical data. One-Hot Encoding of features is used

Perfoming one-hot encoding on categorical data

In [None]:
def oneHot(feat, data):
    print('Ecoding {} as one-hot..'.format(feat))
    cur_dummies = pd.get_dummies(data[feat], prefix=feat)
    data.drop(feat, axis=1, inplace=True)
    data = pd.concat([data, cur_dummies], axis=1)
    return data

Taking a look which columns are float.
For categorical variables with small categories, encode them as one hot
if categorical variables have many distinct values, we should be more careful

In [None]:
features = []
conts = []
for col_name in airbnb_listings_clean.columns:
    if airbnb_listings_clean[col_name].dtype == np.float:
        print('{} is a continous varibale'.format(col_name))
        conts.append(col_name)
    elif len(airbnb_listings_clean[col_name].value_counts()) <= 5:
        data_clean = oneHot(col_name, airbnb_listings_clean)
    else:
        features.append(col_name)

In [None]:
#Printing continuous variables
print(conts)

In [None]:
#Variables we should look further more into 
print(features)

In [None]:
airbnb_listings_clean['host_since'].head()

In [None]:
airbnb_listings_clean['host_since'] = airbnb_listings_clean['host_since'].map(lambda date: 2022- int(date[-4:]), na_action='ignore')

In [None]:
airbnb_listings_clean['host_since'].value_counts()

In [None]:
airbnb_listings_clean = oneHot('host_since', airbnb_listings_clean)

In [None]:
airbnb_listings_clean['neighbourhood_cleansed'].value_counts()

In [None]:
airbnb_listings_clean = oneHot('neighbourhood_cleansed', airbnb_listings_clean)

In [None]:
airbnb_listings_clean['property_type'].value_counts()

In [None]:
airbnb_listings_clean = oneHot('property_type', airbnb_listings_clean)

In [None]:
airbnb_listings_clean['accommodates'].value_counts()

In [None]:
airbnb_listings_clean = oneHot('accommodates', airbnb_listings_clean)

In [None]:
airbnb_listings_clean['amenities'][0]

In [None]:
amenities_list = list(data_clean.amenities)

In [None]:
amenities_list_string = " ".join(amenities_list)

In [None]:
amenities_list_string = amenities_list_string.replace('{', '')
amenities_list_string = amenities_list_string.replace('}', ',')
amenities_list_string = amenities_list_string.replace('[', '')
amenities_list_string = amenities_list_string.replace(']', ',')
amenities_list_string = amenities_list_string.replace('"', '')
amenities_list_string = amenities_list_string.replace("'", "")
amenities_set = [x.strip() for x in amenities_list_string.split(',')]

In [None]:
mydict = {}
for word in amenities_set:
    if word in mydict:
        mydict[word] += 1
    else:
        mydict[word] = 1       

In [None]:
threshold = 500
A = {k:v for (k,v) in mydict.items() if v > threshold }
A = list(A.keys())

for a in A:
    airbnb_listings_clean[a] = airbnb_listings_clean['amenities'].apply(lambda A: 1 if a in A else 0)
airbnb_listings_clean = airbnb_listings_clean.drop(['amenities'],axis=1)

In [None]:
airbnb_listings_clean.describe()

In [None]:
airbnb_listings_clean.dtypes

Re-encoding features

## Fill Missing Values

Few attributes still have missing values. Since all the columns having missing values are continuous, we can fill them with mean or median to impute it.We used median because it is more robust and free from the influence of outliers.

In [None]:
import seaborn as sns
plt.figure(figsize=(20,20))
sns.heatmap(data_clean.corr())
plt.show()

# Exploratory Data Analysis

In [None]:
airbnb_calendar['date'] = pd.to_datetime(airbnb_calendar['date'])
airbnb_calendar['price'] = airbnb_calendar['price'].str.replace(',', '')
airbnb_calendar['price'] = airbnb_calendar['price'].str.replace('$', '')
airbnb_calendar['price'] = airbnb_calendar['price'].astype(float)
airbnb_calendar['date'] = pd.to_datetime(airbnb_calendar['date'])
airbnb_calendar['date'] = airbnb_calendar['date'].dt.strftime('%B')

In [None]:
import plotly.express as px

In [None]:
airbnb_calendar.head()

In [None]:
airbnb_calendar.rename(columns = {'date':'month'}, inplace = True)

In [None]:
airbnb_calendar.head()

In [None]:
temp=px.histogram(airbnb_calendar,x="month", y="price")
temp.show()

From the above bar plot, we can say that sum of listing prices is maximum in the month of july and minimum in the month of february

In [None]:
airbnb_calendar.head()

In [None]:
airbnb_calendar_date['date'] = pd.to_datetime(airbnb_calendar_date['date'])
airbnb_calendar_date['price'] = airbnb_calendar_date['price'].str.replace(',', '')
airbnb_calendar_date['price'] = airbnb_calendar_date['price'].str.replace('$', '')
airbnb_calendar_date['price'] = airbnb_calendar_date['price'].astype(float)
airbnb_calendar_date['date'] = pd.to_datetime(airbnb_calendar_date['date'])
airbnb_calendar_date['date'] = airbnb_calendar_date['date'].dt.strftime('%d')

In [None]:
airbnb_calendar_date.head()

In [None]:
temp=px.histogram(airbnb_calendar_date,x="date", y="price")
temp.show()

The above bar plot shows that the sum of listing prices is lowest on the 31st day of the month.

In [None]:
airbnb_calendar = pd.read_csv("calendar.csv")

In [None]:
for col in airbnb_calendar.columns:
    print(col)

In [None]:
df_airbnb_calendar = pd.DataFrame(airbnb_calendar)

In [None]:
df_airbnb_calendar.head()

In [None]:
df_airbnb_calendar['available'].value_counts()

In [None]:
df_airbnb_calendar['available'] = df_airbnb_calendar['available'].map(lambda available: 1 if available == 't' else 0)

In [None]:
df_airbnb_calendar['available'].value_counts()

In [None]:
ocup = df_airbnb_calendar[['date', 'available']].groupby('date').mean()
ocup['occupancy'] = 1 - ocup['available']

In [None]:
ocup.head()

In [None]:
ocup.reset_index(inplace = True)

In [None]:
for col in ocup.columns:
    print(col)

In [None]:
ocup.head()

In [None]:
import datetime
ocup['month'] = pd.DatetimeIndex(ocup['date']).month
ocup.head()

In [None]:
ocup

Separating Airbnb homes into two parts by their occupancy rate -
1. Popular Airbnb Homes
2. Unpopular Airbnb Homes

In [None]:
ocup = df_airbnb_calendar[['listing_id', 'available']].groupby('listing_id').mean()
ocup['occupancy'] = 1 - ocup['available']
ocup.drop(['available'], axis = 1, inplace=True)
ocup['id'] = ocup.index

In [None]:
ocup.head()

In [None]:
ocup.reset_index(inplace = True)

In [None]:
ocup.head()

In [None]:
px.histogram(data_frame = ocup, x = 'occupancy')
# plt.figure(figsize=(8,6))
# sns.distplot(ocup['occupancy'], color = 'blue', kde = False)
# plt.show

Drawing the comparison distributions about host response time, property type, neighborhood, and cancellation policy

In [None]:
threshold = 0.70

In [None]:
data = pd.merge(ocup, airbnb_listings, how='inner', left_on = 'id', right_on = 'id')
data.head()

In [None]:
def comparePlot(feat):
    plt.figure(figsize=(8,6))
    
    if data[feat].dtype != np.float:
        cnt_popular = data[data['occupancy'] >= threshold][feat].value_counts()
        cnt_unpopular = data[data['occupancy'] < threshold][feat].value_counts()
        fre_popular = pd.DataFrame(cnt_popular / sum(cnt_popular))
        fre_popular['popularity'] = 'Popular'
        fre_popular['index'] = fre_popular.index
        fre_unpopular = pd.DataFrame(cnt_unpopular / sum(cnt_unpopular))
        fre_unpopular['popularity'] = 'Unpopular'
        fre_unpopular['index'] = fre_unpopular.index
        
        plot_data = pd.concat([fre_popular, fre_unpopular], ignore_index=True)
        sns.barplot(x='index', y=feat, hue='popularity', data=plot_data, palette='GnBu')
        plt.xticks(rotation='vertical')
        plt.legend(loc=1)
    
    # if float, draw kde line
    else:
        sns.kdeplot(data[feat][data['occupancy'] >= threshold], color='m')
        sns.kdeplot(data[feat][data['occupancy'] < threshold], color='c')
        plt.legend(['Popular', 'Unpopular'], loc=1)
    
    plt.xlabel(feat)
    plt.ylabel('Frequency')
    
    plt.show()
        

In [None]:
data['price'] = data['price'].map(lambda price: float(price[1:].replace(',', '')), na_action='ignore')
comparePlot('price')

In [None]:
comparePlot('property_type')

In [None]:
comparePlot('host_response_time')

From the above plot we can infer that if a host can respond sooner, they will have more chance to rent their homes

In [None]:
comparePlot('instant_bookable')

From the above plot we can infer that the listing which gets booked early would be the most popular.

Defining Timeplot function Eda to find the seasonal pattern in Chicago

In [None]:
def timeplot(data, feat, title):
    """
    draw a smooth line for the time series of feature
    """
    
    plt.figure(figsize=(20,8))
    
    x = [datetime.strptime(date, '%Y-%m-%d') for date in data.index]
    y = data[feat]
    
    # smooth y for visualization
    y_smooth = gaussian_filter1d(y, sigma=5)
    
    # set x tick by month
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
    
    plt.title(title)
    plt.plot(x, y_smooth, 'c-')
    plt.show()

In [None]:
airbnb_calendar['available'].value_counts()

In [None]:
airbnb_calendar = pd.read_csv("calendar.csv")

In [None]:
# converting 'available' into binary
airbnb_calendar['available'] = airbnb_calendar['available'].map(lambda available: 1 if available == 't' else 0)

In [None]:
airbnb_calendar['available'].value_counts()

In [None]:
airbnb_calendar['date'] = pd.to_datetime(airbnb_calendar['date']).astype(str)

In [None]:
ocp = airbnb_calendar[['date', 'available']].groupby('date').mean()
ocp['occupancy'] = 1 - ocp['available']

In [None]:
ocp.head(50)

### Seasonal Booking

In [None]:
from datetime import datetime
from scipy.ndimage.filters import gaussian_filter1d
import matplotlib.dates as mdates
timeplot(ocp, 'occupancy', 'Occupancy Rate by Date')

This indicates that the occupancy rate peaked in September and then rapidly declined to 30% and 10% in November and December. Because after that it starts to rise gradually in April. This suggests that the busiest seasons in Chicago are the spring and summer. From late June to early October, Chicago experiences its best weather. And during that time, there actually is a peak.

#### Median 

In [None]:
airbnb_calendar

In [None]:
airbnb_calendar['price'] = airbnb_calendar['price'].map(lambda price: float(price[1:].replace(',', '')), na_action='ignore')

In [None]:
price_mean = airbnb_calendar[['date', 'price']].groupby('date').mean()

### Seasonal Pricing

In [None]:
timeplot(price_mean, 'price', 'Mean Price by Date')

The rental rates have increased since November and are at their highest in June and July. The weather in Chicago is typically pleasant at that time. It makes sense that rental rates would be higher then.

#### Mean

In [None]:
price_mean = airbnb_calendar[['date', 'price']].groupby('date').mean()

In [None]:
timeplot(price_mean, 'price', 'Mean Price by Date')

Although it has a slightly different appearance, the same pattern is still visible.

## Machine Learning Models

In [None]:
from sklearn import preprocessing as p
scaler = p.StandardScaler()
columns = airbnb_listings_clean.columns
airbnb_listings_clean = scaler.fit_transform(airbnb_listings_clean)
airbnb_listings_clean = pd.DataFrame(airbnb_listings_clean)
airbnb_listings_clean.columns = columns

In [None]:
airbnb_listings_clean.dtypes

In [None]:
x_col = dropping_column(airbnb_listings_clean, 'price')
y_col = airbnb_listings_clean['price']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_col, y_col, test_size=0.2)

## Problem - 1 Understanding influential features for determining price and Predicting Rental Prices of AirBnb in Chicago

## XGBoost

In [None]:
from sklearn.model_selection import KFold
kf=KFold(n_splits = 10)
scoring=['r2','neg_mean_squared_error']

In [None]:
import sys
import xgboost as xgb
from xgboost import plot_importance

In [None]:
import time
xgb_reg_start = time.time()

xgb_reg=xgb.XGBRegressor(learning_rate = 0.1, alpha = 10,  max_depth = 3, n_estimators = 1000)
xgb_reg.fit(X_train, y_train)
training_preds_xgb_reg=xgb_reg.predict(X_train)
val_preds_xgb_reg=xgb_reg.predict(X_test)

xgb_reg_end=time.time()

In [None]:
ft_weights_xgb_reg=pd.DataFrame(xgb_reg.feature_importances_, columns=['weight'], index=X_train.columns)
ft_weights_xgb_reg.sort_values('weight', ascending=False, inplace=True)
ft_weights_xgb_reg.head(10)

In [None]:
plt.figure(figsize=(20,90))
plt.barh(ft_weights_xgb_reg.index, ft_weights_xgb_reg.weight, align='center') 
plt.title("Feature importances in the XGBoost model", fontsize=14)
plt.xlabel("Feature importance")
plt.margins(y=0.01)
plt.show()

In [None]:
from sklearn.model_selection import cross_val_score,cross_validate
results_kfold_XGB=cross_validate(xgb_reg,x_col,y_col,cv=kf,scoring=scoring)

In [None]:
print("XGBooost MSE:",-round(results_kfold_XGB['test_neg_mean_squared_error'].mean(),4))

In [None]:
print("XGBooost R^2:",round(results_kfold_XGB['test_r2'].mean(),4))

In [None]:
y_test_array =np.array(list(y_test))
val_preds_xgb_reg_array = np.array(val_preds_xgb_reg)
xgb_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': val_preds_xgb_reg_array.flatten()})
xgb_df.head(10)

#### Following XGBoost, we obtained the attributes listed below as the most weighted - 
1. bedrooms
2. calculated_host_lisitngs_count_private_rooms
3. beds

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [None]:
rfr=RandomForestRegressor(n_estimators=1500, random_state=42)

In [None]:
rfr.fit(X_train,y_train)

In [None]:
rfr_train = rfr.predict(X_train)
testPredictin_RF = rfr.predict(X_test)

In [None]:
importancesRfr = rfr.feature_importances_
features_imp1 = pd.DataFrame(importancesRfr, columns=['Weight'], index=X_train.columns)
features_imp1.sort_values(['Weight'], inplace= True, ascending=False)
features_imp1 = features_imp1.head(10)
features_imp1.round(6)

In [None]:
features_imp = pd.DataFrame({'importance':rfr.feature_importances_})  
features_imp['feature'] = X_train.columns
features_imp.sort_values(by='importance',ascending=False,inplace=True)

In [None]:
features_imp.sort_values(by='importance',inplace=True)
features_imp=features_imp.set_index('feature',drop=True)
features_imp.plot.barh(figsize=(20,90))
plt.xlabel('Feature Importance Score')
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print("\nTraining RMSE:", round(mean_squared_error(y_train, rfr_train),4))
print("Validation RMSE:", round(mean_squared_error(y_test, testPredictin_RF),4))
print("\nTraining r2:", round(r2_score(y_train, rfr_train),4))
print("Validation r2:", round(r2_score(y_test, testPredictin_RF),4))

In [None]:
print("RandomForest MSE:",
      -round(results_kfold_XGB['test_neg_mean_squared_error'].mean(),4))
print("RandomForest MSE:",
      -round(results_kfold_XGB['test_neg_mean_squared_error'].mean(),4))

In [None]:
y_test_array=np.array(list(y_test))
val_preds_xgb_reg_array = np.array(val_preds_xgb_reg)
rfr_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': val_preds_xgb_reg_array.flatten()})
rfr_df.head(10)

#### Following RandomForest, we obtained the attributes listed below as the most weighted - 
1. bedrooms
2. beds 
3. calculated_host_lisitngs_count_private_rooms

### CatBoost

In [None]:
from catboost import Pool,CatBoostRegressor
CatB=CatBoostRegressor(iterations=2000,
                       depth=3, learning_rate=0.1,loss_function='RMSE')
CatB.fit(X_train,y_train,plot=True);


trainPrediction_CatB=CatB.predict(X_train)
testPrediction_CatB=CatB.predict(X_test)

In [None]:
airbnb_listings_clean

In [None]:
feature_impCatB = CatB.feature_importances_
feat_imp1 = pd.DataFrame(feature_impCatB, columns=['Weight'], index=X_train.columns)
feat_imp1.sort_values(['Weight'],inplace= True,ascending=False)
feat_imp1.head(10)

In [None]:
feature_impCatB=pd.DataFrame({'imp': CatB.feature_importances_, 'col': X_train.columns})
feature_impCatB=feature_impCatB.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_impCatB.plot(kind='barh', x='col', y='imp', figsize=(10, 10), legend=None)
plt.title('CatBoost - Feature Importance')
plt.ylabel('Features')
plt.xlabel('Importance');
plt.xlim(0.001, 20.0)

In [None]:
results_kfold_catBoost = cross_validate(CatB,x_col,y_col,cv=kf,scoring=scoring)

In [None]:
print("CatBoost MSE:" , -round(results_kfold_catBoost['test_neg_mean_squared_error'].mean(),4))
print("CatBooost MSE:" , -round(results_kfold_catBoost['test_neg_mean_squared_error'].mean(),4))

In [None]:
y_test_array =np.array(list(y_test))
testPrediction_CatB_array = np.array(testPrediction_CatB)
cat_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': testPrediction_CatB_array.flatten()})
cat_df.head(10)

#### Following CatBoost, we obtained the attributes listed below as the most weighted - 
1. bedrooms
2. beds 
3. calculated_host_lisitngs_count_entire_rooms

### Spatial Hedonic Price Model (HPM)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

def train(model):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_train, y_train_pred)
    r2 = r2_score(y_train, y_train_pred)
    print('For training data, mean squared error: {:.4f}, R2: {:.4f}'.format(mse, r2))

    mse = mean_squared_error(y_test, y_test_pred)
    r2 = r2_score(y_test, y_test_pred)
    print('For test data, mean squared error: {:.4f}, R2: {:.4f}'.format(mse, r2))

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import time
from sklearn.linear_model import LinearRegression
hpm_regression_start = time.time()

hpm_regression = LinearRegression()  
hpm_regression.fit(X_train, y_train) #training the algorithm


training_preds_hpm_regression = hpm_regression.predict(X_train)
val_preds_hpm_regression = hpm_regression.predict(X_test)

hpm_regression_end = time.time()

print(f"Time taken to run: {round((hpm_regression_end - hpm_regression_start)/60,1)} minutes")

print("\nTraining RMSE:", round(mean_squared_error(y_train, training_preds_hpm_regression),4))
print("Validation RMSE:", round(mean_squared_error(y_test, val_preds_hpm_regression),4))
print("\nTraining r2:", round(r2_score(y_train, training_preds_hpm_regression),4))
print("Validation r2:", round(r2_score(y_test, val_preds_hpm_regression),4))

In [None]:
y_test_array = np.array(list(y_test))
val_preds_hpm_reg_array = np.array(val_preds_hpm_regression)
hpm_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': val_preds_hpm_reg_array.flatten()})
hpm_df.head(10)

Using Spatial Hedonic Price Model (HPM) for predicting rental prices of Airbnb in Chicago, we got a validation Root Mean Square Error of 9.9 and got the Actual and Predicted values which are not that close compared to other models.

## Linear Regression

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

def train(model):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_train, y_train_pred)
    r2 = r2_score(y_train, y_train_pred)
    print('For training data, mean squared error: {:.4f}, R2: {:.4f}'.format(mse, r2))

    mse = mean_squared_error(y_test, y_test_pred)
    r2 = r2_score(y_test, y_test_pred)
    print('For test data, mean squared error: {:.4f}, R2: {:.4f}'.format(mse, r2))

In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
rr1 = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.5))
rr1.fit(X_train, y_train)
y_train_pred = rr1.predict(X_train)
y_test_pred = rr1.predict(X_test)
train(rr1)

In [None]:
rr2 = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.1))
train(rr2)

In [None]:
rr3 = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.05))
train(rr3)

In [None]:
from sklearn.linear_model import Lasso
lasso1 = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha=0.1))
train(lasso1)

In [None]:
lasso2 = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha=0.01))
train(lasso2)

In [None]:
lasso3 = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha=0.001))
train(lasso3)

In [None]:
from sklearn import linear_model
lr = linear_model.LinearRegression()

for i in range (-2, 3):
    alpha = 10**i
    rm = linear_model.Ridge(alpha=alpha)
    ridge_model = rm.fit(X_train, y_train)
    preds_ridge = ridge_model.predict(X_test)

    plt.scatter(preds_ridge, y_test, alpha=.75, color='r')
    print("r^2 is:",round(ridge_model.score(X_test, y_test), 4))
    print("RMSE is: ",round(mean_squared_error(y_train, training_preds_hpm_regression),4))
    plt.xlabel('Predicted Price')
    plt.ylabel('Actual Price')
    plt.title('Ridge Regularization with alpha = {}'.format(alpha))
    plt.show()

In [None]:
rr1 = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.5))
train(rr1)
y_test_array = np.array(list(y_test))
y_test_pred_array = np.array(y_test_pred)
hpm_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': y_test_pred_array.flatten()})
hpm_df.head(10)

Using Linear Regression to forecast Airbnb rental pricing in Chicago, we obtained a Root Mean Square Error for test data which is 0.40 and obtained Actual and Predicted values that are best compared to other models.

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor  
regressor = DecisionTreeRegressor()  
regressor.fit(X_train, y_train) 
y_pred = regressor.predict(X_test)  
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df.index+=1
df

In [None]:
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
fig = plt.figure()
plt.scatter(y_pred, y_test, alpha=.7,
            color='b') #alpha helps to show overlapping data

plt.xlabel('Predicted Price')
plt.ylabel('Actual Price')
plt.title('Actual Vs Predicted')
plt.show()

We found a Root Mean Square Error of 0.75 when using Decision Tree to anticipate Airbnb rental pricing in Chicago.

## Predicting Reviews of AirBnb Comments

In [None]:
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import spacy
import nltk

In [None]:
nltk.download('stopwords')

In [None]:
reviews_df=pd.read_csv('reviews.csv')
reviews_df.head()

In [None]:
reviews_df=reviews_df.drop('date', axis=1)

In [None]:
reviews_df.isnull().sum()

In [None]:
reviews_df['rating'].value_counts()

#### Checking the rating values

In [None]:
positive=[4,5]
neutral=[2,3]
negative=[0,1]
def sentiment(rating):
    if rating in positive:
        return 2
    elif rating in negative:
        return 0
    else:
        return 1
reviews_df['Sentiment']=reviews_df['rating'].apply(sentiment)
reviews_df.head()

#### Now analysing positive, neutral and negative reviews

In [None]:
fig = go.Figure([go.Bar(x=reviews_df.Sentiment.value_counts().index, 
                        y=reviews_df.Sentiment.value_counts().tolist())])
fig.update_layout(
    xaxis_title="Sentiment", yaxis_title="Values")
fig.show()

In [None]:
from nltk.corpus import stopwords
stopwords_list=set(stopwords.words("english"))
punctuations="""!()-![]{};:,+'"\,<>./?@#$%^&*_~Â"""

def reviewParse(review):
    splitReview = review.split() 
    parsedReview = " ".join([word.translate(str.maketrans('', '', punctuations)) + 
                             " " for word in splitReview])
    return parsedReview 
  
def clean_review(review):
    clean_words = []
    splitReview = review.split()
    for w in splitReview:
        if w.isalpha() and w not in stopwords_list:
            clean_words.append(w.lower())
    clean_review=" ".join(clean_words)
    return clean_review


reviews_df["comments"]=reviews_df["comments"].apply(reviewParse).apply(clean_review)
reviews_df.head()

In [None]:
docs=list(reviews_df['comments'])[:10000]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features = 20000) 
tfidf_vectorizer_vectors= tfidf_vectorizer.fit_transform(docs)

In [None]:
X =tfidf_vectorizer_vectors.toarray()
Y =reviews_df['Sentiment'][:10000]

#### Dividing the data into training, testing and validation sets

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV 
from sklearn.metrics import mean_absolute_error, accuracy_score, confusion_matrix, classification_report,roc_auc_score,roc_curve,auc
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 5, stratify = Y)

In [None]:
fig = go.Figure([go.Bar(x=Y.value_counts().index, y=Y.value_counts().tolist())])
fig.update_layout(
    xaxis_title="Sentiment",yaxis_title="Values")
fig.show()

### Naive bayes classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_train = gnb.predict(X_train)
y_pred_test = gnb.predict(X_test)
print("Training Accuracy score: "+
      str(round(accuracy_score(y_train,gnb.predict(X_train)),4)))
print("Testing Accuracy score: "+
      str(round(accuracy_score(y_test,gnb.predict(X_test)),4)))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Negative', 'Actual Neutral', 'Actual Positive'], 
                        index=['Predict Negative', 'Predict Neutral', 'Predict Positive'])
sns.heatmap(cm_matrix, 
            annot=True, fmt='d', cmap='YlGnBu')
plt.show()

In [None]:
y_test_array = np.array(list(y_test))
y_test_pred_array = np.array(y_pred_test)
naive_df = pd.DataFrame({'Actual': y_test_array.flatten(), 
                         'Predicted': y_test_pred_array.flatten()})
naive_df.head(10)

Using Naive Bayes Classifier, predicting reviews of comments in Airbnb_reviews dataset and plotted a confusion matrix. Also obtained a table with Actual and Predicted values along with mean squared error of 0.30 

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=9).fit(X_train, y_train)
y_pred_train=lr.predict(X_train)
y_pred_test=lr.predict(X_test)
print("Training Accuracy score: "+
      str(round(accuracy_score(y_train,lr.predict(X_train)),4)))
print("Testing Accuracy score: "+
      str(round(accuracy_score(y_test,lr.predict(X_test)),4)))

In [None]:
cm=confusion_matrix(y_test, y_pred_test)
cm_matrix=pd.DataFrame(data=cm, columns=['Actual Negative', 
                                           'Actual Neutral', 'Actual Positive'], 
                        index=['Predict Negative', 'Predict Neutral', 'Predict Positive'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

In [None]:
y_test_array=np.array(list(y_test))
y_test_pred_array = np.array(y_pred_test)
logistic_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': y_test_pred_array.flatten()})
logistic_df.head(10)

Predicting reviews of comments in the Airbnb reviews dataset using Logistic Regression and generating a confusion matrix. A table with actual and predicted values, as well as a mean squared error of 0.305, was also obtained.

### Ensemble Classifier

In [None]:
from sklearn.ensemble import VotingClassifier


classifiers = [('Naive Bayes', gnb),
               ('Logistic Regression', lr)
              ]
vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
print("Accuracy score for Training: "+str(round(accuracy_score(
    y_train,vc.predict(X_train)),4)))
print("Accuracy score for Testing: "+str(round(accuracy_score(
    y_test,vc.predict(X_test)),4)))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Negative', 'Actual Neutral', 'Actual Positive'], 
                        index=['Predict Negative', 'Predict Neutral', 'Predict Positive'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

In [None]:
y_test_array = np.array(list(y_test))
y_pred_array = np.array(y_pred)
ensemble_df = pd.DataFrame({'Actual': y_test_array.flatten(), 'Predicted': y_pred_array.flatten()})
ensemble_df.head(10)

Using Ensemble Classifier to predict reviews of comments in the Airbnb reviews dataset and generating a confusion matrix. A table with actual and anticipated values was also collected, as well as a mean squared error of 0.64.

## Amenities that are most influencial in AirBnb Chicago

In [None]:
amen=airbnb_listings_dataset['amenities'].unique()
amen

In [None]:
import ast


def Lis(x):
    arr = ast.literal_eval(x)
    return arr
Lis("['Hello']")
airbnb_listings_dataset['amen_as_list'] = airbnb_listings_dataset['amenities'].apply(Lis)

In [None]:
airbnb_listings_dataset['amen_as_list']

In [None]:
amenities={}
for i in airbnb_listings_dataset['amen_as_list'].index:
    for j in range(len(airbnb_listings_dataset['amen_as_list'][i])):
        if(airbnb_listings_dataset['amen_as_list'][i][j] not in amenities):
            amenities[(airbnb_listings_dataset['amen_as_list'][i][j])] = 1
        else:
            amenities[(airbnb_listings_dataset['amen_as_list'][i][j])] += 1
amenities

In [None]:
new_amen=sorted(amenities.items(),key=lambda x: x[1])

In [None]:
new_amen[-25:]

In [None]:
newDict=[]
for i in range(15,25):
    newDict.append(new_amen[-25:][i][0])
newDict

In [None]:
dnew=pd.DataFrame()
dnew['scores']=airbnb_listings_clean['review_scores_rating']
dnew

In [None]:
for i in newDict:
    dnew[i]=0
dnew

In [None]:
for i in airbnb_listings_dataset['amen_as_list'].index:
    for j in range(len(airbnb_listings_dataset['amen_as_list'][i])):
        if airbnb_listings_dataset['amen_as_list'][i][j] in newDict:
            dnew[airbnb_listings_dataset['amen_as_list'][i][j]][i] = 1
dnew

### Extra Tree Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import preprocessing
from sklearn import utils


etf=ExtraTreesClassifier()
X=dnew.drop(labels = ['scores'], axis = 1)
y=dnew['scores']
lab = preprocessing.LabelEncoder()
y_transformed = lab.fit_transform(y)
etf.fit(X, y_transformed)

In [None]:
feature_importance=etf.feature_importances_
feature_importance

In [None]:
for i in range(10):
    feature_importance[i]=feature_importance[i]*100

In [None]:
newDict[4]='Long term stays'

In [None]:
fig=plt.figure(figsize=(17,5))
plt.ylim(6,12)
plt.bar(newDict, feature_importance)
plt.xlabel('Amenities')
plt.ylabel('Contributing factor to a good score')
plt.title('Ameninties')
plt.show()

Used an Extra Tree Classifier to get all the Amenities that are most influencial in AirBnb Chicago. Plotted a bar plot where the most influential features are Iron, Hangers and carbon monoxide alarm.

## Predicting future price trends for various seasons.

In [None]:
!pip install prophet

In [None]:
import datetime 
from prophet import Prophet
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode()
from sklearn.metrics import mean_absolute_error

In [None]:
airbnb_calendar['price']=airbnb_calendar['price'].apply(lambda x: str(x).replace( '$', ''))
airbnb_calendar['price']=pd.to_numeric(airbnb_calendar['price'], errors = 'coerce')
df_calendar = airbnb_calendar.groupby('date')[["price"]].sum()
df_calendar['mean']=airbnb_calendar.groupby('date')[["price"]].mean()
df_calendar.columns=['Total', 'Avg']
df_calendar.head(10)

In [None]:
df_calendar2 = airbnb_calendar.set_index("date")
df_calendar2.index = pd.to_datetime(df_calendar2.index)
df_calendar2 = df_calendar2[['price']].resample('M').mean()
df_calendar2.head()

In [None]:
from plotly.offline import iplot,plot,init_notebook_mode,download_plotlyjs
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import plotly.offline as offline
trace3 = go.Scatter(
    x = df_calendar2.index[:-1],
    y = df_calendar2.price[:-1])
layout3 = go.Layout(
    title = "Average Prices by Month",
    xaxis = dict(title = 'Month'),
    yaxis = dict(title = 'Price ($)'))
data3 = [trace3]
figure3=go.Figure(data = data3,layout=layout3)
offline.iplot(figure3)

The average monthly price appears to rise by 20 dollars from January 2023 to a peak of 235 dollars in July 2023

In [None]:
df_calendar_copy = df_calendar.copy()
df_calendar_copy['date'] = df_calendar_copy.index
df_calendar_copy = df_calendar_copy[['date', 'Avg']]
df_calendar_copy.columns = ['ds', 'y']
df_calendar_copy.head()

In [None]:
df_calendar_copy['y_origin'] = df_calendar_copy['y']
df_calendar_copy['y'] = np.log(df_calendar_copy['y'])
df_calendar_copy['ds'] =  pd.to_datetime(df_calendar_copy['ds'])
df_calendar_copy.head()

In [None]:
mean = df_calendar_copy['y'].mean()
stdev = df_calendar_copy['y'].std()
q1 = df_calendar_copy['y'].quantile(0.25)
q3 = df_calendar_copy['y'].quantile(0.75)
iqr = q3 - q1
high = mean + stdev
low = mean - stdev

In [None]:
df_filtered = df_calendar_copy[(df_calendar_copy['y'] > high) | (df_calendar_copy['y'] < low)]
df_filtered_changepoints = df_filtered

filtered_iqr = df_calendar_copy[(df_calendar_copy['y'] < q1 - (1.5 * iqr)) | (df_calendar_copy['y'] < q3 + (1.5 * iqr)) ]

In [None]:
prophet = Prophet(
                  interval_width = 0.95,
                  weekly_seasonality = True,
                  yearly_seasonality = True,
                  changepoint_prior_scale = 0.095)

prophet.fit(df_calendar_copy)
future = prophet.make_future_dataframe(periods = 60, freq = 'd')
future['cap'] = 5.05
forecast = prophet.predict(future)

In [None]:
future

In [None]:
future=prophet.make_future_dataframe(periods = 60,freq = 'd')
future['cap']=5.05
forecast=prophet.predict(future)

In [None]:
py.iplot([
    go.Scatter(x=df_calendar_copy['ds'],y=df_calendar_copy['y'],name='y'),
    go.Scatter(x=forecast['ds'],y=forecast['yhat'],name='yhat'),
    go.Scatter(x=forecast['ds'],y=forecast['yhat_upper'],fill='tonexty',mode='none',name='upper'),
    go.Scatter(x=forecast['ds'],y=forecast['yhat_lower'],fill='tonexty',mode='none',name='lower')])

The forecast object is a dataframe that includes a column yhat with the prediction as well as columns for components and uncertainty levels. The predict method will assign each row in the future a forecasted value that it names yhat.

In [None]:
df_comparison=pd.DataFrame()
df_comparison=pd.merge(df_calendar_copy,forecast,left_on='ds',right_on='ds')
df_comparison.head()

In [None]:
print("Mean_absolute_error yhat\t: {}\nMean_absolute_error trend\t: {}\nMean_absolute_error yhat_lower: {}\nMean_absolute_error yhat_upper: {}".format(
    mean_absolute_error(df_comparison['y'].values,df_comparison['yhat']),
    mean_absolute_error(df_comparison['y'].values,df_comparison['trend']),
    mean_absolute_error(df_comparison['y'].values,df_comparison['yhat_lower']),
    mean_absolute_error(df_comparison['y'].values,df_comparison['yhat_upper'])))

In [None]:
prophet.plot_components(forecast)

The results of time series analysis used to forecast price changes based on seasons were as follows:
1. Given the slanted line on the upper side of the figure, the Trend appears to be accelerating in the future.(From 2022-11 to 2023-11)
2. Regarding the day of the week, the plot indicates a rise on weekends starting on Friday and a decline on Sunday.
3. The forecast appears to be constant in yearly plots during the summer and indicates a drop as winter approaches.
