In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [7]:
cal_df_cleaned = pd.read_csv("cal_df_cleaned.csv")
cal_df_cleaned.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,ColumnDiff
0,736534,12/29/2022,f,1250.0,1250.0,1,1125,0.0
1,736534,12/30/2022,f,1250.0,1250.0,2,1125,0.0
2,736534,12/31/2022,f,1250.0,1250.0,2,1125,0.0
3,736534,1/1/2023,f,1250.0,1250.0,1,1125,0.0
4,736534,1/2/2023,f,1250.0,1250.0,1,1125,0.0


Recommendation Approach

For this project, we aim at creating a recommender system that will benefit the stakeholders in the following ways:
1. Increasing Customer Engagement: Recommender systems can be used to increase customer engagement by providing personalized recommendations based on their preferences, behavior, and historical data. By improving the quality of recommendations, businesses can keep their customers engaged and increase their retention rate.

2. Improving Customer Experience: A good recommender system can help improve the customer experience by suggesting relevant products or services that meet their needs and expectations. By delivering personalized recommendations, businesses can enhance their customers' satisfaction, leading to increased loyalty and repeat business.

3. Boosting Sales: Recommender systems can also help boost sales by increasing cross-selling and up-selling opportunities. By suggesting complementary products or services, businesses can encourage customers to make additional purchases, increasing their revenue and profit margins.

4. Reducing Churn Rate: Recommender systems can help reduce the churn rate by providing targeted recommendations that keep customers interested in the business. By recommending products or services that align with the customers' interests and preferences, businesses can decrease the likelihood of losing customers to competitors.

5. Improving Operational Efficiency: Recommender systems can help improve operational efficiency by automating the recommendation process and reducing the need for manual intervention. By using machine learning algorithms, businesses can streamline their operations and save time and resources, leading to increased productivity and profitability.

# Feature Engineering

User Preferences: Collect user preferences on location, type of accommodation, price range, and other relevant criteria that are important to the user.

User Behavior: Analyze user behavior within the Airbnb platform, such as the types of properties they have booked, the frequency of bookings, and the reviews they have left for properties.

Property Characteristics: Consider the characteristics of the properties, such as the location, property type, size, amenities, and other relevant features that may influence the user's decision.

Reviews and Ratings: Analyze the reviews and ratings left by previous guests for each property to determine their overall satisfaction and identify any common issues or concerns.

External Data: Incorporate external data sources, such as weather forecasts, events happening in the area, and local transportation options, to provide additional context for the user.

Similarity Matching: Use algorithms that find properties that are similar to the ones the user has shown interest in, based on factors like location, property type, price range, and other relevant criteria.

Personalization: Tailor recommendations to the user's unique preferences and behavior, by using machine learning algorithms that can learn and adapt over time.

Diversity and Serendipity: Provide recommendations that are diverse and offer unexpected options, to help the user discover new and exciting properties they may not have considered before.

# Content-based filtering Recommmender System using Count Vectorizer

Content-based filtering technique recommends items similar to what the user has already shown interest in. In this project, we can recommend listings similar to the one the user has already viewed. To do this, we can use the listing's textual attributes, such as the name, description, and amenities, neighbourhood_overview, price and review_scores_rating to calculate similarity between listings.


The count vectorizer is a common technique used to convert text data into numerical features that can be used for recommendation. It involves representing each document or piece of text as a vector of word counts, where each dimension corresponds to a unique word in the vocabulary, and the value in each dimension is the count of how many times that word appears in the document. 

For this, we will use relevant features to check the count of texts(which will be a combination of a variety of relevant features)to group similar users usig Cosine Similarity

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
df = listing_df_cleaned


# Select the relevant features
features = ['id', 'name', 'description', 'neighborhood_overview', 'price', 'review_scores_rating', 
            'property_type', 'amenities']

# Create a new DataFrame with only the relevant features
df_relevant = df[features]

# Preprocess the text data
# by combining the name, description, and neighborhood overview columns into a single text column.
df_relevant['text'] = df_relevant['name'] + ' ' + df_relevant['description'] + ' ' + df_relevant['neighborhood_overview']

# Create a CountVectorizer object to convert the text data into a matrix of word counts.
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer to the text data
X = vectorizer.fit_transform(df_relevant['text'])

# Calculate the cosine similarity between all listings based on the matrix of word counts
cosine_similarities = cosine_similarity(X)

# Define a function to get the most similar listings
#This function takes the listing ID as an argument and 
# returns a DataFrame containing the top n most similar listings based on cosine similarity.
def get_similar_listings(listing_id, n=10):
    # Get the row index for the given listing ID
    idx = df_relevant[df_relevant['id'] == listing_id].index[0]
    
    # Get the cosine similarities for the given row
    sim_scores = list(enumerate(cosine_similarities[idx]))
    
    # Sort the list of similarities by score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top n similar listings
    sim_indices = [i[0] for i in sim_scores[1:n+1]]
    
    # Return the top n similar listings
    return df_relevant.iloc[sim_indices]

In [None]:
#To use the recommender system, you would simply call the get_similar_listings function with a listing ID as the argument.
# For example, to get the top 10 most similar listings to a listing with ID 12345, you would call the function like this
get_similar_listings(3191, n=10)



From the above code, we see that the Airbnb Listing_Id = 3191, has a similar number of clients and thus implementing this recommender system will help reduce the diversification of recommendations that pop-up at these clients's Airbnb Platform/ emails.

# Collaborative filtering based Recommender System using SVD

Collaborative filtering is a type of recommender system that uses the past behavior and preferences of users to make recommendations. The underlying idea is that users who have similar tastes and preferences in the past are likely to have similar tastes and preferences in the future.

Collaborative filtering works by creating a user-item matrix that represents the ratings or preferences of users for various items. This matrix is then used to predict the ratings or preferences of users for items that they have not yet interacted with. This is done by finding users who have similar ratings or preferences as the target user, and then recommending items that those similar users have liked or rated highly.

There are two main types of collaborative filtering: user-based and item-based. User-based collaborative filtering recommends items to a target user based on the ratings and preferences of users who are similar to the target user. Item-based collaborative filtering recommends items to a target user based on the similarity of the items they have rated or interacted with in the past.

Collaborative filtering has some advantages over other types of recommender systems, such as content-based systems. Collaborative filtering can recommend items that are outside the user's usual preferences, but still likely to be enjoyed based on the preferences of similar users. It also works well in situations where there is limited information about the items being recommended, such as for new or niche products. However, collaborative filtering also has some limitations, such as the cold-start problem, where it is difficult to make recommendations for new users who have not yet rated any items, and the sparsity problem, where there may be many users and items with very few ratings, making it difficult to find similar users or items.

# User-Based Collaborative Filtering 

In [None]:
listing_df_cleaned['review_scores_rating'] = listing_df_cleaned['review_scores_rating'].astype('int')

In [None]:
from surprise import Reader, Dataset

#Create a user-item matrix:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(listing_df_cleaned[['id', 'host_id', 'review_scores_rating']], reader)

train_data = data.build_full_trainset()


In [None]:
#Train the user-based collaborative filtering model:
from surprise import KNNWithMeans

sim_options = {'name': 'cosine', 'user_based': True}
algo = KNNWithMeans(sim_options=sim_options)
algo.fit(train_data)


In [None]:
#Make Predictions for a user
# Get the top-n recommendations for a user
def ubcf_recommendations(user_id, top_n=10):
    items = df['id'].unique()
    user_items = df[df['host_id'] == user_id]['id'].tolist()
    other_items = [item for item in items if item not in user_items]
    
    predictions = []
    for item_id in other_items:
        predictions.append((item_id, algo.predict(user_id, item_id).est))
    
    recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)[:top_n]
    recommended_items = [item[0] for item in recommendations]
    return recommended_items


In [None]:
#Lets make predictions now:
user_id = 3754
recommended_items = ubcf_recommendations(user_id, top_n=5)

print(f"Top 5 recommended items for user {user_id}:")
for item_id in recommended_items:
    print(item_id)

In [None]:
# Evaluate the Perfomance of the User-based Collaborative Filtering

In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Define the reader and load the data into the Surprise dataset format
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(listing_df_cleaned[['host_id', 'id', 'review_scores_rating']], reader)

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Train the algorithm on the training set
algo = SVD()
algo.fit(trainset)

# Make predictions on the test set
predictions = algo.test(testset)

# Compute and print the RMSE and MAE scores for the predictions
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)
print('Root Mean Squared Error (RMSE):', rmse)
print('Mean Absolute Error (MAE):', mae)


After running this code, you should see the RMSE score for the predictions. This score represents the root mean squared error of the difference between the predicted ratings and the actual ratings in the test set. The lower the RMSE score, the better the performance of the collaborative filtering algorithm.Thus we can use this user-based model to make a personalized recommendations for users based on their past ratings and behavior.

# Time Series Predictive Model

In [None]:
Time Series Analysis using SARIMA, Prophet and Seasonal Decomposition Time Series(STL mODEL)

SARIMA : Seasonal ARIMA (SARIMA) model: SARIMA models are commonly used to analyze time series data with seasonal patterns. SARIMA models are a type of ARIMA model that takes into account the seasonal component of the data. This model can be used to capture trends, seasonality, and cyclical behavior in the data.

In [None]:

df = cal_df_cleaned1

# Set the date column as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)


In [None]:
cal_df_cleaned

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX


calendar_df = cal_df_cleaned.copy()
# Preprocess the data
calendar_df['date'] = pd.to_datetime(calendar_df['date'])
calendar_df = calendar_df.dropna(subset=['price'])

# Group the data by date and calculate the average price
avg_price_df = calendar_df.groupby('date')['price'].mean()

# Plot the time series
plt.plot(avg_price_df)
plt.title('Average price over time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

# Decompose the time series to observe its trend and seasonality
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(avg_price_df)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(avg_price_df, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

# Fit a SARIMAX model to the time series
model = SARIMAX(avg_price_df, order=(1, 1, 1), seasonal_order=(0, 1, 1, 12))
results = model.fit()

# Print the model summary
print(results.summary())

# Make predictions with the model
forecast = results.forecast(steps=12)
print(forecast)

# Plot the actual and predicted values
plt.plot(avg_price_df, label='Actual')
plt.plot(forecast, label='Predicted')
plt.title('Actual vs Predicted Average Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(loc='best')
plt.show()

* > Dep. Variable: The dependent variable in the model, which in this case is "price".
* > No. Observations: The number of observations used in the model.
* > Model: The specifications of the SARIMA model used in the analysis, which is (1, 1, 1)x(0, 1, 1, 12).
* > Log Likelihood: The log likelihood of the estimated model parameters.
* > AIC, BIC, and HQIC: The Akaike Information Criterion, Bayesian Information Criterion, and Hannan-Quinn Information Criterion, respectively. These are used to compare different models and select the best one based on the lowest value.
* > Covariance Type: The method used to estimate the covariance matrix, which in this case is "opg".
* > Coefficients: The estimated coefficients for the SARIMA model. In this case, there are three coefficients: ar.L1, ma.L1, and ma.S.L12. These represent the autoregressive, moving average, and seasonal moving average parameters, respectively.
* > sigma2: The estimated variance of the error term in the model.
* > Ljung-Box (L1) (Q): A test for autocorrelation in the residuals at a lag of 1. A p-value greater than 0.05 indicates that there is no evidence of autocorrelation.
* > Jarque-Bera (JB): A test for normality of the residuals. A p-value less than 0.05 indicates that the residuals are not normally distributed.
* > Heteroskedasticity (H): A test for heteroscedasticity in the residuals. A p-value less than 0.05 indicates that the residuals are heteroscedastic.
* > Warnings: Any warnings or notes about the model estimation process.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Load Airbnb data for Cape Town
df = cal_df_cleaned.copy()


# Create a SARIMA model with seasonal_order=(1,1,1,12)
model = SARIMAX(df['price'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))

# Fit the model to the data
results = model.fit()

# Make predictions for the next 12 months
forecast = results.predict(start=len(df), end=len(df)+11, dynamic=False)

# Plot the forecast
forecast.plot();


Prophet model: Prophet is a time series forecasting model developed by Facebook. It is particularly well-suited for data that have seasonality and multiple seasonalities with changing trends. The model uses an additive approach to model seasonality, trends, and holidays in the data.

In [None]:
cal_df_cleaned.copy().columns

In [None]:
import pandas as pd
from fbprophet import Prophet

# Load data into a pandas DataFrame
df = cal_df_cleaned.copy()

# Rename columns to "ds" and "y"
df = df.rename(columns={'date': 'ds', 'price': 'y'})


# Instantiate Prophet model and fit to data
model = Prophet()
model.fit(df)
# Generate future dates for forecasting
future_dates = model.make_future_dataframe(periods=365)

# Make predictions for future dates
forecast = model.predict(future_dates)

# Plot forecast
model.plot(forecast, xlabel='Date', ylabel='Price')

# Plot components of forecast (trend, seasonality, holidays)
model.plot_components(forecast)

> * Seasonality: In a time series, seasonality refers to patterns that repeat at fixed intervals of time, such as weekly, monthly, or yearly. In a Prophet model, seasonality is represented by a set of smooth curves that show how the time series varies with the season. These curves are shown in the "seasonality" plot that is produced by model.plot_components(). You can use this plot to identify the periods of the year when prices tend to be high or low, and to see how the seasonality changes over time.

> * Trends: In a time series, trends refer to long-term patterns that are not related to seasonality. Trends can be upward or downward, and may be linear or nonlinear. In a Prophet model, trends are represented by a smooth curve that shows how the time series is changing over time. This curve is shown in the "trend" plot that is produced by model.plot_components(). You can use this plot to see whether prices are generally increasing or decreasing over time, and to identify any sudden changes in the trend.

>* Patterns: In a time series, patterns refer to any other features that are not related to seasonality or trend, such as sudden spikes or drops in the data. In a Prophet model, patterns are represented by a combination of trend and seasonality, and are captured by the residuals of the model. These residuals can be visualized using the "residuals" plot that is produced by model.plot().

Seasonal Decomposition of Time Series (STL) model: The STL model is a classical decomposition method that can be used to identify seasonality in a time series. The model decomposes the time series into three components: trend, seasonal, and residual. The seasonal component can be used to identify seasonal patterns in the data.

In [None]:
# Import necessary libraries
from statsmodels.tsa.seasonal import STL

# Load Airbnb data for Cape Town
df = cal_df_cleaned.copy()

# Set the date column as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Create an STL decomposition model
model = STL(df['price'], period=12)

# Fit the model to the data
results = model.fit()

# Plot the seasonal component of the data
results.seasonal.plot()


# Evaluation of Time Series Models

In [None]:
from sklearn.metrics import mean_squared_error

# Load Airbnb data for Cape Town
df = cal_df_cleaned.copy()

# Set the date column as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Split the data into train and test sets
train = df.iloc[:-12]
test = df.iloc[-12:]

# Create a SARIMA model with seasonal_order=(1,1,1,12)
sarima_model = SARIMAX(train['price'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarima_results = sarima_model.fit()

# Create a Prophet model
prophet_model = Prophet()
prophet_model.fit(train.reset_index().rename(columns={'date':'ds', 'price':'y'}))

# Create an STL decomposition model
stl_model = STL(train['price'], period=12)
stl_results = stl_model.fit()

# Make predictions using each model
sarima_preds = sarima_results.predict(start=len(train), end=len(train)+11, dynamic=False)
prophet_preds = prophet_model.make_future_dataframe(periods=12, freq='M')
prophet_preds = prophet_model.predict(prophet_preds)[-12:]['yhat']
stl_preds = stl_results.seasonal[-12:].values

# Calculate RMSE for each model
sarima_rmse = np.sqrt(mean_squared_error(test['price'], sarima_preds))
prophet_rmse = np.sqrt(mean_squared_error(test['price'], prophet_preds))
stl_rmse = np.sqrt(mean_squared_error(test['price'], stl_preds))

# Print RMSE for each model
print('SARIMA RMSE: ', sarima_rmse)
print('Prophet RMSE: ', prophet_rmse)
print('STL RMSE: ', stl_rmse)

# Choose the model with the lowest RMSE
if sarima_rmse <= prophet_rmse and sarima_rmse <= stl_rmse:
    print('SARIMA is the best model.')
elif prophet_rmse <= sarima_rmse and prophet_rmse <= stl_rmse:
    print('Prophet is the best model.')
else:
    print('STL is the best model.')


# # HYPERPARAMETER TUNING FOR SARIMA