# Regression Project

**Using a dataset found on rentfaster.ca, I will be analyzing the market for rental prices across Canada in 2024.** <br>
**The goal of this analysis is to predict, using different regression models, the rental price most accurately.**

## Project Outline

1. The project will begin with an EDA portion to better understand the data, indentify trends, and determine important features useful for regression models.
2. Data will be cleaned and prepared for visualization.
3. Once data has been organized and cleaned, visualization will be included to support my analysis of aforementioned trends. Key variables will be determined for feature engineering.
4. Proceed with feature engineering to formulate three(3) regression models.
5. Compare different regression models & distinguish which will best predict prices.
6. Use final model to make new predictions.

### 1. EDA

In [6]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
# Loading dataset
rental_df = pd.read_csv('/Users/AlexandreRioux/Desktop/M2P07-Regression_Project-main/canada_rent.csv')
rental_df

FileNotFoundError: [Errno 2] No such file or directory: '/Users/AlexandreRioux/Desktop/M2P07-Regression_Project-main/canada_rent.csv'

*The dataset is quite extensive with many columns describing the rentals.*

In [None]:
# Exploring
print(rental_df.info())
print(rental_df.describe())
print(rental_df.sample(5))

**Directly from dataset info, it is known that there are quite a few columns that have null values, specifically 'smoking' and 'sq feet' columns.** <br>
**At first glance, 'latitude' and 'longitude' columns do not seem to be necessarily useful information for price prediction since city column exists. Although there are fluctuating costs attached to various neighbourhoods in one city, collecting data in the city would be more relevant.**

In [None]:
# Let's observe missing values
print(rental_df.isnull().sum())
print('----------------')
# Let's observe data types
print(rental_df.dtypes)
print('----------------')
# Let's observe the number of unique entries in each column
print(rental_df.nunique())

From this first exploration, it might be prudent to keep latitude and longitude since they are numerical values that can easily be used in regression models. <br>
Here are the changes I am moving forward with:<br>
**Columns to remove:** <br>
*link, address and rentfaster_id*<br>
**Columns to keep as numerical data:** <br>
*latitude, longitude, beds, bath, sq_feet*<br>
**Columns to convert into numerical data:** <br>
*lease_term, type, furnishing, availability date, smoking, cats, dogs*<br>

The **price** will be the target variable used for prediction.<br>
The square feet column will be averaged out according to beds, baths and price.

## 2. Cleaning and Preparing Data

In [None]:
# Cleaning Data
clean_rental_df = rental_df.drop(columns=['rentfaster_id','address', 'link'])
clean_rental_df

In [None]:
# Using mode operations because mean & median would give values that are not realistic to homes.

# Finding mode in 'beds' column
mode_beds = clean_rental_df['beds'].mode()[0]
mode_beds # Mode is 2 for 'beds'
# Filling missing values in 'beds' column with the mode.
clean_rental_df['beds'] = clean_rental_df['beds'].fillna(mode_beds)
# Finding mode in 'baths' column
mode_baths = clean_rental_df['baths'].mode()[0]
mode_baths # Mode is 1 for 'baths'
# Fillling mode in 'baths' column
clean_rental_df['baths'] = clean_rental_df['baths'].fillna(mode_baths)
# Finding mode in the 'lease_term' column
mode_lease_term = clean_rental_df['lease_term'].mode()[0]
mode_lease_term # Mode of 'lease_term' is 'Long Term'. Usually means 12 months.
# Converting Long Term values as 12 months.
clean_rental_df['lease_term'] = clean_rental_df['lease_term'].replace('Long Term', 12).fillna(12)
# Renaming 'lease_term' column for clarification
clean_rental_df = clean_rental_df.rename(columns={'lease_term': 'lease_term(months)'})

In [None]:
# Changing Negotiable values 12 (months).
clean_rental_df['lease_term(months)'] = clean_rental_df['lease_term(months)'].replace('Negotiable', 12)
# Changing Short Term values to 6 (months).
clean_rental_df['lease_term(months)'] = clean_rental_df['lease_term(months)'].replace('Short Term', 6).fillna(6)
# Changing Months values to 12 (months).
clean_rental_df['lease_term(months)'] = clean_rental_df['lease_term(months)'].replace('months', 12).fillna(12)

In [None]:
# Dropping columns with missing values in 'availability_date'
clean_rental_df.dropna(subset=['availability_date'], inplace=True)

In [None]:
# Finding mode and filling missing values in the 'smoking' column
mode_smoking = clean_rental_df['smoking'].mode()[0]
mode_smoking # Mode in Non-Smoking
clean_rental_df['smoking'] = clean_rental_df['smoking'].fillna(mode_smoking)
# Finding mode and filling missing values in the 'cats' and 'dogs' column
mode_cats = clean_rental_df['cats'].mode()[0]
mode_cats # Mode is 'True' which means they are allowed
mode_dogs = clean_rental_df['dogs'].mode()[0]
mode_dogs # Mode is 'True' which means they are allowed
clean_rental_df['cats'] = clean_rental_df['cats'].fillna(mode_cats)
clean_rental_df['dogs'] = clean_rental_df['dogs'].fillna(mode_dogs)

In [None]:
# Cleaning up missing values in the 'sq_feet' column
clean_rental_df.isna().sum()
# Keep all numerical values, make the rest NaN values.
def clean_sq_feet(value): # Creating function to convert values into strings
    value_str = str(value)
    if all(s.isdigit() or s.isspace() or s == '.' for s in value_str):
        try:
            return float(value_str.replace(" ", ""))
        except ValueError:
            return np.nan
    return np.nan
clean_rental_df['sq_feet'] = clean_rental_df['sq_feet'].apply(clean_sq_feet)
# sq_feet column now only contains numerical values -> float.
# Replace NaN values with the mean of numerical values.
mean_sq_feet = clean_rental_df['sq_feet'].mean()
mean_sq_feet # Mean is 885.64 square feet
# Filling values
clean_rental_df['sq_feet'] = clean_rental_df['sq_feet'].fillna(mean_sq_feet)
print(clean_rental_df.isna().sum()) # No more null values.

## 3. Visualization

#### Let's visualize the collected data to identify trends.

In [None]:
clean_rental_df

In [None]:
# Let's visualize the price vs. type of home
plt.figure(figsize=(8, 5))
sns.barplot(data=clean_rental_df, x='type', y='price', hue='type', palette='hls')
plt.xticks(rotation=90)
plt.ylabel('Price ($)', fontsize=12)
plt.xlabel('Type of Rental Unit', fontsize=12)
plt.title('Type of Home vs. Price($)', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Visualization of Price vs. Province
plt.figure(figsize=(8, 5))
sns.barplot(data=clean_rental_df, x='province', y='price', hue='province', palette='viridis')
plt.xticks(rotation=45)
plt.ylabel('Price ($)', fontsize =12)
plt.xlabel('Province', fontsize=12)
plt.title('Province vs. Price($)', fontsize=16)
plt.tight_layout()
plt.show()
# Ontario, Vancouver and Quebec contain the highest prices for rental units.
# Seeing as the 3 largest cities in Canada are in those provinces, this is expected.

In [None]:
# Visualization of Price vs. Smoking
plt.figure(figsize=(8, 5))
sns.barplot(data=clean_rental_df, x='smoking', y='price', hue='smoking', palette='hls')
plt.xticks(rotation=90)
plt.ylabel('Price ($)', fontsize =12)
plt.xlabel('Smoking Units', fontsize=12)
plt.title('Smoking Units vs. Price($)', fontsize=16)
plt.tight_layout()
plt.show()
# As expected, non-smoking units are more expensive.

In [None]:
# Visualization of Price vs Furnished
plt.figure(figsize=(8, 5))
sns.barplot(data=clean_rental_df, x='furnishing', y='price', hue='furnishing')
plt.xticks(rotation=45)
plt.ylabel('Price ($)', fontsize =12)
plt.xlabel('Square Feet', fontsize=12)
plt.title('Furnished Rental Units vs. Price($)', fontsize=16)
plt.tight_layout()
plt.show()
# As expected, furnished units are typically more expensive.

In [None]:
# Visualization of Square Footage vs. Price
plt.figure(figsize=(16, 6))
sns.scatterplot(data=clean_rental_df, x='sq_feet', y='price', color='red')
plt.title('Rental Unit Square Footage vs. Price ($)', fontsize=16)
plt.ylabel('Price ($)')
plt.xlabel('Square Footage')
plt.show()
# There are only a few outliers in this dataset. They may influence feature engineering later on.

## 4. Feature Engineering and Regression Modelling

In [None]:
# Encoding columns for modelling
from sklearn.preprocessing import OneHotEncoder
# Converting categorical columns into numerical
clean_rental_df.dtypes
# Creating an instance of OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output = False)
# Transforming objects into floats for regression modelling
encoded_cols = encoder.fit_transform(clean_rental_df[['city', 'province', 'type',
                                                      'beds', 'baths', 'furnishing', 'availability_date', 'smoking', 'cats', 'dogs']])
encoded_cols_clean_rental_df = pd.DataFrame(encoded_cols, columns = encoder.get_feature_names_out([
    'city', 'province', 'type', 'beds', 'baths', 'furnishing', 'availability_date', 'smoking', 'cats', 'dogs']))
encoded_rental_df = clean_rental_df.drop(columns=['city', 'province', 'type', 'beds', 'baths', 'furnishing', 'availability_date', 
                                                  'smoking', 'cats', 'dogs']).join(encoded_cols_clean_rental_df)
encoded_rental_df['lease_term(months)'].unique() # lease_term(months) still has objects
encoded_rental_df['lease_term(months)'] = encoded_rental_df['lease_term(months)'].replace({'12 months': 12, '6 months': 6})
encoded_rental_df = encoded_rental_df.astype({'lease_term(months)': 'float64'})
encoded_rental_df.dtypes # All features are floats.

In [None]:
# Preparing data for modelling using 'Price' as the target variable.
# Importing train_test_split
from sklearn.model_selection import train_test_split
# Searching for features with the highest correlation to target variable
encoded_rental_df.corr()
corr_encoded_rental_df = encoded_rental_df.corr()
# Obtaining columns with the highest correlation to 'price'
price_corr_df = corr_encoded_rental_df['price']
# Display top 10 columns
top_10_corr = price_corr_df.drop('price').sort_values(ascending=False).head(10)
print(top_10_corr)

In [None]:
# Removing NaNs
encoded_rental_df = pd.get_dummies(clean_rental_df, drop_first=False, dummy_na=False)

In [None]:
# Selecting features among the top correlations.
# Omitting baths_7.5 & beds_8
X = encoded_rental_df[['sq_feet', 'city_Toronto', 'province_Ontario', 'type_House', 'longitude', 'baths_2', 'city_Vancouver', 'baths_3.5']]
y = encoded_rental_df['price']
print('These are the features of the dataframe')
print(X)
print('This is the target variable')
print(y)

In [None]:
# Making sure there are no more NaN values
print(encoded_rental_df.isna().sum())
encoded_rental_df.dropna(inplace=True)

### 4.1 Regression Modelling

### *First Model*

In [None]:
# Linear Regression Model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)
print('This is the X_train array')
print(X_train)
print('This is the X_test array')
print(X_test)
print('This is the y_train array')
print(y_train)
print('This is the y_test array')
print(y_test)

In [None]:
# Importing StandardScaler
# Because there are a few far-reaching outliers, I have chosen to go with this method to minimize the impacts they could have.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit transform on the training data
X_train = scaler.fit_transform(X_train)
# Transform on the testing data
X_test = scaler.transform(X_test)

print('This is the scaled X_train darray')
print(X_train)
print('This is the scaled X_test array')
print(X_test)

# Importing model
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
X_train.shape # Output is (21895, 8)
y_train.shape # Output is (21895, 8)
# Initializing training model
lr_model.fit(X_train, y_train)

In [None]:
# Importing to calculate validity of model
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.metrics import r2_score
import numpy as np
y_pred = lr_model.predict(X_test)
# Find Mean Absolute Error
mae_lr = mean_absolute_error(y_test, y_pred)
mse_lr = mean_squared_error(y_test, y_pred)
rmse_lr = np.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred)

print('This is the Mean Absolute Error')
print(mae_lr)
print('This is the Mean Squared Error')
print(mse_lr)
print('This is the RMSE')
print(rmse_lr)
print('This is the R2')
print(r2_lr)

### *Second Model*

In [None]:
# Polynomial Regression Model
from sklearn.preprocessing import PolynomialFeatures
poly_converter = PolynomialFeatures(degree=2, include_bias=False)
# Fitting converter
poly_features = poly_converter.fit_transform(X)
poly_features

In [None]:
# Training the model
from sklearn.linear_model import LinearRegression
# Creating model
poly_model = LinearRegression()
# Fitting the training model
poly_model.fit(X_train, y_train)

In [None]:
# Calculating validity of model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.15, random_state=1)

In [None]:
# Importing model
y_pred = poly_model.predict(X_test)
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.metrics import r2_score
print('This is the Mean Absolute Error:', poly_mae)
poly_mae = mean_absolute_error(y_test, y_pred)
print('This is the Mean Squared Error:', poly_mse)
poly_mse = mean_squared_error(y_test, y_pred)
print('This is the RMSE:', poly_rmse)
poly_rmse = np.sqrt(poly_mse)
print('This is the R2:', poly_r2)
poly_r2 = r2_score(y_test, y_pred)

### *Third Model*

In [None]:
# Using ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
# Initializing PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=2, include_bias=False)
# Fitting converter to features
X_poly_features = poly_transformer.fit_transform(X)

In [None]:
# ElasticCV - Train_Test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_poly_features, y, test_size=0.15, random_state=1)

In [None]:
# Scaling data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test= scaler.transform(X_test)

In [None]:
# Importing ElasticCV
from sklearn.linear_model import ElasticNetCV
# Creating ElasticCV model
elastic_model = ElasticNetCV(l1_ratio=[.1, .2, .5,.7, .90, .95, 1], eps=0.01, n_alphas=100, max_iter=15000)
elastic_model.fit(scaled_X_train, y_train)
elastic_model

In [None]:
# Searching for best Alpha
print(elastic_model.alpha_)

In [None]:
# Evaluating model
y_pred = elastic_model.predict(scaled_X_test)

elastic_mae = mean_absolute_error(y_test, y_pred)
elastic_mse = mean_squared_error(y_test, y_pred)
elastic_rmse = np.sqrt(elastic_mse)
elastic_r2 = r2_score(y_test, y_pred)

print('This is the Elastic Mean Absolute Error:', elastic_mae)
print('This is the Elastic Mean Squared Error:', elastic_mse)
print('This is the Elastic RMSE:', elastic_rmse)
print('This is the Elastic R2:', elastic_r2)

## 5. Comparing the Models

In [None]:
metrics_comparison = {
    'Models': ['Linear Regression (LR)', 'Polynomial Regression (PR)', 'Poly. Reg. with ElasticNetCV'],
    'MAE': [mae_lr, poly_mae, elastic_mae],
    'MSE': [mse_lr, poly_mse, elastic_mse],
    'RMSE': [rmse_lr, poly_rmse, elastic_rmse],
    'R²': [r2_lr, poly_r2, elastic_r2]}
metrics_comparison_df = pd.DataFrame(metrics_comparison)
metrics_comparison_df

#### *Because Polynomial Regression (PR) has the lest amount of error and the highest r2 score, I will be choosing this for my final prediction model*

# 6. Final Model Predictions

In [None]:
# Defining the features of the final model (again)
X = encoded_rental_df[['sq_feet', 'city_Toronto', 'province_Ontario', 'type_House', 'longitude', 'baths_2', 'city_Vancouver', 'baths_3.5']]
y = encoded_rental_df['price']

# Creating polynomial features by conversion
poly_converter = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_converter.fit_transform(X)  # Transforming the entire dataset