<a href="https://colab.research.google.com/github/Cloudy34/AI_Projects/blob/main/2_Regression_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Illustration of some classical machine learning regression algorithms

We will use the following algorithms:

* Linear Regression
* k-Nearest Neighbors Regression
* Random Forest Regression
* Support Vector Regression

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error, r2_score

Disclaimer: this notebook is not meant to be a comprehensive guide to the algorithms. It is meant to be a quick illustration of the algorithms, and to give you a feel for how they work. For more information, see the documentation of the algorithms.

Data: I'm not sure on the exact license of the data, hence I can not direclly share it here. You can download it from [here](https://www.kaggle.com/datasets/iamasteriix/rental-apartments-in-kenya).
Alway check the license of the data before using / sharing it. As interesting as AI is, it is important to respect the rights of other people's work, and the law 😀.

In [None]:
# Import Dataset
df = pd.read_csv('rent_apts.csv')

'''
A datafram is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
This is quite useful for data analysis and manipulation. We wont be doing any analysis in this demo.
'''

df.head()

Unnamed: 0,Agency,Neighborhood,Price,link,sq_mtrs,Bedrooms,Bathrooms
0,Buy Rent Shelters,"General Mathenge, Westlands","KSh 155,000",/listings/4-bedroom-apartment-for-rent-general...,4.0,4.0,4.0
1,Kenya Classic Homes,"Kilimani, Dagoretti North","KSh 100,000",/listings/3-bedroom-apartment-for-rent-kiliman...,300.0,3.0,4.0
2,Absolute Estate Agents,"Hatheru Rd,, Lavington, Dagoretti North","KSh 75,000",/listings/3-bedroom-apartment-for-rent-lavingt...,3.0,3.0,5.0
3,A1 Properties Limited,"Kilimani, Dagoretti North","KSh 135,000",/listings/3-bedroom-apartment-for-rent-kiliman...,227.0,3.0,4.0
4,Pmc Estates Limited,"Imara Daima, Embakasi","KSh 50,000",/listings/3-bedroom-apartment-for-rent-imara-d...,3.0,3.0,


###  Some data prep:

In [None]:
# drop the columns we dont need
del df['link']

In [None]:
# drop the rows with missing values
df.dropna(inplace=True)

# Alternative you can fill missing values with the mean of the column. There are multiple techniques for handling missing data and these vary based on the type of data, problem, and the algorithm you are using, etc.
# df.fillna(df.mean(), inplace=True)

In [None]:
# Encode the categorical data i.e. the data that is not numerical. Numerical dat is at times reffered to as continuous data.
# Here we encode the 'Neighborhood' column

# Get the unique values in the 'Neighborhood' column
print(len(df['Neighborhood'].unique()))
print(len(df['Agency'].unique()))

598
180


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Encode the 'Neighborhood' & 'Agncy' columns
df['Neighborhood'] = le.fit_transform(df['Neighborhood'])
df['Agency'] = le.fit_transform(df['Agency'])
df.head()

Unnamed: 0,Agency,Neighborhood,Price,sq_mtrs,Bedrooms,Bathrooms
0,18,145,"KSh 155,000",4.0,4.0,4.0
1,91,205,"KSh 100,000",300.0,3.0,4.0
2,3,159,"KSh 75,000",3.0,3.0,5.0
3,1,205,"KSh 135,000",227.0,3.0,4.0
6,107,375,"KSh 100,000",14.0,2.0,3.0


In [None]:
# the price column is seems to be in integer format but it is actually a string. We need to convert it to float by cleaning the string.

# Remove the Ksh and , from the price column
df['Price'] = df['Price'].str.replace('KSh', '')            # We basically scan through and replace the Ksh with nothing i.e. ''
df['Price'] = df['Price'].str.replace(',', '')              # We basically scan through and replace the , with nothing i.e. ''


# Convert the price column to float from string
df['Price'] = df['Price'].astype(int)

df.head()


Unnamed: 0,Agency,Neighborhood,Price,sq_mtrs,Bedrooms,Bathrooms
0,18,145,155000,4.0,4.0,4.0
1,91,205,100000,300.0,3.0,4.0
2,3,159,75000,3.0,3.0,5.0
3,1,205,135000,227.0,3.0,4.0
6,107,375,100000,14.0,2.0,3.0


In [None]:
# Specify the features and the target
X = df.drop('Price', axis=1)
y = df['Price']

In [None]:
# Now we split the data into training and testing sets then move to training the model and making predictions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Arguments
'''
X: The features
y: The target
test_size: The proportion of the dataset to include in the test split
random_state:   Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
                We pass 42 so that we can reproduce the results if we were to do this again / share our work with others.
'''

'\nX: The features\ny: The target\ntest_size: The proportion of the dataset to include in the test split\nrandom_state:   Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.\n                We pass 42 so that we can reproduce the results if we were to do this again / share our work with others.\n'

In [None]:
# Instantiate a Linear Regression model

lr = LinearRegression()

# Fit the model to the training data
lr.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lr.predict(X_test)

# Evaluate the model
print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))

# Arguments
'''
y_true: The true values
y_pred: The predicted values
'''

# We'll discus more on the metrics in the class session. Classification and Regression metrics are different

Mean Squared Error:  1266157446.3613594


'\ny_true: The true values\ny_pred: The predicted values\n'

In [None]:
# we have a high mean squared error. This means that our model is not performing well. We can try to improve the model by using a different algorithm
# Lets try a KNN Regressor

knn = KNeighborsRegressor(n_neighbors=5)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Evaluate the model
print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))

Mean Squared Error:  1602097330.8228204


------

__Things still seem to be not working out. What do you think we can do to improve the model?__