# Base-line

This the baseline of the project.

A Random Forest model is implemented, which loads a dataset and applies StandardScaler transformation to all numerical fields, also applies 70-30 partition for train and test sets respectively, achieving a final accuracy around 85%, and cross validation accuracy around 82%.

# Scope

The expected scope of this project is implementation of techniques and good practices to achieve deployment of the full functionality of this code through REST API.

# Situation

This is an excersice taken from kaggle to work with, in which the objective is to try to determine the median value of owner-occupied homes in $1000's [k\$] (**MEDV**, dependent variable), given a serie of independent variables like structural, neighborhood, accessibility and air pollution data in Boston around 70's.

To know more about the dataset you can see directly kaggle [link](https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data).

# Notebook

The following code in this notebook was tacken from several notebooks developed by other users in kaggle platform.

* [MAGANTI IT](https://www.kaggle.com/code/magantiit/linearregression)
* [SADIK AL JARIF](https://www.kaggle.com/code/sadikaljarif/boston-housing-price-prediction) 
* [MARCIN RUTECKI](https://www.kaggle.com/code/marcinrutecki/regression-models-evaluation-metrics)
* [UNMOVED](https://www.kaggle.com/code/unmoved/regress-boston-house-prices)

# Setup
Section to place all imports required to execute code.

In [157]:
#Imports
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import cross_val_score

from sklearn import metrics
from collections import Counter

import opendatasets as od
import os
from pathlib import Path
import shutil

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Constants
Section to place all CONSTANTS needed to execute code.

In [158]:
DATASETS_DIR = './datasets/'
KAGGLE_URL = "https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data" 
KAGGLE_LOCAL_DIR = KAGGLE_URL.split('/')[-1]
DATA_RETRIEVED = 'data.csv'

COLUMN_NAMES = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT','MEDV']
FEATURES = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
TARGET = 'MEDV'

SEED_SPLIT = 42

SELECTED_FEATURES = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# Functions

Section to write down all functions to implement.

In [163]:
def retrieve_data():

    #Downloads dataset from kaggle with pre-defined structure (folder)
    od.download(KAGGLE_URL, force=True)

    #Finds the recently downloaded file
    paths = sorted(Path(KAGGLE_LOCAL_DIR).iterdir(), key=os.path.getmtime)
    path_new_file = str(paths[-1])
    name_new_file = str(path_new_file).split('\\')[-1]

    #If recently downloaded file already exists in root, delete it
    if os.path.isfile(path_new_file):
        print("Dataset downloaded: " + path_new_file)
    else:
        print("Something went wrong, dataset not downloades!")

    #Moves the file to root instead of downloaded folder
    if os.path.isfile(DATASETS_DIR + name_new_file):   #Searches for the new file downloaded
        os.remove(DATASETS_DIR + name_new_file)
    if os.path.isfile(DATASETS_DIR + DATA_RETRIEVED):       # or any old file with FILE_NAME specified
        os.remove(DATASETS_DIR + DATA_RETRIEVED)
    os.rename(path_new_file, DATASETS_DIR + DATA_RETRIEVED)
    print("And stored in: " + DATASETS_DIR + DATA_RETRIEVED)
    shutil.rmtree(KAGGLE_LOCAL_DIR)


#---------------
#Function call
retrieve_data()
#---------------

Downloading the-boston-houseprice-data.zip to .\the-boston-houseprice-data


100%|██████████| 12.3k/12.3k [00:00<00:00, 458kB/s]


Dataset downloaded: the-boston-houseprice-data\boston.csv
And stored in: ./datasets/data.csv





In [None]:
#Load dataset
raw_df = pd.read_csv(DATA_RETRIEVED, delimiter = ",")
print(raw_df.shape)
raw_df.head()

(506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [None]:
#Check for nulls
raw_df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [None]:
def IQR_method (df,n,features):
    """
    Takes a dataframe and returns an index list corresponding to the observations 
    containing more than n outliers according to the Tukey IQR method.
    """
    outlier_list = []
    
    for column in features:
                
        # 1st quartile (25%)
        Q1 = np.percentile(df[column], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[column],75)
        
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determining a list of indices of outliers
        outlier_list_column = df[(df[column] < Q1 - outlier_step) | (df[column] > Q3 + outlier_step )].index
        
        # appending the list of outliers 
        outlier_list.extend(outlier_list_column)
        
    # selecting observations containing more than x outliers
    outlier_list = Counter(outlier_list)        
    multiple_outliers = list( k for k, v in outlier_list.items() if v > n )
    
    # Calculate the number of records below and above lower and above bound value respectively
    df1 = df[df[column] < Q1 - outlier_step]
    df2 = df[df[column] > Q3 + outlier_step]
    
    print('Total number of deleted outliers:', df1.shape[0]+df2.shape[0])
    
    return multiple_outliers

In [None]:
Outliers_IQR = IQR_method(raw_df,1,COLUMN_NAMES)
# dropping outliers
df_clean = raw_df.drop(Outliers_IQR, axis = 0).reset_index(drop=True)

Total number of deleted outliers: 40


In [None]:
print(raw_df.shape)
print(df_clean.shape)

(506, 14)
(419, 14)


In [None]:
# Defines lists of variables independents and dependent
X = df_clean[SELECTED_FEATURES]
y = df_clean[TARGET]

In [None]:
# Data partition: train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = SEED_SPLIT)

In [None]:
# Creating function for scaling
def Standard_Scaler (df):
    #features = df[col_names]
    scaler = StandardScaler().fit(df)
    features = scaler.transform(df)
    df = features
    
    return df

In [None]:
X_train2 = Standard_Scaler (X_train)
X_test2 = Standard_Scaler (X_test)

In [None]:
# Creating and training model
RandomForest_reg = RandomForestRegressor(n_estimators = 10, random_state = 0)

In [None]:
RandomForest_reg.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = RandomForest_reg.predict(X_test)

In [None]:
def Reg_Models_Evaluation_Metrics (model,X_train,y_train,X_test,y_test,y_pred):
    cv_score = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)
    
    # Calculating Adjusted R-squared
    r2 = model.score(X_test, y_test)
    # Number of observations is the shape along axis 0
    n = X_test.shape[0]
    # Number of features (predictors, p) is the shape along axis 1
    p = X_test.shape[1]
    # Adjusted R-squared formula
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
    R2 = model.score(X_test, y_test)
    CV_R2 = cv_score.mean()

    return [[R2, adjusted_r2, CV_R2, RMSE]]
    
    print('R2:', round(R2,4))
    print('Adjusted R2:', round(adjusted_r2, 4) )
    print("Cross Validated R2: ", round(cv_score.mean(),4) )
    print('RMSE:', round(RMSE,4))

In [None]:
ndf = Reg_Models_Evaluation_Metrics(RandomForest_reg,X_train,y_train,X_test,y_test,y_pred)

rf_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rf_score.insert(0, 'Model', 'Random Forest')
rf_score

Unnamed: 0,Model,R2 Score,Adjusted R2 Score,Cross Validated R2 Score,RMSE
0,Random Forest,0.801535,0.782385,0.725269,2.787419


In [None]:
new_data_pred = pd.DataFrame([[0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33]], columns = FEATURES)
new_data_pred = new_data_pred[SELECTED_FEATURES].copy()
RandomForest_reg.predict(new_data_pred)

array([35.75])

In [None]:
new_data_pred = pd.DataFrame([[0.02729,0.0	,7.07,0	,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03]], columns = FEATURES)
new_data_pred = new_data_pred[SELECTED_FEATURES].copy()
RandomForest_reg.predict(new_data_pred)

array([36.06])

In [None]:
new_data = X_test.head()
new_data = new_data[SELECTED_FEATURES].copy()
print(RandomForest_reg.predict(new_data))
print(y_test.head())

[25.68 20.35 34.38 15.83 13.11]
203    27.5
278    19.8
172    37.9
368    14.1
352    14.5
Name: MEDV, dtype: float64
