# Dimensionality Reduction - Regression

Blog Link : https://medium.com/@muzammila784/dimensionality-reduction-does-approach-matter-58d5cd0915c3

#### Background

In machine learning problems, there are many factors that contribute to the final prediction. Some of these features are highly correlated with the target variable, and some are negatively correlated. Having too many features increases the computational time and sometimes reduces the accuracy of the model, as some models are designed such that they are not able to perform well on high-dimensional data. This is where dimensionality reduction is used. It is a process or technique of reducing the number of features without compromising the accuracy of the model and reducing the computational time, which makes it better for models to learn patterns more precisely as compared to high-dimensional datasets.

The number of features is represented by dimensions. To make the problem simpler and make learning easier, we use dimensionality reduction techniques that map high-dimensional features into low-dimensional space. It is an old technique for reducing the dimensions of large datasets, but it still outperforms many traditional algorithms and newer techniques.

#### Dimensionality Reduction Techniques Used

In this project, we are using 15 different datasets and evaluating the performance of different machine learning models after doing dimensionality reduction with different approaches. The approaches we are using in this project are:


1- Principal Component Analysis (PCA) :-

Principal component analysis (PCA) is the process of calculating and analyzing the key characteristics of data and using them to change the data’s underlying structure, frequently using only the first few key characteristics, and ignoring the rest. The primary objective of PCA is to learn the pattern of variances, and its major objective is to prevent the pattern of variances from being altered during dimensionality reduction.


2- T-SNE : -

TSN-E works on non-linear datasets and calculates similarity in between instances, high dimensional space and low dimensional space and then optimize these two similarities using a cost function.


4- Singular Vector Decomposition : -

The sole distinction between this method and PCA is that, in this case, the data matrix is used as the matrix to be factorized rather than the covariance matrix as it is in PCA. While PCA performs better with dense data, SVD performs better with sparse data.

#### Datasets

We are using eight classification and seven regression datasets to check the performance before and after reducing features.

Regression

1- Russian Housing Prices Dataset

2: Product Sales Dataset

3- Superstore Sales Dataset

4- UCI Energy Dataset

5- Air Quality Dataset

6- Car Price Prediction Dataset

7- Bike Sharing Dataset

#### Pre-Processing

Our pre-processing pipeline consists of the following steps:

1- The first step is to check the data types of columns and convert categorical columns into numerical columns using mapping or encoding techniques.

2- Dropping columns, which are redundant and unnecessary,

3- The second step is to check for null values and fill all the null values with either the mean, the previous value, or dropping the value.

4- The third step is scaling the data, which is mandatory before dimensionality reduction. We are doing scaling using the standard scaling method.

5- Splitting our dataset into train and test set with 70–30% ratio.

#### Machine Learning and Dimensionality Reduction Pipeline

So, our second pipeline will implement machine learning models and dimensionality reduction techniques on a dataset. This pipeline contains the following steps:

1- Applying machine learning models to a full-featured dataset using lazy prediction and getting results

2- PCA and Lazy Predict are applied to the reduced-dimensioned data, and the results are checked.

3- Implementing SVD and applying lazy prediction to the reduced data of SVD

4- Similarly, we will repeat the process of TSNE.

## Importing Libraries

In [100]:
# Importing Important Libraries
import pandas as pd   
import time
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt
import xgboost as xgb
from xgboost import XGBRegressor

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE


from numpy import mean
import warnings
warnings.filterwarnings("ignore") 

## Dataset 1 : Russian Housing Dataset

In [183]:
data = pd.read_csv("E:/Russian_Housing_Dataset1.csv")
print(data.shape)
data.head()

(99999, 272)


Unnamed: 0,full_sq,life_sq,floor,product_type,sub_area,area_m,raion_popul,green_zone_part,indust_part,children_preschool,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
0,43.0,27.0,4.0,Investment,Bibirevo,6407578.1,155572.0,0.189727,7e-05,9576.0,...,9.0,4.0,0.0,13.0,22.0,1.0,0.0,52.0,4.0,5850000.0
1,34.0,19.0,3.0,Investment,Nagatinskij Zaton,9589336.912,115352.0,0.372602,0.049637,6880.0,...,15.0,3.0,0.0,15.0,29.0,1.0,10.0,66.0,14.0,6000000.0
2,43.0,29.0,2.0,Investment,Tekstil'shhiki,4808269.831,101708.0,0.11256,0.118537,5879.0,...,10.0,3.0,0.0,11.0,27.0,0.0,4.0,67.0,10.0,5700000.0
3,77.0,77.0,4.0,Investment,Basmannoe,8398460.622,108171.0,0.015234,0.037316,5706.0,...,319.0,108.0,17.0,135.0,236.0,2.0,91.0,195.0,14.0,16331452.0
4,67.0,46.0,14.0,Investment,Nizhegorodskoe,7506452.02,43795.0,0.00767,0.486246,2418.0,...,62.0,14.0,1.0,53.0,78.0,1.0,20.0,113.0,17.0,9100000.0


In [4]:
data.describe()

Unnamed: 0,full_sq,life_sq,floor,area_m,raion_popul,green_zone_part,indust_part,children_preschool,preschool_education_centers_raion,children_school,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
count,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0
mean,450.974601,596.342883,11.916919,25700840.0,103666.321411,0.238107,0.14501,6436.288113,5.059566,6650.794929,...,59.600206,21.534283,3.860331,26.9006,48.747126,0.547042,16.655833,70.699255,7.933014,14980950.0
std,1115.307106,1535.024315,14.761103,41710870.0,51981.445433,0.200461,0.138731,3816.536558,2.740557,3876.750082,...,93.928443,37.211003,7.478604,37.518277,60.046671,0.607492,26.389935,48.485232,5.048946,21813440.0
min,0.0,0.0,0.0,2081628.0,2546.0,0.001879,0.0,175.0,0.0,168.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100000.0
25%,40.0,23.082199,4.0,7128794.0,76284.0,0.067927,0.035145,4215.0,4.0,4439.0,...,7.0,1.386062,0.0,6.0,13.082211,0.0,1.0,39.056711,4.0,5469147.0
50%,52.516018,32.529567,7.0,9629358.0,101982.0,0.183969,0.101872,5797.0,5.0,5992.0,...,13.9794,3.0,0.0,10.0,24.0,0.230253,4.08928,61.443864,7.353733,7179015.0
75%,75.988433,53.0,12.0,16751120.0,130314.90535,0.372602,0.238617,7751.0,6.330909,7976.0,...,61.485717,17.0,2.80195,23.0,45.624406,1.0,13.805097,93.413009,12.0,11600280.0
max,5326.0,7478.0,76.99863,206071800.0,247469.0,0.852923,0.521867,19223.0,13.0,19083.0,...,377.0,147.0,30.0,151.0,250.0,2.0,106.0,218.0,20.999423,111111100.0


In [5]:
data.dtypes

full_sq               float64
life_sq               float64
floor                 float64
product_type           object
sub_area               object
                       ...   
mosque_count_5000     float64
leisure_count_5000    float64
sport_count_5000      float64
market_count_5000     float64
price_doc             float64
Length: 272, dtype: object

In [6]:
cols = data.columns
print(cols)

Index(['full_sq', 'life_sq', 'floor', 'product_type', 'sub_area', 'area_m',
       'raion_popul', 'green_zone_part', 'indust_part', 'children_preschool',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=272)


In [78]:
num_cols = data._get_numeric_data().columns
print(num_cols)

Index(['full_sq', 'life_sq', 'floor', 'area_m', 'raion_popul',
       'green_zone_part', 'indust_part', 'children_preschool',
       'preschool_education_centers_raion', 'children_school',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=257)


In [79]:
categorical_colums = list(set(cols) - set(num_cols))
categorical_colums

['water_1line',
 'railroad_terminal_raion',
 'railroad_1line',
 'sub_area',
 'ecology',
 'big_market_raion',
 'nuclear_reactor_raion',
 'incineration_raion',
 'culture_objects_top_25',
 'product_type',
 'detention_facility_raion',
 'thermal_power_plant_raion',
 'big_road1_1line',
 'radiation_raion',
 'oil_chemistry_raion']

In [184]:
def pre_process(data):
    data['radiation_raion'] = data['radiation_raion'].map({'yes': 1, 'no': 0})
    data['radiation_raion'] = pd.to_numeric(data['radiation_raion'])
    data['big_market_raion'] = data['big_market_raion'].map({'yes': 1, 'no': 0})
    data['big_market_raion'] = pd.to_numeric(data['big_market_raion'])
    data['ecology'] = data['ecology'].map({'satisfactory': 1, 'poor': 0, 'excellent': 3, 'good':2})
    data['ecology'] = pd.to_numeric(data['ecology'])
    data['oil_chemistry_raion'] = data['oil_chemistry_raion'].map({'yes': 1, 'no': 0})
    data['oil_chemistry_raion'] = pd.to_numeric(data['oil_chemistry_raion'])
    data['culture_objects_top_25'] = data['culture_objects_top_25'].map({'yes': 1, 'no': 0})
    data['culture_objects_top_25'] = pd.to_numeric(data['culture_objects_top_25'])
    data['detention_facility_raion'] = data['detention_facility_raion'].map({'yes': 1, 'no': 0})
    data['detention_facility_raion'] = pd.to_numeric(data['detention_facility_raion'])
    data['railroad_1line'] = data['railroad_1line'].map({'yes': 1, 'no': 0})
    data['railroad_1line'] = pd.to_numeric(data['railroad_1line'])
    data['incineration_raion'] = data['incineration_raion'].map({'yes': 1, 'no': 0})
    data['incineration_raion'] = pd.to_numeric(data['incineration_raion'])
    data['nuclear_reactor_raion'] = data['nuclear_reactor_raion'].map({'yes': 1, 'no': 0})
    data['nuclear_reactor_raion'] = pd.to_numeric(data['nuclear_reactor_raion'])
    data['thermal_power_plant_raion'] = data['thermal_power_plant_raion'].map({'yes': 1, 'no': 0})
    data['thermal_power_plant_raion'] = pd.to_numeric(data['thermal_power_plant_raion'])
    data['big_road1_1line'] = data['big_road1_1line'].map({'yes': 1, 'no': 0})
    data['big_road1_1line'] = pd.to_numeric(data['big_road1_1line'])
    data['big_road1_1line'] = data['big_road1_1line'].map({'yes': 1, 'no': 0})
    data['big_road1_1line'] = pd.to_numeric(data['big_road1_1line'])
    data['railroad_terminal_raion'] = data['railroad_terminal_raion'].map({'yes': 1, 'no': 0})
    data['railroad_terminal_raion'] = pd.to_numeric(data['railroad_terminal_raion'])
    data['product_type'] = data['product_type'].map({'Investment': 1, 'OwnerOccupier': 0})
    data['product_type'] = pd.to_numeric(data['product_type'])
    data['water_1line'] = data['water_1line'].map({'yes': 1, 'no': 0})
    data['water_1line'] = pd.to_numeric(data['water_1line'])

    #data.drop('row ID',axis=1,inplace=True)
    data.drop('sub_area',axis=1,inplace=True)

    return data

In [185]:
data = pre_process(data)
data.shape

(99999, 271)

In [186]:
#Checking Null Values.
data.isnull().sum()

full_sq               0
life_sq               0
floor                 0
product_type          0
area_m                0
                     ..
mosque_count_5000     0
leisure_count_5000    0
sport_count_5000      0
market_count_5000     0
price_doc             0
Length: 271, dtype: int64

In [187]:
data['price_doc'] = data['price_doc'].fillna((data['price_doc'].mean()))

In [188]:
data = data.fillna(1)

In [189]:
cols = data.columns
print(cols)

Index(['full_sq', 'life_sq', 'floor', 'product_type', 'area_m', 'raion_popul',
       'green_zone_part', 'indust_part', 'children_preschool',
       'preschool_education_centers_raion',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=271)


In [13]:
num_cols = data._get_numeric_data().columns
print(num_cols)

Index(['full_sq', 'life_sq', 'floor', 'product_type', 'area_m', 'raion_popul',
       'green_zone_part', 'indust_part', 'children_preschool',
       'preschool_education_centers_raion',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=271)


In [14]:

categorical_colums = list(set(cols) - set(num_cols))
categorical_colums

[]

In [190]:
data = data.sample(n=10000, replace=True)

In [191]:
X = data.loc[:, data.columns != 'price_doc']
y = data[['price_doc']]   #Target Feature

In [192]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [193]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [194]:
X_scaled.head()

Unnamed: 0,full_sq,life_sq,floor,product_type,area_m,raion_popul,green_zone_part,indust_part,children_preschool,preschool_education_centers_raion,...,cafe_count_5000_price_1500,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000
0,-0.380893,-0.381511,-0.539986,0.500156,-0.179843,-0.396384,-1.03888,2.026553,-0.499235,-0.029045,...,-0.501316,-0.510214,-0.505011,-0.517227,-0.47424,-0.372663,-0.91382,-0.373111,-0.076072,1.40233
1,-0.376954,-0.380316,-0.664775,0.500156,-0.367729,0.294363,0.483345,-0.963506,-0.344244,0.33642,...,-0.556062,-0.569111,-0.508713,-0.517227,-0.515001,-0.476563,-0.91382,-0.223378,-0.314592,0.992249
2,-0.347155,-0.372145,-0.225747,0.500156,-0.201631,0.292382,-0.525101,1.169734,0.242787,-1.125439,...,-0.452848,-0.417912,-0.427962,-0.490329,-0.583706,-0.522309,-0.261681,-0.479392,-0.458859,-0.63268
3,-0.373528,-0.379779,-0.46267,0.500156,-0.448111,-0.312041,-0.562584,2.237254,-0.520667,-0.39451,...,-0.479502,-0.567956,-0.480825,-0.517227,-0.409799,-0.588234,-0.91382,-0.638828,-0.639092,-0.991764
4,-0.368042,-0.372994,-0.668496,0.500156,-0.451521,0.147364,0.553802,-1.05209,0.154445,-0.029045,...,-0.461505,-0.380425,-0.190187,-0.517227,-0.488572,-0.489383,0.715291,-0.44903,-0.109371,-1.191272


In [195]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [196]:
import xgboost as xgb
from xgboost import XGBRegressor

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor   0.646748            0.611799  1.360233e+07
1  DecisionTreeRegressor   0.406403            0.347674  1.763264e+07
2  RandomForestRegressor   0.665236            0.632116  1.324161e+07
3       LinearRegression   0.533491            0.487336  1.563152e+07
4                  Ridge   0.533612            0.487468  1.562951e+07
5                  Lasso   0.532822            0.486601  1.564273e+07
6    KNeighborsRegressor   0.566105            0.523177  1.507522e+07


### Dimensionality Reduction Algorithms

### Principal Component Amalysis

In [197]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=13))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor   0.606527            0.603951  1.323460e+07
1  DecisionTreeRegressor   0.256273            0.251405  1.819534e+07
2  RandomForestRegressor   0.634642            0.632250  1.275301e+07
3       LinearRegression   0.596057            0.593412  1.340953e+07
4                  Ridge   0.596057            0.593413  1.340953e+07
5                  Lasso   0.596057            0.593412  1.340953e+07
6    KNeighborsRegressor   0.558689            0.555801  1.401605e+07


#### TSNE

In [198]:
# define the pipeline steps for lda
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=1))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor  -0.742819           -0.743691  2.785348e+07
1  DecisionTreeRegressor  -1.022268           -1.023280  3.000355e+07
2  RandomForestRegressor  -0.861165           -0.862097  2.878364e+07
3       LinearRegression   0.012109            0.011614  2.097046e+07
4                  Ridge   0.012109            0.011614  2.097046e+07
5                  Lasso   0.012109            0.011614  2.097046e+07
6    KNeighborsRegressor  -0.756226           -0.757105  2.796041e+07


#### Singular Vector Decomposition

In [199]:
# define the pipeline steps for SVD
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor   0.569733            0.569302  1.383956e+07
1  DecisionTreeRegressor   0.237207            0.236443  1.842709e+07
2  RandomForestRegressor   0.561672            0.561233  1.396860e+07
3       LinearRegression   0.583117            0.582700  1.362261e+07
4                  Ridge   0.583117            0.582700  1.362261e+07
5                  Lasso   0.583117            0.582700  1.362261e+07
6    KNeighborsRegressor   0.534368            0.533902  1.439708e+07


In [200]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'TSNE', 'SVD'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,TSNE,TSNE,TSNE,TSNE,SVD,SVD,SVD,SVD
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.646748,0.611799,13602330.0,XGBRegressor,0.606527,0.603951,13234600.0,XGBRegressor,-0.742819,-0.743691,27853480.0,XGBRegressor,0.569733,0.569302,13839560.0
1,DecisionTreeRegressor,0.406403,0.347674,17632640.0,DecisionTreeRegressor,0.256273,0.251405,18195340.0,DecisionTreeRegressor,-1.022268,-1.02328,30003550.0,DecisionTreeRegressor,0.237207,0.236443,18427090.0
2,RandomForestRegressor,0.665236,0.632116,13241610.0,RandomForestRegressor,0.634642,0.63225,12753010.0,RandomForestRegressor,-0.861165,-0.862097,28783640.0,RandomForestRegressor,0.561672,0.561233,13968600.0
3,LinearRegression,0.533491,0.487336,15631520.0,LinearRegression,0.596057,0.593412,13409530.0,LinearRegression,0.012109,0.011614,20970460.0,LinearRegression,0.583117,0.5827,13622610.0
4,Ridge,0.533612,0.487468,15629510.0,Ridge,0.596057,0.593413,13409530.0,Ridge,0.012109,0.011614,20970460.0,Ridge,0.583117,0.5827,13622610.0
5,Lasso,0.532822,0.486601,15642730.0,Lasso,0.596057,0.593412,13409530.0,Lasso,0.012109,0.011614,20970460.0,Lasso,0.583117,0.5827,13622610.0
6,KNeighborsRegressor,0.566105,0.523177,15075220.0,KNeighborsRegressor,0.558689,0.555801,14016050.0,KNeighborsRegressor,-0.756226,-0.757105,27960410.0,KNeighborsRegressor,0.534368,0.533902,14397080.0


## Dataset 2 : Product Sales Data

In [172]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/DOWNLOADS/Datasets/sales_data.csv")
data.shape  

(811, 107)

In [59]:
data.dtypes

Product_Code      object
W0                 int64
W1                 int64
W2                 int64
W3                 int64
                  ...   
Normalized 47    float64
Normalized 48    float64
Normalized 49    float64
Normalized 50    float64
Normalized 51    float64
Length: 107, dtype: object

In [62]:
data.isnull().sum()

Product_Code     0
W0               0
W1               0
W2               0
W3               0
                ..
Normalized 47    0
Normalized 48    0
Normalized 49    0
Normalized 50    0
Normalized 51    0
Length: 107, dtype: int64

In [173]:
data.drop('Product_Code',axis=1,inplace=True)

In [174]:
X = data.loc[:, data.columns != 'W0']
y = data[['W0']]

In [175]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [176]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [177]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [178]:
import xgboost as xgb
from xgboost import XGBRegressor

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.982600            0.969361  1.554895
1  DecisionTreeRegressor   0.956966            0.924224  2.445303
2  RandomForestRegressor   0.979460            0.963832  1.689373
3       LinearRegression   0.948691            0.909651  2.670104
4                  Ridge   0.950692            0.913174  2.617518
5                  Lasso   0.949293            0.910712  2.654379
6    KNeighborsRegressor   0.920743            0.860438  3.318552


### Dimensionality Reduction Algorithms

#### Principal Component Analysis

In [179]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.914437            0.912822  3.519706
1  DecisionTreeRegressor   0.900000            0.898113  3.805066
2  RandomForestRegressor   0.933935            0.932689  3.092768
3       LinearRegression   0.951311            0.950392  2.655078
4                  Ridge   0.951307            0.950388  2.655192
5                  Lasso   0.947270            0.946275  2.763071
6    KNeighborsRegressor   0.948753            0.947786  2.723945


#### Singular Vector Decomposition

In [180]:
# define the pipeline steps for SVD
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.924629            0.923687  3.303426
1  DecisionTreeRegressor   0.879746            0.878243  4.172654
2  RandomForestRegressor   0.928194            0.927296  3.224353
3       LinearRegression   0.950713            0.950097  2.671328
4                  Ridge   0.950709            0.950093  2.671449
5                  Lasso   0.947270            0.946611  2.763071
6    KNeighborsRegressor   0.929193            0.928308  3.201840


#### TSNE

In [181]:
# define the pipeline steps for tsne
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=1))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared       RMSE
0           XGBRegressor  -0.730056           -0.740802  15.826773
1  DecisionTreeRegressor  -0.732838           -0.743601  15.839494
2  RandomForestRegressor  -0.735063           -0.745840  15.849658
3       LinearRegression   0.222643            0.217814  10.608957
4                  Ridge   0.222642            0.217814  10.608961
5                  Lasso   0.222004            0.217172  10.613313
6    KNeighborsRegressor  -0.752435           -0.763319  15.928806


In [182]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'SVD', 'TSNE'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,SVD,SVD,SVD,SVD,TSNE,TSNE,TSNE,TSNE
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.9826,0.969361,1.554895,XGBRegressor,0.914437,0.912822,3.519706,XGBRegressor,0.924629,0.923687,3.303426,XGBRegressor,-0.730056,-0.740802,15.826773
1,DecisionTreeRegressor,0.956966,0.924224,2.445303,DecisionTreeRegressor,0.9,0.898113,3.805066,DecisionTreeRegressor,0.879746,0.878243,4.172654,DecisionTreeRegressor,-0.732838,-0.743601,15.839494
2,RandomForestRegressor,0.97946,0.963832,1.689373,RandomForestRegressor,0.933935,0.932689,3.092768,RandomForestRegressor,0.928194,0.927296,3.224353,RandomForestRegressor,-0.735063,-0.74584,15.849658
3,LinearRegression,0.948691,0.909651,2.670104,LinearRegression,0.951311,0.950392,2.655078,LinearRegression,0.950713,0.950097,2.671328,LinearRegression,0.222643,0.217814,10.608957
4,Ridge,0.950692,0.913174,2.617518,Ridge,0.951307,0.950388,2.655192,Ridge,0.950709,0.950093,2.671449,Ridge,0.222642,0.217814,10.608961
5,Lasso,0.949293,0.910712,2.654379,Lasso,0.94727,0.946275,2.763071,Lasso,0.94727,0.946611,2.763071,Lasso,0.222004,0.217172,10.613313
6,KNeighborsRegressor,0.920743,0.860438,3.318552,KNeighborsRegressor,0.948753,0.947786,2.723945,KNeighborsRegressor,0.929193,0.928308,3.20184,KNeighborsRegressor,-0.752435,-0.763319,15.928806


## Dataset 3 : Superstore Sales Prediction

In [149]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/Superstore1.csv")
data.shape  

(9994, 21)

In [4]:
data.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [5]:
data['Category'].unique()

array(['Furniture', 'Office Supplies', 'Technology'], dtype=object)

In [150]:
data['Category'] = data['Category'].map({'Furniture': 1, 'Office Supplies': 2, 'Technology' :3})
data['Category'] = pd.to_numeric(data['Category'])

In [7]:
data['Segment'].unique()

array(['Consumer', 'Corporate', 'Home Office'], dtype=object)

In [151]:
data['Segment'] = data['Segment'].map({'Consumer': 1, 'Corporate': 2, 'Home Office' :3})
data['Segment'] = pd.to_numeric(data['Segment'])

In [9]:
data['Ship Mode'].unique()

array(['Second Class', 'Standard Class', 'First Class', 'Same Day'],
      dtype=object)

In [152]:
data['Ship Mode'] = data['Ship Mode'].map({'Second Class': 2, 'Standard Class': 2, 'First Class' :3, 'Same Day' :4})
data['Ship Mode'] = pd.to_numeric(data['Ship Mode'])

In [11]:
data['Sub-Category'].unique()

array(['Bookcases', 'Chairs', 'Labels', 'Tables', 'Storage',
       'Furnishings', 'Art', 'Phones', 'Binders', 'Appliances', 'Paper',
       'Accessories', 'Envelopes', 'Fasteners', 'Supplies', 'Machines',
       'Copiers'], dtype=object)

In [158]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data['Sub-Category'] = le.fit_transform(data['Sub-Category'].values)

In [159]:
data['Sub-Category'] = le.fit_transform(data['Sub-Category'])

In [155]:
data['Region'] = data['Region'].map({'South': 1, 'West': 2, 'Central' :3, 'East':4})
data['Region'] = pd.to_numeric(data['Region'])

In [156]:
le = preprocessing.LabelEncoder()
data['State'] = le.fit_transform(data['State'].values)

In [157]:
data['Sub-Category'] = le.fit_transform(data['Sub-Category'])

In [160]:
data.drop('Row ID',axis=1,inplace=True)
data.drop('Order ID',axis=1,inplace=True)
data.drop('Order Date',axis=1,inplace=True)
data.drop('Ship Date',axis=1,inplace=True)
data.drop('Customer ID',axis=1,inplace=True)
data.drop('Customer Name',axis=1,inplace=True)
data.drop('City',axis=1,inplace=True)
data.drop('Product Name',axis=1,inplace=True)
data.drop('Country',axis=1,inplace=True)
data.drop('Product ID',axis=1,inplace=True)

In [161]:
data.dtypes

Ship Mode         int64
Segment           int64
State             int32
Postal Code       int64
Region            int64
Category          int64
Sub-Category      int64
Sales           float64
Quantity          int64
Discount        float64
Profit          float64
dtype: object

In [162]:
X = data.loc[:, data.columns != 'Sales']
y = data[['Sales']]

In [163]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [164]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [165]:
X_scaled.head()

Unnamed: 0,Ship Mode,Segment,State,Postal Code,Region,Category,Sub-Category,Quantity,Discount,Profit
0,-0.477546,-0.864161,-0.473638,-0.398302,-1.546848,-1.544978,-0.710815,-0.804303,-0.756643,0.056593
1,-0.477546,-0.864161,-0.473638,-0.398302,-1.546848,-1.544978,-0.512842,-0.354865,-0.756643,0.815054
2,-0.477546,0.44717,-1.24764,1.086817,-0.603811,0.043552,0.477027,-0.804303,-0.756643,-0.093002
3,-0.477546,-0.864161,-0.925139,-0.682407,-1.546848,-1.544978,1.664869,0.544012,1.423149,-1.757484
4,-0.477546,-0.864161,-0.925139,-0.682407,-1.546848,0.043552,1.268921,-0.804303,0.212153,-0.111593


In [166]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [167]:
import xgboost as xgb
from xgboost import XGBRegressor

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor   0.470170            0.468397  418.198018
1  DecisionTreeRegressor   0.331401            0.329164  469.782072
2  RandomForestRegressor   0.719258            0.718318  304.416074
3       LinearRegression   0.448620            0.446775  426.617847
4                  Ridge   0.448588            0.446742  426.630426
5                  Lasso   0.448066            0.446219  426.832230
6    KNeighborsRegressor   0.697382            0.696369  316.053991


### Dimensionality Reduction Algorithms

### Principal Component Analysis

In [168]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor   0.361512            0.360551  614.129129
1  DecisionTreeRegressor   0.277908            0.276822  653.099778
2  RandomForestRegressor   0.393546            0.392634  598.524555
3       LinearRegression  -0.070324           -0.071934  795.135023
4                  Ridge  -0.070307           -0.071917  795.128659
5                  Lasso  -0.069506           -0.071114  794.830916
6    KNeighborsRegressor   0.411157            0.410272  589.770260


### Singular Value Decomposition

In [169]:
# define the pipeline steps for lda
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor   0.287626            0.286912  648.690129
1  DecisionTreeRegressor   0.235530            0.234764  671.991022
2  RandomForestRegressor   0.312665            0.311976  637.187934
3       LinearRegression  -0.059615           -0.060677  791.147068
4                  Ridge  -0.059601           -0.060663  791.141895
5                  Lasso  -0.059059           -0.060120  790.939526
6    KNeighborsRegressor   0.293375            0.292667  646.067284


### T-SNE

In [170]:
# define the pipeline steps for lda
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=1))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared         RMSE
0           XGBRegressor  -0.387057           -0.387751   905.170276
1  DecisionTreeRegressor  -0.861852           -0.862784  1048.710718
2  RandomForestRegressor  -0.564169           -0.564953   961.225030
3       LinearRegression  -0.003902           -0.004405   770.067522
4                  Ridge  -0.003902           -0.004405   770.067522
5                  Lasso  -0.003900           -0.004403   770.066734
6    KNeighborsRegressor  -0.621758           -0.622570   978.759922


In [171]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'SVD', 'TSNE'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,SVD,SVD,SVD,SVD,TSNE,TSNE,TSNE,TSNE
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.47017,0.468397,418.198018,XGBRegressor,0.361512,0.360551,614.129129,XGBRegressor,0.287626,0.286912,648.690129,XGBRegressor,-0.387057,-0.387751,905.170276
1,DecisionTreeRegressor,0.331401,0.329164,469.782072,DecisionTreeRegressor,0.277908,0.276822,653.099778,DecisionTreeRegressor,0.23553,0.234764,671.991022,DecisionTreeRegressor,-0.861852,-0.862784,1048.710718
2,RandomForestRegressor,0.719258,0.718318,304.416074,RandomForestRegressor,0.393546,0.392634,598.524555,RandomForestRegressor,0.312665,0.311976,637.187934,RandomForestRegressor,-0.564169,-0.564953,961.22503
3,LinearRegression,0.44862,0.446775,426.617847,LinearRegression,-0.070324,-0.071934,795.135023,LinearRegression,-0.059615,-0.060677,791.147068,LinearRegression,-0.003902,-0.004405,770.067522
4,Ridge,0.448588,0.446742,426.630426,Ridge,-0.070307,-0.071917,795.128659,Ridge,-0.059601,-0.060663,791.141895,Ridge,-0.003902,-0.004405,770.067522
5,Lasso,0.448066,0.446219,426.83223,Lasso,-0.069506,-0.071114,794.830916,Lasso,-0.059059,-0.06012,790.939526,Lasso,-0.0039,-0.004403,770.066734
6,KNeighborsRegressor,0.697382,0.696369,316.053991,KNeighborsRegressor,0.411157,0.410272,589.77026,KNeighborsRegressor,0.293375,0.292667,646.067284,KNeighborsRegressor,-0.621758,-0.62257,978.759922


## Dataset 4 : UCI Energy Efficiency Dataset

In [138]:
# Importing Dataset using Pandas.
data = pd.read_excel(r"E:/DOWNLOADS/EnergyEfficiencyUCI.xlsx")
data.shape  

(768, 10)

In [25]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [26]:
data.dtypes

X1    float64
X2    float64
X3    float64
X4    float64
X5    float64
X6      int64
X7    float64
X8      int64
Y1    float64
Y2    float64
dtype: object

In [27]:
data.isnull().sum()

X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
X7    0
X8    0
Y1    0
Y2    0
dtype: int64

In [139]:
X = data.loc[:, data.columns != 'Y2']
y = data[['Y2']]

In [140]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [141]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [142]:
X_scaled.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.341641,-1.760447,-1.814575,-0.670115
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-0.447214,-1.760447,-1.814575,-0.670115
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,0.447214,-1.760447,-1.814575,-0.670115
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,1.341641,-1.760447,-1.814575,-0.670115
4,1.284979,-1.229239,0.0,-1.198678,1.0,-1.341641,-1.760447,-1.814575,-0.145503


In [143]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [144]:
import xgboost as xgb
from xgboost import XGBRegressor

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.977693            0.976784  1.420021
1  DecisionTreeRegressor   0.961935            0.960385  1.854964
2  RandomForestRegressor   0.972845            0.971739  1.566752
3       LinearRegression   0.952816            0.950895  2.065240
4                  Ridge   0.952666            0.950738  2.068520
5                  Lasso   0.934338            0.931664  2.436298
6    KNeighborsRegressor   0.958464            0.956772  1.937706


### Dimensionality Reduction Algorithms

#### PCA

In [145]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.972020            0.971460  1.610146
1  DecisionTreeRegressor   0.963721            0.962995  1.833443
2  RandomForestRegressor   0.967306            0.966653  1.740487
3       LinearRegression   0.890483            0.888293  3.185520
4                  Ridge   0.890482            0.888292  3.185531
5                  Lasso   0.878349            0.875916  3.357361
6    KNeighborsRegressor   0.946726            0.945661  2.221761


#### SVD

In [146]:
# define the pipeline steps for svd
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.961356            0.960844  1.892259
1  DecisionTreeRegressor   0.923474            0.922460  2.662838
2  RandomForestRegressor   0.960353            0.959828  1.916658
3       LinearRegression   0.890499            0.889048  3.185291
4                  Ridge   0.890498            0.889048  3.185303
5                  Lasso   0.878349            0.876737  3.357361
6    KNeighborsRegressor   0.941836            0.941066  2.321490


#### TSNE

In [147]:
# define the pipeline steps for tsne
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared       RMSE
0           XGBRegressor  -0.134659           -0.149687  10.253504
1  DecisionTreeRegressor  -0.836049           -0.860367  13.043128
2  RandomForestRegressor  -0.178928           -0.194543  10.451612
3       LinearRegression   0.291007            0.281616   8.105144
4                  Ridge   0.291001            0.281610   8.105179
5                  Lasso   0.292209            0.282834   8.098269
6    KNeighborsRegressor  -1.288749           -1.319063  14.562593


In [148]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'SVD', 'TSNE'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,SVD,SVD,SVD,SVD,TSNE,TSNE,TSNE,TSNE
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.977693,0.976784,1.420021,XGBRegressor,0.97202,0.97146,1.610146,XGBRegressor,0.961356,0.960844,1.892259,XGBRegressor,-0.134659,-0.149687,10.253504
1,DecisionTreeRegressor,0.961935,0.960385,1.854964,DecisionTreeRegressor,0.963721,0.962995,1.833443,DecisionTreeRegressor,0.923474,0.92246,2.662838,DecisionTreeRegressor,-0.836049,-0.860367,13.043128
2,RandomForestRegressor,0.972845,0.971739,1.566752,RandomForestRegressor,0.967306,0.966653,1.740487,RandomForestRegressor,0.960353,0.959828,1.916658,RandomForestRegressor,-0.178928,-0.194543,10.451612
3,LinearRegression,0.952816,0.950895,2.06524,LinearRegression,0.890483,0.888293,3.18552,LinearRegression,0.890499,0.889048,3.185291,LinearRegression,0.291007,0.281616,8.105144
4,Ridge,0.952666,0.950738,2.06852,Ridge,0.890482,0.888292,3.185531,Ridge,0.890498,0.889048,3.185303,Ridge,0.291001,0.28161,8.105179
5,Lasso,0.934338,0.931664,2.436298,Lasso,0.878349,0.875916,3.357361,Lasso,0.878349,0.876737,3.357361,Lasso,0.292209,0.282834,8.098269
6,KNeighborsRegressor,0.958464,0.956772,1.937706,KNeighborsRegressor,0.946726,0.945661,2.221761,KNeighborsRegressor,0.941836,0.941066,2.32149,KNeighborsRegressor,-1.288749,-1.319063,14.562593


## Dataset 5 : Air Quality Dataset

In [102]:
# Importing Dataset using Pandas.
data = pd.read_excel(r"E:/DOWNLOADS/AirQualityUCI.xlsx")
data.shape  

(9357, 15)

In [103]:
data.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,2004-03-10,18:00:00,2.6,1360.0,150,11.881723,1045.5,166.0,1056.25,113.0,1692.0,1267.5,13.6,48.875001,0.757754
1,2004-03-10,19:00:00,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.3,47.7,0.725487
2,2004-03-10,20:00:00,2.2,1402.0,88,8.997817,939.25,131.0,1140.0,114.0,1554.5,1074.0,11.9,53.975,0.750239
3,2004-03-10,21:00:00,2.2,1375.5,80,9.228796,948.25,172.0,1092.0,122.0,1583.75,1203.25,11.0,60.0,0.786713
4,2004-03-10,22:00:00,1.6,1272.25,51,6.518224,835.5,131.0,1205.0,116.0,1490.0,1110.0,11.15,59.575001,0.788794


In [104]:
data.drop('Date',axis=1,inplace=True)
data.drop('Time',axis=1,inplace=True)

In [42]:
data.isnull().sum()

CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

In [105]:
X = data.loc[:, data.columns != 'AH']
y = data[['AH']]

In [106]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [107]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [108]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [110]:
# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   1.000000            1.000000  0.022481
1  DecisionTreeRegressor   0.999999            0.999999  0.032677
2  RandomForestRegressor   1.000000            1.000000  0.016539
3       LinearRegression   0.999381            0.999378  1.007912
4                  Ridge   0.999375            0.999373  1.012335
5                  Lasso   0.994199            0.994174  3.084628
6    KNeighborsRegressor   0.999992            0.999992  0.114974


In [111]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared      RMSE
0           XGBRegressor   0.999965            0.999964  0.230330
1  DecisionTreeRegressor   0.999941            0.999941  0.297697
2  RandomForestRegressor   0.999968            0.999968  0.218837
3       LinearRegression   0.968536            0.968486  6.857637
4                  Ridge   0.968536            0.968486  6.857676
5                  Lasso   0.967295            0.967242  6.991649
6    KNeighborsRegressor   0.999967            0.999967  0.221890


In [112]:
# define the pipeline steps for svd
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared       RMSE
0           XGBRegressor   0.999945            0.999945   0.286562
1  DecisionTreeRegressor   0.999903            0.999902   0.381627
2  RandomForestRegressor   0.999942            0.999942   0.294001
3       LinearRegression   0.891021            0.890904  12.762682
4                  Ridge   0.891021            0.890904  12.762679
5                  Lasso   0.890578            0.890461  12.788597
6    KNeighborsRegressor   0.999943            0.999943   0.291537


In [113]:
# define the pipeline steps for tsne
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared       RMSE
0           XGBRegressor  -0.032975           -0.034080  39.292997
1  DecisionTreeRegressor  -0.032897           -0.034002  39.291518
2  RandomForestRegressor  -0.032798           -0.033904  39.289646
3       LinearRegression   0.010105            0.009045  38.464931
4                  Ridge   0.010105            0.009045  38.464931
5                  Lasso   0.009884            0.008824  38.469225
6    KNeighborsRegressor  -0.032734           -0.033839  39.288411


In [117]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'SVD', 'TSNE'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,SVD,SVD,SVD,SVD,TSNE,TSNE,TSNE,TSNE
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,1.0,1.0,0.022481,XGBRegressor,0.999965,0.999964,0.23033,XGBRegressor,0.999945,0.999945,0.286562,XGBRegressor,-0.032975,-0.03408,39.292997
1,DecisionTreeRegressor,0.999999,0.999999,0.032677,DecisionTreeRegressor,0.999941,0.999941,0.297697,DecisionTreeRegressor,0.999903,0.999902,0.381627,DecisionTreeRegressor,-0.032897,-0.034002,39.291518
2,RandomForestRegressor,1.0,1.0,0.016539,RandomForestRegressor,0.999968,0.999968,0.218837,RandomForestRegressor,0.999942,0.999942,0.294001,RandomForestRegressor,-0.032798,-0.033904,39.289646
3,LinearRegression,0.999381,0.999378,1.007912,LinearRegression,0.968536,0.968486,6.857637,LinearRegression,0.891021,0.890904,12.762682,LinearRegression,0.010105,0.009045,38.464931
4,Ridge,0.999375,0.999373,1.012335,Ridge,0.968536,0.968486,6.857676,Ridge,0.891021,0.890904,12.762679,Ridge,0.010105,0.009045,38.464931
5,Lasso,0.994199,0.994174,3.084628,Lasso,0.967295,0.967242,6.991649,Lasso,0.890578,0.890461,12.788597,Lasso,0.009884,0.008824,38.469225
6,KNeighborsRegressor,0.999992,0.999992,0.114974,KNeighborsRegressor,0.999967,0.999967,0.22189,KNeighborsRegressor,0.999943,0.999943,0.291537,KNeighborsRegressor,-0.032734,-0.033839,39.288411


## Dataset 6 : Car Price Predicion

In [118]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/DOWNLOADS/CarPrice_Assignment.csv")
data.shape  

(205, 26)

#### Data Preprocessing

In [53]:
data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [54]:
data['CarName'].unique()

array(['alfa-romero giulia', 'alfa-romero stelvio',
       'alfa-romero Quadrifoglio', 'audi 100 ls', 'audi 100ls',
       'audi fox', 'audi 5000', 'audi 4000', 'audi 5000s (diesel)',
       'bmw 320i', 'bmw x1', 'bmw x3', 'bmw z4', 'bmw x4', 'bmw x5',
       'chevrolet impala', 'chevrolet monte carlo', 'chevrolet vega 2300',
       'dodge rampage', 'dodge challenger se', 'dodge d200',
       'dodge monaco (sw)', 'dodge colt hardtop', 'dodge colt (sw)',
       'dodge coronet custom', 'dodge dart custom',
       'dodge coronet custom (sw)', 'honda civic', 'honda civic cvcc',
       'honda accord cvcc', 'honda accord lx', 'honda civic 1500 gl',
       'honda accord', 'honda civic 1300', 'honda prelude',
       'honda civic (auto)', 'isuzu MU-X', 'isuzu D-Max ',
       'isuzu D-Max V-Cross', 'jaguar xj', 'jaguar xf', 'jaguar xk',
       'maxda rx3', 'maxda glc deluxe', 'mazda rx2 coupe', 'mazda rx-4',
       'mazda glc deluxe', 'mazda 626', 'mazda glc', 'mazda rx-7 gs',
       'mazda glc 

In [55]:
data['fueltype'].unique()

array(['gas', 'diesel'], dtype=object)

In [119]:
data['fueltype'] = data['fueltype'].map({'gas': 1, 'diesel': 0})
data['fueltype'] = pd.to_numeric(data['fueltype'])

In [57]:
data['aspiration'].unique()

array(['std', 'turbo'], dtype=object)

In [120]:
data['aspiration'] = data['aspiration'].map({'std': 1, 'turbo': 0})
data['aspiration'] = pd.to_numeric(data['aspiration'])

In [59]:
data['doornumber'].unique()

array(['two', 'four'], dtype=object)

In [121]:
data['doornumber'] = data['doornumber'].map({'two': 2, 'four': 4})
data['doornumber'] = pd.to_numeric(data['doornumber'])

In [61]:
data['carbody'].unique()

array(['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop'],
      dtype=object)

In [122]:
data['carbody'] = data['carbody'].map({'convertible': 1, 'hatchback': 2, 'sedan':3, 'wagon' :4,  'hardtop':5})
data['carbody'] = pd.to_numeric(data['carbody'])

In [82]:
data['drivewheel'].unique()

array(['rwd', 'fwd', '4wd'], dtype=object)

In [123]:
data['drivewheel'] = data['drivewheel'].map({'rwd': 1, 'fwd': 2, '4wd':3})
data['drivewheel'] = pd.to_numeric(data['drivewheel'])

In [85]:
data['cylindernumber'].unique()

array(['four', 'six', 'five', 'three', 'twelve', 'two', 'eight'],
      dtype=object)

In [124]:
data['cylindernumber'] = data['cylindernumber'].map({'four': 4, 'five': 5, 'six':6, 'two': 2, 'three': 3, 'eight':8, 'twelve' : 12})
data['cylindernumber'] = pd.to_numeric(data['cylindernumber'])

In [125]:
data.drop('enginelocation',axis=1,inplace=True)
data.drop('fueltype',axis=1,inplace=True)
data.drop('CarName',axis=1,inplace=True)
data.drop('enginetype',axis=1,inplace=True)


In [126]:
data.drop('fuelsystem',axis=1,inplace=True)

In [127]:
data.dtypes

car_ID                int64
symboling             int64
aspiration            int64
doornumber            int64
carbody               int64
drivewheel            int64
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
cylindernumber        int64
enginesize            int64
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
dtype: object

In [128]:
X = data.loc[:, data.columns != 'price']
y = data[['price']]

In [129]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [130]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [131]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [133]:
# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared         RMSE
0           XGBRegressor   0.857720            0.788315  2581.474784
1  DecisionTreeRegressor   0.838797            0.760162  2747.780097
2  RandomForestRegressor   0.873692            0.812078  2432.267484
3       LinearRegression   0.635303            0.457402  4132.965268
4                  Ridge   0.735442            0.606389  3520.109729
5                  Lasso   0.640601            0.465285  4102.836025
6    KNeighborsRegressor   0.752441            0.631681  3405.139256


### Dimensionality Reduction Algorithm

#### PCA

In [134]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared         RMSE
0           XGBRegressor   0.849044            0.836804  3452.110758
1  DecisionTreeRegressor   0.652842            0.624694  5235.076256
2  RandomForestRegressor   0.808483            0.792954  3888.332421
3       LinearRegression   0.735250            0.713784  4571.696952
4                  Ridge   0.735280            0.713816  4571.441798
5                  Lasso   0.735238            0.713771  4571.800750
6    KNeighborsRegressor   0.766070            0.747102  4297.370915


#### SVD

In [135]:
# define the pipeline steps for svd
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared         RMSE
0           XGBRegressor   0.782748            0.771314  4141.344097
1  DecisionTreeRegressor   0.610608            0.590113  5544.384174
2  RandomForestRegressor   0.834675            0.825974  3612.675691
3       LinearRegression   0.727258            0.712903  4640.185947
4                  Ridge   0.727320            0.712969  4639.657225
5                  Lasso   0.727268            0.712913  4640.106531
6    KNeighborsRegressor   0.724511            0.710011  4663.500820


#### TSNE

In [136]:
# define the pipeline steps for tsne
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor  -2.139087           -2.304302  15742.049156
1  DecisionTreeRegressor  -4.933112           -5.245381  21642.173662
2  RandomForestRegressor  -3.515244           -3.752889  18879.923891
3       LinearRegression -42.654534          -44.952141  58704.892957
4                  Ridge -42.641588          -44.938514  58696.187646
5                  Lasso -42.650950          -44.948368  58702.482860
6    KNeighborsRegressor  -2.039585           -2.199564  15490.548756


In [137]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'SVD', 'TSNE'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,SVD,SVD,SVD,SVD,TSNE,TSNE,TSNE,TSNE
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.85772,0.788315,2581.474784,XGBRegressor,0.849044,0.836804,3452.110758,XGBRegressor,0.782748,0.771314,4141.344097,XGBRegressor,-2.139087,-2.304302,15742.049156
1,DecisionTreeRegressor,0.838797,0.760162,2747.780097,DecisionTreeRegressor,0.652842,0.624694,5235.076256,DecisionTreeRegressor,0.610608,0.590113,5544.384174,DecisionTreeRegressor,-4.933112,-5.245381,21642.173662
2,RandomForestRegressor,0.873692,0.812078,2432.267484,RandomForestRegressor,0.808483,0.792954,3888.332421,RandomForestRegressor,0.834675,0.825974,3612.675691,RandomForestRegressor,-3.515244,-3.752889,18879.923891
3,LinearRegression,0.635303,0.457402,4132.965268,LinearRegression,0.73525,0.713784,4571.696952,LinearRegression,0.727258,0.712903,4640.185947,LinearRegression,-42.654534,-44.952141,58704.892957
4,Ridge,0.735442,0.606389,3520.109729,Ridge,0.73528,0.713816,4571.441798,Ridge,0.72732,0.712969,4639.657225,Ridge,-42.641588,-44.938514,58696.187646
5,Lasso,0.640601,0.465285,4102.836025,Lasso,0.735238,0.713771,4571.80075,Lasso,0.727268,0.712913,4640.106531,Lasso,-42.65095,-44.948368,58702.48286
6,KNeighborsRegressor,0.752441,0.631681,3405.139256,KNeighborsRegressor,0.76607,0.747102,4297.370915,KNeighborsRegressor,0.724511,0.710011,4663.50082,KNeighborsRegressor,-2.039585,-2.199564,15490.548756


## Dataset 7 : Bike Sharing

In [201]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/DOWNLOADS/bike_sharing.csv")
data.shape  

(17379, 15)

In [202]:
data.head()

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.190098,3,13,16
1,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.190098,8,32,40
2,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.190098,5,27,32
3,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.190098,3,10,13
4,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.190098,0,1,1


In [203]:
data.dtypes

season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

In [204]:
X = data.loc[:, data.columns != 'cnt']
y = data[['cnt']]

In [205]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [206]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [207]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=2)

In [208]:
# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results1 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results1 = results1.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results1)

                   Model  R-Squared  Adjusted R-Squared          RMSE
0           XGBRegressor   0.999618            0.999617  3.585424e+00
1  DecisionTreeRegressor   0.999038            0.999035  5.690725e+00
2  RandomForestRegressor   0.999708            0.999708  3.133295e+00
3       LinearRegression   1.000000            1.000000  1.435636e-13
4                  Ridge   1.000000            1.000000  1.586650e-02
5                  Lasso   0.999960            0.999960  1.161440e+00
6    KNeighborsRegressor   0.959756            0.959647  3.680462e+01


### Dimensionality Reduction Methods

### PCA

In [209]:
# define the pipeline steps for PCA
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results2 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results2 = results2.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results2)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor   0.733973            0.733743   91.781590
1  DecisionTreeRegressor   0.489787            0.489346  127.106666
2  RandomForestRegressor   0.731414            0.731182   92.221913
3       LinearRegression   0.653539            0.653240  104.741716
4                  Ridge   0.653539            0.653240  104.741729
5                  Lasso   0.653438            0.653139  104.756995
6    KNeighborsRegressor   0.716691            0.716446   94.715825


### SVD

In [210]:
# define the pipeline steps for svd
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('SVD', TruncatedSVD(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.transform(X_train)
X_test_pca = pipeline_pca.transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results3 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results3 = results3.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results3)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor   0.665553            0.665360  102.909705
1  DecisionTreeRegressor   0.348408            0.348033  143.641622
2  RandomForestRegressor   0.630471            0.630258  108.172479
3       LinearRegression   0.646488            0.646285  105.802173
4                  Ridge   0.646488            0.646284  105.802191
5                  Lasso   0.646400            0.646196  105.815360
6    KNeighborsRegressor   0.617089            0.616869  110.113672


### TSNE

In [211]:
# define the pipeline steps for tsne
pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('TSNE', TSNE(n_components=2))
])

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit the pipeline to the training data
pipeline_pca.fit(X_train)

# transform the data using the pipeline
X_train_pca = pipeline_pca.fit_transform(X_train)
X_test_pca = pipeline_pca.fit_transform(X_test)

# create a list of tuples containing the name of the model and the model itself
models = [
    ('XGBRegressor', xgb.XGBRegressor()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('LinearRegression', LinearRegression()),
    ('Ridge', Ridge()),
    ('Lasso', Lasso()),
    ('KNeighborsRegressor', KNeighborsRegressor())
]

# create a dataframe to store the results
results4 = pd.DataFrame(columns=['Model', 'R-Squared', 'Adjusted R-Squared', 'RMSE'])

# fit each model to the transformed data and evaluate its performance
for name, model in models:
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test_pca.shape[1]-1)
    rmse = sqrt(mean_squared_error(y_test, y_pred))
    results4 = results4.append({'Model': name, 'R-Squared': r2, 'Adjusted R-Squared': adj_r2, 'RMSE': rmse}, ignore_index=True)

# display the results
print(results4)

                   Model  R-Squared  Adjusted R-Squared        RMSE
0           XGBRegressor  -0.105919           -0.106556  187.134697
1  DecisionTreeRegressor  -0.339054           -0.339825  205.916737
2  RandomForestRegressor  -0.230133           -0.230842  197.364335
3       LinearRegression   0.075255            0.074722  171.121128
4                  Ridge   0.075255            0.074722  171.121128
5                  Lasso   0.075231            0.074699  171.123294
6    KNeighborsRegressor  -0.377388           -0.378181  208.843365


In [212]:
#Compiling Model Results:
print("Compiling Model Results")
models_results = pd.concat([results1,
                                results2,
                                results3,
                                results4,
                                ], axis = 1, keys =['Without DR', 'PCA ', 'TSNE', 'SVD'])
models_results

Compiling Model Results


Unnamed: 0_level_0,Without DR,Without DR,Without DR,Without DR,PCA,PCA,PCA,PCA,TSNE,TSNE,TSNE,TSNE,SVD,SVD,SVD,SVD
Unnamed: 0_level_1,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE,Model,R-Squared,Adjusted R-Squared,RMSE
0,XGBRegressor,0.999618,0.999617,3.585424,XGBRegressor,0.733973,0.733743,91.78159,XGBRegressor,0.665553,0.66536,102.909705,XGBRegressor,-0.105919,-0.106556,187.134697
1,DecisionTreeRegressor,0.999038,0.999035,5.690725,DecisionTreeRegressor,0.489787,0.489346,127.106666,DecisionTreeRegressor,0.348408,0.348033,143.641622,DecisionTreeRegressor,-0.339054,-0.339825,205.916737
2,RandomForestRegressor,0.999708,0.999708,3.133295,RandomForestRegressor,0.731414,0.731182,92.221913,RandomForestRegressor,0.630471,0.630258,108.172479,RandomForestRegressor,-0.230133,-0.230842,197.364335
3,LinearRegression,1.0,1.0,1.435636e-13,LinearRegression,0.653539,0.65324,104.741716,LinearRegression,0.646488,0.646285,105.802173,LinearRegression,0.075255,0.074722,171.121128
4,Ridge,1.0,1.0,0.0158665,Ridge,0.653539,0.65324,104.741729,Ridge,0.646488,0.646284,105.802191,Ridge,0.075255,0.074722,171.121128
5,Lasso,0.99996,0.99996,1.16144,Lasso,0.653438,0.653139,104.756995,Lasso,0.6464,0.646196,105.81536,Lasso,0.075231,0.074699,171.123294
6,KNeighborsRegressor,0.959756,0.959647,36.80462,KNeighborsRegressor,0.716691,0.716446,94.715825,KNeighborsRegressor,0.617089,0.616869,110.113672,KNeighborsRegressor,-0.377388,-0.378181,208.843365


## Conclusion

For Regression

We have implemented three different techniques for regression datasets, which are PCA, TSNE, and SVD. PCA is performing best on regression datasets because because it focus on capturing variance of the dataset and try not to disturb the actual variation in the dataset. This is the quality of PCA which other two techniques don’t have.

For some cases, PCA performance was not upto the expectation but most of the time it perform better than other so it our winner technique for regression.

Additionally, PCA can handle outliers better than other algorithms because regression has a continuous output and is more likely to have noise than classification datasets.