## Problem Statement
 A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

# Loading the Required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os

In [2]:
pd.options.display.max_columns=100
pd.options.display.max_rows=100

In [3]:
os.getcwd()

'C:\\Users\\Admin\\3D Objects\\imarticus PGDA\\Machine learning\\NEW DATASET\\Class Hackathon\\Black Sales'

## Loading the Data

In [4]:
df_train=pd.read_csv('train.csv')
df_train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [5]:
## Chech the train data shape
df_train.shape


(550068, 12)

* Train Dataframe contain 550068 columnsand 12 features

In [6]:
##import test data set
df_test=pd.read_csv('test.csv')
df_test.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0


In [7]:
## Chech the train data shape
df_test.shape

(233599, 11)

* Train Dataframe contain 233599 columnsand 12 features

In [8]:
print('Train Data Set:',df_train.shape)
print('Test Data Set:',df_test.shape)


Train Data Set: (550068, 12)
Test Data Set: (233599, 11)


## Combine the both Train and Test Data

In [9]:
df=df_train.append(df_test)

In [10]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370.0
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422.0
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057.0
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969.0


# Read and undeerstand the data

In [11]:
df.shape

(783667, 12)

In [12]:
df.dtypes

User_ID                         int64
Product_ID                     object
Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years     object
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                      float64
dtype: object

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 783667 entries, 0 to 233598
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     783667 non-null  int64  
 1   Product_ID                  783667 non-null  object 
 2   Gender                      783667 non-null  object 
 3   Age                         783667 non-null  object 
 4   Occupation                  783667 non-null  int64  
 5   City_Category               783667 non-null  object 
 6   Stay_In_Current_City_Years  783667 non-null  object 
 7   Marital_Status              783667 non-null  int64  
 8   Product_Category_1          783667 non-null  int64  
 9   Product_Category_2          537685 non-null  float64
 10  Product_Category_3          237858 non-null  float64
 11  Purchase                    550068 non-null  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 77.7+ MB


In [14]:
df.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            245982
Product_Category_3            545809
Purchase                      233599
dtype: int64

*  Product_Category_2,Product_Category_3 have the missing value and Purchase  also have null values but it is in test data

In [15]:
## checking the stastical summary
df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,783667.0,783667.0,783667.0,783667.0,537685.0,237858.0,550068.0
mean,1003029.0,8.0793,0.409777,5.366196,9.844506,12.668605,9263.968713
std,1727.267,6.522206,0.491793,3.87816,5.089093,4.12551,5023.065394
min,1000001.0,0.0,0.0,1.0,2.0,3.0,12.0
25%,1001519.0,2.0,0.0,1.0,5.0,9.0,5823.0
50%,1003075.0,7.0,0.0,5.0,9.0,14.0,8047.0
75%,1004478.0,14.0,1.0,8.0,15.0,16.0,12054.0
max,1006040.0,20.0,1.0,20.0,18.0,18.0,23961.0


1. **Occupation** and  **Marital_Status** has a min value of zero. 
2. The ‘count’ of Product_Category_2 and Product_Category_3 confirms the presence  of missing value check.

In [16]:
df.describe(include='O')

Unnamed: 0,Product_ID,Gender,Age,City_Category,Stay_In_Current_City_Years
count,783667,783667,783667,783667,783667
unique,3677,2,7,3,5
top,P00265242,M,26-35,B,1
freq,2709,590031,313015,329739,276425


* By the categorical summary we see that count of each varibale ,unique values ,frequency of the variable

In [17]:
#categorical variable
df_cat=df.select_dtypes(include='O')
df_cat.columns

Index(['Product_ID', 'Gender', 'Age', 'City_Category',
       'Stay_In_Current_City_Years'],
      dtype='object')

In [18]:
## print the frequency of the each  categorical variable
for col in df_cat:
    print('The frequency of categorical variable:',col)
    print(df[col].value_counts())
    print(" ")

The frequency of categorical variable: Product_ID
P00265242    2709
P00025442    2310
P00110742    2292
P00112142    2279
P00046742    2084
             ... 
P00185942       1
P00104342       1
P00074742       1
P00081342       1
P00253842       1
Name: Product_ID, Length: 3677, dtype: int64
 
The frequency of categorical variable: Gender
M    590031
F    193636
Name: Gender, dtype: int64
 
The frequency of categorical variable: Age
26-35    313015
36-45    156724
18-25    141953
46-50     65278
51-55     54784
55+       30579
0-17      21334
Name: Age, dtype: int64
 
The frequency of categorical variable: City_Category
B    329739
C    243684
A    210244
Name: City_Category, dtype: int64
 
The frequency of categorical variable: Stay_In_Current_City_Years
1     276425
2     145427
3     135428
4+    120671
0     105716
Name: Stay_In_Current_City_Years, dtype: int64
 


# Missing Value Treatment

In [19]:
df.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            245982
Product_Category_3            545809
Purchase                      233599
dtype: int64

 #### we see that Product_Category_2,Product_Category_3  and Purchase  have missing value 

In [20]:
# Drop the user_id and  column Becoz it is unique id  so we can drop it and it is not use for analysis
#df.drop(['User_ID'], axis=1 ,inplace=True)

In [21]:
# Drop the product_id and  column Becoz it is unique id  so we can drop it
#df.drop(['Product_ID'], axis=1 ,inplace=True)

In [22]:
df.head(10)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370.0
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422.0
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057.0
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969.0
5,1000003,P00193542,M,26-35,15,A,3,0,1,2.0,,15227.0
6,1000004,P00184942,M,46-50,7,B,2,1,1,8.0,17.0,19215.0
7,1000004,P00346142,M,46-50,7,B,2,1,1,15.0,,15854.0
8,1000004,P0097242,M,46-50,7,B,2,1,1,16.0,,15686.0
9,1000005,P00274942,M,26-35,20,A,1,1,8,,,7871.0


In [23]:
df['Product_Category_2'].unique()

array([nan,  6., 14.,  2.,  8., 15., 16., 11.,  5.,  3.,  4., 12.,  9.,
       10., 17., 13.,  7., 18.])

#### It is dicrete continueous valriable there is more repeted values in the data so  for repeted value we usw to replace mode 

In [24]:
from scipy.stats import mode
p2_mode=df.pivot_table(values='Product_Category_2',index=['User_ID'],aggfunc=lambda x:mode(x).mode[0])
p2_mode

Unnamed: 0_level_0,Product_Category_2
User_ID,Unnamed: 1_level_1
1000001,4.0
1000002,8.0
1000003,2.0
1000004,2.0
1000005,8.0
...,...
1006036,8.0
1006037,8.0
1006038,14.0
1006039,12.0


In [25]:
p2bool=df.Product_Category_2.isnull()
df.loc[p2bool,'Product_Category_2']=df.loc[p2bool,'User_ID'].apply(lambda x:p2_mode.loc[x])

In [26]:
p3_mode=df.pivot_table(values='Product_Category_3',index=['User_ID'],aggfunc=lambda x:mode(x).mode[0])
p3_mode

Unnamed: 0_level_0,Product_Category_3
User_ID,Unnamed: 1_level_1
1000001,12.0
1000002,14.0
1000003,5.0
1000004,14.0
1000005,16.0
...,...
1006036,15.0
1006037,16.0
1006038,17.0
1006039,12.0


In [27]:
p3bool=df.Product_Category_3.isnull()
df.loc[p3bool,'Product_Category_3']=df.loc[p3bool,'User_ID'].apply(lambda x:p3_mode.loc[x])

KeyError: 1000492

In [29]:
df['Product_Category_3']=df['Product_Category_3'].fillna(df['Product_Category_3'].mode()[0])

In [30]:
df.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2                 0
Product_Category_3                 0
Purchase                      233599
dtype: int64

* The null values are present in the Purchase Data but it is in test data no need to treat

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 783667 entries, 0 to 233598
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     783667 non-null  int64  
 1   Product_ID                  783667 non-null  object 
 2   Gender                      783667 non-null  object 
 3   Age                         783667 non-null  object 
 4   Occupation                  783667 non-null  int64  
 5   City_Category               783667 non-null  object 
 6   Stay_In_Current_City_Years  783667 non-null  object 
 7   Marital_Status              783667 non-null  int64  
 8   Product_Category_1          783667 non-null  int64  
 9   Product_Category_2          783667 non-null  float64
 10  Product_Category_3          783667 non-null  float64
 11  Purchase                    550068 non-null  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 77.7+ MB


## No missing value in the data 

##  EDA and Visualization

In [None]:
### visualizatin on age v/s purchase
sns.barplot('Age','Purchase',hue='Gender',data=df)

 #### By graph all age male people are purchase the maximum compare to  female

In [None]:
### visualizatin on occupation v/s purchase
plt.figure(figsize=(10,5))
sns.barplot('Occupation','Purchase',hue='Gender',data=df)

In [None]:
## visualize on city v/s purchase
plt.figure(figsize=(8,5))
#sns.countplot(x='City_Category',data=df)
sns.barplot('City_Category','Purchase',hue='Gender',data=df)
plt.show()

#### By this max of purchase are done by male people in the city c

In [None]:
### visualizatin on product_category v/s purchase
sns.barplot('Product_Category_1','Purchase',hue='Gender',data=df)

In [None]:
sns.barplot('Product_Category_2','Purchase',hue='Gender',data=df)

In [None]:
sns.barplot('Product_Category_3','Purchase',hue='Gender',data=df)

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(8,5))
sns.heatmap(df.corr(),annot=True)
plt.show()

## Covert the categerical  varibale to numerical variable

In [32]:
## Fixing categerical Variable into Numerical Varialbe"Gender"
df['Gender']=df['Gender'].map({'F':0,'M':1})
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,0,0-17,10,A,2,0,3,4.0,16.0,8370.0
1,1000001,P00248942,0,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,0,0-17,10,A,2,0,12,4.0,16.0,1422.0
3,1000001,P00085442,0,0-17,10,A,2,0,12,14.0,16.0,1057.0
4,1000002,P00285442,1,55+,16,C,4+,0,8,8.0,16.0,7969.0


In [35]:
df['Age'].unique()

array(['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25'],
      dtype=object)

In [36]:
## Fixing Catergial Variable to Numerical Variable "Age"
df['Age']=df['Age'].map({'0-17':1,'18-25':2,'26-35':3,'36-45':4,'46-50':5,'51-55':6,'55+':7})

In [37]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,0,1,10,A,2,0,3,4.0,16.0,8370.0
1,1000001,P00248942,0,1,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,0,1,10,A,2,0,12,4.0,16.0,1422.0
3,1000001,P00085442,0,1,10,A,2,0,12,14.0,16.0,1057.0
4,1000002,P00285442,1,7,16,C,4+,0,8,8.0,16.0,7969.0


In [38]:
#second Technique
# Import label encoder
#from sklearn import preprocessing
 
#label_encoder object knows how to understand word labels.
#label_encoder = preprocessing.LabelEncoder()
 
# Encode labels in column 'species'.
#df['Age']= label_encoder.fit_transform(df['Age'])
 
df['City_Category'].unique()

array(['A', 'C', 'B'], dtype=object)

In [None]:
##Fixing the Caterigal variable into Numerical variable "City_Category"
df_city=p(df['City_Category'],drop_first=True)
df_city.head()

In [None]:
df=pd.concat([df,df_city],axis=1)
df.head()

In [None]:
df.drop('City_Category',axis=1,inplace=True)
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df['Stay_In_Current_City_Years'].unique()

In [None]:
df['Stay_In_Current_City_Years']=df['Stay_In_Current_City_Years'].str.replace('+','')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
##convert object into integer
df['Stay_In_Current_City_Years']=df['Stay_In_Current_City_Years'].astype(int)

In [None]:
df.info()

In [None]:
df['B']=df['B'].astype(int)

In [None]:
df['C']=df['C'].astype(int)

In [None]:
df.info()

## All data is Converted into numerical data and data type also corrected

# Now Data is cleaned and Build the model

In [None]:
df1=df.copy(deep=True)

In [None]:
df_test=df1[df['Purchase'].isnull()]
df_test.head()

In [None]:
df_train=df1[~df1['Purchase'].isnull()]
df_train.head()

In [None]:
X=df_train.drop(['Purchase','User_ID','Product_ID'],axis=1)
X.head()

In [None]:
y=df_train['Purchase']
y.head()

In [None]:
print('X_shape:',X.shape)
print('y_shape:',y.shape)

## Scale the data

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()

In [None]:
X_sc=sc.fit_transform(X)
X_sc

In [None]:
## Convert the data  from array into data frame
X_sc=pd.DataFrame(X_sc)
X_sc.head()

####  After scaling the data is give -ve values to over this -ve value we do MinMaxScaler

## MinMaxScaler

In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler
mms=MinMaxScaler()

In [None]:
X_mms=mms.fit_transform(X)
X_mms

In [None]:
X_mms=pd.DataFrame(X_mms)
X_mms.head()

In [None]:
## Train the model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr= LinearRegression()
from sklearn.metrics import mean_squared_error,r2_score
from math import sqrt

In [None]:
X_mms_train,X_mms_test, y_train, y_test = train_test_split(X_mms, y, test_size=0.25,random_state =100)

In [None]:
print('X_train:',X_mms_train.shape)
print('X_test:',X_mms_test.shape)
print('y_train:',y_train.shape)
print('y_test:',y_test.shape)

In [None]:
LINEAR REGRESSION

In [None]:
lr_model=lr.fit(X_mms_train,y_train)
lr_model

In [None]:
tr_pred=lr.predict(X_mms_train)
ts_pred=lr.predict(X_mms_test)

In [None]:
Mean_Squared_Error_Train = mean_squared_error(y_train,tr_pred)
Root_Mean_Squared_Error_Train = np.sqrt(Mean_Squared_Error_Train)
Mean_Squared_Error_Test = mean_squared_error(y_test, ts_pred)
Root_Mean_Squared_Error_Test = np.sqrt(Mean_Squared_Error_Test)
R_Squared_Test=r2_score(y_test,ts_pred)
R_Squared_Train=r2_score(y_train,tr_pred)


print('Mean_Squared_Error_Train:',Mean_Squared_Error_Train)
print('Mean_Squared_Error_Test:',Mean_Squared_Error_Test)
print('Root_Mean_Squared_Error_Train:',Root_Mean_Squared_Error_Train)
print('Root_Mean_Squared_Error_Test:',Root_Mean_Squared_Error_Test)
print('R_Squared_Train :',R_Squared_Train*100)
print('R_Squared_Test:',R_Squared_Test*100)

In [None]:
cols = ['Model', 'R-squared', 'RMSE']
result_tabulation = pd.DataFrame(columns = cols)
LinearRegression = pd.Series({'Model': "Linear Regression",
                 'R-squared' : R_Squared_Test,  'RMSE' :Root_Mean_Squared_Error_Test})
result_tabulation = result_tabulation.append(LinearRegression , ignore_index = True)
result_tabulation

In [None]:
pd.DataFrame(list(zip(y_train,tr_pred)),columns=['Actual','Predicted']).head()

# VIF

In [None]:
from statsmodels.stats.outliers_influence import  variance_inflation_factor as VIF

In [None]:
X=df.drop('Purchase',axis=1)
df_vif=pd.DataFrame()
df_vif['feature']=X.columns
df_vif['VIF']=[VIF(X.values,i) for i in range (len(X.columns))]

In [None]:
df_vif

In [None]:
df.var()

In [None]:
lst=[i for i in df.columns if df[i].var() < 0.5]
lst

## By RidgeRegression

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
rid=Ridge()

In [None]:
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state =100)

In [None]:
params={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20]}

In [None]:
reg_regressor=GridSearchCV(rid,params,scoring='r2',cv=5)

In [None]:
reg_regressor.fit(X_train,y_train)

In [None]:
print(reg_regressor.best_params_)
print(reg_regressor.best_score_)

In [None]:
rid=Ridge(alpha=20,  solver='auto',  random_state=100)
rid.fit(X_train,y_train)

In [None]:
ts_pred=rid.predict(X_test)
tr_pred=rid.predict(X_train)

In [None]:
Mean_Squared_Error_Train = mean_squared_error(y_train,tr_pred)
Root_Mean_Squared_Error_Train = np.sqrt(Mean_Squared_Error_Train)
Mean_Squared_Error_Test = mean_squared_error(y_test, ts_pred)
Root_Mean_Squared_Error_Test = np.sqrt(Mean_Squared_Error_Test)
R_Squared_Test=r2_score(y_test,ts_pred)
R_Squared_Train=r2_score(y_train,tr_pred)


print('Mean_Squared_Error_Train:',Mean_Squared_Error_Train)
print('Mean_Squared_Error_Test:',Mean_Squared_Error_Test)
print('Root_Mean_Squared_Error_Train:',Root_Mean_Squared_Error_Train)
print('Root_Mean_Squared_Error_Test:',Root_Mean_Squared_Error_Test)
print('R_Squared_Train :',R_Squared_Train)
print('R_Squared_Test:',R_Squared_Test)

In [None]:
cols = ['Model', 'R-squared', 'RMSE']
RidgeRegression = pd.Series({'Model': "Ridge Regression",
                 'R-squared' : R_Squared_Test,  'RMSE' :Root_Mean_Squared_Error_Test})
result_tabulation = result_tabulation.append(RidgeRegression , ignore_index = True)
result_tabulation

## By Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
RFR=RandomForestRegressor()

In [None]:
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state =100)

In [None]:
RFE=RandomForestRegressor(n_estimators=100,n_jobs=-1,random_state=1)

In [None]:
RFE.fit(X_train,y_train)

In [None]:
tr_pred=RFE.predict(X_train)
ts_pred=RFE.predict(X_test)

In [None]:
Mean_Squared_Error_Train = mean_squared_error(y_train,tr_pred)
Root_Mean_Squared_Error_Train = np.sqrt(Mean_Squared_Error_Train)
Mean_Squared_Error_Test = mean_squared_error(y_test, ts_pred)
Root_Mean_Squared_Error_Test = np.sqrt(Mean_Squared_Error_Test)
R_Squared_Test=r2_score(y_test,ts_pred)
R_Squared_Train=r2_score(y_train,tr_pred)


print('Mean_Squared_Error_Train:',Mean_Squared_Error_Train)
print('Mean_Squared_Error_Test:',Mean_Squared_Error_Test)
print('Root_Mean_Squared_Error_Train:',Root_Mean_Squared_Error_Train)
print('Root_Mean_Squared_Error_Test:',Root_Mean_Squared_Error_Test)
print('R_Squared_Train :',R_Squared_Train)
print('R_Squared_Test:',R_Squared_Test)

In [None]:
cols = ['Model', 'R-squared', 'RMSE']
RandomForestRegressor = pd.Series({'Model': "RandomForestRegressor",
                 'R-squared' : R_Squared_Test,  'RMSE' :Root_Mean_Squared_Error_Test})
result_tabulation = result_tabulation.append(RandomForestRegressor , ignore_index = True)
result_tabulation

## Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score=cross_val_score(lr,X_train,y_train,cv=5,scoring='r2')
print(cross_val_score)

In [None]:
np.mean(cross_val_score)

In [None]:
def models(x_train_sc, y_train):
    #accuracy = []
    #f1score = []
    RMSE=[]
    R2score=[]
    model = []
    
    model.append(LinearRegression())
    model.append(RandomForestRegressor(random_state=42,oob_score=True))
    model.append(DecisionTreeRegressor())
    model.append(SVR())
    model.append(GradientBoostingRegressor())
    #model.append(KNeighborsClassifier())
    #model.append(SVC(random_state=40))
    
    #model.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=40)))
    #model.append(BaggingClassifier(random_state=40))
    #model.append(GradientBoostingC(random_state=40))
    #model.append(XGBClassifier(random_state=40, verbosity=0))
    #model.append(CatBoostClassifier(random_state=40, verbose=0))
    model.append(CatBoostRegressor())
    for i in model:
        mdl = i
        i.fit(x_train_sc, y_train)
        pred = i.predict(x_train_sc)
        
        #accuracy.append((round(accuracy_score(y_train, pred), 2))*100)
        #f1score.append((round(f1_score(y_train, pred), 2))*100)
        RMSE.append(np.sqrt(mean_squared_error(y_train, pred)))
        R2score.append(r2_score(y_train, pred))
        
        #print('RMSE\n',RMSE,'\nR2score',R2score)
        print(f'Model: {i}\nRMSE: {np.sqrt(mean_squared_error(y_train, pred))}\R2_score: {r2_score(y_train, pred)}\n\n')

In [None]:
gbr=GradientBoostingRegressor(random_state=42)

gbr.fit(x_train_sc,y_train_log)

# Predicting the model

train_pred=rfr.predict(x_train_sc)
test_pred=rfr.predict(x_test_sc)


# Regression results
print('RMse of trained data is:',np.sqrt(mean_squared_error(y_train_log,train_pred)))
print('R2_score of trained data is:',r2_score(y_train_log,train_pred))


In [None]:
lr =LinearRegression()

# fitting the data in the model

lr.fit(x_train,y_train_log)

# Predicting the model

train_pred=lr.predict(x_train)
test_pred=lr.predict(x_test_sc)


# results
print('RMse of trained data is:',np.sqrt(mean_squared_error(y_train_log,train_pred)))
print('R2_score of trained data is:',r2_score(y_train_log,train_pred))