**Define the business problem that you are trying to solve. <br>What is the business context and what are the constraints you might face.**

## BigMart Sales Prediction

**Build a predictive model and find out the sales of each product at a particular store.**

## About Dataset

**The data scientists at BigMart have collected 2013 sales data for numerous products across many stores in different cities. Also, certain attributes of each product and store have been defined.<br>
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.**

## Data Fields

- **Item_Identifier**: Unique Number assigned to each Item.
- **Item_Weight**: Item Weight in g.
- **Item_Fat_Content**: Item Fat Content.
- **Item_Visibility**: Placement value of each item: 0 - Far & Behind 1 - Near & Front.
- **Item_Type**: Type of item utility.
- **Item_MRP**: Price of the Item.
- **Outlet_Identifier**: Unique Outlet Name.
- **Outlet_Establishment_Year**: Year of Outlet Establishment.
- **Outlet_Size**: Size of the Outlet.
- **Outlet_Location_Type**: Tier of Outlet Location.
- **Outlet_Type**: Type of Outlet.
- **Item_Outlet_Sales**: Target variable; Total Sales of the Outlet.

## Data Preprocessing

**Cleaning, transforming, and preparing the data for analysis. <br>This step includes handling missing values, dealing with outliers, and ensuring data quality.**

Importing Libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.impute import KNNImputer
from colorama import Fore, Style
import pandas_profiling
from pandas_profiling import ProfileReport
import scipy.stats as stats
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score,mean_squared_error, mean_absolute_error
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import xgboost
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
import matplotlib.pyplot as plt

## Descriptive Data

In [3]:
df= pd.read_csv('bigmart.csv')
pd.options.display.float_format = "{:.2f}".format
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.02,Dairy,249.81,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.14
1,DRC01,5.92,Regular,0.02,Soft Drinks,48.27,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.42
2,FDN15,17.5,Low Fat,0.02,Meat,141.62,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.09,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.86,OUT013,1987,High,Tier 3,Supermarket Type1,994.71


In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.Item_Identifier.value_counts()

In [None]:
df.Item_Identifier.nunique()

In [None]:
df[df.Item_Identifier=='FDW13']

In [None]:
df.describe().T.style.background_gradient(cmap='GnBu').\
                    bar(subset=["std"], color='#BB0000').bar(subset=["mean",], color='green')

In [None]:
df.astype('object').describe().T.style.background_gradient(cmap='GnBu').\
                                    bar(subset=["count"], color='#BB0000').bar(subset=["unique",], color='green')

### Handling Duplicated data 

In [None]:
df_nodub=df.drop_duplicates() 
print(df.shape,df_nodub.shape,'\n Number of duplicate data : ',df.shape[0]-df_nodub.shape[0]) 

**There is no duplicate data.**

## Handling Missing Values

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.show()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean() * 100

In [None]:
from colorama import Fore, Style
def pcnt_miss_values(df):
    col_names = df.columns
    print('\nThe percentage of miss values for those columns having missing value  \n')
    for col in col_names:
        n_value = df[col].isnull().sum()
        if n_value>0 :
            null_pcnt = round((n_value / df.shape[0])*100 , 2) 
            print(Fore.RED + Style.BRIGHT+'----> The percentage of null values for column {0} is {1} % \n '.format(df[col].name , null_pcnt))

In [None]:
pcnt_miss_values(df.drop(['Item_Outlet_Sales'],axis=1))

### Using mean 

df.loc[df.Item_Weight.isnull(),'Item_Weight'] = df.Item_Weight.mean()

df.isnull().sum()

### impute Outlet_Size with mode                  

df.loc[df.Outlet_Size.isnull() , 'Outlet_Size']  =df.Outlet_Size.mode()[0]

df.Outlet_Size.value_counts()

df.isnull().sum()

# Impute item_weight using average of item_identifier 

#### Using KNN Imputer 

In [None]:
# seprate numerical features first
import numpy as np
num_cols = df.select_dtypes(include=np.number).columns
num_cols

In [None]:
from sklearn.impute import KNNImputer
knn  = KNNImputer(n_neighbors=5)
df_filled = pd.DataFrame(knn.fit_transform(df[num_cols]),columns=num_cols)
df_filled = pd.concat([df.drop(columns=num_cols),df_filled],axis=1)



In [None]:
# df_filled=df.copy()
# df_filled.loc[df_filled['Item_Weight'].isnull(),'Item_Weight']=df_filled.Item_Weight.mean()

In [None]:
df_filled[df_filled.Item_Identifier=='FDW13']

In [None]:
df_filled.isna().sum()

In [None]:
df=df_filled.copy()

## Standardise Values

In [None]:
df.loc[df['Item_Fat_Content'].isin(['LF', 'low fat']), 'Item_Fat_Content']='Low Fat'
df.loc[df['Item_Fat_Content']=='reg','Item_Fat_Content']='Regular'

#### we decided to categirize these types into 7 major category : 


    
    * Proteins and Main Dishes:
        Meat
        Seafood
   
    * Carbohydrates and Staples:
        Breads
        Starchy Foods
   
    * Dairy and Alternatives:
        Dairy
   
    * Fruits and Vegetables:
        Fruits and Vegetables

    * Processed and Convenience Foods:
        Snack Foods
        Frozen Foods
        Canned
        Baking Goods
        Breakfast
    
    *Beverages:
        Soft Drinks
        Hard Drinks

    *Household and Others:
        Household
        Health and Hygiene
        Others



In [None]:
Beverages=['Soft Drinks','Hard Drinks']
Household_and_Others=['Household','Health and Hygiene','Others']
Proteins = ['Seafood','Meat']
Processed = ['Snack Foods','Frozen Foods','Canned','Baking Goods','Breakfast']
Carbohydrates = ['Starchy Foods','Breads']

# df.loc[(~ df['Item_Type'].isin(Beverages)) & (~ df['Item_Type'].isin(Household_and_Others)),'Item_Type'] = 'Food'

df.loc[df['Item_Type'].isin(Beverages),'Item_Type'] = 'Beverages'
df.loc[df['Item_Type'].isin(Household_and_Others),'Item_Type'] = 'Household_and_Others'
df.loc[df['Item_Type'].isin(Proteins),'Item_Type'] = 'Proteins'
df.loc[df['Item_Type'].isin(Processed),'Item_Type'] = 'Processed'
df.loc[df['Item_Type'].isin(Carbohydrates),'Item_Type'] = 'Carbohydrates'

In [None]:
df.Outlet_Establishment_Year=df.Outlet_Establishment_Year.astype('object')

###### **Using randomforrest classifier to impute outlet_size , RandomForestClassifier is designed for classification tasks and works with numeric features , so i have to first encode other features to perform randomforrest classifier**

## Handling Outliers : 

In [None]:
import matplotlib.pyplot as plt
def num_boxplot(df) :
    for col in df.columns:
        if df[col].dtype != 'object':
            plt.figure(figsize=(15,8))
            sns.boxplot(x=col,data=df)
            plt.title('Box plot for '+col+'\n')
            print(Fore.RED , Style.BRIGHT,'\033[1m ',100*'=')
            plt.show()
            
num_boxplot(df.drop('Item_Outlet_Sales',axis=1))

###### as we can see in boxplot , it seems that only Item_Visibility  has some oulliers  , now i test it with IQR and find the extreme outliers

In [None]:
def iqr(df,col) :
    if df[col].dtype != 'object':
        q1=df[col].quantile(.25)
        q3=df[col].quantile(.75)
        iqr = q3-q1 
        lower_b = q1 - 1.5 * iqr
        ex_lower_b = q1-3 * iqr
        upper_b = q3+1.5 * iqr
        ex_upper_b = q3 + 3 * iqr
        return ex_lower_b , ex_upper_b , upper_b , lower_b

In [None]:
def find_number_of_ext_outliers(df,label):
    list_col=[]
    df = df.drop(label , axis=1 )
    for col in df.columns :
        if df[col].dtype != 'object':
            ex_lower_b , ex_upper_b,  upper_b , lower_b = iqr(df,col)
            num_ex_outliers = df[(df[col]>ex_upper_b) | (df[col]<ex_lower_b)].shape[0]
            if num_ex_outliers > 0 :
                print('\n \033[1m Number of Exterme Outliers for {0}: ( values > {1} or values <{2})\n'.format(col,ex_upper_b,ex_lower_b))
                print(Fore.RED ,'\033[1m', Style.BRIGHT,(num_ex_outliers))
                print(Style.RESET_ALL)
                list_col.append(col)

    if len(list_col)==0 :
        print('There is no any exterme outliers in dataset ')                
    else:
        return list_col


In [None]:
def drop_outliers (df, colname) :
    if len(colname) >0 :
        for col in colname:
            ex_lower_b , ex_upper_b  , upper_b , lower_b= iqr(df,col)
            df = df[(df[col]<ex_upper_b) & (df[col]>ex_lower_b)]

    return df
    

In [None]:
list_col_outliers = find_number_of_ext_outliers(df,'Item_Outlet_Sales')


##### drop outliers
the number of extreme values are too less , so we decided to remove those rows

In [None]:
df_result= drop_outliers(df,list_col_outliers)
df_result.shape[0] - df.shape[0]

#### check if there is any outlier

In [None]:
list_col_outliers = find_number_of_ext_outliers(df_result,'Item_Outlet_Sales')

In [None]:
df_result1= drop_outliers(df_result,list_col_outliers)
df_result1.shape[0] - df_result.shape[0]

In [None]:
list_col_outliers = find_number_of_ext_outliers(df_result1,'Item_Outlet_Sales')

### Potential Outliers : 

In [None]:
def find_number_of_pot_outliers(df,label):
    list_col=[]
    df = df.drop(label , axis=1 )
    for col in df.columns :
        if df[col].dtype != 'object':
            ex_lower_b , ex_upper_b,upper_b,lower_b = iqr(df,col)
            num_ex_outliers = df[(df[col]>upper_b) | (df[col]<lower_b)].shape[0]
            if num_ex_outliers > 0 :
                print('\n \033[1m Number of Potenitial Outliers for {0}: ( values > {1} or values <{2})\n'.format(col,ex_upper_b,ex_lower_b))
                print(Fore.RED ,'\033[1m', Style.BRIGHT,(num_ex_outliers))
                print(Style.RESET_ALL)
                list_col.append(col)

    if len(list_col)==0 :
        print('There is no any Potenitial outliers in dataset ')                
    else:
        return list_col


In [None]:
def drop_pot_outliers (df, colname) :
    if len(colname) >0 :
        for col in colname:
            ex_lower_b , ex_upper_b,upper_b,lower_b = iqr(df,col)
            df = df[(df[col]<upper_b) & (df[col]>lower_b)]

    return df

In [None]:
list_col_outliers = find_number_of_pot_outliers(df_result1,'Item_Outlet_Sales')


In [None]:
df_result2= drop_pot_outliers(df_result1,list_col_outliers)
df_result1.shape[0] - df_result2.shape[0]

In [None]:
list_col_outliers = find_number_of_pot_outliers(df_result2,'Item_Outlet_Sales')


In [None]:
df_result1= drop_pot_outliers(df_result2,list_col_outliers)
df_result2.shape[0] - df_result1.shape[0]

In [None]:
list_col_outliers = find_number_of_pot_outliers(df_result1,'Item_Outlet_Sales')

In [None]:
df_result2= drop_pot_outliers(df_result1,list_col_outliers)
df_result1.shape[0] - df_result2.shape[0]

In [None]:
list_col_outliers = find_number_of_pot_outliers(df_result2,'Item_Outlet_Sales')

In [None]:
df_result2.info()

In [None]:
df=df_result2.copy()

In [None]:
print(df.dtypes)


# Exploratory Data Analysis (EDA)

##### Getting familiar with all features and doing some replacemen , extracion , discritizing ,etc 

**Exploring the data visually and statistically to gain a better understanding of its characteristics and relationships.**

## Univariate Analysis

In [None]:
numeric_columns = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']
colors = ['purple', 'g', 'b', 'c'] 

plt.figure(figsize=(8, 6))
sns.set_style("white")

for i in range(len(numeric_columns)):
    plt.subplot(2, 2, i+1)
    sns.distplot(df[numeric_columns[i]], bins=20, color=colors[i])
    plt.xlabel(numeric_columns[i])
    plt.ylabel('Density')

plt.tight_layout()
plt.show()

**The distribution of Item_Weight is basically flat. Only when the weight is 13g, the number surges.<br>
<br>
The shapes of Item_Visibility and Item_Outlet_Sales are basically the same. <br>
It can be intuitively seen that the placement of the item directly affects sales. <br>
The further away the placement, the lower the sales.<br>
<br>
Item_MRP fluctuates greatly, proving that the price difference between items is large.**

In [None]:
categorical_columns = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Establishment_Year',
                       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
colors = ['CornflowerBlue', 'MediumSeaGreen', 'Tomato', 'Plum', 'RoyalBlue', 'LightGreen', 'Coral' ]

plt.figure(figsize=(12, 12))

for i in range(len(categorical_columns)):
    plt.subplot(4, 2, i+1)
    sns.barplot(x=df[categorical_columns[i]].value_counts(normalize=True).index,
                y=df[categorical_columns[i]].value_counts(normalize=True) * 100, color=colors[i])
    plt.xlabel(categorical_columns[i])
    plt.ylabel('Percentage')

plt.tight_layout()
plt.show()

In [None]:
categorical_columns_pp = ['Item_Fat_Content', 'Outlet_Location_Type', 'Outlet_Type']

plt.figure(figsize=(8, 8))

for i in range(len(categorical_columns_pp)):
    value_counts = df[categorical_columns_pp[i]].value_counts()
    explode = [0.1 if idx == value_counts.idxmax() else 0 for idx in value_counts.index]
    
    plt.subplot(2, 2, i+1)
    plt.pie(df[categorical_columns_pp[i]].value_counts().values, explode=explode,\
            labels=df[categorical_columns_pp[i]].value_counts().index, \
            autopct='%1.1f%%', shadow=True, startangle=140)
    plt.xlabel(categorical_columns_pp[i])
    plt.ylabel('Percentage')

plt.tight_layout()
plt.show()

In [None]:
for i in range(len(categorical_columns)):
    value_counts_percentage = df[categorical_columns[i]].value_counts(normalize=True) * 100
    print(f'Percentage distribution for {categorical_columns[i]}:')
    print(value_counts_percentage)
    print('_' * 50)

**For Item_Fat_Content:<br>
The proportion of low fat products is significantly higher than that of regular fat products.<br>
<br>
For Item_Type:<br>
The proportion of Fruits and Vegetables is close to that of Snack Foods, and they are the two categories with the highest proportions. After that, Household and Frozen Foods have the highest proportions.<br>
<br>
For Outlet_Identifier:<br>
The proportions are generally the same, only the proportions of #10 and #19 are slightly lower.<br>
<br>
For Outlet_Establishment_Year:<br>
The proportions are roughly the same every year, but the proportion in 1985 is slightly higher and the proportion in 1998 is slightly lower.<br>
<br>
For Outlet_Size:<br>
Among Outlet_Size, medium has the highest proportion.<br>
<br>
For Outlet_Location_Type:<br>
The higher the Tier, the higher the proportion.<br>
<br>
For Outlet_Type:<br>
The proportion of Supermarket Type 1 is 65%, far exceeding the sum of the other three types of outlets.<br>
<br>
For Category:<br>
Food accounts for 72% of all items and is the main product**

##### Item_Identifier

In [None]:
df.Item_Identifier.value_counts()

In [None]:
df.Item_Identifier.nunique()

In [None]:
df.Outlet_Identifier.value_counts()

#### lets check the correlation with other features in bi-variate analysis then decist about it 

### Item_Fat_Content

In [None]:
df.dtypes['Item_Fat_Content']

In [None]:
df.Item_Fat_Content.value_counts(normalize=True) * 100

In [None]:
# df.loc[df['Item_Fat_Content'].isin(['LF', 'low fat']), 'Item_Fat_Content']='Low Fat'
# df.loc[df['Item_Fat_Content']=='reg','Item_Fat_Content']='Regular'

In [None]:
df.Item_Fat_Content.value_counts(normalize=True) * 100

### Item_Visibility

In [None]:
df.dtypes['Item_Visibility']

In [None]:
df.Item_Visibility.describe()

In [None]:
sns.displot(data=df , x='Item_Visibility' , kde=True)

### Item_Type                  

In [None]:
df.dtypes['Item_Type']

In [None]:
df.Item_Type.value_counts()

In [None]:
# Beverages=['Soft Drinks','Hard Drinks']
# Household_and_Others=['Household','Health and Hygiene','Others']
# Proteins = ['Seafood','Meat']
# Processed = ['Snack Foods','Frozen Foods','Canned','Baking Goods','Breakfast']
# Carbohydrates = ['Starchy Foods','Breads']

# # df.loc[(~ df['Item_Type'].isin(Beverages)) & (~ df['Item_Type'].isin(Household_and_Others)),'Item_Type'] = 'Food'

# df.loc[df['Item_Type'].isin(Beverages),'Item_Type'] = 'Beverages'
# df.loc[df['Item_Type'].isin(Household_and_Others),'Item_Type'] = 'Household_and_Others'
# df.loc[df['Item_Type'].isin(Proteins),'Item_Type'] = 'Proteins'
# df.loc[df['Item_Type'].isin(Processed),'Item_Type'] = 'Processed'
# df.loc[df['Item_Type'].isin(Carbohydrates),'Item_Type'] = 'Carbohydrates'



In [None]:
round(df.Item_Type.value_counts(normalize=True) * 100,2)

### Item_MRP

In [None]:
df.dtypes['Item_MRP']

In [None]:
sns.displot(data=df ,x='Item_MRP',kde=True)

In [None]:
from scipy import stats

stat, p = stats.shapiro(df.Item_MRP)

if p<0.05 :
    print('Data does not look normally distributed (reject H0) - Pvalue is : ',p)
else :
    print('Data does looks normally distributed (cannot reject H0) - Pvalue is : ',p)

### Outlet_Identifier

In [None]:
df.dtypes['Outlet_Identifier']

In [None]:
df.Outlet_Identifier.value_counts()

### Outlet_Establishment_Year

In [None]:
df.dtypes['Outlet_Establishment_Year']

In [None]:
df.Outlet_Establishment_Year.value_counts()

In [None]:
# df.Outlet_Establishment_Year=df.Outlet_Establishment_Year.astype('object')

In [None]:
df.dtypes['Outlet_Establishment_Year']

In [None]:
df.Outlet_Establishment_Year.value_counts()

### Outlet_Size

In [None]:
df.dtypes['Outlet_Size']

In [None]:
df.Outlet_Size.value_counts()

### Outlet_Location_Type

In [None]:
df.dtypes['Outlet_Location_Type']

In [None]:
df.Outlet_Location_Type.value_counts()

### Outlet_Type

In [None]:
df.dtypes['Outlet_Type']

In [None]:
df.Outlet_Type.value_counts()

## Bivariate Analysis

### Numeric - Categorical Analysis

#### Item_Weight

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(df, x='Item_Weight', y='Item_Outlet_Sales',hue="Item_Outlet_Sales", palette='RdYlBu', s = 60)
plt.show()

**Sales in each range of item weight are roughly the same.<br>
The relationship between item weight and item sales cannot be seen from this plot.**

#### Item_MRP

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(df, x='Item_MRP', y='Item_Outlet_Sales',hue="Item_Outlet_Sales", palette='RdYlBu', s = 60)
plt.show()

**As the item MRP increases, item sales also increase significantly.**

#### Item_Visibility

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(df, x='Item_Visibility', y='Item_Outlet_Sales',hue="Item_Outlet_Sales", palette='RdYlBu', s = 60)
plt.show()

In [None]:
plt.figure(figsize=(8, 4))
sns.scatterplot(df, x='Item_Visibility', y='Item_Outlet_Sales',hue="Item_Outlet_Sales", palette='RdYlBu', s = 60)
plt.show()

**Most items are placed far away. Therefore, sales are generally higher than those in the front area.**

### Numeric - Categorical Analysis

#### Item_Fat_Content

In [None]:
plt.figure(figsize = (4,4))
sns.barplot(df, x='Item_Fat_Content', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.swarmplot(df,x='Item_Fat_Content',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.show()

**There is not much difference in the item sales between low fat and regular fat.**

#### Item_Type

In [None]:
plt.figure(figsize = (12,4))
sns.barplot(df, x='Item_Type', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.xticks(rotation=30)
plt.show()

In [None]:
plt.figure(figsize=(12,4))
sns.swarmplot(df,x='Item_Type',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.xticks(rotation=30)
plt.show()

**There is not much difference in sales between item types.<br>
The number of breakfast and seafood items is small, but the sales are still at the overall average level.**

In [None]:
plt.figure(figsize = (12,4))
sns.barplot(df, x='Item_Type', y='Item_MRP', ci=None, palette='RdYlBu')
plt.xticks(rotation=30)
plt.show()

**The difference in MRP between different item types is very small and basically the same.**

#### Outlet_Establishment_Year

In [None]:
plt.figure(figsize = (6,4))
sns.barplot(df, x='Outlet_Establishment_Year', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12,4))
sns.swarmplot(df,x='Outlet_Establishment_Year',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.show()

**Sales were roughly the same in each year, with only 1998 having unusually low sales.<br>
Referring to the frequency in 1998, the number of outlets established in 1998 was very small, resulting in very low sales.**

#### Outlet_Size

In [None]:
plt.figure(figsize = (4,4))
sns.barplot(df, x='Outlet_Size', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.swarmplot(df,x='Outlet_Size',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.show()

**Mid-sized stores have the highest sales.**

#### Outlet_Location_Type

In [None]:
plt.figure(figsize = (4,4))
sns.barplot(df, x='Outlet_Location_Type', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.show()

In [None]:
plt.figure(figsize=(10,4))
sns.swarmplot(df,x='Outlet_Location_Type',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.show()

**Tier 2 location type has the highest sales, with little difference from the second place.**

#### Outlet_Type

In [None]:
plt.figure(figsize = (8,4))
sns.barplot(df, x='Outlet_Type', y='Item_Outlet_Sales', ci=None, palette='RdYlBu')
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.swarmplot(df,x='Outlet_Type',y='Item_Outlet_Sales',hue='Item_Outlet_Sales', palette='RdYlBu')
plt.show()

**The proportion of supermarket Type1 is very high, but the item mean sales is not very high.<br>
The one with the highest sales is supermarket Type3.<br>
Maybe supermarket Type3 is a high-end imported supermarket, and supermarket Type1 is an affordable ordinary supermarket.**

## Correlation

In [None]:
df.select_dtypes(exclude='object').corr()


### Correlation between categorial features

In [None]:
cat_features=df.select_dtypes(include='object').columns
cat_features

In [None]:
from scipy.stats import chi2_contingency
import researchpy as rp
def cramer_v(df , list_cat) :
    df_result = pd.DataFrame(columns=['Category1', 'Category2', 'pvalue', 'Cramer'])    
    for i in range(len(list_cat)) :
        for j in range(i+1,len(list_cat)) :
            coli = df[list_cat[i]]
            colj = df[list_cat[j]]
            crosstab , test_result=rp.crosstab(coli,colj,test='chi-square')
            pvalue = test_result.loc[1]
            cramer = test_result.loc[2]
            new_result = pd.DataFrame({'Category1': [list_cat[i]], 'Category2': [list_cat[j]], 'pvalue': [pvalue[1]], 'Cramer': [cramer[1]]})
            df_result = pd.concat([df_result, new_result], ignore_index=True)
    return df_result

result = cramer_v(df ,list(cat_features))
result.sort_values(by='Cramer' , ascending=False)

### as we can see here Item_Identifier and Outlet_Identifier have highly strong relation with other features , i drop both of them 

In [None]:
df.drop(columns=['Item_Identifier','Outlet_Identifier'] , axis=1,inplace=True)

In [None]:
cat_features=df.select_dtypes(include='object').columns
result = cramer_v(df ,list(cat_features))
result.sort_values(by='Cramer' , ascending=False)

In [None]:
df.drop(columns='Outlet_Establishment_Year' , axis=1,inplace=True)


In [None]:
cat_features=df.select_dtypes(include='object').columns
result = cramer_v(df ,list(cat_features))
result.sort_values(by='Cramer' , ascending=False)

In [None]:
df.info()

## Hypothesis Testing

### Normality Test

**Since the sample size is large enough (≥30), I can rely on the Central Limit Theorem.**

### Homogeneity of Variances

In [None]:
import scipy.stats as stats
features = ['Item_Weight', 'Item_Visibility', 'Item_MRP']
levene_results = []

for feature in features:
    group1 = df[feature][df['Item_Outlet_Sales'] == 0]
    group2 = df[feature][df['Item_Outlet_Sales'] == 1]
    
    stat, p_value = stats.levene(group1, group2)
    levene_results.append((feature, stat, p_value))

levene_df = pd.DataFrame(levene_results, columns=['Feature', 'Statistic', 'P-value'])
print(levene_df)

**Since p-value= is less than 0.05, we can reject the null hypothesis and conclude that they have significant difference in their variances.**

In [None]:
features = ['Item_Weight', 'Item_Visibility', 'Item_MRP']
ttest_results = []

for feature in features:
    group1 = df[feature][df['Item_Outlet_Sales'] == 0]
    group2 = df[feature][df['Item_Outlet_Sales'] == 1]
    
    t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=False)
    ttest_results.append((feature, t_statistic, p_value))
    
ttest_df = pd.DataFrame(ttest_results, columns=['Feature', 'T-Statistic', 'P-value'])
print(ttest_df)

**Since the p-values for each feature are less than 0.05, we can reject the null hypothesis and conclude that there have significant association between them.**

### Encoding 

###### Before encoding all features , i first encode all except Outlet_size to use random forrest classifier to impute missing data and then encode Outlet_size

In [None]:
df_encoded=df.drop(columns='Outlet_Size',axis=1)
df_encoded=pd.get_dummies(df_encoded,drop_first=True)
# df=df_test.copy()
df_encoded.head()

### Concat df_encoded with uncoded outlet_size

In [None]:
df_encoded2 = pd.concat([df_encoded,df.Outlet_Size],axis=1)
df_encoded2


In [None]:
df_encoded2.isna().sum()

### Impute Outlet_Size using RandomForrestClassifier

In [None]:

from sklearn.ensemble import RandomForestClassifier

# Assuming 'categorical_feature' is the column you want to impute


# Separate data into complete and incomplete
complete_data = df_encoded2.dropna(subset=['Outlet_Size'])
incomplete_data = df_encoded2[df_encoded2['Outlet_Size'].isna()]

# Features and target
X = complete_data.drop(['Outlet_Size'], axis=1)
y = complete_data['Outlet_Size']

# Train RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X, y)

# Predict missing values
predicted_values = classifier.predict(incomplete_data.drop(['Outlet_Size'], axis=1))

# Fill in the missing values
df_encoded2.loc[df_encoded2['Outlet_Size'].isna(), 'Outlet_Size'] = predicted_values

In [None]:
df_encoded2.isnull().sum()

In [None]:
df=df_encoded2.copy()

In [None]:
df.Outlet_Size.value_counts()

In [None]:
df.head()

### Now encode outlet_size :

In [None]:
df_encoded=df.copy()
df_encoded=pd.get_dummies(df_encoded,drop_first=True)
# df=df_test.copy()
df_encoded.head()

In [None]:
df_encoded = df_encoded.astype(int)
df_encoded.head()

In [None]:
df=df_encoded.copy()

## Handling Outlier: LOF

In [None]:
from sklearn.neighbors import LocalOutlierFactor
def outlier_detection(df):
    x = df.drop("Item_Outlet_Sales", axis=1)
    y = df.Item_Outlet_Sales
    lof = LocalOutlierFactor(n_neighbors = 10, metric= "euclidean") #or manhattan

    res = lof.fit_predict(x)
    print(res)
  
    x_o = x[res != -1]
    y_o = y[res != -1]
    df = pd.concat([x_o, y_o], axis=1)
    return df

df_cleaned= outlier_detection(df)

In [None]:
print('number of removed observations : ' ,df.shape[0]-df_cleaned.shape[0])

In [None]:
df=df_cleaned.copy()

## Define X and Y 

In [None]:
X=df.drop('Item_Outlet_Sales',axis=1)
y=df.Item_Outlet_Sales
X.shape , y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3 , shuffle=True,random_state=1234)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

## Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_Scaled = pd.DataFrame(sc.fit_transform(X_train), columns=X_train.columns)
X_train_Scaled.head()

In [None]:
X_test_Scaled = pd.DataFrame(sc.transform(X_test),columns=X_test.columns)
X_test_Scaled.head()


In [None]:
X_train=X_train_Scaled
X_test=X_test_Scaled


In [None]:
X_train17=X_train.copy()
X_test17 = X_test.copy()

In [None]:
X_train.info()


In [None]:
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,y_train)


In [None]:
# Cross validation scores for the model. 
from sklearn.model_selection import cross_val_score
crossvalidation = cross_val_score(linear_regressor, X_train, y_train, cv=5, n_jobs=-1)
crossvalidation


In [None]:
# pedictions on the train set. 
y_pred = linear_regressor.predict(X_train)
y_pred_test = linear_regressor.predict(X_test)


In [None]:
# evaluation metrics value 
from sklearn.metrics import mean_squared_error, r2_score

r2_val = r2_score(y_train,y_pred)
r2_adj_val=1 - (((len(X_train.index) - 1) / (len(X_train.index) - len(X_train.columns) - 1)) * (1 - r2_score(y_train,y_pred)))
rmse_error = mean_squared_error (y_train, y_pred, squared = False)
print ("R2 score for the model is :",r2_val )
print("Adjusted_R2 for the model is :",r2_adj_val)
print ("RMSE error for the model is :",rmse_error )

y_pred_test = linear_regressor.predict(X_test)
r2_val_test = r2_score(y_test,y_pred_test)
r2_adj_val_test=1 - (((len(X_test.index) - 1) / (len(X_test.index) - len(X_test.columns) - 1)) * (1 - r2_score(y_test,y_pred_test)))

rmse_error_test = mean_squared_error (y_test, y_pred_test, squared = False)
print ("R2 score (test) for the model is :",r2_val_test )
print("Adjusted_R2(test) for the model is :",r2_adj_val_test)
print ("RMSE error (test) for the model is :",rmse_error_test )


In [None]:
# create an entry to store the data
def store_results(name, y_pred, y_train, X_train, y_pred_test, y_test, X_test, model, folds, norm, Alpha):

    """    creates and entry to add to the resuts dataframe
    name: name of the model 
    y_pred: predicted y 
    y_train: true value of y
    X_train: features 
    model: model to be fit
    folds: number of folds in cv
    norm: L1 or L2
    Alpha: Value of regularization parameter"""
    model.fit(X_train,y_train)
    y_pred = model.predict(X_train) # predictions on the train set.
    r2_val = r2_score(y_train,y_pred)
    r2_adj_val=1 - (((len(X_train.index) - 1) / (len(X_train.index) - len(X_train.columns) - 1)) * (1 - r2_score(y_train,y_pred)))
    rmse_error = mean_squared_error (y_train, y_pred, squared = False)
    y_pred_test = model.predict(X_test)
    r2_val_test = r2_score(y_test,y_pred_test)
    r2_adj_val_test=1 - (((len(X_test.index) - 1) / (len(X_test.index) - len(X_test.columns) - 1)) * (1 - r2_score(y_test,y_pred_test)))
    rmse_error_test = mean_squared_error (y_test, y_pred_test, squared = False)
    crossvalidation = cross_val_score(model, X_train, y_train, cv=folds, n_jobs=-1)
    
    entry = {'Model': [name],
          'Regularization' : [norm],
          'Alpha_value' : [Alpha],
         'R2Score': [r2_val],
         'Adjusted_R2Score': [r2_adj_val],
         'RMSE': [rmse_error],
         'R2Score_Test': [r2_val_test],
         'Adjusted_R2Score_test': [r2_adj_val_test],
         'RMSE_Test': [rmse_error_test],
          'CrossVal_Mean(r2)': [crossvalidation.mean()],           
          'CrossVal1(r2)': [crossvalidation[0]],
          'CrossVal2(r2)': [crossvalidation[1]],
          'CrossVal3(r2)': [crossvalidation[2]],
          'CrossVal4(r2)': [crossvalidation[3]],
          'CrossVal5(r2)': [crossvalidation[4]],
          }


    result = pd.DataFrame(entry)
    return result


In [None]:
import numpy as np

In [None]:
model= LinearRegression()
temp = store_results("With 17 Features", y_pred, y_train,X_train, y_pred_test, y_test,X_test, linear_regressor, 5, np.nan, np.nan)
temp


In [None]:
outcomes=temp.copy()

In [None]:
X_train.shape

### Recursive Feature Elimination (RFE)

In [None]:
# Import RFE 
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# List for holding values 
i_list = []
r2_list = []
r2_adj_list=[]
rmse_list = []
cross_val_list = []


In [None]:
model = LinearRegression()
for i in range(3, 18, 1):                             # Performing RFE with 1 step jumps
    rfe = RFE(model , n_features_to_select = i )        # running RFE with i variable output.
    rfe.fit(X_train, y_train)                           # Fit the RFE model 
    col = X_train.columns[rfe.support_]                 # identify the columns selected by the RFE model
    linear_regressor = LinearRegression()             
    linear_regressor.fit(X_train[col],y_train)          # Train a new linear model with RFE selected columns 
    crossvalidation = cross_val_score(linear_regressor, X_train[col], y_train, cv=5, n_jobs=-1).mean() # Calculate the cross vall scores for the new model  
    # Predictions on the train set. 
    y_pred = linear_regressor.predict(X_train[col])
    r2_val = r2_score(y_train,y_pred)                    # find R-squared,adjusted R-squared  and RMSE for the new model
    r2_adj_val=1 - (((len(X_train[col].index) - 1) / (len(X_train[col].index) - len(X_train[col].columns) - 1)) * (1 - r2_score(y_train,y_pred)))
    rmse_error = mean_squared_error (y_train, y_pred, squared = False)
    # maintain a list for performance to analyse in future
    i_list.append(i)
    r2_list.append(r2_val)
    r2_adj_list.append(r2_adj_val)
    rmse_list.append(rmse_error)
    cross_val_list.append(crossvalidation)
    # print the outputs 
    print (i)
    print ("R2 score for the model is :",r2_val )
    print("Adjusted_R2 :",r2_adj_val)
    print ("RMSE error for the model is :",rmse_error )
    print ("Mean Cross Validation Score (r2) :",crossvalidation )
    print ("="*70)


In [None]:
# Variation of r2 with number of features 
import seaborn as sns
sns.lineplot( x=i_list, y=r2_list)


In [None]:
# Variation of Adjusted r2 with number of features 
sns.lineplot( x=i_list, y=r2_adj_list)


In [None]:
# Variation of RMSE with increasing number of features 
sns.lineplot( x=i_list, y=rmse_list)


In [None]:
model = LinearRegression()
i = 4                                              # Performing RFE with 3 step jumps
rfe = RFE(model , n_features_to_select = i )        # running RFE with i variable output.
rfe.fit(X_train, y_train)                           # Fit the RFE model 
col = X_train.columns[rfe.support_]                 # identify the columns slected by the RFE model
linear_regressor = LinearRegression()             
linear_regressor.fit(X_train[col],y_train)          # Train a new linear model with RFE slected columns 
crossvalidation = cross_val_score(linear_regressor, X_train[col], y_train, cv=5, n_jobs=-1).mean() # Calculate the cross vall scores for the new model  
# predictions on the train set. 
y_pred = linear_regressor.predict(X_train[col])
r2_val = r2_score(y_train,y_pred)                    # find R-squared,adjusted R-squared  and RMSE for the new model
r2_adj_val=1 - (((len(X_train[col].index) - 1) / (len(X_train[col].index) - len(X_train[col].columns) - 1)) * (1 - r2_score(y_train,y_pred)))
rmse_error = mean_squared_error (y_train, y_pred, squared = False)
# maintain a list for performance to analyse in future
i_list.append(i)
r2_list.append(r2_val)
r2_adj_list.append(r2_adj_val)
rmse_list.append(rmse_error)
cross_val_list.append(crossvalidation)
# print the outputs 
print (i)
print ("R2 score for the model is :",r2_val )
print("Adjusted_R2 :",r2_adj_val)
print ("RMSE error for the model is :",rmse_error )
print ("Mean Cross Validation Score (r2) :",crossvalidation )


In [None]:
rfe_cols = col
col


In [None]:
#columns which are droped
set(X_train.columns)-set(col)


#### As observed, removing these features has a negligible impact on the model score, with only a slight change by a few hundredths. Consequently, we have opted to retain all the columns, as we are not persuaded that their removal would be beneficial.



     
# before droping lets analyze VIF

In [None]:
# dropping the rest of the columns form both train and test sets. 
# X_train = X_train[rfe_cols]


In [None]:
corr = X_train.corr()
corr.style.background_gradient(cmap='coolwarm')


### Feature selection with VIF

In [None]:

# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_string())


### We are going to keep features with VIF<5

In [None]:
X_train = X_train.drop (["Outlet_Size_Small"], axis =1)
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_string())

In [None]:
# def vif_calc (x_train):
#   '''
#   x_train = Training feature set
#   '''
#   vif = pd.DataFrame()
#   vif['Features'] = x_train.columns
#   vif['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
#   vif['VIF'] = round(vif['VIF'], 2)
#   vif = vif.sort_values(by = "VIF", ascending = False)
#   #print(vif.to_string())
#   topval = vif.head(1)
#   return topval

In [None]:
# X_train1 = X_train.copy()

In [None]:
# vif_calc(X_train1)

In [None]:
X_train.shape , X_test.shape

In [None]:
X_test = X_test[X_train.columns]


In [None]:
X_train.shape , X_test.shape

In [None]:
corr = X_train.corr()
corr.style.background_gradient(cmap='coolwarm')


In [None]:
def bad_correlation(df1, treshold=0.6):
    pairs = pd.DataFrame(columns=['feature1','feature2','value'])
    cm = df1.corr() #correlation matrix
    np.fill_diagonal(cm.values, 0) # set diagonal to 0 
    corr = [(cm.index[x], cm.columns[y], cm.iloc[x,y]) for x, y in zip(*np.where(abs(np.tril(cm)) > treshold))] # create couple (feature1, feature2, value)
    for couple in corr:
        feature1, feature2, value = couple
        #print(f'{feature1} and {feature2} are strongly correlated (treshold = {treshold}) (value = {value})')
        entry = {'feature1': [feature1], 'feature2' : [feature2], 'value' : [value]}
        temp = pd.DataFrame(entry)
        pairs = pd.concat([pairs, temp], ignore_index=True) 
    return pairs


In [None]:
bad_correlation(X_train)

In [None]:
X_train.drop(columns='Outlet_Type_Supermarket Type1',axis=1 , inplace=True)




In [None]:
X_test=X_test[X_train.columns]


In [None]:
X_train.shape , X_test.shape

In [None]:
bad_correlation(X_train)

#### After assessing the Variance Inflation Factor (VIF) and intercorrelation among the features, we have identified and retained a subset of 15 features. Subsequently, I created separate training and testing datasets using these 15 features, with the intention of applying a model to this refined dataset in subsequent steps

In [None]:
X_train15 = X_train.copy()
X_test15 = X_test.copy()

### There is no any strong relation between pair features with threshhold >0.6

## Lets back to check with RFE

In [None]:
X_train.shape

In [None]:
# List for holding values 
i_list = []
r2_list = []
r2_adj_list=[]
rmse_list = []
cross_val_list = []

model = LinearRegression()
for i in range(3, 16, 1):                             # Performing RFE with 3 step jumps
    rfe = RFE(model , n_features_to_select = i )        # running RFE with i variable output.
    rfe.fit(X_train, y_train)                           # Fit the RFE model 
    col = X_train.columns[rfe.support_]                 # identify the columns selected by the RFE model
    linear_regressor = LinearRegression()             
    linear_regressor.fit(X_train[col],y_train)          # Train a new linear model with RFE selected columns 
    crossvalidation = cross_val_score(linear_regressor, X_train[col], y_train, cv=5, n_jobs=-1).mean() # Calculate the cross vall scores for the new model  
    # Predictions on the train set. 
    y_pred = linear_regressor.predict(X_train[col])
    r2_val = r2_score(y_train,y_pred)                    # find R-squared,adjusted R-squared  and RMSE for the new model
    r2_adj_val=1 - (((len(X_train[col].index) - 1) / (len(X_train[col].index) - len(X_train[col].columns) - 1)) * (1 - r2_score(y_train,y_pred)))
    rmse_error = mean_squared_error (y_train, y_pred, squared = False)
    # maintain a list for performance to analyse in future
    i_list.append(i)
    r2_list.append(r2_val)
    r2_adj_list.append(r2_adj_val)
    rmse_list.append(rmse_error)
    cross_val_list.append(crossvalidation)
    # print the outputs 
    print (i)
    print ("R2 score for the model is :",r2_val )
    print("Adjusted_R2 :",r2_adj_val)
    print ("RMSE error for the model is :",rmse_error )
    print ("Mean Cross Validation Score (r2) :",crossvalidation )
    print ("="*70)


In [None]:
# Variation of r2 with number of features 
import seaborn as sns
sns.lineplot( x=i_list, y=r2_list)


#### After applying Recursive Feature Elimination (RFE), we did not observe a significant increase in the model score. Consequently, we have decided to proceed with using all the features in our analysis

### Model 2 : Linear regression with selected features

In [None]:
#Let's fit a linear regression model!
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,y_train)

# Cross validation scores for the model. 
crossvalidation = cross_val_score(linear_regressor, X_train, y_train, cv=5, n_jobs=-1)
crossvalidation

# pedictions on the train set. 
y_pred = linear_regressor.predict(X_train)
y_pred_test = linear_regressor.predict(X_test)


In [None]:
# evaluation metrics value 
r2_val = r2_score(y_train,y_pred)
r2_adj_val=1 - (((len(X_train.index) - 1) / (len(X_train.index) - len(X_train.columns) - 1)) * (1 - r2_score(y_train,y_pred)))
rmse_error = mean_squared_error (y_train, y_pred, squared = False)
print ("R2 score for the model is :",r2_val )
print("Adjusted_R2 for the model is :",r2_adj_val)
print ("RMSE error for the model is :",rmse_error )

y_pred_test = linear_regressor.predict(X_test)
r2_val_test = r2_score(y_test,y_pred_test)
r2_adj_val_test=1 - (((len(X_test.index) - 1) / (len(X_test.index) - len(X_test.columns) - 1)) * (1 - r2_score(y_test,y_pred_test)))

rmse_error_test = mean_squared_error (y_test, y_pred_test, squared = False)
print ("R2 score (test) for the model is :",r2_val_test )
print("Adjusted_R2(test) for the model is :",r2_adj_val_test)
print ("RMSE error (test) for the model is :",rmse_error_test )


In [None]:
X_train.shape

In [None]:
model= LinearRegression()
temp = store_results("With 15 Features(VIF and intercorralation)", y_pred, y_train,X_train, y_pred_test, y_test,X_test, linear_regressor, 5, np.nan, np.nan)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


### Backward Elimination:

In [None]:
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

In [None]:
# Linear regression model using OLS
import statsmodels.api as sm
X1 = sm.add_constant(X_train)
ols = sm.OLS(y_train,X1)
lr = ols.fit()

print(lr.summary())


In [None]:
X_train.shape , y_train.shape

In [None]:
type(X_train)

In [None]:
type(y_train)

In [None]:
X_train.isnull().sum()

In [None]:
X_train

In [None]:
#backward feature elimination
maxp = lr.pvalues.max()
while(maxp > 0.05):
    print(f"Adjuste R-Square is {lr.rsquared_adj}")
    print(f"{lr.pvalues.idxmax()} with p-value= {maxp} was dropped\n")
    X1.drop(lr.pvalues.idxmax(),axis=1,inplace=True) 
    ols = sm.OLS(y_train,X1)
    lr = ols.fit()
    maxp = lr.pvalues.max()
print(lr.summary())


In [None]:
X1.drop('const',axis=1,inplace=True)

In [None]:
X_train = X1.copy()

X_test = X_test[X_train.columns]


In [None]:
len(X_train.columns)

In [None]:
model= LinearRegression()

In [None]:
temp = store_results("After Backward Elimination(7Features)", y_pred, y_train,X_train, y_pred_test, y_test,X_test, model, 5, np.nan, np.nan)
temp 

outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes


In [None]:
#Displaying the Intercept
print(model.intercept_)


In [None]:
#Coefficient
coeff_df = pd.DataFrame(model.coef_.T, X_test.columns, columns=['Coefficient'])
print('coeff=',coeff_df)


In [None]:
y_pred = model.predict(X_test)

In [None]:
plt.scatter(y_test, y_pred, edgecolor='black')
plt.xlabel("y_test")
plt.ylabel("y_pred")


In [None]:
# MAE : 
(abs(y_test-y_pred)).describe()


### Linear model with ridge (L2) regularization

In [None]:
X_train17.shape , X_test17.shape

In [None]:
X_train=X_train17.copy()
X_test=X_test17.copy()

In [None]:
from sklearn.linear_model import Ridge
from sklearn import linear_model
ridge_model_15 = linear_model.Ridge(alpha = 0.01, random_state=42)
ridge_model = linear_model.Ridge(alpha = 0.01, random_state=42)
"""
Linear least squares with l2 regularization.

Minimizes the objective function::

||y - Xw||^2_2 + alpha * ||w||^2_2
"""
ridge_model.fit (X_train, y_train)
ridge_model_15.fit (X_train15, y_train)

# predictions on the train set. 
y_pred= ridge_model.predict(X_train)
y_pred_test = ridge_model.predict(X_test)

y_pred_15 = ridge_model_15.predict(X_train15)
y_pred_test_15 = ridge_model_15.predict(X_test15)

temp = store_results("Ridge-1 with 17 features", y_pred, y_train,X_train, y_pred_test, y_test,X_test, ridge_model, 5, "L2", 0.01)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

temp = store_results("Ridge-1 with 15 features", y_pred_15, y_train,X_train15, y_pred_test_15, y_test,X_test15, ridge_model_15, 5, "L2", 0.01)
outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes

In [None]:
ridge_model.coef_

In [None]:
outcomes2=outcomes.copy()
for i in [0.0001, 0.0005, 0.001, 0.005, 0.05, 0.1, 0.5, 1, 5]:
  ridge_model = linear_model.Ridge(alpha = i, random_state=42)
  ridge_model.fit (X_train, y_train)

  # predictions on the train set. 
  y_pred = ridge_model.predict(X_train)
  y_pred_test = ridge_model.predict(X_test)
  temp = store_results("Ridge with 17 features ", y_pred, y_train,X_train, y_pred_test, y_test,X_test, ridge_model, 5, "L2", i)
  outcomes2 = pd.concat([outcomes2, temp], ignore_index=True)

outcomes2


In [None]:
outcomes2=outcomes.copy()
for i in [0.0001, 0.0005, 0.001, 0.005, 0.05, 0.1, 0.5, 1, 5]:
  ridge_model = linear_model.Ridge(alpha = i, random_state=42)
  ridge_model.fit (X_train15, y_train)

  # predictions on the train set. 
  y_pred = ridge_model.predict(X_train15)
  y_pred_test = ridge_model.predict(X_test15)
  temp = store_results("Ridge 15 features ", y_pred, y_train,X_train15, y_pred_test, y_test,X_test15, ridge_model, 5, "L2", i)
  outcomes2 = pd.concat([outcomes2, temp], ignore_index=True)

outcomes2

#### As observed, varying the alpha parameter over a wide range did not result in any discernible increase in the model scores. The scores remained relatively consistent across the tested alpha values

### Optimized Ridge 

In [None]:
from sklearn.linear_model import Ridge, RidgeCV

ridge_cv = RidgeCV(alphas = list(np.arange(1, 20, 0.2)))
ridge_cv_15 = RidgeCV(alphas = list(np.arange(1, 20, 0.2)))
ridge_cv.fit(X_train, y_train)
ridge_cv_15.fit(X_train15, y_train)
# what alpha value did the algorithm choose
alpha = ridge_cv.alpha_
alpha_15 = ridge_cv_15.alpha_
alpha , alpha_15


In [None]:
ridge_opti = Ridge(alpha = alpha)
ridge_opti.fit(X_train, y_train)

ridge_opti_15 = Ridge(alpha = alpha)
ridge_opti_15.fit(X_train15, y_train)


In [None]:
X_train.shape , X_test.shape

In [None]:
y_pred = ridge_opti.predict(X_train)
y_pred_test = ridge_opti.predict(X_test)



In [None]:

y_pred_15 = ridge_opti_15.predict(X_train15)
y_pred_test_15 = ridge_model_15.predict(X_test15)

In [None]:
temp = store_results("Ridge-optimised with 17 features ", y_pred, y_train,X_train, y_pred_test, y_test,X_test, ridge_opti, 5, "L2", alpha)
outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes

In [None]:
temp = store_results("Ridge-optimised with 15 features ", y_pred_15, y_train,X_train15, y_pred_test_15, y_test,X_test15, ridge_opti_15, 5, "L2", alpha)
outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes

In [None]:
ridge_opti.coef_  , ridge_opti_15.coef_

### Linear regression with Lasso (L1) Regularization


In [None]:
from sklearn.linear_model import Lasso, LassoCV

In [None]:
lasso_model = linear_model.Lasso(alpha = 0.01, random_state=42)
lasso_model.fit (X_train, y_train)


In [None]:
lasso_model15 = linear_model.Lasso(alpha = 0.01, random_state=42)
lasso_model15.fit (X_train15, y_train)

In [None]:
# predictions on the train set. 
y_pred = lasso_model.predict(X_train)
y_pred_test = lasso_model.predict(X_test)

y_pred15 = lasso_model15.predict(X_train15)
y_pred_test_15 = lasso_model15.predict(X_test15)

temp = store_results("Lasso with 17 features ", y_pred, y_train,X_train, y_pred_test, y_test,X_test, lasso_model, 5, "L1", 0.01)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

temp = store_results("Lasso with 15 features ", y_pred15, y_train,X_train15, y_pred_test_15, y_test,X_test15, lasso_model15, 5, "L1", 0.01)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


In [None]:
lasso_model.coef_ , lasso_model15.coef_

In [None]:
outcomes2=outcomes.copy()
for i in [0.0001, 0.0005, 0.001, 0.005, 0.05, 0.1, 0.5, 1, 5, 10, 15, 20]:
  lasso_model = linear_model.Lasso(alpha = i, random_state=42)
  lasso_model.fit (X_train, y_train)

  # pedictions on the train set. 
  y_pred = lasso_model.predict(X_train)
  y_pred_test = lasso_model.predict(X_test)

  # Count the number of zeors in the coeff list. 
  coeflist = list(lasso_model.coef_)
  zeros = list(lasso_model.coef_).count(0)
  print (i)
  print ("The number of zero coeff in the model are :", zeros)
  print ("=======================================================")

  temp = store_results("Lasso", y_pred, y_train,X_train, y_pred_test, y_test,X_test, lasso_model, 5, "L1", i)
#   outcomes2 = outcomes2.append (temp)
  outcomes2 = pd.concat([outcomes2, temp], ignore_index=True)

outcomes2


##### Using a discrete set of alpha values does not significantly impact the outcome. 

### Optimized lasso 

In [None]:
from sklearn.linear_model import Lasso, LassoCV

lasso_cv = LassoCV(alphas = None, cv = 5, max_iter = 100000)
lasso_cv.fit(X_train, y_train)

lasso_cv15 = LassoCV(alphas = None, cv = 5, max_iter = 100000)
lasso_cv15.fit(X_train15, y_train)


In [None]:
# what alpha value did the algorithm choose
alpha = lasso_cv.alpha_
alpha15 = lasso_cv15.alpha_
alpha,alpha15


In [None]:
lasso_opti = Lasso(alpha = alpha)
lasso_opti.fit(X_train, y_train)

In [None]:
lasso_opti15 = Lasso(alpha = alpha15)
lasso_opti15.fit(X_train15, y_train)

In [None]:
y_pred = lasso_opti.predict(X_train)
y_pred_test = lasso_opti.predict(X_test)
temp = store_results("Lasso-optimised with 17 features ", y_pred, y_train,X_train, y_pred_test, y_test,X_test, lasso_opti, 5, "L1", alpha)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)


y_pred15 = lasso_opti15.predict(X_train15)
y_pred_test15 = lasso_opti15.predict(X_test15)
temp = store_results("Lasso-optimised with 15 features ", y_pred15, y_train,X_train15, y_pred_test15, y_test,X_test15, lasso_opti15, 5, "L1", alpha15)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)



In [None]:
outcomes

In [None]:
lasso_opti.coef_ ,lasso_opti15.coef_

#### Model 5: Random Forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X_train, y_train)

feature_importances = model_rf.feature_importances_
features = X_train.columns
df = pd.DataFrame({'features': features, 'importance': feature_importances})
df.sort_values(by='importance', ascending = False)


In [None]:
df = df[df.importance > 0.005]
rf_cols = []
for col in list(X_train.columns):
  if col in list(df.features):
    rf_cols.append(col)


In [None]:
model_rf = RandomForestRegressor( random_state=42)
model_rf.fit(X_train[rf_cols], y_train)
len(rf_cols)


In [None]:
predict_rf = model_rf.predict(X_train[rf_cols])

In [None]:
predict_rf_test = model_rf.predict(X_test[rf_cols])

In [None]:
temp = store_results("RF with 17 features", predict_rf, y_train,X_train[rf_cols], predict_rf_test, y_test,X_test[rf_cols], model_rf, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


#### RF for 15 featues 

In [None]:
model_rf15 = RandomForestRegressor(random_state=42)
model_rf15.fit(X_train15, y_train)

feature_importances15 = model_rf15.feature_importances_
features15 = X_train15.columns
df15 = pd.DataFrame({'features': features15, 'importance': feature_importances15})
df15.sort_values(by='importance', ascending = False)


In [None]:
df15 = df15[df15.importance > 0.005]
rf_cols15 = []
for col in list(X_train15.columns):
  if col in list(df15.features):
    rf_cols15.append(col)


In [None]:
model_rf15 = RandomForestRegressor( random_state=42)
model_rf15.fit(X_train15[rf_cols15], y_train)
len(rf_cols15)


In [None]:
predict_rf15 = model_rf15.predict(X_train15[rf_cols15])

In [None]:
predict_rf_test15 = model_rf15.predict(X_test15[rf_cols15])

In [None]:
temp = store_results("RF with 15 features", predict_rf15, y_train,X_train15[rf_cols15], predict_rf_test15, y_test,X_test15[rf_cols15], model_rf15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


#### Random Forest Regression excels primarily in R2 Score and Adjusted R2 Score on the training set for both data sets with 15 and 17 features . However, other evaluation metrics for both the training and test datasets indicate comparatively lower performance.

### Tuning Random forest regressor

In [None]:
model_rf.get_params()

In [None]:
from sklearn.model_selection import GridSearchCV
estimator = RandomForestRegressor()
param_grid = { 
        "n_estimators"      : [5,8,10,12,15],
        "min_samples_split" : [5,8,10,15,20],
        "max_depth"         : [3,4,5]
        }

grid = GridSearchCV(estimator, param_grid, n_jobs=-1, cv=5)

grid.fit(X_train[rf_cols], y_train)

print(grid.best_score_)
print(grid.best_params_)


In [None]:
estimator = RandomForestRegressor(n_estimators = 8, min_samples_split = 5, max_depth= 5, n_jobs=-1)
estimator.fit(X_train[rf_cols],y_train)
y_predict_train = estimator.predict(X_train[rf_cols])
y_predict_test = estimator.predict(X_test[rf_cols])
temp = store_results("Tuned RF with 17 features ", y_predict_train, y_train,X_train[rf_cols], y_predict_test, y_test,X_test[rf_cols], estimator, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


### Tuning Random forest regressor for training set with 15 features

In [None]:
model_rf15.get_params()

In [None]:
estimator15 = RandomForestRegressor()
param_grid15 = { 
        "n_estimators"      : [5,8,10,12,15],
        "min_samples_split" : [5,8,10,15,20],
        "max_depth"         : [3,4,5]
        }

grid15 = GridSearchCV(estimator, param_grid, n_jobs=-1, cv=5)

grid15.fit(X_train15[rf_cols15], y_train)

print(grid15.best_score_)
print(grid15.best_params_)


In [None]:
estimator15 = RandomForestRegressor(n_estimators = 15, min_samples_split = 10, max_depth= 5, n_jobs=-1)
estimator15.fit(X_train15[rf_cols15],y_train)
y_predict_train15 = estimator15.predict(X_train15[rf_cols15])
y_predict_test15 = estimator15.predict(X_test15[rf_cols15])
temp = store_results("Tuned RF with 15 features ", y_predict_train15, y_train,X_train15[rf_cols15], y_predict_test15, y_test,X_test15[rf_cols15], estimator15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


### RF optimised

In [None]:
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV

param_grid = {'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10]}
base_estimator = RandomForestRegressor(random_state=0)
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5,factor=2, resource='n_estimators',max_resources=30).fit(X_train[rf_cols], y_train)
sh.best_estimator_




In [None]:
RF_opti = RandomForestRegressor(n_estimators = 24, min_samples_split = 10, max_depth= 10,random_state=0, n_jobs=-1)
RF_opti.fit(X_train[rf_cols],y_train)
y_predict_train = RF_opti.predict(X_train[rf_cols])
y_predict_test = RF_opti.predict(X_test[rf_cols])
temp = store_results("RF_optimised with 17 features ", y_predict_train, y_train,X_train[rf_cols], y_predict_test, y_test,X_test[rf_cols], RF_opti, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

In [None]:
outcomes

#### RF optimised for 15 features

In [None]:
param_grid15 = {'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10]}
base_estimator15 = RandomForestRegressor(random_state=0)
sh15 = HalvingGridSearchCV(base_estimator15, param_grid15, cv=5,factor=2, resource='n_estimators',max_resources=30).fit(X_train15[rf_cols15], y_train)
sh15.best_estimator_


In [None]:

RF_opti15 = RandomForestRegressor(n_estimators = 24, min_samples_split = 10, max_depth= 10,random_state=0, n_jobs=-1)
RF_opti15.fit(X_train15[rf_cols15],y_train)
y_predict_train15 = RF_opti15.predict(X_train15[rf_cols15])
y_predict_test15 = RF_opti15.predict(X_test15[rf_cols15])
temp = store_results("RF_optimised with 15 features ", y_predict_train15, y_train,X_train15[rf_cols15], y_predict_test15, y_test,X_test15[rf_cols15], RF_opti15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

In [None]:
outcomes

### Adaboost Adaboost 

In [None]:
from sklearn.ensemble import AdaBoostRegressor
Ada= AdaBoostRegressor(random_state=0)

Ada.fit(X_train, y_train)

# Make predictions on the training set
y_predict_train = Ada.predict(X_train)

# Make predictions on the test set (assuming X_test is your test data)
y_predict_test = Ada.predict(X_test)


temp = store_results("AdaBoost wit 17 features ", y_predict_train, y_train,X_train, y_predict_test, y_test,X_test, Ada, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes


In [None]:

Ada15= AdaBoostRegressor(random_state=0)

Ada15.fit(X_train15, y_train)

# Make predictions on the training set
y_predict_train15 = Ada15.predict(X_train15)

# Make predictions on the test set (assuming X_test is your test data)
y_predict_test15 = Ada15.predict(X_test15)


temp = store_results("AdaBoost wit 15 features ", y_predict_train15, y_train,X_train15, y_predict_test15, y_test,X_test15, Ada15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)
outcomes


### Tuning Adaboost

In [None]:
from sklearn.tree import DecisionTreeRegressor
param_grid={'n_estimators':range(10,110,10)}
clf=GridSearchCV(AdaBoostRegressor(DecisionTreeRegressor(max_depth=4)),param_grid)
clf.fit(X_train,y_train)
clf.best_params_
ADA_Tuned=clf.best_estimator_
temp = store_results("Tuned AdaBoost with 17 features ", y_predict_train, y_train,X_train, y_predict_test, y_test,X_test, ADA_Tuned, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)



In [None]:
outcomes

In [None]:
param_grid15={'n_estimators':range(10,110,10)}
clf15=GridSearchCV(AdaBoostRegressor(DecisionTreeRegressor(max_depth=4)),param_grid)
clf15.fit(X_train15,y_train)


ADA_Tuned15=clf15.best_estimator_
temp = store_results("Tuned AdaBoost with 15 features ", y_predict_train15, y_train,X_train15, y_predict_test15, y_test,X_test15, ADA_Tuned15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes

In [None]:
clf.best_estimator_ , clf15.best_estimator_

### xgboost regressor


In [None]:

import xgboost as xgb
# Initialize and train the XGBoost regressor
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)


xg_reg.fit(X_train, y_train)

y_predict_train = xg_reg.predict(X_train)

y_predict_test = xg_reg.predict(X_test)

temp = store_results("XGBoost with 17 features ", y_predict_train, y_train,X_train, y_predict_test, y_test,X_test, xg_reg, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


In [None]:
# Define the model
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror')

# Define the hyperparameter grid
param_grid = {
    'colsample_bytree': [0.3, 0.5, 0.7],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [4, 5, 6],
    'alpha': [1, 5, 10],
    'n_estimators': [10, 50, 100]
}
# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=xg_reg, param_grid=param_grid, 
                           scoring='neg_mean_squared_error', cv=5, verbose=1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

# Extract the best estimator
XGBoost_Tuned = grid_search.best_estimator_

# Make predictions on the training set
y_predict_train = XGBoost_Tuned.predict(X_train)

# Make predictions on the test set
y_predict_test = XGBoost_Tuned.predict(X_test)


temp = store_results("Tuned XGBoost with 17 features", y_predict_train, y_train,X_train, y_predict_test, y_test,X_test, XGBoost_Tuned, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


In [None]:
xg_reg15 = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

xg_reg15.fit(X_train15, y_train)

y_predict_train15 = xg_reg15.predict(X_train15)

y_predict_test15 = xg_reg15.predict(X_test15)

temp = store_results("XGBoost with 15 features ", y_predict_train15, y_train,X_train15, y_predict_test15, y_test,X_test15, xg_reg15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes

In [None]:
# Define the model
xg_reg15 = xgb.XGBRegressor(objective ='reg:squarederror')

# Define the hyperparameter grid
param_grid = {
    'colsample_bytree': [0.3, 0.5, 0.7],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [4, 5, 6],
    'alpha': [1, 5, 10],
    'n_estimators': [10, 50, 100]
}
# Grid search with 5-fold cross-validation
grid_search15 = GridSearchCV(estimator=xg_reg, param_grid=param_grid, 
                           scoring='neg_mean_squared_error', cv=5, verbose=1)

grid_search15.fit(X_train15, y_train)

best_params15 = grid_search15.best_params_

# Extract the best estimator
XGBoost_Tuned15 = grid_search15.best_estimator_

# Make predictions on the training set
y_predict_train15 = XGBoost_Tuned15.predict(X_train15)

# Make predictions on the test set
y_predict_test15 = XGBoost_Tuned15.predict(X_test15)

temp = store_results("Tuned XGBoost with 15 features", y_predict_train15, y_train,X_train15, y_predict_test15, y_test,X_test15, XGBoost_Tuned15, 5, np.nan, np.nan)
# outcomes = outcomes.append (temp)
outcomes = pd.concat([outcomes, temp], ignore_index=True)

outcomes


### Ploting results :


In [None]:
# Plotting
plt.figure(figsize=(20, 12))


### Ploting all models : 
    - Model performance comparison (Train and Test R-squared)

In [None]:
models = outcomes['Model']
train_r2 = outcomes['R2Score']
test_r2 = outcomes['R2Score_Test']

bar_height = 0.35

y = np.arange(len(models))

fig, ax = plt.subplots(figsize=(10, 6))

plt.barh(y - bar_height/2, train_r2, bar_height, label='Train R-squared', color='skyblue')

plt.barh(y + bar_height/2, test_r2, bar_height, label='Test R-squared', color='lightcoral')

plt.yticks(y, models)
plt.ylabel('Model')
plt.xlabel('R-squared')
plt.title('Model performance comparison (Train and Test R-squared)')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:
models = outcomes['Model']
train_adr2 = outcomes['Adjusted_R2Score']
test_adr2 = outcomes['Adjusted_R2Score_test']

bar_height = 0.35

y = np.arange(len(models))

fig, ax = plt.subplots(figsize=(10, 6))

plt.barh(y - bar_height/2, train_adr2, bar_height, label='Train Adjusted_R2Score', color='skyblue')

plt.barh(y + bar_height/2, test_adr2, bar_height, label='Test Adjusted_R2Score', color='lightcoral')

plt.yticks(y, models)
plt.ylabel('Model')
plt.xlabel('Adjusted_R2Score')
plt.title('Model performance comparison (Train and Test Adjusted_R2Score)')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:
models = outcomes['Model']
train_rmse = outcomes['RMSE']
test_rmse = outcomes['RMSE_Test']

bar_height = 0.35

y = np.arange(len(models))

fig, ax = plt.subplots(figsize=(10, 6))

plt.barh(y - bar_height/2, train_rmse, bar_height, label='Train RMSE', color='skyblue')

plt.barh(y + bar_height/2, test_rmse, bar_height, label='Test RMSE', color='lightcoral')

plt.yticks(y, models)
plt.ylabel('Model Name')
plt.xlabel('RMSE')
plt.title('Model performance comparison (Train and Test RMSE)')
plt.legend()

plt.tight_layout()
plt.show()


### Ploting all model with 17 features  based on R2Score : 

In [None]:

outcomes_sorted = outcomes[outcomes['Model'].str.lower().str.contains('17 features')].sort_values(by='R2Score', ascending=True)

models = outcomes_sorted['Model']
train_r2 = outcomes_sorted['R2Score']
test_r2 = outcomes_sorted['R2Score_Test']

bar_height = 0.35

y = np.arange(len(models))

fig, ax = plt.subplots(figsize=(10, 6))

plt.barh(y - bar_height/2, train_r2, bar_height, label='Train R-squared', color='skyblue')

plt.barh(y + bar_height/2, test_r2, bar_height, label='Test R-squared', color='lightcoral')

plt.yticks(y, models)
plt.ylabel('Model')
plt.xlabel('R-squared')
plt.title('Model performance comparison (Train and Test R-squared)')
plt.legend()

plt.tight_layout()
plt.show()


### Ploting all model with 15 features  based on R2Score : 

In [None]:
outcomes_sorted = outcomes[outcomes['Model'].str.lower().str.contains('15 features')].sort_values(by='R2Score', ascending=True)

models = outcomes_sorted['Model']
train_r2 = outcomes_sorted['R2Score']
test_r2 = outcomes_sorted['R2Score_Test']

bar_height = 0.35

y = np.arange(len(models))

fig, ax = plt.subplots(figsize=(10, 6))

plt.barh(y - bar_height/2, train_r2, bar_height, label='Train R-squared', color='skyblue')

plt.barh(y + bar_height/2, test_r2, bar_height, label='Test R-squared', color='lightcoral')

plt.yticks(y, models)
plt.ylabel('Model')
plt.xlabel('R-squared')
plt.title('Model performance comparison (Train and Test R-squared) - 15 Features')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:

outcomes_sorted = outcomes.sort_values(by='R2Score', ascending=True)

plt.barh(data=outcomes_sorted,width='R2Score' ,y= 'Model', color='skyblue')
plt.xlabel('R-squared')
plt.title('Model performance comparison (R-squared)')
plt.tight_layout()
plt.show()

# <span style="color: blue;">Model Evaluation Summary</span>

## <span style="color: green;">Model 0: With 17 Features</span>
- Train R-squared (R2Score): 0.5556
- Test R-squared (R2Score_Test): 0.5577
- Comments: The model seems to have moderate predictive power on the training set, but it generalizes reasonably well to the test set as indicated by a similar R-squared.

## <span style="color: green;">Model 1: With 15 Features (VIF and Intercorrelation)</span>
- Train R-squared (R2Score): 0.4512
- Test R-squared (R2Score_Test): 0.4562
- Comments: The model's performance is lower compared to the one with 17 features, and there might be room for improvement.

## <span style="color: green;">Model 2: After Backward Elimination (7 Features)</span>
- Train R-squared (R2Score): 0.4503
- Test R-squared (R2Score_Test): 0.4583
- Comments: Similar to Model 1, but with fewer features. There is a risk of underfitting due to the simplicity of the model.

## <span style="color: green;">Model 3: Ridge-1 with 17 Features</span>
- Train R-squared (R2Score): 0.5556
- Test R-squared (R2Score_Test): 0.5577
- Comments: Ridge regularization doesn't seem to significantly impact the model's performance.

## <span style="color: green;">Model 4: Ridge-1 with 15 Features</span>
- Train R-squared (R2Score): 0.4512
- Test R-squared (R2Score_Test): 0.4562
- Comments: Similar to Model 1, Ridge regularization doesn't seem to improve performance.

## <span style="color: green;">Model 5: Ridge-optimized with 17 Features</span>
- Train R-squared (R2Score): 0.5556
- Test R-squared (R2Score_Test): 0.5576
- Comments: The optimized Ridge model performs similarly to the basic Ridge model.

## <span style="color: green;">Model 6: Ridge-optimized with 15 Features</span>
- Train R-squared (R2Score): 0.4512
- Test R-squared (R2Score_Test): 0.4562
- Comments: Similar to Model 4, the optimized Ridge model doesn't significantly improve performance.

## <span style="color: green;">Model 7: Lasso with 17 Features</span>
- Train R-squared (R2Score): 0.5556
- Test R-squared (R2Score_Test): 0.5577
- Comments: Lasso regularization doesn't seem to impact the model's performance.

## <span style="color: green;">Model 8: Lasso with 15 Features</span>
- Train R-squared (R2Score): 0.4512
- Test R-squared (R2Score_Test): 0.4562
- Comments: Similar to Model 1, Lasso regularization doesn't significantly improve performance.

## <span style="color: green;">Model 9: Lasso-optimized with 17 Features</span>
- Train R-squared (R2Score): 0.5553
- Test R-squared (R2Score_Test): 0.5578
- Comments: The optimized Lasso model performs similarly to the basic Lasso model.

## <span style="color: green;">Model 10: Lasso-optimized with 15 Features</span>
- Train R-squared (R2Score): 0.4512
- Test R-squared (R2Score_Test): 0.4562
- Comments: Similar to Model 8, the optimized Lasso model doesn't significantly improve performance.

## <span style="color: green;">Model 11: Random Forest (RF) with 17 Features</span>
- Train R-squared (R2Score): 0.9147
- Test R-squared (R2Score_Test): 0.5216
- Comments: There's a significant difference between training and test R-squared, suggesting potential overfitting on the training set.

## <span style="color: green;">Model 12: RF with 15 Features</span>
- Train R-squared (R2Score): 0.9057
- Test R-squared (R2Score_Test): 0.4710
- Comments: Similar to Model 11, overfitting is observed.

## <span style="color: green;">Model 13: Tuned RF with 17 Features</span>
- Train R-squared (R2Score): 0.4589
- Test R-squared (R2Score_Test): 0.4624
- Comments: There's a small improvement over Model 11, but overfitting is still evident.

## <span style="color: green;">Model 14: Tuned RF with 17 Features</span>
- Train R-squared (R2Score): 0.4807
- Test R-squared (R2Score_Test): 0.4762
- Comments: Overfitting is observed, and the model's performance on the test set is not significantly better than the untuned RF.

## <span style="color: green;">Model 15: Tuned RF with 15 Features</span>
- Train R-squared (R2Score): 0.4762
- Test R-squared (R2Score_Test): 0.4702
- Comments: Similar to Model 14, overfitting is observed.

## <span style="color: green;">Model 16: RF-optimized with 17 Features</span>
- Train R-squared (R2Score): 0.6758
- Test R-squared (R2Score_Test): 0.5118
- Comments: There's a significant difference between training and test R-squared, suggesting potential overfitting on the training set.

## <span style="color: green;">Model 17: RF-optimized with 17 Features</span>
- Train R-squared (R2Score): 0.6537
- Test R-squared (R2Score_Test): 0.5148
- Comments: Similar to Model 16, overfitting is observed.

## <span style="color: green;">Model 18: RF-optimized with 15 Features</span>
- Train R-squared (R2Score): 0.6537
- Test R-squared (R2Score_Test): 0.5148
- Comments: Similar to Models 16 and 17, overfitting is observed.

## <span style="color: green;">Model 19: AdaBoost with 17 Features</span>
- Train R-squared (R2Score): 0.4568
- Test R-squared (R2Score_Test): 0.4792
- Comments: The model performs better on the test set compared to Models 11-18, but there might still be overfitting.

## <span style="color: green;">Model 20: AdaBoost with 17 Features</span>
- Train R-squared (R2Score): 0.4568
- Test R-squared (R2Score_Test): 0.4792
- Comments: Similar to Model 19.

## <span style="color: green;">Model 21: AdaBoost with 15 Features</span>
- Train R-squared (R2Score): 0.4104
- Test R-squared (R2Score_Test): 0.4233
- Comments: There is potential underfitting as the performance is lower compared to Models 19 and 20.

## <span style="color: green;">Model 22: Tuned AdaBoost with 17 Features</span>
- Train R-squared (R2Score): 0.5592
- Test R-squared (R2Score_Test): 0.5615
- Comments: The model performs well on both training and test sets, indicating a good balance.

## <span style="color: green;">Model 23: Tuned AdaBoost with 17 Features</span>
- Train R-squared (R2Score): 0.5119
- Test R-squared (R2Score_Test): 0.5187
- Comments: There's a slight difference between training and test R-squared, but the model's performance is reasonable.

## <span style="color: green;">Model 24: Tuned AdaBoost with 15 Features</span>
- Train R-squared (R2Score): 0.4544
- Test R-squared (R2Score_Test): 0.4573
- Comments: Similar to Models 19 and 20, there might be overfitting.

## <span style="color: green;">Model 25: Tuned XGBoost with 17 Features</span>
- Train R-squared (R2Score): 0.6072
- Test R-squared (R2Score_Test): 0.5872
- Comments: The model performs well on both training and test sets, indicating a good balance.

## <span style="color: green;">Model 26: XGBoost with 17 Features</span>
- Train R-squared (R2Score): 0.0573
- Test R-squared (R2Score_Test): 0.0336
- Comments: There's a significant difference between training and test R-squared, indicating potential overfitting.

## <span style="color: green;">Model 27: XGBoost with 17 Features</span>
- Train R-squared (R2Score): 0.0573
- Test R-squared (R2Score_Test): 0.0336
- Comments: Similar to Model 26.

## <span style="color: green;">Model 28: XGBoost with 15 Features</span>
- Train R-squared (R2Score): -0.0854
- Test R-squared (R2Score_Test): -0.0994
- Comments: There's a significant difference between training and test R-squared, indicating potential overfitting.

## <span style="color: green;">Model 29: Tuned XGBoost with 15 Features</span>
- Train R-squared (R2Score): 0.6198
- Test R-squared (R2Score_Test): 0.5315
- Comments: The model performs well on both training and test sets, indicating a good balance.

# <span style="color: blue;">Summary</span>

Models like AdaBoost (22, 23) and Tuned XGBoost (25, 29) seem to strike a good balance between training and test performance.
Models like Random Forest (11-18) and some versions of XGBoost (26-28) exhibit overfitting.
Ridge and Lasso regularization don't seem to significantly impact the models.
Feature selection strategies (Models 0-2) show mixed results, with potential for improvement.
It's essential to further fine-tune hyperparameters and consider feature engineering to improve model performance and mitigate overfitting or underfitting. Additionally, cross-validation results should be carefully considered to ensure the robustness of the models.





# Model Selection Considerations


## Models with Good Generalization Performance:

### Tuned XGBoost with 17 features (Model 25):
- <span style="color: green;">**High Train R-squared:** 0.6072</span>
- <span style="color: green;">**Test R-squared:** 0.5872</span>
- <span style="color: green;">**Comments:** This model shows good performance on both the training and test sets, indicating a good balance between fitting the training data and generalizing to new, unseen data.</span>

### Tuned AdaBoost with 17 features (Model 22):
- <span style="color: green;">**Train R-squared:** 0.5592</span>
- <span style="color: green;">**Test R-squared:** 0.5615</span>
- <span style="color: green;">**Comments:** Like the XGBoost model, this AdaBoost model exhibits a good balance between training and test performance.</span>

## Models with Potential Overfitting:

### Random Forest (RF) with 17 features (Model 11):
- <span style="color: green;">**High Train R-squared:** 0.9147</span>
- <span style="color: green;">**Lower Test R-squared:** 0.5216</span>
- <span style="color: green;">**Comments:** There's a significant difference between training and test R-squared, suggesting overfitting on the training set.</span>

### XGBoost with 17 features (Models 26 and 27):
- <span style="color: green;">**Train R-squared:** 0.0573</span>
- <span style="color: green;">**Test R-squared:** 0.0336</span>
- <span style="color: green;">**Comments:** These XGBoost models exhibit a significant difference between training and test R-squared, indicating potential overfitting.</span>

## Consideration for Model Selection:

Tuned XGBoost with 17 features (Model 25) seems to be a strong candidate as it exhibits good performance on both the training and test sets, suggesting it might generalize well to new data.

It's crucial to consider the application's context and the importance of interpretability. More complex models might provide better predictive performance, but simpler models are often easier to interpret.

Cross-validation results and other performance metrics (e.g., RMSE) should also be considered for a comprehensive evaluation.

Further hyperparameter tuning or ensemble methods could potentially improve model performance.

In conclusion, based on the provided information, Tuned XGBoost with 17 features (Model 25) appears to be a strong contender for the best model. However, the final decision should be made considering the specific requirements and goals of your analysis.

# <span style="color:red">Overfitted Models</span>

## <span style="color:blue">Random Forest (RF) with 17 features (Model 11)</span>
- Train R-squared: 0.9147
- Test R-squared: 0.5216
- **Comments:** There's a significant difference between training and test R-squared, suggesting potential overfitting on the training set.

## <span style="color:blue">Random Forest (RF) with 15 features (Model 12)</span>
- Train R-squared: 0.9057
- Test R-squared: 0.4710
- **Comments:** Similar to Model 11, overfitting is observed.

## <span style="color:blue">Tuned RF with 17 features (Models 13 and 14)</span>
### <span style="color:green">Model 13</span>
- Train R-squared: 0.4589
- Test R-squared: 0.4624
- **Comments:** There's a small improvement over Model 11, but overfitting is still evident.

### <span style="color:green">Model 14</span>
- Train R-squared: 0.4807
- Test R-squared: 0.4762
- **Comments:** Overfitting is observed, and the model's performance on the test set is not significantly better than the untuned RF.

## <span style="color:blue">Tuned RF with 15 features (Model 15)</span>
- Train R-squared: 0.4762
- Test R-squared: 0.4702
- **Comments:** Similar to Model 14, overfitting is observed.

## <span style="color:blue">RF-optimized with 17 features (Models 16 and 17)</span>
### <span style="color:purple">Model 16</span>
- Train R-squared: 0.6758
- Test R-squared: 0.5118
- **Comments:** There's a significant difference between training and test R-squared, suggesting potential overfitting on the training set.

### <span style="color:purple">Model 17</span>
- Train R-squared: 0.6537
- Test R-squared: 0.5148
- **Comments:** Similar to Model 16, overfitting is observed.

## <span style="color:blue">RF-optimized with 15 features (Model 18)</span>
- Train R-squared: 0.6537
- Test R-squared: 0.5148
- **Comments:** Similar to Models 16 and 17, overfitting is observed.

## <span style="color:blue">XGBoost with 17 features (Models 26 and 27)</span>
### <span style="color:orange">Model 26</span>
- Train R-squared: 0.0573
- Test R-squared: 0.0336
- **Comments:** There's a significant difference between training and test R-squared, indicating potential overfitting.

### <span style="color:orange">Model 27</span>
- Train R-squared: 0.0573
- Test R-squared: 0.0336
- **Comments:** Similar to Model 26.

## <span style="color:blue">XGBoost with 15 features (Model 28)</span>
- Train R-squared: -0.0854
- Test R-squared: -0.0994
- **Comments:** There's a significant difference between training and test R-squared, indicating potential overfitting.
