Name: Aditi Mohan<br/>
Roll No: C067<br/>
Batch: C1<br/>
Div: C

# Problem Statement
### Predict Thermostability of a protien sequence (enzyme)

> **Enzymes** are proteins that act as catalysts in the chemical reactions of living organisms.

> **Thermostability** is the quality of a substance to resist irreversible change in its chemical or physical structure

Understanding and accurately predict protein stability is a fundamental problem in biotechnology. Its applications include enzyme engineering for addressing the world’s challenges in sustainability, carbon neutrality and more. Improvements to enzyme stability could lower costs and increase the speed scientists can iterate on concepts.

# Imports

In [None]:
! pip install biopandas -q

[?25l[K     |▍                               | 10 kB 22.3 MB/s eta 0:00:01[K     |▊                               | 20 kB 12.3 MB/s eta 0:00:01[K     |█▏                              | 30 kB 16.4 MB/s eta 0:00:01[K     |█▌                              | 40 kB 8.0 MB/s eta 0:00:01[K     |█▉                              | 51 kB 8.1 MB/s eta 0:00:01[K     |██▎                             | 61 kB 9.5 MB/s eta 0:00:01[K     |██▋                             | 71 kB 8.9 MB/s eta 0:00:01[K     |███                             | 81 kB 9.0 MB/s eta 0:00:01[K     |███▍                            | 92 kB 10.0 MB/s eta 0:00:01[K     |███▊                            | 102 kB 7.8 MB/s eta 0:00:01[K     |████                            | 112 kB 7.8 MB/s eta 0:00:01[K     |████▌                           | 122 kB 7.8 MB/s eta 0:00:01[K     |████▉                           | 133 kB 7.8 MB/s eta 0:00:01[K     |█████▏                          | 143 kB 7.8 MB/s eta 0:00:01[K 

In [None]:
! pip install python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.9-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.9
  Downloading Levenshtein-0.20.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (174 kB)
[K     |████████████████████████████████| 174 kB 8.5 MB/s 
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 53.7 MB/s 
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.9 python-Levenshtein-0.20.9 rapidfuzz-2.13.7


In [None]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! pip install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biopython
  Downloading biopython-1.80-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 6.8 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.80


In [None]:
# loading data
from google.colab import files

# data handling libraries
import numpy as np
import pandas as pd

# visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns 

# biotech packages
from biopandas.pdb import PandasPdb
import Levenshtein
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.PDB import PDBParser
from Bio.PDB.SASA import ShrakeRupley

# Load Data

In [None]:
files.upload()

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle competitions download -c novozymes-enzyme-stability-prediction

In [None]:
! mkdir datasets

In [None]:
! unzip novozymes-enzyme-stability-prediction.zip -d datasets

## Creating the DataFrame

In [None]:
df = pd.read_csv('/content/datasets/train.csv')

In [None]:
df.head()

# EDA

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isnull().sum()

In [None]:
df['pH'].hist(bins=50)

In [None]:
# histogram ignoring the outliers
vals = df[df['pH'] < 30]
vals['pH'].hist()

In [None]:
plt.scatter(df['pH'], df['tm'])
plt.xlabel("pH")
plt.ylabel("tm")
plt.show()

In [None]:
df['tm'].hist()

#### pH values have some outliers
#### Both pH and tm have an even distribution

# Handling Missing Values

Drop Column - Data Source : Does not affect prediction

In [None]:
df.drop(["data_source"], inplace=True, axis=1)

Drop Rows - with missing pH values : Affects predictions

In [None]:
df.dropna(axis=0, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.head()

# Feature Engineering

Factors that affect thermal stabilization of proteins and enzymes -

<br>

## 1. Hydrophobicity

> **The overall amount of hydrophobicity present in a particular protein is responsible for its thermostability.**

Hydrophobic Interactions are important for the folding of proteins. This is important in keeping a protein stable and biologically active, because it allow to the protein to decrease in surface are and reduce the undesirable interactions with water.<br><br>

<br>

## 2. Protien, Amino acid charge & Fariás-Bonato ratio

> **The Fariás-Bonato ratio determines levels of thermostability based on amino acid composition in protein sequences. Particularly, the increased number of charged residues (Glu+Lys) against the decreased number of polar residues (Gln+His) to create a ratio (Glu+Lys)/(Gln+His) which is used to identify thermostable proteins.**

Charged to Uncharged Amino Acid Ratio helps determine the stability of an enzyme.
Overall Protien Charge at pH=7, gives the overall Hydrogen donors and acceptors AA in the chain, which helps determine the stability of a protien sequence.<br><br>

<br>

## 3. Protein Flexibility

> **Flexibility indices show that overall flexibility is reduced when thermostability is increased.**

Protein molecules require both flexibility and rigidity to function, but the higher the temperature optimum and stability the more rigid is the structure needed to compensate for increased thermal fluctuations.

<br>

## 4. Ionic interactions
*   **i. Isoelectrict point**<br>
The isoelectric point (pI) of a protein is defined as the pH at which the net charge of a protein molecule is zero
> Consequently, proteins are expected to be least soluble & most stable near their isoelectric points

*   **ii. Molecular weight**

*   **iii. Aromacity**<br>
The presence of additional aromatic clusters near the active site should help in retaining the conformational features of the active sites residues required to bind the substrate at high temperatures and thus contributing to the high thermophilicity of the thermostable proteins.
> The aromaticity value of a protein according to Lobry & Gautier is simply the relative frequency of Phe+Trp+Tyr.

<br>

## 5. Protein surface area 
> **It has been suggested that increased polar surface area contributes to the greater stability of the thermophilic proteins**

We need special .pdb files to calculate the surface area of a protein 

<br>

## 6. Amino acid percentages && SSF
> **Percentage of individual amino acid can be used to analyse and predict the behaviour of a protien sequence**

Further SSF - Secondary Structure Fraction can give a more comprehesive insight

> SSF gives the percentage for 3 groups of AA
*   Amino acids in helix: V, I, Y, F, W, L.
*   Amino acids in turn: N, P, G, S.
*   Amino acids in sheet: E, M, A, L.

<br>

## 7. Instability & GRAVY index

**Instability** of a protien sequence here is calculated using Guruprasad et al. (1990, Protein Engineering, 4, 155-161). This method tests a protein for stability.

> Any value above 40 means the protein is unstable.

**GRAVY Index** indicates the hydrophobicity of the proteins, calculated by adding the hydropathy value for each residue and dividing by the length of the sequence

>  Proteins with a GRAVY scores above 0 are more likely to be hydrophobic proteins

## Hydrophobicity 

The overall amount of hydrophobicity affects the Thermostabality of the enzyme.
The higher the Hydrophobicity the higher the thermostabality<br><br>
Different Scales to measure Hydrophobicity -<br>
1. Kyte-Doolittle
2. Hopp-Woods
3. Cornette
4. Eisenberg	
5. Rose
6. Janin
7. Engelman GES<br><br>

The protien wise Hydrophobicity for can be found here - [Hydrophobicity Chart](https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Hydrophobicity_scales.html)

In [None]:
hydrophobicity_factors_url = 'https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Hydrophobicity_scales.html'
hydrophobicity_factors_df = pd.read_html(
    hydrophobicity_factors_url,
    header=0,
    skiprows=0
)[0]

hydrophobicity_factors_headers = ['aa', 'Amino Acid', 'Kyte-Doolittle', 'Hopp-Woods', 'Cornette', 'Eisenberg', 'Rose', 'Janin', 'Engelman GES']
hydrophobicity_factors_df.columns = hydrophobicity_factors_headers
hydrophobicity_factors_df.set_index('aa', drop=True, inplace=True)

hydrophobicity_factors_df

In [None]:
def get_hydrophobicity_of_seq(seq, scale):
  hy = 0
  for each in seq:
    hy += hydrophobicity_factors_df[scale][each]
  
  return hy/len(seq)

In [None]:
for each in [ 'Kyte-Doolittle' , 'Hopp-Woods', 'Cornette', 'Eisenberg', 'Rose', 'Janin', 'Engelman GES']:
  df[each] = df['protein_sequence'].apply(lambda x: get_hydrophobicity_of_seq(x, each))

In [None]:
df.head()

In [None]:
df.to_csv('/content/Hydrophobicity.csv')

In [None]:
df = pd.read_csv('/content/Hydrophobicity.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)

## Protein, Amino Acid Charge & Fariás-Bonato ratio

In [None]:
# using Protien Analysis object to get insights into the protien sequence
df['protein_analysis_obj'] = df['protein_sequence'].apply(lambda x:ProteinAnalysis(x))

In [None]:
# Getting overall charge of sequence at pH=7
df['protein_charge_at_ph7'] = df['protein_analysis_obj'].apply(lambda x: round(x.charge_at_pH(pH=7),2))

In [None]:
charged_aa = ['D', 'E', 'H', 'K', 'R']

In [None]:
def get_charge_ratio(seq):
  c = 0
  for aa in seq:
    if aa in charged_aa:
      c+=1
  
  return c/(len(seq)-c)

In [None]:
df['charge_ratio'] = df['protein_sequence'].apply(lambda x: get_charge_ratio(x))

In [None]:
def get_FB_ratio(seq):
  e_k = 0
  q_h = 0
  for aa in seq:
    if aa == 'E' or aa == 'K':
      e_k += 1
    elif aa == 'Q' or aa == 'H':
      q_h += 1
  
  return e_k/q_h if q_h != 0 else 0

In [None]:
df['Fariás-Bonato ratio'] = df['protein_sequence'].apply(lambda x: get_FB_ratio(x))

## Protein Flexibility

In [None]:
def get_amino_flexibility(seq_flex):
  flex = sum(seq_flex)
    
  return flex

In [None]:
df['flex'] = df['protein_analysis_obj'].apply(lambda x: x.flexibility())

In [None]:
df['flex'] = df['flex'].apply(lambda x: get_amino_flexibility(x))

In [None]:
df.head()

## Ionic Interaction

In [None]:
df['protein_isoelectric_point'] = df['protein_analysis_obj'].apply(lambda x: x.isoelectric_point())

In [None]:
df['protein_molecular_weight'] = df['protein_analysis_obj'].apply(lambda x:x.molecular_weight())

In [None]:
df['protein_aromaticity'] = df['protein_analysis_obj'].apply(lambda x:x.aromaticity())

## Protein Surface Area

Unable to calculate Protien Surface as Sequence pdb is not available in the dataset

## Amino Acid Percentage

In [None]:
def get_aa_percent(df):
  aa_df = pd.DataFrame()
  for i in df.index:
    aa_per = df['protein_analysis_obj'][i].get_amino_acids_percent()
    aa_per['seq_id'] = df['seq_id'][i]
    aa_df = aa_df.append(aa_per, ignore_index=True)
  
  return aa_df

In [None]:
aa_df = get_aa_percent(df)
aa_df.head()

In [None]:
df = pd.merge(df, aa_df, how='inner', on='seq_id')
df.head()

### Secondary structure fraction - Helix, Turn, Sheet

In [None]:
# see if required
df['protein_analysis_obj'][0].secondary_structure_fraction()

In [None]:
def get_ssf_percent(df):
  ssf_df = pd.DataFrame()
  for i in df.index:
    ssf = {}
    ssf_tup = df['protein_analysis_obj'][i].secondary_structure_fraction()
    ssf['seq_id'] = df['seq_id'][i]
    ssf['Helix'] = ssf_tup[0]
    ssf['Turn'] = ssf_tup[1]
    ssf['Sheet'] = ssf_tup[2]
    ssf_df = ssf_df.append(ssf, ignore_index=True)
  
  return ssf_df

In [None]:
ssf_df = get_ssf_percent(df)
ssf_df.head()

In [None]:
df = pd.merge(df, ssf_df, how='inner', on='seq_id')
df.head()

## Instability & GRAVY Index

In [None]:
df['instability'] = df['protein_analysis_obj'].apply(lambda x: x.instability_index())

In [None]:
df['protein_gravy_val'] = df['protein_analysis_obj'].apply(lambda x: x.gravy())

In [None]:
df.head()

In [None]:
df.to_csv('final_dataset.csv')

# Load Feature Engineered Dataset

In [None]:
df = pd.read_csv('/content/act_final_dataset.csv')

In [None]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

# Correlations

In [None]:
len(df.columns)

In [None]:
df.columns

In [None]:
meta_columns = [
       'tm', 'pH', # pH
       'protein_charge_at_ph7', 'charge_ratio', 'Fariás-Bonato ratio', # charge ratios
       'flex', # flexibility
       'protein_isoelectric_point', 'protein_molecular_weight', 'protein_aromaticity', # ionic properties
       'instability', # instability
       'protein_gravy_val', 'Kyte-Doolittle', 'Hopp-Woods', 'Cornette', # hydrophobicity
       'Eisenberg', 'Rose', 'Janin', 'Engelman GES', # hydrophobicity
       'Helix', 'Turn', 'Sheet' # composition percentages
       ]

meta_corr = df[meta_columns].corr()
meta_corr.style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
aa_corr = df[['tm', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
       'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']].corr()
aa_corr.style.background_gradient(cmap='coolwarm', axis=None)

### Above plots Illustrate high multi colinearity between features

# Reducing Multi Collinearity

In [None]:
df_p = df

## Ionic Measures

In [None]:
i_corr_cols = ['protein_isoelectric_point', 'protein_molecular_weight', 'protein_charge_at_ph7']

i_corr = df_p[['tm']+i_corr_cols].corr()
i_corr.style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
_, ax = plt.subplots(figsize=(20, 4))

sns.heatmap(df_p[['seq_id']+i_corr_cols].set_index('seq_id', drop=True).T, cmap='coolwarm')
plt.show()

Since **protein_charge_at_ph7** is highly correlated to both **protein_molecular_weight & protein_isoelectric_point** & captures informaion similar to **protein_isoelectric_point**
<br><br>
We can drop **protein_charge_at_ph7**

In [None]:
df_p.drop(['protein_charge_at_ph7'], inplace=True, axis=1)

## Protein Composition Measures

*   Amino acids in helix: V, I, Y, F, W, L.
*   Amino acids in turn: N, P, G, S.
*   Amino acids in sheet: E, M, A, L.

In [None]:
pch_corr_cols = ['Helix', 'V', 'I', 'Y', 'F', 'W', 'L']

pch_corr = df_p[['tm']+pch_corr_cols].corr()
pch_corr.style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
pct_corr_cols = ['Turn', 'N', 'P', 'G', 'S']

pct_corr = df_p[['tm']+pct_corr_cols].corr()
pct_corr.style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
pcs_corr_cols = ['Sheet', 'E', 'M', 'A', 'L']

pcs_corr = df_p[['tm']+pcs_corr_cols].corr()
pcs_corr.style.background_gradient(cmap='coolwarm', axis=None)

Secondary Structure Fraction - **Helix, Turn, Sheet** gives similar insights as individual **AA percentages** with lesser dimentionality
<br><br>
Secondary Structure Fraction - **Helix, Turn, Sheet** also have higher correlation with the target variable **tm** than any individual **AA percentage**
<br><br>
We test the accuracy of the models before and after dropping **AA percentage columns**.
The accuracy of the model reduces on dropping **AA percentage columns**.
Therefore, here we trade off multicollinearity for performance and ** do not drop AA percentage columns and retain SSF columns**

In [None]:
# df_p.drop(['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'], inplace=True, axis=1)

## Hydrophobicity

In [None]:
pchy_corr_cols = ['protein_gravy_val', 'Kyte-Doolittle', 'Hopp-Woods', 'Cornette', 'Eisenberg', 'Rose', 'Janin', 'Engelman GES',]

pchy_corr = df_p[['tm']+pchy_corr_cols].corr()
pchy_corr.style.background_gradient(cmap='coolwarm', axis=None)

Features **protein_gravy_val** and **Kyte-Doolittle** are perfectly co-related, therefore we can drop either
<br><br>
Dropping **Kyte-Doolittle**

In [None]:
df_p.drop(['Kyte-Doolittle'], axis=1, inplace=True)

Drop - **Engelman GES & Hopp-Woods** index, as they have very little correlation with the target feature **tm**

In [None]:
df_p.drop(['Engelman GES', 'Hopp-Woods'], axis=1, inplace=True)

In [None]:
pchy_corr_cols = ['Cornette', 'Eisenberg', 'Rose', 'Janin']

pchy_corr = df_p[['tm']+pchy_corr_cols].corr()
pchy_corr.style.background_gradient(cmap='coolwarm', axis=None)

Creating a Feature - aggregate of **Eisenberg & Rose, Janin** indexes 
<br><br>
Excluding **Cornette** as it has higher correlation with the target Feature **tm** and comperatively less correlation with other indexes

In [None]:
df_p['erj_avg'] = df_p[['Eisenberg', 'Rose', 'Janin']].mean(axis=1)

In [None]:
df_p.corr()['erj_avg']['tm']

In [None]:
pchy_corr_cols = ['Cornette', 'erj_avg']

pchy_corr = df_p[['tm']+pchy_corr_cols].corr()
pchy_corr.style.background_gradient(cmap='coolwarm', axis=None)

Drop the corresponding individual features - **Eisenberg, Rose & Janin**

In [None]:
df_p.drop(['Eisenberg', 'Rose', 'Janin'], inplace=True, axis=1)

## Final Overall Correlation

In [None]:
df_p.columns

In [None]:
meta_corr_cols = [
       'tm', 'pH', # pH
       'charge_ratio', 'Fariás-Bonato ratio', # charge ratios
       'flex', # flexibility
       'protein_isoelectric_point', 'protein_molecular_weight', 'protein_aromaticity', # ionic properties
       'instability', # instability
       'protein_gravy_val', 'Cornette', 'erj_avg', # hydrophobicity
       'Helix', 'Turn', 'Sheet' # composition percentages,
]

meta_corr = df_p[meta_corr_cols].corr()
meta_corr.style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
df_p.to_csv('dataset_wo_mc.csv')

# Cleaning

In [None]:
df_c = df_p

In [None]:
# remove unncesarry columns
df_c.drop(['protein_sequence', 'protein_analysis_obj'], inplace=True, axis=1)

## Outlier Handling

In [None]:
#check outliers in pH
fig = plt.figure(figsize=[15,4])
plt.subplot(1,2,1)
sns.boxplot(x='pH', data=df_c)


plt.subplot(1,2,2)
sns.histplot(x='pH', data=df_c)
plt.show()

In [None]:
perc_95=np.percentile(df_c['pH'], 95)

In [None]:
too_large = df["pH"] > perc_95

drop_indexes = df_c[too_large].index.values

df_c.drop(index=drop_indexes, inplace = True)

In [None]:
# cleaned pH
fig = plt.figure(figsize=[15,4])
plt.subplot(1,2,1)
sns.boxplot(x='pH', data=df_c)


plt.subplot(1,2,2)
sns.histplot(x='pH', data=df_c)
plt.show()

In [None]:
plt.scatter(df_c['pH'], df_c['tm'])

## Final Data Distribution

In [None]:
fig = plt.figure(figsize=(20,20))
ax = fig.gca()
df_c.hist(ax=ax)
plt.show()

In [None]:
plt.scatter(df_c['pH'], df_c['tm'], c=df_c['protein_isoelectric_point'])

In [None]:
plt.scatter(df_c['pH'], df_c['tm'], c=df_c['instability'])

In [None]:
plt.scatter(df_c['pH'], df_c['tm'], c=df_c['protein_aromaticity'])

In [None]:
df_c.to_csv('Cleaned_dataset.csv')

In [None]:
df_c = pd.read_csv('/content/Cleaned_Dataset_fin.csv')
df_c.drop(['Unnamed: 0'], axis=1, inplace=True)

# Train & Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_cols = [x for x in df_c.columns if x != 'tm']
y_col = ['tm']

In [None]:
X = df_c[X_cols]
y = df_c[y_col]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Scaling and Standardisation

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
X.head()

In [None]:
X.columns

Some Features range between 0 to 1 while others range from 0 to hunderds & thousands.<br>
Some Features even have negative values.
<br><br>
Therefore we nee to perform Scalign on the Features.

In [None]:
# Initialise the Scaler
scaler = StandardScaler()
 
# To scale data
scaler.fit(X_train)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train[0]

In [None]:
X_test[0]

In [None]:
fig = plt.figure(figsize=(20,20))
ax = fig.gca()
X[[x for x in X.columns if x not in ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K',
       'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']]].hist(ax=ax)
plt.title("Scaled Features Distribution")
plt.show()

Features **flex** (Protien Flexibility) and **protein_molecular_weight** (Molecular Weight) are slightly skewed

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression()

In [None]:
reg.fit(X_train,y_train)

## Evaluation

In [None]:
from sklearn.metrics import r2_score

In [None]:
lr_trainp = reg.predict(X_train)
lr_pred = reg.predict(X_test)

In [None]:
train_acc_lr = r2_score(y_train, lr_trainp)
test_acc_lr = r2_score(y_test, lr_pred)
overfit_lr = train_acc_lr - test_acc_lr

In [None]:
print("Train Accuracy (%): ", train_acc_lr*100)
print("Test Accuracy (%): ", test_acc_lr*100)
print("Overfitting (%): ", overfit_lr*100)

# Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

In [None]:
model = Lasso(alpha=0.01)

In [None]:
model.fit(X_train, y_train)

## Evaluation

In [None]:
lsr_trainp = model.predict(X_train)
lsr_pred = model.predict(X_test)

In [None]:
train_acc_lsr = r2_score(y_train, lsr_trainp)
test_acc_lsr = r2_score(y_test, lsr_pred)
overfit_lsr = train_acc_lsr - test_acc_lsr

In [None]:
print("Train Accuracy (%): ", train_acc_lsr*100)
print("Test Accuracy (%): ", test_acc_lsr*100)
print("Overfitting (%): ", overfit_lsr*100)

# LGBM Regressor

In [None]:
from lightgbm import LGBMRegressor

In [None]:
model_lgb = LGBMRegressor(n_estimators = 250, 
                              learning_rate = 0.01,
                              num_leaves = 31,
                              max_depth = 5, 
                              reg_alpha = 1, 
                              reg_lambda = 5, 
                              subsample = 0.75,
                              colsample_bytree = 0.55)

In [None]:
model_lgb.fit(X_train, y_train)

## Evaluation

In [None]:
lgb_trainp = model_lgb.predict(X_train)
lgb_pred = model_lgb.predict(X_test)

In [None]:
train_acc_lgb = r2_score(y_train, lgb_trainp)
test_acc_lgb = r2_score(y_test, lgb_pred)
overfit_lgb = train_acc_lgb - test_acc_lgb

In [None]:
print("Train Accuracy (%): ", train_acc_lgb*100)
print("Test Accuracy (%): ", test_acc_lgb*100)
print("Overfitting (%): ", overfit_lgb*100)

# Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
regr = RandomForestRegressor(n_jobs=2,bootstrap=True)

In [None]:
regr.fit(X_train,y_train)

## Evaluation

In [None]:
rf_trainp = regr.predict(X_train)
rf_pred = regr.predict(X_test)

In [None]:
train_acc_rf = r2_score(y_train, rf_trainp)
test_acc_rf = r2_score(y_test, rf_pred)
overfit_rf = train_acc_rf - test_acc_rf

In [None]:
print("Train Accuracy (%): ", train_acc_rf*100)
print("Test Accuracy (%): ", test_acc_rf*100)
print("Overfitting (%): ", overfit_rf*100)

#### We can notice some Overfitting when using the Random Forest Regressor

# Conclusion
1. We were able to add Features to the dataset using domain knowledge and bio tech packages available in python.<br><br>
2. We tackled High Multicollinearity by performing Feature Selection and combining similar features.<br><br>
3. We cleaned the data set by Dropping Redundant or Irrelevant columns and Removing Outliers.<br><br>
4. We used StandardScaler to Scale our Features to account for varying ranges of different features.<br><br>
5. We implemented Linear Regression to establish a baseline of performance of model. We observed that Linear Regression gave an Accuracy of 25% (approx.).<br><br>
6. We implemented Lasso Regression to combat the Multicollinearity in the dataset and improve the performance of our model. We observed that Lasso Regression gave an Accuracy of 25% (approx.).<br><br>
7. We implemented Light Gradient Boosting Machine (LGBM) Regression, a tree based learning algorithm, to further improve the performance of our model. We observed that LGBM Regression gave an Accuracy of 50% (approx.).<br><br>
8. We implemented Random Forest Regression, another tree based learning algorithm. This model gave us the best performance over other models. We observed that Random Forest Regression gave an Accuracy of 56% (approx.). But there is noticeable overfitting when using a Random Forest Regression.