# **Simple Prediction Model**
>This model predicts the value of logS using two supervised learning algorithms: Linear Regression and Random Forest. We utilize the sklearn library for implementing these models.



In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv")

In [None]:
df.columns

Index(['MolLogP', 'MolWt', 'NumRotatableBonds', 'AromaticProportion', 'logS'], dtype='object')

In [None]:
y = df["logS"]
y

0      -2.180
1      -2.000
2      -1.740
3      -1.480
4      -3.040
        ...  
1139    1.144
1140   -4.925
1141   -3.893
1142   -3.790
1143   -2.581
Name: logS, Length: 1144, dtype: float64

In [None]:
x = df.drop("logS", axis=1)
x

Unnamed: 0,MolLogP,MolWt,NumRotatableBonds,AromaticProportion
0,2.59540,167.850,0.0,0.000000
1,2.37650,133.405,0.0,0.000000
2,2.59380,167.850,1.0,0.000000
3,2.02890,133.405,1.0,0.000000
4,2.91890,187.375,1.0,0.000000
...,...,...,...,...
1139,1.98820,287.343,8.0,0.000000
1140,3.42130,286.114,2.0,0.333333
1141,3.60960,308.333,4.0,0.695652
1142,2.56214,354.815,3.0,0.521739


# **Splitting the Data**

> **In any machine learning project, splitting the data into training and testing sets is a crucial step. This ensures that we can evaluate the model's performance on unseen data and avoid overfitting. Here, we use the train_test_split function from the sklearn.model_selection module to achieve this.**



In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=100)

In [None]:
x_train

Unnamed: 0,MolLogP,MolWt,NumRotatableBonds,AromaticProportion
107,3.14280,112.216,5.0,0.000000
378,-2.07850,142.070,0.0,0.000000
529,-0.47730,168.152,0.0,0.000000
546,-0.86740,154.125,0.0,0.000000
320,1.62150,100.161,2.0,0.000000
...,...,...,...,...
802,3.00254,250.301,1.0,0.842105
53,2.13860,82.146,3.0,0.000000
350,5.76304,256.348,0.0,0.900000
79,3.89960,186.339,10.0,0.000000


**Training the model based on the train dataset**


In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train,y_train)

In [None]:
y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

In [None]:
y_train

107   -4.440
378   -1.250
529   -1.655
546   -1.886
320   -0.740
       ...  
802   -2.925
53    -2.680
350   -7.020
79    -4.800
792   -3.240
Name: logS, Length: 915, dtype: float64

In [None]:
y_train_pred

In [None]:
y_test_pred

**Finding how much correct this model is using mean_squared_error and r2_score for both training dataset and also testing dataset**

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

lr_train_mse = mean_squared_error(y_train, y_train_pred)
lr_train_score = r2_score(y_train, y_train_pred)

lr_test_mse = mean_squared_error(y_test, y_test_pred)
lr_test_score = r2_score(y_test, y_test_pred)

In [None]:
print("lr_train_mse: ",lr_train_mse)
print("lr_train_score: ",lr_train_score)
print("lr_test_mse: ",lr_test_mse)
print("lr_test_score: ",lr_test_score)

lr_train_mse:  1.0075362951093687
lr_train_score:  0.7645051774663391
lr_test_mse:  1.0206953660861033
lr_test_score:  0.7891616188563282


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(max_depth=2, random_state=100)
rf.fit(x_train,y_train)

In [None]:
y_rf_train_pred = rf.predict(x_train)
y_rf_test_pred = rf.predict(x_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_score = r2_score(y_train, y_rf_train_pred)

rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_score = r2_score(y_test, y_rf_test_pred)

In [None]:
print("rf_train_mse: ",rf_train_mse)
print("rf_train_score: ",rf_train_score)
print("rf_test_mse: ",rf_test_mse)
print("rf_test_score: ",rf_test_score)

rf_train_mse:  1.028227802112806
rf_train_score:  0.7596688824431413
rf_test_mse:  1.407688264904896
rf_test_score:  0.7092230211002489


In [5]:
from google.colab import drive
drive.mount('/content/Colab_Notebooks')

Mounted at /content/Colab_Notebooks


In [6]:
!cd /content/Colab_Notebooks/MyDrive/Colab_Notebooks/DADV.ipynb

/bin/bash: line 1: cd: /content/Colab_Notebooks/MyDrive/Colab_Notebooks/DADV.ipynb: Not a directory


In [7]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


In [8]:
!git add .

error: open("Colab_Notebooks/MyDrive/Documents/Diploma Results.gdoc"): Operation not supported
error: unable to index file 'Colab_Notebooks/MyDrive/Documents/Diploma Results.gdoc'
fatal: adding files failed
