In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Introduction to Data Fitting with SciKitLearn

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* How do I fit a model using SciKitLearn?

* How can I use SciKitLearn to assess my model?

* How can I use other models from the SciKitLearn library?

Objectives:

* Fit a linear model using SciKitLearn.
    
* Use train_test_split to split the data.

* Try other models from SciKitLearn.

</div>

## Data Loading and Visualization

Before fitting our models, we will first load our data using pandas and visualize the linear relationships using seaborn.
For a review of pandas and seaborn, see notebok `03_python_data_science`.

In [None]:
import pandas as pd # for dataframes
from rdkit.Chem import PandasTools # for loading dataframes from SDF files

import seaborn as sns # for graphs

PandasTools.RenderImagesInAllDataFrames(True)

df = PandasTools.LoadSDF("data/amino_acids-data.sdf", strictParsing=False, includeFingerprints=True, smilesName="SMILES")

In [None]:
df.head(3) # preview the first three rows.

Unfortunately, `LoadSDF` doesn't always load data types correctly, as can be seen using `df.info`. 
We would expect `MolWt`, `TPSA`, and `NumHeavyAtoms` to be integers or float, but they are listed as "object".
We can use the `apply` method to apply the pandas function `to_numeric` to every column. 
Adding the argument `errors="ignore"` will make it skip the columns that can't be turned into numbers.

In [None]:
df.info()

In [None]:
df = df.apply(pd.to_numeric, errors="ignore") # convert columns to numbers.

In [None]:
df.info() # Now our columns are numbers.

### Visualizing Correlation

One way to visualize the relationship between different variables is to use a pandas correlation matrix.
This is available on a dataframe using [df.corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html). 
This will return a dataframe that shows the correlation between each variable and every other variable.

In [None]:
df.corr(numeric_only=True)

The correlation ranges from 0 (not correlated) to 1 (correlated). The correlation of a column with itself is 1.
We can combine this with a heatmap function in seaborn to see this visually.

In [None]:
sns.heatmap(df.corr(numeric_only=True), cmap="Blues", annot=True)

## Fitting using SciKitLearn

[Scikit-learn](https://scikit-learn.org/stable/index.html) is a popular machine learning library in Python that provides a wide range of algorithms for supervised and unsupervised learning, including classification, regression, clustering, and dimensionality reduction. It is built on top of NumPy, SciPy, and matplotlib, and provides a simple and efficient API for working with data in Python. 

For our simple example, we will fit a linear relationship between the number of heavy atoms and the molecular weight. As a reminder, the equation for a linear relationship is $y = mx + b$.
Molecular weight and number of heavy atoms will obviously be correlated, but is a good demonstration of how to use the scikitlearn library.
All model fits in Scikit-Learn use [a similar API](https://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects), so learning how to do a linear fit will directly translate to other models.



In [None]:
from sklearn.linear_model import LinearRegression # import the linear regression model

from sklearn.metrics import r2_score, mean_squared_error

In [None]:
# Get our x and y data
X = df[["NumHeavyAtoms"]]
Y = df[["MolWt"]]

linear_model = LinearRegression()
linear_model.fit(X,Y)

That's it! That is how we fit a SciKit Learn model to our data. 
Now, our variable `linear_model` is a trained SciKit Learn model.
We can use it to predict new values or evaluate our model.

First, we will look at the score for the model on our training data.
This score will range from 0 to 1, with a value closer to 1 indicating a better fit.

In [None]:
linear_model.score(X, Y)

Our model score is `0.96`. This indicates that our model is a good fit.

Our trained model, `linear_model` now has a method called `predict`. 
If we put in values for our dependent variable, our model will return what values our model
will predict.
To visualize how well our model does, we will compare our model predicted values to our
observed values.

In [None]:
# use the model to make predictions and add it to the graph
model_values = linear_model.predict(X)

# Save our predicted values in a new column in our dataframe.
df["PredictedWt"] = model_values

In [None]:
df.head(3)

We will now use seaborn to visualize our actual and predicted values.
This plot shows the `MolWt` as the x value, with the predicted `MolWt` as the Y value.
The `MolWt` vs itself is shown as a reference line. For a perfect model,
all values would fall along the line.

In [None]:
sns.scatterplot(x="MolWt", y="PredictedWt", data=df)
sns.lineplot(x="MolWt", y="MolWt", data=df, linestyle='--')

In [None]:
import math

# Evaluate the model's performance
mse = mean_squared_error(Y, model_values)
rmse = math.sqrt(mse)
r2 = r2_score(Y, model_values)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R2 Score:", r2)

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>

    Perform a linear fit with MolWt as the y value and TPSA as the x value. Repeat the plotting and scoring analysis.

</p>
</div>


<div class="exercise admonition">
<p class="admonition-title">Exercise - Multilinear Regression

<p> To perform multilinear regression, you only need to feed in two columns of data for X. Then, you perform the fit the same way. </p>
    
```python
X = df[["MolWt", "TPSA"]]
```

Now, you will be fitting an equation of the form $y = m_{1}x_{1} + m_{2}x_{2} +b$.  
    
</div>

## Model Validation - Train Test Split

When training a model for machine learning, it is a best practice to evaluate the model's performance on data that was not part of the training set.
One way to achieve this is to use a method called "train test split".

Train-test split is a widely used technique in the field of machine learning and data science to evaluate the performance of a model. It involves dividing the available data into two distinct sets: a training set and a testing set. This partitioning is essential to ensure that the model generalizes well to new, unseen data and to prevent overfitting.

The primary purposes of using train-test split are:

Model validation: To ensure that the model built using the training data performs well on unseen data, the testing set serves as a proxy for new data. By comparing the model's predictions with the actual outcomes in the testing set, we can gauge its predictive accuracy and robustness.

Prevent overfitting: Overfitting occurs when a model learns to perform exceptionally well on the training data but fails to generalize to new data. This is usually due to the model capturing the noise or random fluctuations in the training data rather than the underlying patterns. A train-test split helps mitigate this issue by allowing us to evaluate the model's performance on a separate dataset.

To perform a train-test split, the data is typically divided into approximately 70-80% for training and 20-30% for testing. This ratio can be adjusted depending on the size and characteristics of the dataset. The split should be done randomly to ensure that both sets are representative of the overall data distribution.

SciKit-Learn has tools that can split your data for you. We will now use `train-test-split` and repeat our analysis.

In [None]:
# Train test split

from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Now we perform a fit using the training data only.

In [None]:
ttt_model = LinearRegression()
ttt_model.fit(X_train, Y_train)

After performing our fit with the training data, we use the "test" data to evaluate the model.

In [None]:
y_pred = ttt_model.predict(X_test)

In [None]:
df_train = pd.DataFrame()
df_train["MolWt"] = Y_test
df_train["Predicted MolWt"] = y_pred

In [None]:
sns.scatterplot(x="MolWt", y="Predicted MolWt", data =df_train)
sns.lineplot(x="MolWt", y="MolWt", data=df, linestyle='--')

In [None]:
import math

# Evaluate the model's performance
mse = mean_squared_error(Y_test, y_pred)
rmse = math.sqrt(mse)
r2 = r2_score(Y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R2 Score:", r2)

In [None]:
# Compare it to model performance on the training data.

y_pred = ttt_model.predict(X_train)


# Evaluate the model's performance
mse = mean_squared_error(Y_train, y_pred)
rmse = math.sqrt(mse)
r2 = r2_score(Y_train, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R2 Score:", r2)



## The SciKit Learn Model API

All scikit learn models use the same API, or interface. This means to switch from a linear model to a more sophisticated model like a [random forest model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), one need only change
the model creation.

For example, recall our code to fit a linear model and use it for prediction:

```python
from sklearn.linear_model import LinearRegression 

model = LinearRegression()
model.fit(X,Y)
predictions = model.predict(X)
```

To do the same thing with a random forest regresso, the code would be:


```python
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, Y)
predictions = model.predict()

```


<div class="exercise admonition">
<p class="admonition-title">Exercises - Challenge</p>

<p> 1. Repeat this analysis for your vitamins data set.</p>
    <p> 2. <strong>Bonus</strong> - Use the solvation dataset from the pandas lesson and create a model for solubility. You can use the descriptors in the file, or you can use RDKit to create molecules from the SMILES strings.</p>

</p>
</div>