<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Linear Fitting using SciKit Learn

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>

<li>How do you fit a linear model using SciKit Learn?</li>
<li>How can you evaluate the performance of a linear model?</li>
<li>What is a train-test-split?</li>

<strong>Objectives:</strong>

<li>Learn how to fit a linear model using SciKit Learn.</li>
<li>Learn how to do a train-test-split using SciKit Learn.</li>

</div>

In this lesson, we will fit a linear model using SciKit Learn. 
Our data set will be the one we built yesterday.


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data/amino_acids_processed.csv")
df.head()

In [None]:
X = df[["num_heavy"]]
Y = df[["molecular_weight"]]

## Linear Fitting with SciKit Learn
Now that we have prepared our X and Y variables, let's see how we would do a fit using scikitlearn.

Typically when doing fitting with scikitlearn, the first thing you will do is to import the type of model you want to use. In our case, we are importing a `LinearRegression` model. This type of model performs ordinary least squares fitting. You will first import the model, then you will create a model object. After creation, you will give data to the model and tell it to perform a fit. Your model can then be used to make predictions.

Now that you have imported the model, you can read more about it either on the SciKitLearn website, or by using the built-in Python help function.

Before we do the fit, we first create the model. Then, we specify settings for it such as if we want the linear model to have an intercept. It will have one by default, but if you wanted to do an ordinary least squares fit without an intercept, you would specify it when you create the model.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Fit a linear regression model
model = LinearRegression()
model.fit(X, Y)

After we create the model, we give it data and call the fit function. Then, the model will contain information about coefficients and an intercept.

In [None]:
print(f"The coefficients are {model.coef_} and the intercept is {model.intercept_}.")

Remember that each coefficient above corresponds to one of the features. For example, the coefficient (12.55) tells us that for every 1 heavy atom our molecular weight increases by 12.55 (what element does that roughly correspond to?).

## Using the linear regression to make predictions

One way we might evaluate our fit is to compare the values predicted by the model to the actual values.

In [None]:
from sklearn.metrics import r2_score

# Predict molecular weights using the linear model
predicted_weight = model.predict(X)

# Print the model coefficients
print(f"Model coefficient (slope): {model.coef_[0][0]}")
print(f"Model intercept: {model.intercept_[0]}")

r2_score(predicted_weight, Y)

## Visualizing Results

In machine learning contexts, we may sometimes not be able to visualize the model in 2 dimensions as we can with simpler models. In these cases, it's common to create a scatter plot of predicted vs. actual (true) values to evaluate the model's performance.

In this type of plot, one would expect a perfect model to produce points that lie along a line with a slope of 1, which corresponds to the line \( y = x \). Points that fall below this diagonal line indicate that the model is under-predicting (i.e., the predicted values are less than the actual values), while points that fall above the line indicate that the model is over-predicting (i.e., the predicted values are greater than the actual values).

In [None]:
# Plot the results
y_values = Y.values
plt.scatter(y_values, predicted_weight)

plt.plot([min(y_values), max(y_values)], [min(y_values), max(y_values)], 'r--', label='Ideal line (y=x)')
plt.xlabel('Actual Molecular Weight')
plt.ylabel('Predicted Molecular Weight')
plt.title('Predicted vs. Actual Molecular Weight')
plt.legend()
plt.grid(True)
plt.axis('equal')

## Model Validation - Train Test Split

When training a model, it is a best practice to evaluate the model's performance on data that was not part of the training set.
One way to achieve this is to use a method called "train test split".

Train-test split is a widely used technique in the field of machine learning and data science to evaluate the performance of a model. It involves dividing the available data into two sets: a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the model's performance.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Now we perform a train-test split using the training data only:

In [None]:
ttt_model = LinearRegression()
ttt_model.fit(X_train, Y_train)

After performing our fit with the training data, we use the "test" data to evaluate the model.

In [None]:
y_pred = ttt_model.predict(X_test)

In [None]:
# Print the model coefficients
print(f"Model coefficient (slope): {ttt_model.coef_[0][0]}")
print(f"Model intercept: {ttt_model.intercept_[0]}")

r2_score(y_pred, Y_test)

In [None]:
# Plot the results
y_values = Y.values
y_predict_all = ttt_model.predict(X)

plt.scatter(y_values, y_predict_all)

plt.plot([min(y_values), max(y_values)], [min(y_values), max(y_values)], 'r--', label='Ideal line (y=x)')
plt.xlabel('Actual Molecular Weight')
plt.ylabel('Predicted Molecular Weight')
plt.title('Predicted vs. Actual Molecular Weight')
plt.legend()
plt.grid(True)
plt.axis('equal')

## The SciKit Learn Model API

All scikit learn models use the same API, or interface. This means to switch from a linear model to a more sophisticated model like a [random forest model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), one need only change
the model creation.

For example, recall our code to fit a linear model and use it for prediction:

```python
from sklearn.linear_model import LinearRegression 

model = LinearRegression()
model.fit(X,Y)
predictions = model.predict(X)
```

To do the same thing with a random forest regresso, the code would be:


```python
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, Y)
predictions = model.predict(X)

```

<div class="alert alert-block alert-warning"> 
<h3>Challenge</h3>

Using the periodic table dataset from the previous lesson, fit ionization energy vs. electronegativity with a linear model.

</div>

In [None]:
# provided 
periodic_df = pd.read_csv("data/PubChemElements_all.csv")
elements_to_fit = periodic_df.dropna(subset=['Electronegativity', 'IonizationEnergy'])