# Linear Fitting using SciKit Learn

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>

<li>How do you fit a linear model using SciKit Learn?</li>
<li>How can you evaluate the performance of a linear model?</li>
<li>What is a train-test-split?</li>

<strong>Objectives:</strong>

<li>Learn how to fit a linear model using SciKit Learn.</li>
<li>Learn how to do a train-test-split using SciKit Learn.</li>

</div>

In this lesson, we will fit a linear model using SciKit Learn. 
Our data set will be the one we built yesterday.


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data/amino_acids_processed.csv")
df.head()

In [None]:
X = df[["num_heavy"]]
Y = df[["molecular_weight"]]

## Linear Fitting with SciKit Learn
Now that we have prepared our X and Y variables, let's see how we would do a fit using scikitlearn.

Typically when doing fitting with scikitlearn, the first thing you will do is to import the type of model you want to use. In our case, we are importing a `LinearRegression` model. This type of model performs ordinary least squares fitting. You will first import the model, then you will create a model object. After creation, you will give data to the model and tell it to perform a fit. Your model can then be used to make predictions.

Now that you have imported the model, you can read more about it either on the SciKitLearn website, or by using the built-in Python help function.

Before we do the fit, we first create the model. Then, we specify settings for it such as if we want the linear model to have an intercept. It will have one by default, but if you wanted to do an ordinary least squares fit without an intercept, you would specify it when you create the model.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Fit a linear regression model
model = LinearRegression()
model.fit(X, Y)

After we create the model, we give it data and call the fit function. Then, the model will contain information about coefficients and an intercept.

In [None]:
print(f"The coefficients are {model.coef_} and the intercept is {model.intercept_}.")

Remember that each coefficient above corresponds to one of the features. For example, the coefficient (12.55) tells us that for every 1 heavy atom our molecular weight increases by 12.55 (what element does that roughly correspond to?).

## Using the linear regression to make predictions

One way we might evaluate our fit is to compare the values predicted by the model to the actual values.

In [None]:
from sklearn.metrics import r2_score

# Predict molecular weights using the linear model
predicted_weight = model.predict(X)

# Print the model coefficients
print(f"Model coefficient (slope): {model.coef_[0][0]}")
print(f"Model intercept: {model.intercept_[0]}")

r2_score(predicted_weight, Y)

We can use `matplotlib` to plot our predicted vs actual values.

In [None]:
# Plot the results
plt.scatter(X, Y, color='blue', label='Actual')
plt.plot(X, predicted_weight, color='red', linewidth=2, label='Predicted')
plt.xlabel('Number of Heavy Atoms')
plt.ylabel('Molecular Weight')
plt.legend()

## Model Validation - Train Test Split

When training a model, it is a best practice to evaluate the model's performance on data that was not part of the training set.
One way to achieve this is to use a method called "train test split".

Train-test split is a widely used technique in the field of machine learning and data science to evaluate the performance of a model. It involves dividing the available data into two sets: a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the model's performance.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Now we perform a train-test split using the training data only:

In [None]:
ttt_model = LinearRegression()
ttt_model.fit(X_train, Y_train)

After performing our fit with the training data, we use the "test" data to evaluate the model.

In [None]:
y_pred = ttt_model.predict(X_test)

In [None]:
# Print the model coefficients
print(f"Model coefficient (slope): {ttt_model.coef_[0][0]}")
print(f"Model intercept: {ttt_model.intercept_[0]}")

r2_score(y_pred, Y_test)

In [None]:
# Plot the results
plt.scatter(X_test, Y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Number of Heavy Atoms')
plt.ylabel('Molecular Weight')
plt.legend()

<div class="alert alert-block alert-warning"> 
<h3>Challenge</h3>

Using the periodic table dataset from the previous lesson, fit ionization energy as a function of electronegativity.

</div>

In [None]:
# provided 
periodic_df = pd.read_csv("data/PubChemElements_all.csv")
elements_to_fit = periodic_df.dropna(subset=['Electronegativity', 'IonizationEnergy'])

In [None]:
# answer below
elements_to_fit.head()

In [None]:
X = elements_to_fit[["Electronegativity"]]
Y = elements_to_fit[["IonizationEnergy"]]

In [None]:
plt.scatter(X,Y)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
ttt_model = LinearRegression()
ttt_model.fit(X_train, Y_train)
y_pred = ttt_model.predict(X_test)
# Print the model coefficients
print(f"Model coefficient (slope): {ttt_model.coef_[0][0]}")
print(f"Model intercept: {ttt_model.intercept_[0]}")

print(r2_score(y_pred, Y_test))

# Plot the results
plt.scatter(X_test, Y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Number of Heavy Atoms')
plt.ylabel('Molecular Weight')
plt.legend()