## Implement SLR to __Predict Baseball Team Wins from Run Differential__


Can you predict a baseball team's win total if you know their run differential?

The Major League Baseball (MLB) season schedule generally consists of __162 games__ for each of the 30 teams in the __American League (AL)__ and __National League (NL)__. A __run (R)__ is considered $1$ point in baseball and __run differential (RD)__ is the difference between the number of runs a team scores and the number of runs it gave up.

__Steps:__
1. Exploring the dataset and preprocessing
2. Modeling and fitting
3. Prediction and Model Performance

In [1]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background (for readability of visualizations)
sns.set_style("whitegrid")

## Let's start by reading our data and some EDA

In [None]:
# Use pandas to import the data it is stored in the baseball_run_diff.csv file



# Look at 5 randomly sampled rows


In [None]:
## use .info to get a grasp of your dataset

In [2]:
## Handel missing information

In [None]:
## make a copy of the original dataframe so we dont mess with it while working

### Train-Test Split

To check our predictive model's accuracy, we split our data into training ($\{(X_{\text{train}}, y_{\text{train}})\}$)and testing subsets ($\{(X_{\text{test}}, y_{\text{test}})\}$). One reson for this is training data may overestimate a model's effectiveness, the test data serves as a sanity check, to see if the model's performance aligns with expectations.


In [None]:
## We use sample() to make a random sample
## frac: to set aside 25% for testing
## random_state=440 allows you to reproduce the same train test split each time you run the code.

### Check for potential relationship between variables by plotting your data

Can we use SLR?

In [3]:
## plt.scatter plots RD on the x and W on the y

### Model Building

You can use the formula given in lecture (previous notebook) or build a model using `sklearn` which is an open source python machine learning library.
__Steps:__

1. import the model from sklearn,
2. make a model object,
3. fit the object,
4. predict

In [5]:
# First we import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

In [6]:
## Now we make a LinearRegression object
slr = LinearRegression(copy_X = True)

The parameter `copy_X` is set to `True`. This makes a copy of the original input $X$ to ensure that it is not modified during the fitting process.

To learn more about the `LinearRegression` object read the documentation here: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html</a>.

In [7]:
## Now we fit the model. Replace X and y in the following
#slr.fit(X, y)

In [None]:
## Print the coefficients
beta_1_hat = slr.coef_[0]
beta_0_hat = slr.intercept_

print("beta_1_hat is", beta_1_hat)
print("beta_0_hat is", beta_0_hat)

Now you can produce a scatter plot of data points along with the fitted SLR line ($y = \beta_0 + \beta_1 x $). Replace X and y to do so:

In [None]:
# Define padding for extending x-axis range
padding = 20

# Generate 1000 evenly spaced x values for plotting, extending the range of RD
x = np.linspace(X.min() - padding, X.max() + padding, 1000)

# Create the scatter plot along with the SLR line
plt.figure(figsize=(10, 10))  # Set figure size
plt.scatter(X, y, label="Observations")  # Plot the observations
plt.plot(x, beta_0_hat + beta_1_hat * x, 'k', label="SLR Line", linewidth=3)  # Plot the SLR line
plt.xlabel("Run Differential", fontsize=16)  # Set x-axis label
plt.ylabel("Wins", fontsize=16)  # Set y-axis label
plt.legend(fontsize=14)  # Add legend
plt.show()  # Display the plot


### Model Performance:

We measure the performance by looking at its predictions. The basic idea is to split the dataset into train and test subsets. Then we compute the regression line on train dataset and use it to predeict the values on the test dataset. A large gap between the training and testing MSE indicates high variance (__overfitting__), while a small gap may suggest high bias (__underfitting__).

__Steps__

1-find MSE on training dataset

2-find MSE on testing dataset

3-compare them

In [8]:
## Here we calculate the MSE on the training data
## (how well the model performs on seen data)
from sklearn.metrics import mean_squared_error

In [10]:
## Store the y values from the train dataset in a variable and call it: y_train

In [11]:
# compute MSE on training dataset

## Store the predictions in a variable and call it y_train_pred
## alternatively, you can use:
#y_train_pred = slr.predict(X_train)
#Computing MSE
#mse = mean_squared_error(y_train, y_train_pred)
#print("The training MSE is", mse)

In [12]:
#compute MSE on training dataset


1- How do you interpret you MSE?

2- Is it acceptable?

3- Let's look at error again. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. Plot the residuals. Did you successfully remove the strong linear pattern?

4- How do you interpret the residuals?