<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/Linear_regression_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Linear regression with a dataset of your choice** 

Recall that we use linear regression when our research question calls for the analysis of a predictive relationship between at least one independent and dependent variable. Put another way: are you trying to demonstrate that one characteristic of your dataset can be predicted by another? If the answer is yes, you might want to use linear regression!

### **Step 1:** Identify your question


This has already been done for you! Look at the research question for your dataset and consider whether or not it's a good fit for the linear regression approach.


### **Step 2:** Select your data

Let's import our data. First we need to load in the necessary Python libraries. Run the code below:

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package
import numpy as np
import pandas as pd
from scipy import stats

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

Next, we create a dataframe called with a pre-cleaned version of your data. 

**Important:** *Only run the cell for your chosen dataset. Ignore the other two cells.*

In [13]:
# Run this cell ONLY if you are using the stellar rotation dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/Stellar_rotation_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the dragonfly wing dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/wing_measurements_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the North Carolina crime dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/crime_data_clean.csv"))

Let's take a look at the first 5 rows of the dataset. Make sure this is the dataset you meant to import! If it's wrong, just go back and run the correct cell above. That will over-write the dataframe.

In [None]:
df.head()

Let's see how big the dataset is:

In [None]:
len(df)

### **Step 3:** Choose your method

Review your dataset and your research question one more time to make sure that you're ready to run linear regression. There should be two numerical variables that you believe might have a linear predictive relationship. This approach is not good for categorizing data or finding non-libear patterns.

### **Step 4:** Prepare your data

This step has been taken care of for you! All three datasets have had rows with missing data and problematic outliers removed.

Create some boxplots to confirm:

**Simply replace the xxxx in the quotation marks with the name of the column you would like to use for the box plot. Make sure you don't have any typos in the column name!**

If you want to look at more than one boxplot, just change the column name and re-run the cell.

In [None]:
# Boxplot for your data. Replace xxxx with the name of your column
sns.boxplot(df["xxxx"])

Let's look at the correlations between all variables in your dataset. This will help you choose which variables to compare for linear regression:

In [None]:
# Let's see the correlation between different variables.
sns.heatmap(df.corr(), cmap="YlGnBu", annot = True)
plt.show()

Now it's time to choose your dependent and independent variables for linear regression analysis. 

Consider: Which variable do you expect to predict another variable? This is the independent variable. Which variable will be predicted? This is the dependent variable.

**Replace the word "independent" and "dependent" with the correct column names:**

In [38]:
# Replace the word "independent" and "dependent" with the correct column names:
ind_var = "independent"
dep_var = "dependent"

X = df[ind_var]
y = df[dep_var]
xlabel = ind_var
ylabel = dep_var

### **Step 5:** Train the model

We now need to split our variable into training and testing sets. We'll perform this by importing `train_test_split` from the `sklearn.model_selection` library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

In [39]:
# import train test split function
from sklearn.model_selection import train_test_split

# split the data into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

You first need to import the `statsmodel.api` library using which you'll perform the linear regression.

In [40]:
import statsmodels.api as sm

By default, the `statsmodels` library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the `add_constant` attribute of `statsmodels`. And once you've added the constant to your `X_train` dataset, you can go ahead and fit a regression line using the `OLS` (Ordinary Least Squares) attribute of `statsmodels` as shown below

In [41]:
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

In [None]:
# Print the parameters, i.e. the intercept and the slope of the regression line fitted
lr.params

In [None]:
# Performing a summary operation lists out all the different parameters of the regression line fitted
print(lr.summary())

**How does your R-squared look?**

In scientific studies, the R-squared may need to be above 0.95 for a regression model to be considered reliable, so we're pretty close! In other domains, an R-squared of just 0.3 may be sufficient if there is extreme variability in the dataset.


Let's visualize how well the model fit the data:

In [None]:
plt.scatter(X_train, y_train)
plt.title("Training data")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(X_train, lr.params[1]*X_train + lr.params[0], 'r')
plt.show()

### **Step 6:** Test the model

Now that we've fitted a regression line on the training dataset, it's time to make some predictions on the test data:

In [45]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)

# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

Let's check the R-squared on the test set. The closer R squared is to 1, the better our correlation:

In [None]:
from sklearn.metrics import r2_score
r_squared = r2_score(y_test, y_pred)
r_squared

In [None]:
print(r_squared*100, "of the variance in "+ dep_var+ " in the test set is explained by "+ ind_var)

Let's visualize the fit on the test set:

In [None]:
plt.scatter(X_test, y_test)
plt.plot(X_test, lr.params[1]*X_test + lr.params[0], 'r')
plt.title("Test data")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.show()

### **Step 7:** Evaluate the model

Did this model help you answer your research question?

What are some forms of bias that you need to be aware of in this analysis?

What questions do you still have?