# Module 7 Exercises - Linear Regression

### Exercise 1:

Using the pandas library, in the datasets folder load the gradedata.csv file as a dataframe. Narrow your data (make the dataframe smaller) by choosing columns that you think can help predict student grades. Use any method that you've learned so far to help your decision on which columns to keep. 

In [39]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression

location = "datasets/gradedata.csv"
df = pd.read_csv(location)
df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [40]:
df2 = df.drop(["fname", "lname", "address"], axis = 1)
df2.head()

Unnamed: 0,gender,age,exercise,hours,grade
0,female,17,3,10,82.4
1,male,18,4,4,78.2
2,male,18,5,9,79.3
3,female,14,2,7,83.2
4,female,18,4,15,87.4


### Exercise 2:

Using the dataframe in the exercise above, clean and prepare your data. This means to eliminate any null (missing) values (either by dropping or filling them) and to transform any data column types to numerical values that a model can interpret. For example, if the column has string values, convert them to integers that best represent their order.

In [42]:
df3 = pd.get_dummies(data=df2, columns=['gender'], drop_first=True)
df3.head()

Unnamed: 0,age,exercise,hours,grade,gender_male
0,17,3,10,82.4,0
1,18,4,4,78.2,1
2,18,5,9,79.3,1
3,14,2,7,83.2,0
4,18,4,15,87.4,0


In [43]:
df3.rename(columns={'gender_male': 'male'}, inplace =True)
df3.head()

Unnamed: 0,age,exercise,hours,grade,male
0,17,3,10,82.4,0
1,18,4,4,78.2,1
2,18,5,9,79.3,1
3,14,2,7,83.2,0
4,18,4,15,87.4,0


In [44]:
df3.describe()

Unnamed: 0,age,exercise,hours,grade,male
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,16.5785,3.0005,10.9885,82.55605,0.5
std,1.696254,1.423205,4.063942,9.747593,0.500125
min,14.0,0.0,0.0,32.0,0.0
25%,15.0,2.0,8.0,75.575,0.0
50%,17.0,3.0,11.0,82.7,0.5
75%,18.0,4.0,14.0,89.7,1.0
max,19.0,5.0,20.0,100.0,1.0


In [45]:
df3.isnull().sum()

age         0
exercise    0
hours       0
grade       0
male        0
dtype: int64

### Exercise 3:

Using the cleaned dataframe in the exercise above, use the sklearn library to split the data into training and test datasets. Make the test size 30%.

In [29]:
X = df3.drop("grade", axis=1)
Y = df3["grade"]

In [30]:
X.head()

Unnamed: 0,age,exercise,hours,male
0,17,3,10,0
1,18,4,4,1
2,18,5,9,1
3,14,2,7,0
4,18,4,15,0


In [31]:
Y.head()

0    82.4
1    78.2
2    79.3
3    83.2
4    87.4
Name: grade, dtype: float64

In [32]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.3, random_state = 10)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1400, 4)
(600, 4)
(1400,)
(600,)


### Exercise 4:

Using the training data from the previous exercise, set a linear regression function to fit the data (build the model).

In [34]:
lm = LinearRegression()

lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Exercise 5:

What is the intercept coefficient (y-intercept) for the linear regression model?

In [35]:
lm.intercept_

57.56939978204882

### Exercise 6:

Use the predict function on the training data and the test data.

In [37]:
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)

print(pred_train[:3])
print(pred_test[:3])

[75.43520487 70.7069889  76.32075166]
[86.16503067 87.48589372 67.71795141]


### Exercise 7:

Calculate the MSE (mean squared error) of the training and test predictions. How "good" was the linear regression model at predicting the test data compared to the training data?

In [38]:
print ('Fit a model X_train, and calculate MSE with Y_train:', np.mean((y_train - lm.predict(X_train)) ** 2))

print ('Fit a model X_test, and calculate MSE with Y_test:', np.mean((y_test - lm.predict(X_test)) ** 2))

Fit a model X_train, and calculate MSE with Y_train: 32.162703612683714
Fit a model X_test, and calculate MSE with Y_test: 31.245631088363965


# trainMSE is most commonly going to be lower than testMSE
# if trainMSE is MUCH lower than testMSE, this could be a sign of overfitting
# if trainMSE is MUCH higher than testMSE, your model could use some work

# the scores for test predictions are always going to be more important to look at!