# Linear Regression Exercise `Python`

This exercise is going to be a little different in the sense that we won't be guiding in a question-by-question format. Instead, we are going to let you construct a linear model in your choice of `R` or `Python`, whichever you prefer.

The prediction problem is to predict `height` from the `'/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv'` dataset. You can use any variable or combination of variables in order to predict `height`.

You are not going to be graded upon the performance of the model itself, but please approach this as an actual prediction problem. That being said, you should split the data into training and testing sets, in which your model is trained on your training set while the performance is assessed on the testing set. Be sure to predict some output with your testing inputs.

The purpose of this assignment is to demonstrate your ability to use regression to develop a machine learning model. Feel free to include anything that demonstrates your understanding of model development and model refinement including data exploration and even a written description of your reasoning. 

Like always, feel free to ask questions along the way if you get stuck at any point. We are more than happy to help!

To add execution cells, click in this cell.
Then, in the notebook menu: `Insert > Insert Cell Below`

In [3]:
# Import the libraries and dataset and look at the first few rows
import pandas as pd
import numpy as np
from sklearn import linear_model # necessary package for linear regression

with open('/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv') as file:
    df = pd.read_csv(file)

df.head()

Unnamed: 0,gender,height,hand length,foot length
0,1,1760.2,208.6,269.6
1,1,1730.1,207.6,251.3
2,1,1659.6,173.2,193.6
3,1,1751.3,258.0,223.8
4,1,1780.6,212.3,282.1


In [4]:
# Split the data into training (80%) and testing (20%) sets
train = df.sample(frac=8/10, random_state = 1)
test = df.drop(train.index)

In [7]:
# I needed this because I was getting errors for not adding the space in ' foot length'
print (df.columns)

Index(['gender', 'height', 'hand length', ' foot length'], dtype='object')


In [8]:
# Make arrays for x and y variables for both training and testing sets
train_X = np.asarray(train[['gender', 'hand length', ' foot length']])
train_y = np.asarray(train.height)
test_X = np.asarray(test[['gender', 'hand length', ' foot length']])
test_y = np.asarray(test.height)

In [9]:
# Create linear regression object
regr = linear_model.LinearRegression()

In [10]:
# Train the model using the training sets
regr.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [11]:
# Display the intercept
regr.intercept_

675.4199499784518

In [12]:
# Display the coefficients for each predictor
z = zip(['gender', 'hand length', ' foot length'],regr.coef_)
list(z)

[('gender', -40.36670587082764),
 ('hand length', 2.717991037060058),
 (' foot length', 2.0921446369297643)]

In [15]:
# Display R-Squared for training data
# Explained variance score: 1 is perfect prediction
print('Training R-Squared: {}'.format(regr.score(train_X, train_y)))

Training R-Squared: 0.8714169877788327


In [16]:
# Display R-Squared for testing data
print('Test R-Squared: {}'.format(regr.score(test_X, test_y)))

Test R-Squared: 0.9119185401572119


In [17]:
# Make predictions for test_Y by putting test_X into the model
regr.predict(test_X)

array([1725.06413066, 1706.08111449, 1792.77445332, 1818.58468958,
       1799.11167488, 1698.80152935, 1762.80138834, 1732.24719503,
       1741.55634017, 1757.82495929, 1804.52619641, 1750.64908288,
       1687.14990101, 1738.89733979, 1750.42020486, 1748.57839879,
       1737.43283855, 1735.11181589, 1583.55940636, 1764.60477098,
       1572.99677143, 1609.18188869, 1622.58588695, 1580.94512406,
       1602.79994905, 1670.77488285, 1623.90375837, 1601.00106406,
       1596.67014497, 1600.24825139, 1644.59093575])

# Save your notebook, then `File > Close and Halt`