# STAT 207 Group Lab Assignment 14 - [10 total points]

## Regularization Models & Selecting a Tuning Parameter

<hr>

## <u>Purpose</u>:
You should work in groups of 2-3 on this report (not working in groups without permission will result in a point deduction). The purpose of this group lab assignment is to fit a regularized model and assess its performance on new data.
<hr>

## <u>Assignment Instructions</u>:

### Contribution Report
These contribution reports should be included in all group lab assignments. In this contribution report below you should list of the following:
1. The netID for the lab submission to be graded.  (Some groups have each member create their own version of the document, but only one needs to be submitted for grading.  Other groups have only one member compose and submit the lab.)
2. Names and netIDs of each team member.
3. Contributions of each team member to report.

### Group Roles

Suggested and specified roles are provided below: 

#### Groups of 2

* **Driver**: This student will type the report.  While typing the report, you may be the one who is selecting the functions to apply to the data.
* **Navigator**: This student will guide the process of answering the question.  Specific ways to help may include: outlining the general steps needed to solve a question (providing the overview), locating examples within the course notes, and reviewing each line of code as it is typed.

#### Groups of 3

* **Driver**: This student will type the report.  They may also be the one to select the functions to apply to the data.
* **Navigator**: This student will guide the process of answering the question.  They may select the general approach to answering the question and/or a few steps to be completed along the way. 
* **Communicator**: This student will review the report (as it is typed) to ensure that it is clear and concise.  This student may also locate relevant examples within the course notes that may help complete the assignment.

<hr>

### Imports

In [87]:
#Run this
import pandas as pd                    # imports pandas and calls the imported version 'pd'
import matplotlib.pyplot as plt        # imports the package and calls it 'plt'
import seaborn as sns                  # imports the seaborn package with the imported name 'sns'
sns.set()  
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

## Case Study: Avoiding Underwater Weighing

We will look at data collected on 252 males in 1985.  In particular, we will consider a measure of the percent of body fat in these males, as measured by Siri's equation (`siri`).  This particular measure is obtained through an underwater weighing technique and is quite extensive and resource-intensive.  We would like to consider alternative (and easier to capture information) to get roughly the same information that the `siri` variable currently contains.  We have other body measures for these males available, including:

- **age**: Age (yrs)
- **weight**: Weight (lbs)
- **height**: Height (inches)
- **adipos**: BMI index
- **neck**: Neck circumference (cm)
- **chest**: Chest circumference (cm)
- **abdom**: Abdomen circumference (cm)
- **hip**: Hip circumference (cm)
- **dthigh**: Thigh circumference (cm)
- **knee**: Knee circumference (cm)
- **ankle**: Ankle circumference (cm)
- **biceps**: Extended bicepts circumference (cm)
- **forearm**: Forearm circumference (cm)
- **wrist**: Wrist circumference (cm)

**We will use all variables in the data as predictors except for the siri, brozek, density, and free (fat free weight)**, since these four variables are challenging measurements to obtain, for our analysis.

The code cell below will read in the data for you.  Be sure to run the cell. 

In [88]:
df = pd.read_csv('fat.csv')
df.head()

Unnamed: 0,brozek,siri,density,age,weight,height,adipos,free,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
0,12.6,12.3,1.0708,23,154.25,67.75,23.7,134.9,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,6.9,6.1,1.0853,22,173.25,72.25,23.4,161.3,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,24.6,25.3,1.0414,22,154.0,66.25,24.7,116.0,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,10.9,10.4,1.0751,26,184.75,72.25,24.9,164.7,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,27.8,28.7,1.034,24,184.25,71.25,25.6,133.1,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


### 1. [1 point] Select a Regularization Technique 

We work for a company that is measuring the proportion of body fat for customers in order to design suits that can fit well for a customer that is ordering the suit over the internet.  We know that asking these customers to perform many body measurements will be burdensome for the customers, so we'd like to streamline the process by only asking the customer to provide a few measurements.  We can then use our models to select an appropriately cut and tailored suit for the customer.

**a)** For this situation, what would our primary purpose of fitting the model be?  In other words, are we concerned about making predictions or understanding structures?

Making predictions

**b)** We would like to use a regularization model for our fitted model.  What regularization technique would you suggest should be used based on the optimal design for the company and customer?

LASSO cause helps us understand what predictors to use

### 2. [2.5 points] Prepare the Data

**a)** Split your data into training and test data.  Use a random state (you can choose your random state), and set aside 20% of your data for the test data.

In [89]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1000)

**b)** We know that regularization models require some preparation of the data before the model can be fit.  Scale your $X$ explanatory variables for this model.

*Note:* You should perform this scaling in two stages for your training and test data.

In [90]:
scaler_full = StandardScaler()
X_train = df_train.drop(['siri', 'brozek', 'density', 'free'], axis = 1)
scaled_expl_vars = scaler_full.fit_transform(X_train)
X_train = pd.DataFrame(scaled_expl_vars, columns=X_train.columns)
X_train.head()

Unnamed: 0,age,weight,height,adipos,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
0,0.15028,-0.364116,-0.818472,0.245927,-0.528692,0.063994,-0.176778,0.01933,0.261051,-1.051463,-0.686878,1.094159,0.752308,-0.66384
1,1.79363,-0.471613,-0.621529,-0.051526,0.189599,-0.327225,0.369589,-0.371906,-0.839482,-0.142101,0.339609,-0.901656,-0.684631,0.822274
2,-0.710522,-0.851987,-0.621529,-0.511226,-0.369072,-0.979257,-0.917614,-0.952014,-0.858135,-0.968793,-0.572824,-0.57975,-0.588835,-0.557689
3,-0.945286,-0.967753,-0.030701,-1.05205,-0.967648,-1.631288,-1.093562,-0.871069,-0.820829,-0.968793,-0.572824,-0.772894,-0.924121,-1.194595
4,-0.084484,-0.926408,-0.227644,-0.889803,-1.087363,-1.109663,-0.565717,-0.398888,-0.708911,-1.175467,-0.629851,-0.901656,-0.972019,-0.876142


In [91]:
X_test = df_test.drop(['siri', 'brozek', 'density', 'free'], axis = 1)
scaled_expl_vars = scaler_full.fit_transform(X_test)
X_test = pd.DataFrame(scaled_expl_vars, columns=X_test.columns)
X_test.head()

Unnamed: 0,age,weight,height,adipos,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
0,0.173355,0.814035,-1.100493,1.730192,0.943049,1.534593,1.24877,0.400951,0.314067,0.989976,1.130179,0.591506,1.293284,0.059507
1,-0.451773,-0.468728,-0.932453,0.115502,-0.415855,0.176349,-0.574831,-0.982281,-0.280745,0.095993,0.707914,0.476366,-2.088486e-15,0.621513
2,-1.166205,1.81837,-0.260296,2.178716,1.040114,1.546829,2.021818,1.967503,3.096937,0.777123,0.637537,1.359109,1.76357,0.621513
3,-0.719685,-0.597999,0.411861,-0.871252,0.554791,-1.561225,-1.169483,-0.332329,-0.535664,0.393987,-0.347747,0.130944,0.2939283,-0.052895
4,0.530571,-1.234409,0.159802,-1.409482,-1.289437,-1.500043,-1.149661,-1.032278,-1.024259,-1.138556,-0.981144,-1.135601,-0.8229991,-0.952106


**c)** Prepare your y variable for the model.

In [92]:
y_train = df_train['siri']
y_test = df_test['siri']

### 3. [3 points] Fitting a Regularized Model

**a)** We'd like to fit a model using the regularization technique suggested in Question 1.  Start by fitting your model with a $\lambda$ of 2 to your training data.

In [93]:
lasso_model_2 = Lasso(alpha = 2, max_iter = 1000)
lasso_model_2.fit( X_train, y_train)
df_slopes = pd.DataFrame( {'lasso_2' : lasso_model_2.coef_.T}, index = X_train.columns)
df_slopes

Unnamed: 0,lasso_2
age,0.0
weight,0.0
height,-0.0
adipos,0.0
neck,0.0
chest,0.0
abdom,4.75103
hip,0.0
thigh,0.0
knee,0.0


**b)** How many of your slopes were equal to 0 for this model?  Which variables would you suggest should be retained in the model?

13, abdom should be retained

**c)** Fit another version of your regularized model to your training data, but this time using a $\lambda$ of 0.5.

In [94]:
lasso_model_05 = Lasso(alpha = 0.5, max_iter = 1000)
lasso_model_05.fit( X_train, y_train)
df_slopes = pd.DataFrame( {'lasso_2' : lasso_model_2.coef_.T, 'lasso_05' : lasso_model_05.coef_.T}, index = X_train.columns)
df_slopes

Unnamed: 0,lasso_2,lasso_05
age,0.0,0.508591
weight,0.0,-0.0
height,-0.0,-0.58321
adipos,0.0,0.0
neck,0.0,-0.0
chest,0.0,0.0
abdom,4.75103,6.417714
hip,0.0,-0.0
thigh,0.0,0.0
knee,0.0,-0.0


**d)** How many of your slopes were equal to 0 for this model?  Which variables would you suggest should be retained from this version of the model?

10, retain age, height, abdom, and wrist

### 4. [2 points] Comparing Model Results

**a)** Apply each of your two models fit from **3a** and **3c** to your test set.  Calculate the $R^2$ on your test set for each of your models.

In [98]:
lasso_model_2.score(X_test, y_test)

0.2290518909602689

In [96]:
lasso_model_05.score(X_test, y_test)

0.2908190274210648

**b)** Which $\lambda$ would you suggest based on your training data?

lambda = 0.5

**c)** Compare the results from **4a** to another group (or two) in your lab.  Did you pick the same preferred $\lambda$ from our two choices?  Did you achieve the same $R^2$ values on your test data?  Did you have the same training/test data split from **2a**?

We have the same training/test data split and the same preferred lambda but our values for R^2 were different

### 5. [1.5 points] A Limitation of the Results

Return to your selection of a regularization technique in Question 1 and recall that there are two primary purposes or benefits to using a regularization model.  Assess if there is still a concern in your model that the other regularization model approach could address.  Explain.