<a href="https://colab.research.google.com/github/HugoStigletz/Data-Analytics-Portfolio/blob/main/10_4_Encoding_Categorical_Variables_(get_dummies).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 11.2.2 Activity

We know that height tends to run in families.  That is, tall parents tend to have children who grow up to be tall adults and vice versa.

But can we really predict how tall a child will grow up to be based only on how tall the parents are?  Does the prediction change if the child is male or female?  And is the same parent-child height relationship true for every family?

In this activity, we'll answer the second two questions by encoding both the `Gender` and `Family` variables so they can be added into our Linear Regression Model.



#Step 1: Download and save the `heights.csv` dataset from the class materials  

* Make a note of where you saved the file on your computer.

#Step 2: Upload the heights.csv dataset by running the following code block 

* When prompted, navigate to and select the `heights.csv` dataset where you saved it on your computer.
* This is a pretty large dataset, so it may take a little while.

#Step 3: Import necessary packages

```
* import pandas as pd
* from sklearn.model_selection import train_test_split
* from sklearn.linear_model import LinearRegression
* from sklearn.metrics import r2_score, mean_squared_error as MSE

```

In [None]:
#Step 3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error as MSE

# Step 4: Create a Pandas DataFrame from the CSV file
* Name the DataFrame `heights`.
* Print the first five observations of `heights`.  Note the kinds of data it contains.

In [None]:
#Step 4
heights = pd.read_csv('/content/heights.csv')

heights

Unnamed: 0,Family,Father,Mother,Gender,Height,Kids,MidParent
0,1,78.5,67.0,M,73.2,4,75.43
1,1,78.5,67.0,F,69.2,4,75.43
2,1,78.5,67.0,F,69.0,4,75.43
3,1,78.5,67.0,F,69.0,4,75.43
4,2,75.5,66.5,M,73.5,4,73.66
...,...,...,...,...,...,...,...
893,205,68.5,65.0,M,68.5,8,69.35
894,205,68.5,65.0,M,67.7,8,69.35
895,205,68.5,65.0,F,64.0,8,69.35
896,205,68.5,65.0,F,63.5,8,69.35


#Step 5: Split the data into the target variable and the feature of interest.
* We want to predict an adult child's height (`Height`) using the heights of the parents (`MidParent`), the adult child's gender (`Gender`) and the family of origin (`Family`).          
* `MidParent` is calculated as the average of the father's height plus a tiny bit more than the mother's height.
* Select the columns `MidParent`, `Gender` and `Height` from the `heights` DataFrame and name the resulting DataFrame `X`.  
* Select the column `Height` from the `heights` DataFrame and name it `y`.  Make sure `y` is also a DataFrame and not a Series.





In [None]:
#Step 5
X = heights[['MidParent', 'Gender', 'Family']]

y = heights[['Height']]

In [None]:
heights.isna().sum()


Family       0
Father       0
Mother       0
Gender       0
Height       0
Kids         0
MidParent    0
dtype: int64

#Step 6: One-hot encode `Gender` and `Family`
* Run the code below to one-hot encode `Gender` and `Family` and name the results `one_hot`. 

```
one_hot = pd.get_dummies(data=X, columns=['Gender', 'Family'])

```

* Convert `one_hot` to a Pandas DataFrame and name it `X`.
* Now `X` includes our one-hot encoded values of `Family` and `Gender`.





In [None]:
#Step 6
X = pd.get_dummies(data=X, columns=['Gender', 'Family'])

#Step 7: Split the data into a training/validation dataset and a test dataset.
* You did this in the last activity, but it doesn't hurt to practice!
* Use `train_test_split` from `sklearn.model_selection`.
* Name the X training/validation set `X_train_val` and the y training/validation set `y_train_val`.
* Name the X test set `X_test` and the y test set `y_test`.
* Set the `test_size = 0.25` and `random_state = 42`. 






In [None]:
#Step 7
X_train_val, X_test, y_train_val, y_test =train_test_split(X, y, test_size=0.25,random_state=42)

#Step 8: Split the training/validation dataset into a training set and validation set
* You did this in the last activity, but it doesn't hurt to practice!
* Use `train_test_split` from `sklearn.model_selection` to split `X_train_val` and `y_train_val` into `X_train`, `X_val`, `y_train` and `y_val`.
* Set the `test_size = 0.333` (this will be the size of the validation set) and `random_state = 42`.





In [None]:
#Step 8
X_train, X_val, y_train, y_val =train_test_split(X_train_val, y_train_val, test_size=0.3333,random_state=42)

#Step 9: Fit a linear regression model using sklearn
* You did this in the last activity, but it doesn't hurt to practice!
* Specify the regression model using sklearn.  Name the resulting model `model`.  Use the format 

```
reg = model name

model = reg.fit(X variable, y variable)

```

* Make sure you are building your model using the training data that we just one-hot encoded.







In [None]:
#Step 9
reg = LinearRegression()

model = reg.fit(X_train, y_train)

model.score(X_train, y_train)          

0.8109768892676624

#Step 10: Calculate the predicted y values using your regression model
* We've done this before, but it doesn't hurt to practice!
* Call the predicted values `y_pred`.  Use the format 

```
y_pred = model.predict(X variable)

```

* Make sure you are calcuating the predicted values on the training data.






In [None]:
#Step 10
y_pred = model.predict(X)

In [None]:
np.sum(y_pred > 100)


12

In [None]:
mask = (y_pred > 100)
mask

In [None]:
print(y_pred[mask])


[3.11623012e+13 7.31590539e+12 7.31590539e+12 7.31590539e+12
 1.01851340e+14 8.24948523e+13 1.10297670e+14 1.10297670e+14
 1.70966516e+13 1.70966516e+13 1.70966516e+13 1.70966516e+13]


In [None]:
print(X[mask])


     MidParent  Gender_F  Gender_M  Family_1  Family_2  Family_3  Family_4  \
32       72.37         1         0         0         0         0         0   
74       72.72         0         1         0         0         0         0   
75       72.72         1         0         0         0         0         0   
76       72.72         1         0         0         0         0         0   
87       71.37         1         0         0         0         0         0   
124      68.94         1         0         0         0         0         0   
298      70.10         0         1         0         0         0         0   
299      70.10         0         1         0         0         0         0   
362      69.02         0         1         0         0         0         0   
363      69.02         0         1         0         0         0         0   
364      69.02         1         0         0         0         0         0   
365      69.02         1         0         0         0         0

#Step 11: Calculate and interpret the RMSE and R-squared for the model using the training data
* You've done this before, but it doesn't hurt to practice!
* Assign the RMSE to the variable `RMSE_train`.  Use the format 

```
RMSE_train = MSE(y observed, y predicted, squared = False)

```

* Print the value of the RMSE for the model.

* Assign the R-squared to the variable `r2_train`.  Use the format

```
r2_train = r2_score(y observed, y predicted)

```
* Print the value R-squared for the model.

* Interpret the value of RMSE and R-squared for the model.




In [None]:
#Step 11
RMSE = MSE(y, y_pred, squared = False)

In [None]:
RMSE

22415244433304.84

In [None]:
r2 = r2_score(y, y_pred)

In [None]:
r2

-3.91829382911446e+25