#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-7)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config
import warnings

warnings.filterwarnings("ignore")

set_config(display="diagram") 
#setting this will display your pipelines as seen above


### The Data: Ames Housing

This dataset is a popular beginning dataset used in teaching regression.  The task is to use specific features of houses to predict the price of the house.  In addition to this, as discussed in video 8.10 -- this dataset is available for use in an ongoing competition where you can use the `test.csv` to submit your models predictions.  Accordingly, the two data files are identical with the exception of the `test.csv` file not containing the target feature.

The data contains 81 columns of different information on the individual houses and their sale price.  A full description of the data is attached [here](data/data_description.txt).  In this assignment, you will use a small subset of the features to begin modeling with that includes ordinal, categorical, and numeric features. As an optional exercise you are encouraged to continue engineering additional features and attempt to improve the performance of your model including submitting the predictions on kaggle. 

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [12]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
train['CentralAir']

0       Y
1       Y
2       Y
3       Y
4       Y
       ..
1455    Y
1456    Y
1457    Y
1458    Y
1459    Y
Name: CentralAir, Length: 1460, dtype: object

In [5]:
#note the difference in one column from train to test
[i for i in train.columns if i not in test.columns]

['SalePrice']

[Back to top](#Index:) 

### Problem 1

#### Train/Test split

**5 Points**

Despite having a test dataset, you want to create a holdout set to assess your models performance.  To do so, use sklearn's `train_test_split` with arguments:

- `test_size = 0.3`
- `random_state = 22`

Assign your results to `X_train, X_test, y_train, y_test` below with `X` and `y` as given.  `X_train` and `X_test` should be a pandas DataFrame, and `y_train`, `y_test` are to be pandas Series.  


In [6]:
X = train.drop('SalePrice', axis = 1)
y = train['SalePrice']
y_trans = np.log1p(train['SalePrice'].values)

In [7]:
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [8]:
### GRADED

X_train, X_test, y_train, y_test = '', '', '', ''

# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

# Answer check
print(X_train.shape)
print(X_test.shape)
print(type(X_train), type(y_train))#should be DataFrame and Series

(1022, 80)
(438, 80)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


### Codio Activity 8.7: Evaluating Multiple Models

**Estimated Time: 120 Minutes**

**Total points: 100**

This assignment focuses on solving a specific regression problem using basic cross validation with a train/test/validation split.  In addition to using the methods explored, this assignment also aims to familiarize you with further utilities for data transformation including the `OneHotEncoder` and `OrdinalEncoder` along with their use in a `make_column_transformer`.  

The operations of encoding categorical features will be introduced using `sklearn`.  This will allow you to streamline your model building pipelines.  Depending on whether a string type feature is **ordinal** or **categorical** we want to encode differently.  The `OrdinalEncoder` will be used to encode features that do not need to be binarized due to an underlying order, and `OneHotEncoder` for categorical features (as a similar approach to that of the `.get_dummies()` method in pandas).  By the end of the assignment, you will see how to chain multiple feature encoding methods together including the earlier `PolynomialFeatures` for numeric features. 

<center>
    <img src = images/pipes.png width = 50% />
</center>

In [9]:
X_train[['HeatingQC']].value_counts()

HeatingQC
Ex           511
TA           297
Gd           184
Fa            29
Po             1
Name: count, dtype: int64

[Back to top](#Index:) 

### Problem 2

#### Baseline Predictions

**10 Points**

Before buildling a regression model, you should set a baseline to compare your later models to.  One way to do this is to guess the mean of the `SalePrice` column.  For the variables `baseline_train` and `baseline_test`, create arrays of same shape as `y_train` and `y_test` respectively.  These should both contain the mean of the target feature in the train set. Use the mean predictions to determine the `mean_squared_error` for both the train and test sets and assign to `mse_baseline_train` and `mse_baseline_test` below.  

In [10]:
### GRADED

baseline_train = ''
baseline_test = ''
mse_baseline_train = ''
mse_baseline_test = ''

# YOUR CODE HERE
baseline_train = np.ones(shape = y_train.shape)*y_train.mean()
baseline_test = np.ones(shape = y_test.shape)*y_test.mean()
mse_baseline_train = mean_squared_error(baseline_train, y_train)
mse_baseline_test = mean_squared_error(baseline_test, y_test)

# Answer check
print(baseline_train.shape, baseline_test.shape)
print(f'Baseline for training data: {mse_baseline_train}')
print(f'Baseline for testing data: {mse_baseline_test}')

(1022,) (438,)
Baseline for training data: 6277713446.182904
Baseline for testing data: 6374354899.510017


[Back to top](#Index:) 

### Problem 3

#### Examining the Correlations

**5 Points**

What feature has the highest positive correlation with `SalePrice`?  Assign your answer as a string matching the column name exactly to `highest_corr` below.  

In [11]:
### GRADED

highest_corr = ''

# YOUR CODE HERE
highest_corr = train.corr()[['SalePrice']].nlargest(columns = 'SalePrice', n = 2).index[1]

# Answer check
print(highest_corr)

ValueError: could not convert string to float: 'RL'

[Back to top](#Index:) 

### Problem 4

#### Simple Model

**10 Points**


Build a `LinearRegression` model on the training data using only the column `OverallQual`.  Evaluate the mean squared error on both the training and testing data, and assign these to `model_1_train_mse` and `model_1_test_mse` below.    

In [None]:
### GRADED

model_1_train_mse = ''
model_1_test_mse = ''

# YOUR CODE HERE
X1 = X_train[['OverallQual']]
lr = LinearRegression().fit(X1, y_train)
model_1_train_mse = mean_squared_error(y_train, lr.predict(X1))
model_1_test_mse = mean_squared_error(y_test, lr.predict(X_test[['OverallQual']]))

# Answer check
print(f'Train MSE: {model_1_train_mse: .2f}')
print(f'Test MSE: {model_1_test_mse: .2f}')

[Back to top](#Index:) 

### Problem 5

#### Using `OneHotEncoder`

**10 Points**

Similar to the `pd.get_dummies()` method earlier encountered, scikitlearn has a utility for encoding categorical features in the same way.  Below, the `OneHotEncoder` is demonstrated on the `CentralAir` column.  You are to use these results to build a model where the only feature is the `CentralAir` column.  Note the two arguments are used in the `OneHotEncoder`:

- `sparse = False`: returns an array that we can investigate vs with `sparse = True` you are returned a sparse matrix -- a memory saving representation
- `drop = if_binary`: returns a single column for any binary categories.  This avoids reduntant features in our regression model.

Be sure to assign your fit regression model to `model_2`.  Does this model perform better than the baseline model?  

In [None]:
#extract the features
central_air_train = X_train[['CentralAir']]
central_air_test = X_test[['CentralAir']]

In [None]:
#a categorical feature
central_air_train.head()

In [None]:
#Instantiate a OHE object
#sparse = False returns an array so we can view
ohe = OneHotEncoder(sparse = False, drop='if_binary')
print(ohe.fit_transform(central_air_train)[:5])

In [None]:
model_2_train = ohe.fit_transform(central_air_train)
model_2_test = ohe.transform(central_air_test)

In [None]:
### GRADED

model_2 = ''

# YOUR CODE HERE
model_2 = LinearRegression().fit(model_2_train, y_train)

# Answer check
print(model_2.coef_)

[Back to top](#Index:) 

### Problem 6

#### Using `make_column_transformer`

**10 Points**


To build a model using both the `OverallQual` column and the `CentralAir` column, you could use the `OneHotEncoder` to transform `CentralAir`, and then concatenate the results back into a DataFrame or numpy array.  To streamline this process, the `make_column_transformer` can be used to seperate specific columns for certain transformations.  Below, a `make_column_transformer` has been created for you to do just this.  


The arguments are tuples of the form `(transformer, columns)` that specify a transformation to perform on the given column.  Further, the `remainder = passthrough` argument says to just pass the other columns through.  You are returned a numpy array with the `CentralAir` column binarized and concatenated to the `OverallQual` feature.


For an example using the `make_column_transformer` see [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py).


In [None]:
col_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']), remainder='passthrough')

In [None]:
col_transformer.fit_transform(X_train[['OverallQual', 'CentralAir']])

In [None]:
### GRADED

pipe_1 = ''

# YOUR CODE HERE
pipe_1 = Pipeline([('col_transformer', col_transformer), ('linreg', LinearRegression())])
pipe_1.fit(X_train[['OverallQual', 'CentralAir']], y_train)

# Answer check
print(pipe_1.named_steps)#col_transformer and linreg should be keys
pipe_1

In [None]:
X_train[['HeatingQC']]

Now, you can treat the `col_transformer` as a transformer object and insert it as a step in a `Pipeline`.  Below, create a `Pipeline` with the `col_transformer` as the first step, followed by `LinearRegression` estimator as `pipe_1` below.  Fit and score the pipeline on the columns `['OverallQual', 'CentralAir']`.  Does this model perform better than the baseline? 

- Reminder that steps in a `Pipeline` are tuples with names and objects.  You should name the transformer `col_transformer` and estimator `linreg`.  

[Back to top](#Index:) 

### Problem 7

#### Using `OrdinalEncoder`

**10 Points**

Not all columns warrant binarization as done on the `CentralAir` column.  For example, consider the `HeatingQC` feature -- representing the quality of the heating in the house.  From the data description the unique values are described as:

```
HeatingQC: Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
```

These are ordered values, and rather than binarizing them a numeric value representing the scale can be used.  For example, using a scale of 0 - 4 you may associate the categories with an order in a list from least to greatest as:

```
['Po', 'Fa', 'TA', 'Gd', 'Ex']
```

Creating an `OrdinalEncoder` with these categories will transform the `HeatingQC` feature mapping each category as

```
Po: 0
Fa: 1
TA: 2
Gd: 3
Ex: 4
```

This is demonstrated below, and in a similar manner the use of the `make_column_transformer` is shown using the three columns `['OverallQual', 'CentralAir', 'HeatingQC']`, applying the appropriate transformations to each column and passing the remaining numeric feature through.  

In [None]:
oe = OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']])

In [None]:
oe.fit_transform(X_train[['HeatingQC']])

In [None]:
X_train['HeatingQC'].head()

In [None]:
ordinal_ohe_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                          (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                          remainder='passthrough')

In [None]:
ordinal_ohe_transformer.fit_transform(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])[:5]

In [None]:
X_train[['OverallQual', 'CentralAir', 'HeatingQC']].head()

You are to build a pipeline incorporating this new transformer and with steps named `transformer` and `linreg`.  Fit and evaluate the model on the training and test set using mean squared error.  Is this model better than the baseline? 

Assign your pipeline to `pipe_2` below and mean squared errors to `pipe_2_train_mse` and `pipe_2_test_mse` as floats below. 

In [None]:
### GRADED

pipe_2 = ''
pipe_2_train_mse = ''
pipe_2_test_mse = ''

# YOUR CODE HERE
pipe_2 = Pipeline([('transformer', ordinal_ohe_transformer), ('linreg', LinearRegression())])
pipe_2.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']], y_train)
pred_train = pipe_2.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
pred_test = pipe_2.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])
pipe_2_train_mse = mean_squared_error(y_train, pred_train)
pipe_2_test_mse = mean_squared_error(y_test, pred_test)

# Answer check
print(pipe_2.named_steps)
print(f'Train MSE: {pipe_2_train_mse: .2f}')
print(f'Test MSE: {pipe_2_test_mse: .2f}')
pipe_2

[Back to top](#Index:) 

### Problem 8

#### Including `PolynomialFeatures`

**10 Points**

Finally, the earlier transformation of continuous columns using the `PolynomialFeatures` with `degree = 2` can be implemented alongside the `OneHotEncoder` and `OrdinalEncoder`.  

The `make_column_transformer` is again used, and you are to create a `Pipeline` with steps `transformer` and `linreg`.  

The `Pipeline` is fit on the training data using features `['OverallQual', 'CentralAir', 'HeatingQC']`.  

Your task is to determine the mean squared error on the train and test data and assign these as floats to `quad_train_mse` and `quad_test_mse` below.  

In [None]:
poly_ordinal_ohe = make_column_transformer((OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                           (OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                           (PolynomialFeatures(include_bias = False, degree = 2), ['OverallQual']))
pipe_3 = Pipeline([('transformer', poly_ordinal_ohe), ('linreg', LinearRegression())])
pipe_3.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']], y_train)

In [None]:
### GRADED

quad_train_mse = ''
quad_test_mse = ''

# YOUR CODE HERE
quad_train_preds = pipe_3.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
quad_test_preds = pipe_3.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])
quad_train_mse = mean_squared_error(y_train, quad_train_preds)
quad_test_mse = mean_squared_error(y_test, quad_test_preds)

# Answer check
print(f'Train MSE: {quad_train_mse: .2f}')
print(f'Test MSE: {quad_test_mse: .2f}')

[Back to top](#Index:) 

### Problem 9

#### Including More Features

**20 Points**

Use the following features to build a new `make_column_transformer` and fit 5 different models of degree 1 - 5 using the `degree` argument in your `PolynomialFeatures` transformer.  Keep track of the subsequent train mean squared error and test set mean squared error with the lists `train_mses` and `test_mses` respectively.  

The `poly_ordinal_ohe` object contains the different transformers needed.  Note that rather than passing a list of columns to the `PolynomialFeatures` transformer, the `make_column_selector` function is used to select any numeric feature.  For more information on the `make_column_selector` see [here](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html).



In [None]:
features = ['CentralAir', 'HeatingQC', 'OverallQual', 'GrLivArea', 'KitchenQual', 'FullBath']

In [None]:
X_train[features].head()

In [None]:
poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC', 'KitchenQual']),
                                               (OneHotEncoder(drop = 'if_binary', sparse = False), ['CentralAir']))

In [None]:
### GRADED

train_mses = []
test_mses = []
#for degree in 1 - 5
for i in range(1, 6):
    #create pipeline with PolynomialFeatures degree i 
    #ADD APPROPRIATE ARGUMENTS IN POLYNOMIALFEATURES
    poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                               (OneHotEncoder(drop = 'if_binary'), ['CentralAir']))
    
    
    #fit on train

    #predict on train and test

    #compute mean squared errors
    
    #append to train_mses and test_mses respectively

# YOUR CODE HERE
train_mses = []
test_mses = []
#for degree in 1 - 5
for i in range(1, 6):
    #create pipeline with PolynomialFeatures degree i
    poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(degree = i), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                               (OneHotEncoder(drop = 'if_binary'), ['CentralAir']))
    pipe = Pipeline([('transformer', poly_ordinal_ohe), ('linreg', LinearRegression())])
    
    #fit on train
    pipe.fit(X_train[features], y_train)
    R_square = pipe.score(X_train[features], y_train)
    
    #predict on train and test
    p1 = pipe.predict(X_train[features])
    p2 = pipe.predict(X_test[features])
    
    #create MSEs for train and test sets
    train_mses.append(mean_squared_error(y_train, p1))
    test_mses.append(mean_squared_error(y_test, p2))

# Answer check
print(train_mses)
print(test_mses)
print(f"R-square: {R_square}")
pipe

[Back to top](#Index:) 

### Problem 10

#### Optimal Model Complexity 

**10 Points**

Based on your models mean squared error on the testing data in **Problem 9** above, what was the optimal complexity?  Assign your answer as an integer to `best_complexity` below.  Compute the **MEAN SQUARED ERROR** of this model and assign to `best_mse` as a float. 

In [None]:
### GRADED

best_complexity = ''
best_mse = ''

# YOUR CODE HERE
best_complexity = test_mses.index(min(test_mses)) + 1
best_mse = min(test_mses)

# Answer check
print(f'The best degree polynomial model is:  {best_complexity}')
print(f'The smallest mean squared error on the test data is : {best_mse: .2f}')

In [None]:
plt.hist(y)

In [None]:
plt.hist(y_trans)

### Further Exploration

This activity was meant to introduce you to a more streamlined modeling process using the `sklearn` library.  While your models should be performing better than the baseline, it is likely that with a bit more feature engineering and cross validation you would be able to further improve the performance.  You are encouraged to explore further feature engineering and encoding, particularly with handling missing values.  

Additionally, other transformations on the data may be appropriate.  For example, if you look at the distribution of errors in your model, you will note that they are slightly skewed.  An assumption of a Linear Regression model is that these should be roughly normally distributed.  By building a model on the logarithm of the target column, and evaluating the model on the logarithm of the testing data you will improve towards this assumption.  Note that the actual kaggle exercise is judged on the **ROOT MEAN SQUARED ERROR** of the logarithm of the target feature. 

If interested, scikitlearn also provides a function `TransformedTargetRegressor` that will accomplish this transformation and can easily be added to a pipeline. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html) for more information on this transformer. 