# Assignment - Linear Regression¶

### Learning Objectives
After completing this assignment, you should be comfortable:

- Simple feature engineering
- Using sklearn to build simple and more complex linear models
- Building a data pipeline using pandas
- Identifying informative variables through EDA
- Feature engineering with categorical variables

## Setup Notebook

In [None]:
# Import 3rd party libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

# Overview
The [Ames](http://jse.amstat.org/v19n3/decock.pdf) dataset consists of 2930 records taken from the Ames, Iowa Assessor’s Office describing houses sold in Ames from 2006 to 2010. The data set has 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers). 82 features in total. An explanation of each variable can be found in the included `codebook.txt` file.

# Import Data
Let's import the training datasets.

In [None]:
ames_data = pd.read_csv('ames_data.csv')

Now, let's take a look.

In [None]:
ames_data.head()

The next order of business is getting a feel for the variables in our data. The Ames data set contains information that typical homebuyers would want to know. A more detailed description of each variable is included in `codebook.txt`. You should take some time to familiarize yourself with the codebook before moving forward.

# 1. Exploratory Data Analysis
In this section, we will make a series of exploratory visualizations and interpret them.

## Sale Price
We begin by examining our target variable `SalePrice` using three different plot types. 

In [None]:
# Write your code here
 


Now, let's use the Pandas `describe()` method to look at some descriptive statistics of this variable.

In [None]:
ames_data['SalePrice'].describe()

## Question 1a
To check your understanding of the graph and summary statistics above, answer the following True or False questions:

1. The distribution of SalePrice in the training set is left-skewed.
2. The mean of SalePrice is greater than the median.
3. At least 25% of the houses in the training set sold for more than $200,000.00.

In [None]:
# Write your code here



## SalePrice vs Gr_Liv_Area
Next, we visualize the association between `SalePrice` and `Gr_Liv_Area`. The `codebook.txt` file tells us that `Gr_Liv_Area` measures "above grade (ground) living area square feet."

This variable represents the square footage of the house excluding anything underground. Some additional research (into real estate conventions) reveals that this value also excludes the garage space.

## Question 1b
Create a cross-plot with `SalePrice` on the y-axis and `Gr_Liv_Area` on the x-axis. Use the Seaborn ploting function `sns.jointplot()`. You plot should look something like this.

<br>
<img src="https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/Section 4 - 6 Exercise/images/join_plot_sales_gr_liv_area.png" alt="drawing" width="400"/>
<br>

In [None]:
# Write your code here.



There's certainly an association between `SalePrice` and `Gr_Liv_Area`, and perhaps it's linear, but the spread is wider at larger values of both variables. Also, there seems to be at least one suspicious houses above 5000 square feet that look too inexpensive for its size.

## Question 1c
Find the Parcel Indentification any houses with `Gr_Liv_Area` greater than 5000 sqft. Create a new variable called `potential_outliers` and assign a list of `PID`'s to it.

In [None]:
# Write your code here.


# Print answer
print("Potential outlier PID's: {}".format(potential_outliers))

We've looked into these two homes in more detail and have determined that they are true outliers in this data set. They were partial sales, priced far below market value. Therefore, we would like to exlude them from our analysis.

## Question 1d
We could simply filter out these outliers using `Pandas` filtering functionality, but when doing machine learning, its always advantageous to create modular, reusable code. For example, we may want to remove outliers from other features and we may want to operationalize our code as a resuable pipeline.

Create a function `remove_outliers`, which removes outliers from a data set based off a `lower` and `upper` limit (non-inclusive). For example, `remove_outliers(training_data, 'Gr_Liv_Area', upper=5000)` should return a DataFrame with only observations that satisfy `Gr_Liv_Area` less than 5000.

In [None]:
# Write your code here.

# View DataFrame
ames_data.head()

We started with 2000 rows in `ames_data`. A quick check.

In [None]:
ames_data.shape[0]

Makes sense!

# 2. Feature Engineering
In this section we will create a new feature out of existing ones through a simple data transformation.

## Bathrooms
Let's create a new feature, which described the total number of bathrooms. We will use the following formula:

$$ \text{TotalBathrooms}=(\text{BsmtFullBath} + \text{FullBath}) + \dfrac{1}{2}(\text{BsmtHalfBath} + \text{HalfBath})$$

## Question 2a
Write a function `add_total_bathrooms(data)` that returns a copy of `data` with an additional column called `total_bathrooms` computed by the formula above. You should treat missing values as zeros. 

In [None]:
# Write your code

# View DataFrame
ames_data_with_bathrooms.head()

Let's check out answer.

In [None]:
# Change row_id to check multiple rows
row_id = 2
print('Bsmt_Full_Bath: {}'.format(ames_data_with_bathrooms.loc[row_id , 'Bsmt_Full_Bath']))
print('Full_Bath: {}'.format(ames_data_with_bathrooms.loc[row_id , 'Full_Bath']))
print('Bsmt_Half_Bath: {}'.format(ames_data_with_bathrooms.loc[row_id , 'Bsmt_Half_Bath']))
print('Half_Bath: {}'.format(ames_data_with_bathrooms.loc[row_id , 'Half_Bath']))
print('total_bathrooms: {}'.format(ames_data_with_bathrooms.loc[row_id , 'total_bathrooms']))

## Question 2b
Create a visualization that clearly shows that `total_bathrooms` is associated with `SalePrice`. Your visualization should avoid overplotting.

In [None]:
# Write your code here.



# 3. Modelling
We've reached the point where we can specify a model. But first, we will load a fresh copy of the data, just in case our code above produced any undesired side-effects. 

Run the cell below to store a fresh copy of the data from ames_train.csv in a dataframe named full_data. We will also store the number of rows in full_data in the variable full_data_len.

In [None]:
# Load a fresh copy of the data and get its length
full_data = pd.read_csv('ames_data.csv')
full_data_len = len(full_data)
full_data.head()

## Question 3a
Now, let's split the data set into a training set, a validation set, and a test set. We will use the training set to fit our model's parameters, and we will use the validation set to estimate how well our model will perform on unseen data drawn from the same distribution. If we used all the data to fit our model, we would not have a way to estimate model performance on unseen data. The test set is used as a final unseen dataset and we shouldn't touch our test set until our model is finalized.

In the cell below, split the data in full_data into three DataFrames named `train`, `val`, and `test`. Let `train` contain 70% of the data, let `val` contain 15% of the data, and let `test` contain 15% of the data.

Use the `train_test_split()` function from `sklearn.model_selection` to perform these splits. Use a `random_state=0` as an argument to `train_test_split()`.

In [1]:
# Write your code




Lock the test set away and do not predict on it until you've selected your final model.

## Reusable Pipeline
Throughout this assignment, you should notice that your data flows through a single processing pipeline several times. From a software engineering perspective, it's best to define functions/methods that can apply the pipeline to any dataset. We will now encapsulate our entire pipeline into a single function `process_data`. We select a handful of features to use from the many that are available.

In [None]:
# Write your code


Now, we can use `process_data` to clean our data, select features, and add our `TotalBathrooms` feature all in one step. This function also splits our data into `X`, a matrix of features, and `y`, a vector of sale prices (our training target). 

Run the cell below to feed our training and validation data through the pipeline, generating `X_train`, `y_train`, `X_val`, and `y_val`.

In [None]:
X_train, y_train = process_data(train)
X_val, y_val = process_data(val)

## Fitting Our First Model
We are finally going to fit a model.  The model we will fit can be written as follows:

$$\text{SalePrice} = \theta_0 + \theta_1 \cdot \text{Gr}\_\text{Liv}\_\text{Area} + \theta_2 \cdot \text{Garage}\_\text{Area} + \theta_3 \cdot \text{total_bathrooms}$$

**Note:** Notice that all of our variables are continuous, except for `total_bathrooms`, which takes on discrete ordered values (0, 0.5, 1, 1.5, ...). We'll treat `total_bathrooms` as a continuous quantitative variable in our model for now, but this might not be the best choice. The latter half of this assignment may revisit the issue.

## Question 3b
We will use a [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object as our linear model. In the cell below, create a `LinearRegression` object and name it `linear_model`.

**Hint:** See the `fit_intercept` parameter and make sure it is set appropriately. The intercept of our model corresponds to $\theta_0$ in the equation above.

In [2]:
# Write your code


## Question 3c
Now, remove the commenting and fill in the ellipses `...` below with `X_train`, `y_train`, `X_val,` or `y_val`.

With the ellipses filled in correctly, the code below should fit our linear model to the training data and generate the predicted sale prices for both the training and validation datasets.

Assign your predictions for the training set to `y_fitted` and your predictions to the validation set to `y_predicted`.

In [None]:
# Write your code



## Question 3d
Is our linear model any good at predicting house prices? Let's measure the quality of our model by calculating the Root-Mean-Square Error (RMSE) between our predicted house prices and the true prices stored in `SalePrice`.

$$\text{RMSE} = \sqrt{\dfrac{\sum_{\text{houses in dataset}}(\text{actual price of house} - \text{predicted price of house})^2}{\text{number of houses in dataset}}}$$

In the cell below, write a function named `rmse` that calculates the RMSE of a model.

**Hint:** Make sure to vectorize your code. This question can be answered without any `for` statements.

In [None]:
# Write your code



## Question 3e
Now use your `rmse` function to calculate the training error and validation error in the cell below.

In [None]:
# Write your code here.


# Print answers
print('Training RMSE: ${}'.format(training_error))
print('Validation RMSE: ${}'.format(val_error))

## Question 3f
How much does including `total_bathrooms` as a predictor reduce the RMSE of the model on the validation set? That is, what's the difference between the RSME of a model that only includes `Gr_Liv_Area` and `Garage_Area` versus one that includes all three predictors (`Gr_Liv_Area`, `Garage_Area`, and `total_bathrooms`)?

In [None]:
# Write your cide

# Print results
print('Validation RMSE: ${}'.format(val_error))
print('Validation RMSE (No Bath): ${}'.format(val_error_no_bath))
print('Validation Error Difference: {}'.format(val_error_difference))

## Question 3g
What changes could you make to your linear model to improve its accuracy and lower the validation error? Suggest at least two things you could try in the cell below, and carefully explain how each change could potentially improve your model's accuracy.

1. add more data and treat missing values.

2. algorithm Tuning. Try to find the optium value for each parameter which greatly influence the outcomes. 

# 4. Cross Validation
Moving forward, we will now use cross validation to help validate our model instead of explicitly splitting the data into a training and validation set. To do this, we'll need to create a cross-validation function for our `rmse` score.

First, let's split the `full_data` again but this time into only two datasets `train` and `test`. Let `train` contain 70% of the data, let `test` contain 30% of the data.

Again, we will use the `train_test_split()` function from `sklearn.model_selection` to perform these splits and use a `random_state=0` as an argument to `train_test_split()`.

In [None]:
# Write your code 



Next, let's use out `process_data` function to get our features (`X_train`) and target ('y_train') for the training dataset.

In [None]:
X_train, y_train = process_data(train)

# Question 4
Create a cross-validation function, which returns the average `rmse` score from all 5 splits.

Hint: `train_index` and `val_index` contain the train and val row indices for `X` and `y`.

In [None]:
# Write your code



Now, let's apply out new cross-validation function to out training dataset.

In [None]:
cv_scores = cross_validate_rmse(model=LinearRegression(fit_intercept=True), X=X_train, y=y_train)

# Print cv scores
print('Cross-validation RMSE scores: {}'.format(cv_scores))
print('Cross-validation RMSE scores mean: ${}'.format(np.mean(cv_scores)))
print('Cross-validation RMSE scores std: ${}'.format(np.std(cv_scores)))

Next, let's compare this to our old validation score from earlier.

In [None]:
print('Validation RMSE score: ${}'.format(val_error))
print('Cross-validation RMSE scores mean: ${}'.format(np.mean(cv_scores)))

This example clearly demonstrates the function of cross-validation instead of explicitly splitting the data into a training and validation set. 

We can see that the first cross-validation score for the first split is `53256.63312984414`, which is similar to our original validation score. However, the other 4 cross-validation scores are lower (`40632.040286195836, 40982.685175624334, 43732.46059794499, 42570.16643666945`). 

Especially when datasets are small, sampling bias can cause differences in performance depending on how a dataset is split. Cross-validation ensures you're able to monitor this variability and evaluate your models properly.

# 5. More Feature Selection and Engineering 
The linear model that you created failed to produce accurate estimates of the observed housing prices because the model was too simple. The goal of the next few sections is to guide you through the iterative process of specifying, fitting, and analyzing the performance of more complex linear models used to predict prices of houses in Ames, Iowa. 

In this section, we identify two more features of the dataset that will increase our linear regression model's accuracy. Additionally, we will implement one-hot encoding so that we can include binary and categorical variables in our improved model.

We've used a slightly modified data cleaning pipeline from the first half of the assignment to prepare the data. This data is stored in `ames_data_cleaned.csv`. It consists of 1998 observations and 83 features (we added the feature `total_bathrooms` from the first half of the assignment).

First, let's import `ames_data_cleaned.csv`.

In [None]:
ames_data_cleaned = pd.read_csv('ames_data_cleaned.csv')
ames_data_cleaned.head()

Next, let's split the `ames_data_cleaned` into only two datasets `train` and `test`. Let `train` contain 70% of the data, let `test` contain 30% of the data.

Again, we will use the `train_test_split()` function from `sklearn.model_selection` to perform these splits and use a `random_state=0` as an argument to `train_test_split()`.

In [None]:
# Split dataset
train_cleaned, test_cleaned = train_test_split(ames_data_cleaned, test_size = 0.30, random_state = 0)

# Print results
print('Train {}%'.format(train_cleaned.shape[0] / ames_data_cleaned.shape[0] * 100))
print('Test {}%'.format(test_cleaned.shape[0] / ames_data_cleaned.shape[0] * 100))

## Neighborhood vs Sale Price
First, let's take a look at the relationship between neighborhood and sale prices of the houses in our data set.

In [None]:
fig, axs = plt.subplots(nrows=2, figsize=(12, 8))

sns.boxplot(
    x='Neighborhood',
    y='SalePrice',
    data=train_cleaned.sort_values('Neighborhood'),
    ax=axs[0]
)

sns.countplot(
    x='Neighborhood',
    data=train_cleaned.sort_values('Neighborhood'),
    ax=axs[1]
)

# Draw median price
axs[0].axhline(
    y=train_cleaned['SalePrice'].median(), 
    color='red',
    linestyle='dotted'
)

# Label the bars with counts
for patch in axs[1].patches:
    x = patch.get_bbox().get_points()[:, 0]
    y = patch.get_bbox().get_points()[1, 1]
    axs[1].annotate(f'{int(y)}', (x.mean(), y), ha='center', va='bottom')
    
# Format x-axes
axs[1].set_xticklabels(axs[1].xaxis.get_majorticklabels(), rotation=90)
axs[0].xaxis.set_visible(False)

# Narrow the gap between the plots
plt.subplots_adjust(hspace=0.01)

## Question 5a
Based on the plot above, what can be said about the relationship between the house sale prices and their neighborhoods?


There are neighbourhoods that are just generally higher in sale prices as they lie above the mean value line, and those are not necessarily the most or least in counts.  


## Question 5b
One way we can deal with the lack of data from some neighborhoods is to create a new feature that bins neighborhoods together.  Let's categorize our neighborhoods in a crude way. Take the top 3 neighborhoods measured by median `SalePrice` and identify them as **"rich neighborhoods"**. We won't mark the other neighborhoods.

Write a function that returns a list of the top `n` most pricy neighborhoods as measured by our choice of aggregating function (`np.median, np.mean, etc.`).  For example, in the setup above, we would want to call `find_rich_neighborhoods(train_cleaned, 3, np.median)` to find the top 3 neighborhoods measured by median `SalePrice`.

In [None]:
# Write your code

# Print rich neighborhoods
print('The three richest neighborhoods are: {}'.format(rich_neighborhoods))

Check the figure above to make sure you've got the correct answer.

## Question 5c
We now have a list of neighborhoods we've deemed as richer than others.  Let's use that information to make a new variable `in_rich_neighborhood`.  Write a function `add_rich_neighborhood` that adds an indicator variable which takes on the value 1 if the house is part of `rich_neighborhoods` and the value 0 otherwise.

**Hint:** [`pd.Series.astype`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.astype.html) may be useful for converting True/False values to integers.

In [None]:
# Write your code

# View DataFrame
train_cleaned_rich.head()

Let's check to see if our function added the new feature correctly. We should see a value of 1 for this rich neighborhood.

In [None]:
train_cleaned_rich[train_cleaned_rich['Neighborhood'] == 'NoRidge'].head()

## Fireplace Quality
In the following section, we will take a closer look at the Fireplace_Qu feature of the dataset and examine how we can incorporate categorical features into our linear model.

## Question 5d
Let's see if our data set has any missing values.  Create a Series object containing the counts of missing values in each of the columns of our data set, sorted from greatest to least.  The Series should be indexed by the variable names.  For example, `missing_counts.loc['Fireplace_Qu']` should return 688.

**Hint:** [`pandas.DataFrame.isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) may help here.

In [None]:
# Write your code here.

# Display missing_counts
missing_counts

## Question 5e
An `NA` here actually means that the house had no fireplace to rate.  Let's fix this in our data set.  Write a function that replaces the missing values in `Fireplace_Qu` with `'No Fireplace'`.  In addition, it should replace each abbreviated condition with its full word.  For example, `'TA'` should be changed to `'Average'`, `'FA'` should be changed to `'Fair'`, and so on.  Hint: the [`DataFrame.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) method may be useful here.

In [None]:
# Write your code

# View DataFrame
training_data_fireplace_qu.head()

Let's check out the unique values in `Fireplace_Qu` now.

In [None]:
training_data_fireplace_qu['Fireplace_Qu'].unique().tolist()

### Using Categorical Variables for Regression
Unfortunately, simply fixing these missing values isn't sufficient for using `Fireplace_Qu` in our model.  Since `Fireplace_Qu` is a categorical variable, we will have to **dummy-encode** the data. Note that dummy-encoding drops the first one-hot-encoded column. For more information on categorical data in pandas, refer to this [link](https://pandas-docs.github.io/pandas-docs-travis/categorical.html).

In [None]:
# Write your code

# View new encoded features
training_data_ohe[[col for col in training_data_ohe.columns 
                   if 'Fireplace_Qu' in col]].head()

Notice that there are five new binary features:
- `'Fireplace_Qu_Good'`
- `'Fireplace_Qu_Average'`
- `'Fireplace_Qu_Fair'`
- `'Fireplace_Qu_Poor'`
- `'Fireplace_Qu_No Fireplace'`

but we initially had 6 categories. Where is `'Fireplace_Qu_Excellent'`?

Imagine a case where:
- `'Fireplace_Qu_Good'=0`
- `'Fireplace_Qu_Average'=0`
- `'Fireplace_Qu_Fair'=0`
- `'Fireplace_Qu_Poor'=0`
- `'Fireplace_Qu_No Fireplace'=0`

This this would mean that the Fireplace quality is excellent. If we added a sixth features named `'Fireplace_Qu_Excellent'`, then is would mean that our Fireplace features would be correlated and redundant.

Basically:
- `'Fireplace_Qu_Good'=0`
- `'Fireplace_Qu_Average'=0`
- `'Fireplace_Qu_Fair'=0`
- `'Fireplace_Qu_Poor'=0`
- `'Fireplace_Qu_No Fireplace'=0`

would mean the same thing as:
- `'Fireplace_Qu_Excellent'=1`

Therefore, it is desireable to dropped the first one-hot-encoded column, which we call dummy-encoding.

# 6. Improving Our Linear Model
In this section, we will create linear models that produce more accurate estimates of the housing prices in Ames than the model created in the first half of this assgnment, but at the expense of increased complexity.

The model we will fit can be written as follows:

$$
\text{SalePrice} = 
\theta_0 + 
\theta_1 \cdot \text{Gr}\_\text{Liv}\_\text{Area} + 
\theta_2 \cdot \text{Garage}\_\text{Area} + 
\theta_3 \cdot \text{total_bathrooms} +
\theta_4 \cdot \text{in_rich_neighborhood} +
\theta_5 \cdot \text{(Fireplace_Qu_Good)} +
\theta_6 \cdot \text{(Fireplace_Qu_Average)} +
\theta_7 \cdot \text{(Fireplace_Qu_Fair)} +
\theta_8 \cdot \text{(Fireplace_Qu_Poor)} +
\theta_9 \cdot \text{(Fireplace_Qu_No Fireplace)}
$$

We still have a little bit of work to do prior to esimating our linear regression model's coefficients. Instead of having you go through the process of selecting the pertinent features and creating a [`sklearn.linear_model.LinearRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object for our linear model again, we will provide the necessary code from the first half of this assignment. However, we will now use cross validation to help validate our model instead of explicitly splitting the data into a training and validation set.

First, we will re-import the data.

In [None]:
ames_data_cleaned = pd.read_csv('ames_data_cleaned.csv')
ames_data_cleaned.head()

And split into training and test data.

In [None]:
# Split dataset
train_cleaned, test_cleaned = train_test_split(ames_data_cleaned, test_size=0.30, random_state=0)

# Print results
print('Train {}%'.format(train_cleaned.shape[0] / ames_data_cleaned.shape[0] * 100))
print('Test {}%'.format(test_cleaned.shape[0] / ames_data_cleaned.shape[0] * 100))

Next, we will implement a reusable pipeline that selects the required variables in our data and splits our feature and target variable into a matrix and a vector, respectively.

In [None]:
# Write your code



We then process our training set using our data cleaning pipeline.

In [None]:
# Pre-process the training data
# Our functions make this very easy!
X_train, y_train = process_data(train_cleaned)
X_train.head()

## Question 6a
Use the `cross_validate_rmse` function to calculate the cross validation error in the cell below.

In [None]:
# Write your code here.


# Print cv scores
print('Cross-validation RMSE scores: {}'.format(cv_scores_updated))
print('Cross-validation RMSE scores mean: ${}'.format(np.mean(cv_scores_updated)))
print('Cross-validation RMSE scores std: ${}'.format(np.std(cv_scores_updated)))

Let's compare this to our earlier cross-validation score when only using:
- `'SalePrice'`
- `'Gr_Liv_Area'` 
- `'Garage_Area'`
- `'total_bathrooms'`

In [None]:
print('Cross-validation RMSE scores mean: ${}'.format(np.mean(cv_scores)))

You've done it! by adding two new features, we've improved our model's performance.

Now that we are happy with out model's performance and have settled on a final set of features, we can train the final model on the entire training dataset. First, we initialize a `sklearn.linear_model.LinearRegression()` object as our linear model. We set the `fit_intercept=True` to ensure that the linear model has a non-zero intercept.

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(fit_intercept=True)

It's finally time to fit our updated linear regression model. The cell below estimates the model and then uses it to compute the fitted value of `SalePrice` over the training data.

In [None]:
# Fit the model
linear_model.fit(X_train, y_train)

# Compute the fitted and predicted values of SalePrice
y_fitted = linear_model.predict(X_train)

Let's assess the performance of our new linear regression model using the Root Mean Squared Error function from earlier in this assignment.

In [None]:
training_error = rmse(y_fitted, y_train)
print("Training RMSE: ${}".format(training_error))

## Question 6b
Now that we have trained our final model, we can evaluate it on our test data. Prediced. Predict the house price for the test feature `X_test` and named the variable `y_predicted`. Then compute the test `RMSE`.

In [None]:
# Pre-process the training data
# Our functions make this very easy!
X_test, y_test = process_data(test_cleaned)
X_test.head()

# Write your code here.


# 7. Open-Response 
The following part is purposefully left nearly open-ended. 

## Question 7
Your goal is to provide a linear regression model that improves the cross-validation root mean square error from the previous section.

- Cross-validation RMSE scores mean: `$39006.42198732712`

To do this, you should add at least one new feature. Please use Markdown cells to explain your thinking when engineering new features.

Let's import the data and split again with new variable names for this section.

In [None]:
ames_data_cleaned_q7 = pd.read_csv('ames_data_cleaned.csv')
ames_data_cleaned_q7.head()

In [None]:
# Split dataset
train_cleaned_q7, test_cleaned_q7 = train_test_split(ames_data_cleaned_q7, test_size = 0.30, random_state = 0)

# Print results
print('Train {}%'.format(train_cleaned_q7.shape[0] / ames_data_cleaned_q7.shape[0] * 100))
print('Test {}%'.format(test_cleaned_q7.shape[0] / ames_data_cleaned_q7.shape[0] * 100))

In [None]:
# Write your code here. (Add as many new cells as you like)
# see below

Engineer Features: 
- Categorical Encoding
- Binning
- Transformations
- Polynomial Features
- Scaling
- Datetime Extraction
- Text Splitting
- Imputation and Outliers
- Dimension Reduction
- Advanced Feature Engineering

There are a lot of options to tackle this problem, I have chosen a more straightforward approach. Since heating is essential in everyday living and might influence the way the house is priced.

I first fixed the Heating_QC column to replace the abbreviations with something more readable. Now the Heating_QC column has categorical variables.
 
In order to use the categorical variable for regression, I then encoded the variables, which is more ready for linear regression. 


In [None]:


# View DataFrame
training_data_heating_qc.head()

Here is your resuable pipeline. You'll want to add a least one new feature here:

```python
data = select_columns(data, 
                      'SalePrice', 
                      'Gr_Liv_Area', 
                      'Garage_Area',
                      'total_bathrooms',
                      'in_rich_neighborhood',
                      'Fireplace_Qu_Good',
                      'Fireplace_Qu_Average',
                      'Fireplace_Qu_Fair',
                      'Fireplace_Qu_Poor',
                      'Fireplace_Qu_No Fireplace')
```

In [None]:
# Write your code



## Final Evaluation
This is where you can compute your cross-validation score. `X_train_q7` and `y_train_q7` should be your new feature and target variables. `X_train_q7` should include at least one new feature.

In [None]:


# Print cv scores
print('Cross-validation RMSE scores: {}'.format(cv_scores_updated))
print('Cross-validation RMSE scores mean: ${}'.format(np.mean(cv_scores_updated)))
print('Cross-validation RMSE scores std: ${}'.format(np.std(cv_scores_updated)))

Cross-validation RMSE scores mean: $39006.42198732712 (from before adding the feature)
Cross-validation RMSE scores mean: $37773.49539301975 (from after adding the feature)