<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [None]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from feature_selector import FeatureSelector
from sklearn.linear_model import LinearRegression, RidgeCV

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Estimating the value of homes from fixed characteristics.

---

Your superiors have outlined this year's strategy for the company:
1. Develop an algorithm to reliably estimate the value of residential houses based on *fixed* characteristics.
2. Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Then we can use that to buy houses that are likely to sell for more than the cost of the purchase plus renovations.

Your first job is to tackle #1. You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
You need to build a reliable estimator for the price of the house given characteristics of the house that cannot be renovated. Some examples include:
- The neighborhood
- Square feet
- Bedrooms, bathrooms
- Basement and garage space

and many more. 

Some examples of things that **ARE renovate-able:**
- Roof and exterior features
- "Quality" metrics, such as kitchen quality
- "Condition" metrics, such as condition of garage
- Heating and electrical components

and generally anything you deem can be modified without having to undergo major construction on the house.

---

**Your goals:**
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?

> **Note:** The EDA and feature engineering component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

In [None]:
# Load the data
house = pd.read_csv('./housing.csv')

In [None]:
house.head()

In [None]:
house.info()

In [None]:
# droping rows that are not residential sales/properties
house = house[~house['MSZoning'].isin(['I','A','C','C (all)'])]
# dropping some features due to null values or unimportant.
house.drop(['LotFrontage','Street', 'Alley', 'LotShape', 'LandContour', \
            'LandSlope', 'MasVnrArea', 'GarageYrBlt', 'PoolArea', 'PoolQC', \
            'Fence', 'MiscFeature','FireplaceQu'],axis=1,inplace=True)


In [None]:
# Pulling target variable 'SalePrice' from df
target = house.SalePrice

In [None]:
target.head()

In [None]:
house = house.drop('SalePrice',axis=1)

In [None]:
house.head()

In [None]:
# Seperating fixed house features and renovatble features into seperate dfs

df_fixed = house[['MSSubClass', 'LotArea', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType'
                 , 'HouseStyle', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'MasVnrType', 'Foundation','BsmtExposure'
                 , 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath'
                 , 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageType', 'GarageCars'
                 , 'GarageArea', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']]

df_reno = house[['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond','RoofMatl', 'Exterior1st', 'Exterior2nd', 'BsmtQual'
                 , 'BsmtCond', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'HeatingQC'
                 , 'CentralAir', 'Electrical', 'LowQualFinSF', 'KitchenQual', 'Functional', 'GarageFinish', 'GarageQual'
                 , 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch'
                 , 'ScreenPorch', 'MiscVal']]

In [None]:
# init feature selector
fs = FeatureSelector(data = df_fixed, labels = target)

In [None]:
# One-hot encoding catergorical variables for use in regression analysis, running model to select features with zero importance
fs.identify_zero_importance(task = 'regression', eval_metric = 'l2',
                            n_iterations = 10, early_stopping = False)

In [None]:
one_hot_features = fs.one_hot_features
base_features = fs.base_features
print('There are %d original features' % len(base_features))
print('There are %d one-hot features' % len(one_hot_features))

In [None]:
zero_importance_features = fs.ops['zero_importance']
zero_importance_features

In [None]:
#plotting most important features
fs.plot_feature_importances(threshold = 0.99, plot_n = 20)

In [None]:
fs.identify_low_importance(cumulative_importance = 0.99)

In [None]:
# The year sold feature seems to have some influence on determining the target.
# Training on houses sold pre 2010 and testing on houses sold 2010 as the question asks seems like a bad idea.
# I will first try random selection using Cross Validation. Build a model using that and then try spliting train/test on yrsold

In [None]:
# Removing features with low importance
train_no_zero = fs.remove(methods = ['zero_importance'])

In [None]:
train_no_zero

In [None]:
# Dropping features that have been one hot encoded but leaving their OHE columns
train_no_zero = train_no_zero.drop(['Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                                  'RoofStyle', 'MasVnrType', 'Foundation', 'BsmtExposure','GarageType','SaleType','SaleCondition'],axis=1)

In [None]:
train_no_zero

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, SelectFromModel

In [None]:
# Making pipeline to standardise data before modelling is conducted
pipeline = make_pipeline(StandardScaler(),LinearRegression())
# Fitting linreg
pipeline.fit(train_no_zero,target)

In [None]:
# linreg score with no cv
pipeline.score(train_no_zero,target)

In [None]:
# cv score with linreg model
score = cross_val_score(estimator=pipeline, X=train_no_zero,y=target,cv=5)
score.mean()

In [None]:
predict = cross_val_predict(pipeline,train_no_zero,target,cv=5)
predict

In [None]:
target_t = np.log(target)

In [None]:
# plotting predictions with log scaled actuals
plt.scatter(predict,target_t)

In [None]:
# Merging target to prepare to split on YrSold
train_no_zero_target = pd.merge(train_no_zero, target, how='left',on=train_no_zero.index).reset_index()

In [None]:
train_no_zero_target = train_no_zero_target.drop(['index','key_0'],axis=1)

In [None]:
train_no_zero_target

In [None]:
# training set on pre 2010 houses
train_pre_2010 = train_no_zero_target[train_no_zero_target['YrSold']<2010]
train_pre_2010['YrSold'].unique()

In [None]:
# test set on 2010 houses
test_2010 = train_no_zero_target[train_no_zero_target['YrSold']>=2010]
test_2010['YrSold'].unique()

In [None]:
target_pre_2010 = train_pre_2010.SalePrice
target_2010 = test_2010.SalePrice

In [None]:
train_pre_2010 = train_pre_2010.drop(['SalePrice'],axis=1)
test_2010 = test_2010.drop(['SalePrice'],axis=1)

In [None]:
train_pre_2010.head()

In [None]:
test_2010.head()

In [None]:
# fitting ridgecv on pre 2010 train data
pipeline_RCV = make_pipeline(StandardScaler(),RidgeCV(alphas=np.logspace(-1,3,200)))
pipeline_RCV.fit(train_pre_2010,target_pre_2010)
pipeline_RCV.score(test_2010,target_2010)

In [None]:
# fitting linreg on ypre 2010 train data
pipeline.fit(train_pre_2010,target_pre_2010)

In [None]:
# testing linreg on 2010 test data
pipeline.score(test_2010,target_2010)

In [None]:
#rcv seems like the most robust model

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

In [None]:
# calculating difference between predicted house price and actual house price.
# this gives a rough estimate to the effect of renovatible features.
# The error in the model could be seen as differences that are unexplained by the fixed features.
residuals = target - predict
residuals

In [None]:
# running the same feature selector on the variable features with the target being the residuals
fs_2 = FeatureSelector(data= df_reno, labels= residuals)

In [None]:
fs_2.identify_missing(missing_threshold=0.6)

In [None]:
fs_2.identify_single_unique()

In [None]:
fs_2.identify_collinear(correlation_threshold=0.95)

In [None]:
fs_2.identify_zero_importance(task='regression',eval_metric='l2',early_stopping=False)

In [None]:
com_df = fs_2.remove(methods=['zero_importance'])

In [None]:
fs_2.identify_low_importance(cumulative_importance=0.99)

In [None]:
# renovatable features that are most important in predicting the residual. Or difference in fixed feature model prediction 
# and actual price
fs_2.plot_feature_importances(threshold=0.99, plot_n=20)

In [None]:
(com_df)

In [None]:
com_df = fs_2.remove(methods=['low_importance'])

In [None]:
# com_df = com_df.drop(['MSZoning', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
#             'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'BsmtExposure','GarageType',
#             'ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir',
#             'KitchenQual','Functional','GarageFinish','GarageQual','GarageCond','PavedDrive','Electrical','SaleType','SaleCondition'],axis=1)

# dropping features that are remaining after being one-hot-encoded.

com_df = com_df.drop(['RoofMatl', 'Exterior1st', 'Exterior2nd','ExterQual','ExterCond',
                      'BsmtQual','BsmtCond','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','KitchenQual',
                      'Functional','GarageFinish','GarageQual','GarageCond','PavedDrive','Electrical'],axis=1)

In [None]:
com_df.info(verbose=2)

In [None]:
# fitting rcv to new model
pipeline_RCV.fit(com_df,residuals)

In [None]:
# this score sucks
# why?
pipeline_RCV.score(com_df,residuals)

In [None]:
com_predict = pipeline_RCV.predict(com_df)

In [None]:
rcv = RidgeCV(alphas=np.logspace(-1,3,200))

In [None]:
# plotting predictions compared to actual residuals
plt.scatter(com_predict,residuals)

In [None]:
# training basic model to plot approx $ difference for each unit change in renovatable feature/s

linreg = LinearRegression()
linreg.fit(com_df,residuals)
linreg.score(com_df,residuals)


In [None]:
# plotting $ difference of features
reno_features = pd.DataFrame(linreg.coef_.reshape(-1,1),index=com_df.columns)

In [None]:
reno_features.plot.barh(legend=False,figsize=(5,30))

In [None]:
reno_features.sort_values(by=0,ascending=False).head(50)

In [None]:
# This model could be used to determine the renovatable features that typically negatively affect sale 
# price and those that increase sale price. A company could identify houses with that have these features and reduce the negative
# features and increase the positive features.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. What property characteristics predict an "abnormal" sale?

---

The `SaleCondition` feature indicates the circumstances of the house sale. From the data file, we can see that the possibilities are:

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)
       
One of the executives at your company has an "in" with higher-ups at the major regional bank. His friends at the bank have made him a proposal: if he can reliably indicate what features, if any, predict "abnormal" sales (foreclosures, short sales, etc.), then in return the bank will give him first dibs on the pre-auction purchase of those properties (at a dirt-cheap price).

He has tasked you with determining (and adequately validating) which features of a property predict this type of sale. 

---

**Your task:**
1. Determine which features predict the `Abnorml` category in the `SaleCondition` feature.
- Justify your results.

This is a challenging task that tests your ability to perform classification analysis in the face of severe class imbalance. You may find that simply running a classifier on the full dataset to predict the category ends up useless: when there is bad class imbalance classifiers often tend to simply guess the majority class.

It is up to you to determine how you will tackle this problem. I recommend doing some research to find out how others have dealt with the problem in the past. Make sure to justify your solution. Don't worry about it being "the best" solution, but be rigorous.

Be sure to indicate which features are predictive (if any) and whether they are positive or negative predictors of abnormal sales.

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegressionCV
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split

In [None]:
house.head()

In [None]:
house['SalePrice'] = target

In [None]:
house.head()

In [None]:
house.SaleCondition.value_counts(normalize=True)

In [None]:
house.SaleCondition = house.SaleCondition.apply(lambda x: 0 if x == 'Normal' else (1 if x == 'Abnorml' else 2))

In [None]:
house.head()

In [None]:
house.SaleCondition.unique()

In [None]:
house.shape

In [None]:
house = house[house['SaleCondition']<2]

In [None]:
house.shape

In [None]:
y = house.SaleCondition

In [None]:
y

In [None]:
house.drop(['SaleCondition'],axis=1,inplace=True)

In [None]:
house.head()

In [None]:
house.set_index(['Id'],inplace=True)

In [None]:
house.head()

In [None]:
house.fillna(0,inplace=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(house,y,test_size=0.10)

In [None]:
# define oversampling strategy
over = RandomOverSampler(sampling_strategy=0.3)
# fit and apply the transform
house_t, y_t = over.fit_resample(X_train, y_train)
# define undersampling strategy
under = RandomUnderSampler(sampling_strategy=0.2)
# fit and apply the transform
house_t, y_t = under.fit_resample(X_train, y_train)

In [None]:
house_t.shape

In [None]:
y_t.shape

In [None]:
house_t = pd.DataFrame(house_t, columns=house.columns)

In [None]:
fs_3 = FeatureSelector(data=house_t,labels=y_t)

In [None]:
fs_3.identify_zero_importance(task='classification',eval_metric='auc')

In [None]:
fs_3.plot_feature_importances(threshold=0.99,plot_n=25)

In [None]:
house_t.head()

In [None]:
zero_importance_features = fs_3.ops['zero_importance']
zero_importance_features

In [None]:
house_best_features = fs_3.remove(methods=['zero_importance'])

In [None]:
house_best_features.info(verbose=2)

In [None]:
house_best_features.drop(['MSSubClass', 'MSZoning', 'LotArea', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2',
                          'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
                          'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
                          'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
                          'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                          'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
                          'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageArea',
                          'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
                          'ScreenPorch', 'MiscVal', 'MoSold', 'YrSold', 'SaleType','SalePrice'],axis=1,inplace=True)

In [None]:
logreg = LogisticRegressionCV(max_iter=1000)
logreg.fit(house_best_features,y_t)



In [None]:
logreg.score(house_best_features,y_t)

In [None]:
score_ab_cv = cross_val_score(logreg,house_best_features,y_t,cv=10)

In [None]:
score_ab_cv

In [None]:
score_ab_cv.mean()

In [None]:
'OverallCond' 'BsmtFinSF2' 'LowQualFinSF' 'BsmtHalfBath' 'FullBath'\n 'BedroomAbvGr' 'KitchenAbvGr' 'Fireplaces' 'GarageCars' 'EnclosedPorch'\n '3SsnPorch' 'ScreenPorch' 'MiscVal'

In [None]:
fs_4 = FeatureSelector(X_test, y_test)
fs_4.identify_zero_importance(task='classification',eval_metric='auc')
X_test_best_features = fs_4.remove(methods=['zero_importance'])
X_test_best_features.select_dtypes(exclude='object')

In [None]:
logreg.score(X_test_best_features,y_test)