# Classification Challenge
I will use the Ames housing data to create a classification model that predicts whether the sale of a house in Ames, IA was abnormal or not
-  For each Id in the test set, I  will:

    1. Classify the `Sale Condition` value as abnormal or not.
    2. Transform this feature so that 1=abnormal and 0=not abnormal.

---
## Reading in data

In [222]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [223]:
df = pd.read_csv('train.csv')

## EDA
- I will use what I learned from the Regression Challenge

### Dealing with NaN values
Some NaN values need to be changed to string for categorization with get_dummies
Other NaN values need to be converted to 0.0 to signify that it doesn't exist for that house

In [224]:
df.fillna(value={'Mas Vnr Type': 'None',
                 'Bsmt Qual': 'NA',
                 'Bsmt Cond': 'NA',
                 'Bsmt Exposure': 'NA',
                 'BsmtFin Type 1': 'NA',
                 'BsmtFin Type 2': 'NA',
                 'Fireplace Qu': 'NA',
                 'Garage Type': 'NA',
                 'Garage Finish': 'NA',
                 'Garage Qual': 'NA',
                 'Garage Cond': 'NA',
                 'Paved Drive': 'NA',
                 'Mas Vnr Area': 0.0,
                 'BsmtFin SF 1': 0.0,
                 'BsmtFin SF 2': 0.0,
                 'Total Bsmt SF': 0.0,
                 'Bsmt Full Bath': 0.0,
                 'Bsmt Half Bath': 0.0,
                 'Garage Cars': 0.0, 
                 'Garage Area': 0.0
                }, axis=0, inplace=True)

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,,,,0,3,2010,WD,Normal,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,,,,0,4,2009,WD,Normal,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,,,,0,1,2010,WD,Abnorml,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,,,,0,3,2010,WD,Normal,138500
5,138,535126040,20,RL,137.0,16492,Pave,,IR1,Lvl,...,0,,,,0,6,2010,WD,Normal,190000
6,2827,908186070,180,RM,35.0,3675,Pave,,Reg,Lvl,...,0,,,,0,6,2006,New,Partial,140000
7,145,535154050,20,RL,,12160,Pave,,IR1,Lvl,...,0,,MnPrv,,0,5,2010,COD,Abnorml,142000
8,1942,535353130,20,RL,,15783,Pave,,Reg,Lvl,...,0,,MnPrv,Shed,400,6,2007,WD,Normal,112500
9,1956,535426130,60,RL,70.0,11606,Pave,,IR1,HLS,...,0,,,,0,9,2007,WD,Family,135000


---
Lot Frontage (Linear feet of street connected to property): Using an imputer to replace with median


In [225]:
imp = Imputer(strategy='median')

In [226]:
df['Lot Frontage'] = pd.DataFrame(imp.fit_transform(df[['Lot Frontage']]))

---
Dropping Unecessary Columns

In [227]:
df.drop(['Alley',
         'Id',
         'PID',
         'Bsmt Unf SF',
         'Garage Yr Blt',
         'Pool QC',
         'Fence', 
         'Misc Feature'
        ], axis=1, inplace=True)

---
## Get Dummies
The data for the X needs to be an int or float because we will be calculating the distance between neighbors with KNN.

In [228]:
df = pd.get_dummies(df, prefix=['MasVnrType', 
                                           'BsmtQual', 
                                           'BsmtCond', 
                                           'BsmtExposure',
                                           'BsmtFinType1',
                                           'BsmtFinType2',
                                           'FireplaceQu',
                                           'GarageType',
                                           'GarageFinish',
                                           'GarageQual',
                                           'GarageCond',
                                           'PavedDrive'
                                          ], 
                               columns=['Mas Vnr Type', 
                                        'Bsmt Qual', 
                                        'Bsmt Cond', 
                                        'Bsmt Exposure',
                                        'BsmtFin Type 1',
                                        'BsmtFin Type 2',
                                        'Fireplace Qu',
                                        'Garage Type',
                                        'Garage Finish',
                                        'Garage Qual',
                                        'Garage Cond',
                                        'Paved Drive'
                                       ])

---
Use get_dummies again to change objects/strings to floats

In [229]:
list(df.loc[:, df.dtypes == object])
df = pd.get_dummies(df, prefix=['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'SaleType',
],
                   columns=['MS Zoning',
 'Street',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Sale Type'
])

---
## Feature Selection with SelectKBest

---
Define y and Map the y such that abnormal = 1, and all others = 0

In [230]:
y = df['Sale Condition']

y.value_counts() 

Normal     1696
Partial     164
Abnorml     132
Family       29
Alloca       19
AdjLand      11
Name: Sale Condition, dtype: int64

In [231]:
y = y.map(lambda x: 1 if x == 'Abnorml' else 0)
y.value_counts()

0    1919
1     132
Name: Sale Condition, dtype: int64

---
Define the X

In [232]:
X = df.drop(['Sale Condition', 'SalePrice'], axis=1)

In [233]:
skb = SelectKBest(k=13)
skb.fit_transform(X, y)
X = X.loc[:,skb.get_support()]
X.columns

Index(['Overall Qual', 'Year Built', 'Year Remod/Add', 'Garage Cars',
       'GarageType_NA', 'GarageCond_TA', 'MSZoning_C (all)',
       'Neighborhood_IDOTRR', 'Foundation_PConc', 'CentralAir_N',
       'CentralAir_Y', 'Functional_Sev', 'SaleType_COD'],
      dtype='object')

---
## Preprocesing

---
train test split

In [234]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3, stratify=y)

---
StandardScaler

In [235]:
sc = StandardScaler()
Xsc_train = sc.fit_transform(X_train)

---
## Modeling

### KNN
    DATA NEEDS X: int or float (b/c we calc dist b/w neighbors, must be scaled) 
    DATA NEEDS y: int, float, str (anything)
    
    KNN - can change k (number of neighbors), usually b/w 2-10 neighbors, non-parametric so good for high dimensionality (lots of features) and sparse matricies

In [236]:
knn = KNeighborsClassifier()

knn_params = {
    'n_neighbors':np.arange(2,10,1)
}

knn_model = GridSearchCV(knn, param_grid=knn_params)
knn_model.fit(Xsc_train, y_train)
print('Best Score:', knn_model.best_score_)
print('Best Params:', knn_model.best_params_)
print('Test Score:', knn_model.score(sc.transform(X_test), y_test))

Best Score: 0.935888501742
Best Params: {'n_neighbors': 6}
Test Score: 0.935064935065


---
# Test.csv
There are 2 less columns in test than train: SalePrice and Sale Condition. This is because SalePrice and Sale Condition are the features we are predicting in regression and classification models respectively.

In [237]:
test = pd.read_csv('test.csv')

In [238]:
test_X = test.fillna(value={'Garage Type': 'NA',
                 'Garage Finish': 'NA',
                 'Garage Qual': 'NA',
                 'Garage Cond': 'NA',
                 'Garage Cars': 0.0
                }, axis=0, inplace=True)

In [239]:
test_X = pd.get_dummies(test_X, prefix=['MSZoning',
                                         'Exterior1st',
                                         'Foundation',
                                         'HeatingQC',
                                         'CentralAir',
                                         'Functional',
                                         'SaleType',
                                        'Neighborhood',
                                        'GarageType',
                                        'GarageFinish',
                                        'GarageQual',
                                        'GarageCond',
                                        'MSZoning'
                                        ], 
                               columns=['MS Zoning',
                                         'Exterior 1st',
                                         'Foundation',
                                         'Heating QC',
                                         'Central Air',
                                         'Functional',
                                         'Sale Type',
                                        'Neighborhood',
                                       'Garage Type',
                                        'Garage Finish',
                                        'Garage Qual',
                                        'Garage Cond',
                                        'MS Zoning'
                                       ])

In [240]:
# Dropping features not used in my model

model_features = ['Overall Qual', 'Year Built', 'Year Remod/Add', 'Garage Cars',
       'GarageType_NA', 'GarageCond_TA', 'MSZoning_C (all)',
       'Neighborhood_IDOTRR', 'Foundation_PConc', 'CentralAir_N',
       'CentralAir_Y', 'Functional_Sev', 'SaleType_COD']

test_X = test_X.drop(list(set(list(test_X.columns)).difference(model_features)), axis=1)

---
## Predicting Sale Condition

---
Scaling test.csv

In [241]:
test_X = sc.transform(test_X)

---
Getting predictions

In [244]:
test_pred_knn = knn_model.predict(test_X)
pred_knn = test[['Id']].merge(pd.DataFrame(test_pred_knn), left_index=True, right_index=True)
pred_knn = pred_knn.rename(columns = {0:'Sale Condition'})
pred_knn.index = pred_knn["Id"]
pred_knn = pred_knn.drop(['Id'], axis=1)

In [245]:
#pred_knn.to_csv('output_submission_knn.csv')