# Lab 2 Classification <br>
Group Members: Thomas Pengilly, Quynh Chao, Anish Patel, Michael Weatherford <br>
Date: 10/11/2020

# Summary:
The dataset used in this analysis is the South American Real Estate Listings dataset, which contains a number of property features and listing prices. This dataset will be used to perform both a price estimation regression, and a <font color = 'red'> price classification </font>.  

## Data Preparation
<font color = 'red'>Define and prepare your class variables, use proper variable represenations (float, int, one-hot). Use pre-processing methods for dimensionality reduction, scaling, etc. Remove variables that are nnot needed/useful for the analysis. <br>
Describe the final dataset that is used andinclude description of any newly formed variables created.</font><br>

The classification we are modeling is to predict whether a property has a (log) price classification of low, average, or high for a given property type. Price classifications are defined separately by log price tertiles (3-quantiles) for each property type. In other words, each property type has its own price classification distribution (but each is balanced within property types). The benefit of this method of labeling allows users to determine if a property is over/underpriced in relation to other properties on the market, and to take advantage of this information.<br>

<font color = 'red'> Dimensionality reduction was explored to reduce a large number of dummy variables to a more manageable size, allowing faster computation times, and hopefully improving model performance by removing redundant and correlated variables. </font> The dataset was scaled and normalized to allow feature importance to be accurately measured by model weights. Several variables were also removed. <br>

## Modeling and Evaluation
<font color = 'red'>Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.<br>

INSERT EXPLANATION<br>

Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.<br>

INSERT EXPLANATION<br>

Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.<br>

INSERT EXPLANATION<br>

Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.<br>

INSERT EXPLANATION<br>

Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.<br>

INSERT EXPLANATION<br>

Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.<br>

INSERT EXPLANATION<br>

## Deployment
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?<br>

INSERT EXPLANATION<br>

## Exceptional Work
You have free reign to provide additional modeling.
One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?<br>

INSERT EXPLANATION<br>

</font><br>

# Data Preparation
The classification variable is a stratified log price of low, average, or high, with threshold prices being defined on a property-type basis. The data will be scaled so that feature importance may be determined using class weights, and dimensionality reduction will be explored (PCA?). In addition to this, several variables that are not thought to be important will be removed from the dataset. <font color = 'red'> Which variables were removed? </font>

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from pandas import set_option
set_option('display.max_columns',400)
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
import scipy

In [2]:
# Read in the imputed dataset

# Tom
df = pd.read_csv('C:\\Users\\Tpeng\\OneDrive\\Documents\\SMU\\Term 3\\Machine Learning\\Lab1\\Imputed_Dataset.csv', sep = ',', header = 0)

# Quynh
#df = pd.read_csv('filepath', sep = ',', header = 0)

# Anish
#df = pd.read_csv('filepath, sep = ',', header = 0)

# Michael
#df = pd.read_csv('filepath', sep = ',', header = 0)

# Drop index column
df = df.drop(columns = 'Unnamed: 0')

In [3]:
# Reformat attributes, excluding categoricals, which aren't supported for the the dummy variable generation method used.
ordinal_vars = ['rooms', 'bedrooms', 'bathrooms' ]
continuous_vars = ['lat', 'lon', 'surface_total', 'surface_covered', 'price', 'log_price']
string_vars = ['id', 'title', 'description']
time_vars = ['start_date', 'end_date', 'created_on']

# Change data types
df[ordinal_vars] = df[ordinal_vars].astype('uint8')
df[continuous_vars] = df[continuous_vars].astype(np.float64)
df[string_vars] = df[string_vars].astype(str)

# Remove observations missing l3 and price before encoding 
df2 = df.dropna(axis = 0, subset = ['price', 'l3'])

Create a transformed dataset with numeric variables square-root transformed to better meet model assumptions of feature distributions. This reduces the number and magnitude of outliers. In addition to this, both datasets will have the property_type, country, province, and department dummified, and all other attributes will be scaled. Both the transformed and non-transformed datasets will be used to create competing models.

In [4]:
# Create datasets with transformed variables for model selection methods
# Transform rooms, bedrooms, bathrooms, surface_total, and surface_covered using square root
df_transform = df2.copy()
df_transform['sqrt_surface_total'] = df_transform.surface_total.transform(func = 'sqrt')
df_transform['sqrt_surface_covered'] = df_transform.surface_covered.transform(func = 'sqrt')
df_transform['sqrt_bedrooms'] = df_transform.bedrooms.transform(func = 'sqrt')
df_transform['sqrt_bathrooms'] = df_transform.bathrooms.transform(func = 'sqrt')
df_transform['sqrt_rooms'] = df_transform.rooms.transform(func = 'sqrt')

df_transform = df_transform.drop(columns = ['surface_total', 'surface_covered', 'bedrooms', 'bathrooms', 'rooms'])

In [5]:
# Get dummy variables for non-transformed dataset
data = pd.get_dummies(df2, columns = ['l1', 'l2', 'l3', 'property_type'], 
                      prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
                      sparse = True, drop_first = False)

# Drop reference levels for each dummified feature and unimportant or currently unusable features. 
data = data.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
data = data.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

# Get dummy variables for transformed dataset
trans = pd.get_dummies(df_transform, columns = ['l1', 'l2', 'l3', 'property_type'], 
                          prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
                          sparse = True, drop_first = False)

# Drop reference levels for each dummified feature (same references as non-transformed data) and unimportant or currently unusable features. 
trans = trans.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
trans = trans.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

## Standardization
Using 5-fold stratified cross validation, the data will be split into an 80-20 train-test split before standardization to prevent test set data from influencing the training data scale. The training data will be standardized, and the scale used to do this will also be applied to transform the test data set. The stratified cross-validation will ensure that the imbalances in the dataset's features (country, province, department, property type) wil be adjusted for.<br>

<font color = 'red'>The final dataset used in this analysis will be the principal components produced by the principal component analysis (this might change). These principal components are derived from our transformed data, which includes log-transformed and standardized price, square-root transformed and standardized rooms, bedrooms, bathrooms, surface_total, surface_covered, and 1000+ country, province, department, and property type dummy variables. A discussion of these components follows the analysis.</font>

## KNN using Transformed data
The following stratified cross-validated model uses KNN classifier on our dataset.

In [7]:
# Standardize the transformed data before applying PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics as mt
from sklearn.decomposition import PCA
#from sklearn.pipeline import Pipeline

# Create cross validation, standard scalar, and PCA object
cv_obj = StratifiedKFold(n_splits = 3, random_state = 6)
clf = KNeighborsClassifier(n_jobs = -1)
ss = StandardScaler()

# Require the number of components used to have 99% variance explained
pca = PCA(n_components = .99, svd_solver = 'full', random_state = 6)

X = trans.drop(columns = ['price_class']).values
y = trans.price_class.values

In [8]:
# Split dataset and fit PCA to data
iter_num=0

# Iterate over the split data
for train_indices, test_indices in cv_obj.split(X,y): 

    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # Scale training and test data to training data scale
    scaled_features = X_train.copy()
    X_train = ss.fit_transform(scaled_features)
    X_test = ss.fit(scaled_features).transform(X_test)
    
    # Run the PCA algorithm on the data
    %time Xtrain_pca = pca.fit(X_train).transform(X_train)
    Xtest_pca = pca.transform(X_test)
    
    # train the KNN model on the training data
    %time clf.fit(Xtrain_pca,y_train)
    y_hat = clf.predict(Xtest_pca)

    # Print the accuracy, precision, recall, fscore, and confusion matrix for each iteration
    acc = mt.accuracy_score(y_test,y_hat)
    metrics = mt.precision_recall_fscore_support(y_test, y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    print("Recall, Precision, Fscore\n", metrics)
    iter_num+=1

Wall time: 1min 1s
Wall time: 35.3 s
====Iteration 0  ====
accuracy 0.5895031718942294
confusion matrix
 [[28409 12493  8278]
 [15199 33471  2664]
 [17204  4535 24820]]
Recall, Precision, Fscore
 (array([0.46716109, 0.6628052 , 0.69403277]), array([0.57765352, 0.652024  , 0.53308705]), array([0.51656484, 0.6573704 , 0.60300531]), array([49180, 51334, 46559], dtype=int64))
Wall time: 1min 1s
Wall time: 36.2 s
====Iteration 1  ====
accuracy 0.5927232919930374
confusion matrix
 [[26853 12579  9748]
 [14258 33882  3193]
 [15270  4851 26438]]
Recall, Precision, Fscore
 (array([0.47627747, 0.66031338, 0.67137307]), array([0.54601464, 0.66004325, 0.56783866]), array([0.50876744, 0.66017828, 0.61528078]), array([49180, 51333, 46559], dtype=int64))
Wall time: 1min 1s
Wall time: 31.4 s
====Iteration 2  ====
accuracy 0.3939117841042762
confusion matrix
 [[ 7571  3325 38283]
 [ 7267  7598 36468]
 [ 2667  1128 42764]]
Recall, Precision, Fscore
 (array([0.432505  , 0.6304871 , 0.36390248]), array([0

In [8]:
# Now repeat the above process using a grid search algorithm
from sklearn.model_selection import GridSearchCV

gs_clf = GridSearchCV(clf, param_grid = {'n_neighbors': [10, 20, 30]}, n_jobs = -1, verbose = 2, cv = 3)

# Split dataset and fit PCA to data
iter_num=0

# Iterate over the split data
for train_indices, test_indices in cv_obj.split(X,y): 

    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # Scale training and test data to training data scale
    scaled_features = X_train.copy()
    %time X_train = ss.fit_transform(scaled_features)
    X_test = ss.fit(scaled_features).transform(X_test)
    
    # Run the PCA algorithm on the data
    %time Xtrain_pca = pca.fit(X_train).transform(X_train)
    Xtest_pca = pca.transform(X_test)
    
    # train the KNN model on the training data
    %time gs_clf.fit(Xtrain_pca,y_train)
    %time y_hat = gs_clf.predict(Xtest_pca)

    # Print the accuracy, precision, recall, fscore, and confusion matrix for each iteration
    acc = mt.accuracy_score(y_test,y_hat)
    metrics = mt.precision_recall_fscore_support(y_test, y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    print("Recall, Precision, Fscore\n", metrics)
    print("best estimator", gs_clf.best_params_)
    print("score", gs_clf.best_score_)
    iter_num+=1

Wall time: 11.6 s
Wall time: 1min 2s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 25.6min remaining: 12.8min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 34.9min finished


Wall time: 35min 29s
Wall time: 1min 39s
====Iteration 0  ====
accuracy 0.6194134885397048
confusion matrix
 [[29232 11005  8943]
 [14228 34304  2802]
 [15565  3431 27563]]
Recall, Precision, Fscore
 (array([0.49524778, 0.70381617, 0.70120586]), array([0.59438796, 0.66825106, 0.59200155]), array([0.54030775, 0.68557268, 0.64199285]), array([49180, 51334, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.5326796830113244
Wall time: 10.4 s
Wall time: 1min 2s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 23.8min remaining: 11.9min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 35.1min finished


Wall time: 35min 44s
Wall time: 1min 32s
====Iteration 1  ====
accuracy 0.6192069190600522
confusion matrix
 [[27487 11358 10335]
 [13242 34558  3533]
 [13565  3971 29023]]
Recall, Precision, Fscore
 (array([0.5062622 , 0.69272556, 0.67666877]), array([0.55890606, 0.67321216, 0.62335961]), array([0.53128322, 0.68282948, 0.64892119]), array([49180, 51333, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.5545277143167973
Wall time: 11 s
Wall time: 1min 3s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 14.0min remaining:  7.0min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 21.8min finished


Wall time: 22min 21s
Wall time: 5min 27s
====Iteration 2  ====
accuracy 0.38814586152266595
confusion matrix
 [[ 5079  2996 41104]
 [ 3509  7548 40276]
 [  722  1379 44458]]
Recall, Precision, Fscore
 (array([0.54554243, 0.63306215, 0.35329551]), array([0.10327579, 0.14703992, 0.95487446]), array([0.17367368, 0.2386493 , 0.51576304]), array([49179, 51333, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.6305359601557056


## Non-transformed data
The same stratified K-fold cross validation and standardization technique is employed for our non-transformed dataset to determine which set of variables are best for training our models.

## Principal Component Analysis
Features in both datasets will be normalized and Principal Component Analysis will be conducted to reduce the dimensionality of our dataset. This will allow us to extract latent information which is thought to be contained within country, province, and department features, while significantly reducing our dataset size and model computation times.

In [10]:
# PCA for transformed dataset
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

X = trans.drop(columns = ['price_class']).values
y = trans.price_class

pca = PCA(n_components = 10, random_state = 6)
X_pca = pca.fit(X).transform(X)

print('pca: ', pca.components_)

pca:  [[ 1.40750225e-02  7.89707679e-01  6.13213081e-01 ...  1.58628764e-03
  -3.62081818e-04  2.62335159e-05]
 [-2.60939025e-05 -6.13279661e-01  7.89855722e-01 ... -9.05973910e-04
   5.31027070e-05  1.61882733e-05]
 [-9.09305750e-01  1.35624470e-02  9.33048627e-03 ... -6.32004904e-03
  -4.70493501e-04 -2.48591313e-04]
 ...
 [-1.24765155e-01  1.24755862e-03  1.75669786e-03 ... -2.89307372e-02
  -9.71554063e-02 -3.60307814e-03]
 [-9.87764103e-02  8.47169934e-04  2.05762906e-04 ... -1.43167698e-02
  -2.26922524e-02 -2.01196176e-05]
 [-1.42454452e-01  2.26816924e-03  8.55972494e-04 ... -2.93828175e-02
   2.11950809e-02 -1.45528620e-03]]


In [15]:
print('pca variance explained: ', pca.explained_variance_ratio_)
print('Cumulative ', sum(pca.explained_variance_ratio_))
print('first 3 ', sum(pca.explained_variance_ratio_[0:3]))

pca variance explained:  [7.25915920e-01 2.68862366e-01 1.16526088e-03 5.90946039e-04
 3.93789852e-04 2.69261263e-04 2.33105627e-04 2.02883871e-04
 1.62168838e-04 1.45526174e-04]
Cumulative  0.9979412291040144
first 3  0.9959435474386268


The PCA on the transformed data with 10 components explains 99.7% of all variance within the data. This is a significant reduction in our dataset size, while still retaining a lot of the information within the data. Further exploration shows that just 3 principal components explain 99.5% of the variation within our data.