Using the `Boston` data set, fit classifcation models in order to predict
whether a given suburb has a crime rate above or below the median.
Explore logistic regression, LDA, naive Bayes, and KNN models using
various subsets of the predictors. Describe your fndings.
<br>
*Hint: You will have to create the response variable yourself, using the
variables that are contained in the `Boston` data set.*

In [0]:
# import statistical packages
import numpy as np
import pandas as pd

In [0]:
# import data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# load and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Boston.csv"
Boston = spark.read.option("header", "true").csv(url).toPandas()
Boston.set_index('SlNo', inplace=True)

int_cols = ['chas', 'rad', 'tax']
float_cols = list(set(Boston.columns) - set(int_cols))
Boston[int_cols] = Boston[int_cols].astype(int)
Boston[float_cols] = Boston[float_cols].astype(float)

In [0]:
Boston.head()

**Calculating median crime rate**

In [0]:
crim_median = Boston['crim'].median()

In [0]:
crim_median

**Adding classification data crim1**

In [0]:
crim1 = pd.DataFrame(columns=['crim1'])

In [0]:
Boston = pd.concat([crim1, Boston], axis = 1)

In [0]:
Boston.head()

In [0]:
index = Boston.index

In [0]:
for i in index:
    if Boston.loc[i]['crim'] > crim_median:
        Boston.at[i, 'crim1'] = 1
    else:
        Boston.at[i, 'crim1'] = 0

In [0]:
type(Boston['crim1'])

In [0]:
Boston

In [0]:
Boston.crim1.dtype

*As we can see the data type of Boston.crim1 is not in any recognisable format which will cause problems later on. So, we will first have to convert the data type of Boston into a dummy variable.*

In [0]:
Boston = pd.get_dummies(Boston, columns=['crim1'], drop_first=True)

In [0]:
Boston.head(25)

In [0]:
Boston.crim1_1.dtype

*We see there is a new column Boston.crim1_1 with integral digits. We will use this column for modelling.*

**Important question: What is the true nature of distribution of each independent variable?**

*This is important because an underlying assumption of LDA and QDA is that the marginal distribution of each variable is normal. Non-normality can significantly reduce the predictive performance of LDA and QDA.*

In [0]:
import warnings
warnings.filterwarnings('ignore')
for i in Boston.columns:
    plt.xkcd()
    plt.figure(figsize = (25, 10))
    sns.distplot(Boston[i])

*As we can see, amongst non-categorical data, only rm (and somewhat medv) have a normal distribution. The question then is - how strongly are these and other non-normally distributed predictors correlated with crime rates? We can use the correlation to determine if we could potentially keep these predictors. First, let me perform predictions using ALL predictors.*

**Dividing the dataframe into training and test data**

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X = Boston.drop(columns='crim1_1')
y = Boston['crim1_1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Logistic Regression**

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [0]:
glmpred = glmfit.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix, classification_report

In [0]:
print(confusion_matrix(y_test, glmpred))

In [0]:
print(classification_report(y_test, glmpred))

*93% overall (unweighted) precision is great! This means we correctly predicted the crime rates in 93% of cases. Let's explore where the model is inaccurate. In ~7.23% of the cases, it wrongly classifies neighbourhoods with their crime rates. Delving deeper using the classification report, we see that the issue stems from (relatively) low precision in those neighbourhoods where the crim was lower than median. This might be an issue from a practical standpoint since classifying some less-crime affected areas would mean the government would deploy disproportionately large police force at the expense of other areas with higher than median crime rates.*

In [0]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

In [0]:
lda = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
ldapred = lda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, ldapred))

In [0]:
print(classification_report(y_test, ldapred))

*As we can see, using LDA reduces the precision of the model.*

In [0]:
qda = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
qdapred = qda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, qdapred))

In [0]:
print(classification_report(y_test, qdapred))

*QDA improves upon the results of logistic regression and could be considered as the contender for Boston's top model.*

**K-Nearest Neighbours**

*We will need to standardise the predictors since they all measure differently.*

In [0]:
# we will need to import data again
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Boston.csv"
Boston = spark.read.option("header", "true").csv(url).toPandas().astype(float)

crim_median = Boston['crim'].median()
crim1 = pd.DataFrame(columns=['crim1'])
Boston = pd.concat([crim1, Boston], axis = 1)
Boston.drop(columns='SlNo', inplace=True)
index = Boston.index
for i in index:
    if Boston.iloc[i]['crim'] > crim_median:
        Boston.at[i, 'crim1'] = 1
    else:
        Boston.at[i, 'crim1'] = 0
Boston = pd.get_dummies(Boston, columns=['crim1'], drop_first=True)

In [0]:
Boston.head(25)

*Let me check the variances of each predictor.*

In [0]:
pf = pd.DataFrame()
for i in Boston.columns[:-1]:
    pf = pf.append([Boston[i].var()])

pf.columns = ['var']
plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(pf['var'].reset_index())

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
scaler = StandardScaler()

In [0]:
scaler.fit(Boston.drop(columns='crim1_1', axis = 1).astype(float))

In [0]:
scaled_features = scaler.transform(Boston.drop(columns='crim1_1', axis = 1).astype(float))

In [0]:
Boston_scaled = pd.DataFrame(scaled_features, columns=Boston.columns[:-1])

In [0]:
Boston_scaled.head()

*Checking the variances of each predictor in the scaled dataframe.*

In [0]:
pf = pd.DataFrame()
for i in Boston_scaled.columns[:-1]:
    pf = pf.append([Boston_scaled[i].var()])


plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(pf.reset_index())

*Okay, this looks good!*

*Let me visually examine the error rate for different values of K*

In [0]:
X = Boston_scaled
y = Boston['crim1_1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [0]:
from sklearn.neighbors import KNeighborsClassifier

In [0]:


error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [0]:
plt.xkcd()
plt.figure(figsize=(25,10))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

*So we can see, the error rate is lowest for K = 2 and then keeps increasing thereafter. So, I will perform KNNs for K = 1, 2 and 39 (highest error rate)*

In [0]:
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(X_train,y_train)
knnpred1 = knn1.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred1))

In [0]:
print(classification_report(y_test, knnpred1))

In [0]:
knn2 = KNeighborsClassifier(n_neighbors=2)
knn2.fit(X_train,y_train)
knnpred2 = knn2.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred2))

In [0]:
knn39 = KNeighborsClassifier(n_neighbors=39)
knn39.fit(X_train,y_train)
knnpred39 = knn39.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred39))

In [0]:
print(classification_report(y_test, knnpred39))

*As we can see, we get the best overall precision at K = 2 (92%) and the worst precision at K = 39 (84%). However, the best precision for K-Means is a shade lower than that of QDA. So, QDA is the best classifier, given we use ALL predictors.*

**So, how about using a subset of predictors?**

*First, we check the pairplots and the correlation matrix.*

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
sns.pairplot(Boston)

In [0]:
round(Boston.corr()*100, 2)

*Assuming an arbitrary correlation cutoff of 35% (in absolute value) and using some qualitative judgment (such as disregarding 'nox') the most correlated predictors are 'indus', 'age', 'dis', 'tax', 'black' and 'lstat'. These should give a healthy idea about some prime factors (and reverse factors too!) for crime.*

*Two points of note are in order here:*<br>
*1. The selections of these columns are arbitrary and conditional upon my personal bias. So, the reader is expected to play around more with different subsets and explore for themselves.*<br>
*2. I have not considered multicollinearity amongst different predictors, something that should be done. At the moment, I have just 'eyeballed' the multicollinearity (such as those areas with large 'indus'(non-retail businesses) are likely to have large 'nox' (nitric oxides concentration (parts per 10 million) because of fumes from these businesses.*

In [0]:
Boston1 = Boston.drop(columns=['crim', 'zn', 'chas', 'nox', 'rad', 'medv'])

In [0]:
import warnings
warnings.filterwarnings('ignore')
for i in Boston.columns:
    plt.xkcd()
    plt.figure(figsize = (25, 10))
    sns.distplot(Boston[i])

In [0]:
sns.pairplot(Boston1)

**Splitting the Boston1 dataset into training and test data**

In [0]:
X = Boston1.drop(columns='crim1_1', axis=1)
y = Boston1['crim1_1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Logistic Regression**

In [0]:
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [0]:
glmpred = glmfit.predict(X_test)

In [0]:
print(confusion_matrix(y_test, glmpred))

In [0]:
print(classification_report(y_test, glmpred))

**Linear Discriminant Analysis**

In [0]:
lda = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
ldapred = lda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, ldapred))

In [0]:
print(classification_report(y_test, ldapred))

**Quadratic Discriminant Analysis**

In [0]:
qda = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
qdapred = qda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, qdapred))

In [0]:
print(classification_report(y_test, qdapred))

**K-Nearest Neighbours**

In [0]:
pf = pd.DataFrame()
for i in Boston1.columns[:-1]:
    pf = pf.append([Boston1[i].var()])

pf.columns = ['var']
plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(pf['var'].reset_index())

In [0]:
scaler = StandardScaler()

In [0]:
scaler.fit(Boston1.drop(columns='crim1_1', axis = 1).astype(float))

In [0]:
scaled_features = scaler.transform(Boston1.drop(columns='crim1_1', axis = 1).astype(float))

In [0]:
Boston1_scaled = pd.DataFrame(scaled_features, columns=Boston1.columns[:-1])

In [0]:
Boston1_scaled.head()

*Checking the variances of each predictor in the Boston1 scaled dataframe.*

In [0]:
pf = pd.DataFrame()
for i in Boston1_scaled.columns[:-1]:
    pf = pf.append([Boston1_scaled[i].var()])


plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(pf.reset_index())

*Looks good!*

In [0]:
X = Boston1
y = Boston['crim1_1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [0]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [0]:
plt.xkcd()
plt.figure(figsize=(25,10))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

*Let me check the KNN predictions for K = 2 (lowest error rate) and K = 21 (highest error rate)*

In [0]:
knn2 = KNeighborsClassifier(n_neighbors=2)
knn2.fit(X_train,y_train)
knnpred2 = knn2.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred2))

In [0]:
print(classification_report(y_test, knnpred2))

In [0]:
knn21 = KNeighborsClassifier(n_neighbors=21)
knn21.fit(X_train,y_train)
knnpred21 = knn21.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred21))

In [0]:
print(classification_report(y_test, knnpred21))

*As we can see, taking the subset, KNN(K=2) provides the best precision for all models. Likewise, we could conduct further tests and check for different subsets of the data.*