#0 For each fold you will obtain a confusion matrix that will be just part of the confusion matrix for the data. For example in a 10-fold cross validation each fold will return a confusion matrix that is only 1/10th of the total confusion matrix. Therefore to obtain the total confusion matrix one must add all the confusion matrices together. I showed this method in the first question iteratively and by using confusion_matrix("test data", cross_val_predict).

=====================================================================================================================

#1 a) First cell below is importing the iris dataset from scikit-learn's built-in datasets and preprocessing it.

In [2]:
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing

iris = datasets.load_iris()
x = iris.data
y = iris.target

le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


The next cell is me calculting average accuracy score and confusion matrix of the SDGClassifier using 10-fold cross validation in an iterative fashion

In [3]:
from sklearn.model_selection import KFold
import numpy as np
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

kf = KFold(10,True)
lm = linear_model.SGDClassifier()
total_accuracy_score = 0
result_confusion_matrix = np.matrix([[0,0,0],[0,0,0],[0,0,0]])

for train_index, test_index in kf.split(x):
   x_train, x_test = x[train_index], x[test_index]
   y_train, y_test = y[train_index], y[test_index]
   lm.fit(x_train, y_train)
   pred = lm.predict(x_test)
   total_accuracy_score += accuracy_score(y_test, pred, normalize = True)
   result_confusion_matrix = result_confusion_matrix + confusion_matrix(y_test, pred)

avg_accuracy_score = total_accuracy_score/10
print("Average Accuracy Score: ",avg_accuracy_score)
print("Confusion Matrix:\n",result_confusion_matrix)

Average Accuracy Score:  0.7066666666666667
Confusion Matrix:
 [[47  3  0]
 [13 26 11]
 [ 2 15 33]]




The next cell is the same as above but using scikit-learn's built-in  cross_val_score and cross_val_predict instead of iteration.

In [4]:
from sklearn.model_selection import cross_val_score, cross_val_predict

lm_score = cross_val_score(lm,x,y,cv=10)
lm_pred = cross_val_predict(lm, x, y, cv=10)
lm_conf_matrix = confusion_matrix(y, lm_pred)

avg_accuracy_score = np.mean(lm_score)
print("Average Accuracy Score: ",avg_accuracy_score)
print("Confusion Matrix:\n",lm_conf_matrix)

Average Accuracy Score:  0.7266666666666667
Confusion Matrix:
 [[48  2  0]
 [20 17 13]
 [ 0  3 47]]


#1 b) In the following cells I will be classifying the iris dataset again but this time with Random Forest Classifier, using the same built-in functions from scikit-learn as above.

In [5]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=2)

rfc_score = cross_val_score(rfc,x,y,cv=10)
rfc_pred = cross_val_predict(rfc, x, y, cv=10)
rfc_conf_matrix = confusion_matrix(y,rfc_pred)

avg_accuracy_score = np.mean(rfc_score)
print("Average Accuracy Score:",avg_accuracy_score)
print("Confusion Matrix:\n",rfc_conf_matrix)

Average Accuracy Score: 0.9466666666666667
Confusion Matrix:
 [[50  0  0]
 [ 0 46  4]
 [ 0  3 47]]


After changing the parameters n_estimators and max_depth I found that raising the n_estimators causes the accuracy to increase, but also greatly increases the performance of the program. Any value above 100 is probably considered deminishing returns. As for max_depth any value under 5-6 caused an increase in accuracy but values above 5-6 didn't cause an increase and sometimes caused a decrease, which I assume is from overfitting.

#1 c) My findings from completing the above classifiers is that the RandomForestClassifier does a much better(more accurate) job of classifying this dataset. The RandomForestClassifier was between .94 and .97 accuracy for any parameters I entered. Meanwhile the the SDGClassifier was always between .6 and .82 no matter how many times I ran it using either technique. \**Just an observation*\* After some diving into the SDGClassifier documentation I realized that by leaving the number of iterations set to default is what's causing my low accuracy score. By raising 'n_iter' to a more appropriate value of 1000 I received an accuracy score of ~.93

========================================================================================================================

#2 a) I read up on how to read an arff file on the website suggested in the assignment [https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.arff.loadarff.html][website].

I choose to use the bodies of the emails data and not the subjects. I don't know which one we were suppose to use but that made more sense to me.

I also had to use DataFrame.replace() to make my model in the next cell actually accept the dataframe as input.

In [6]:
from scipy.io import arff

#Used my local path because the link was just to a zip file. I assume when testing this, the path must be changed.
bodies = arff.loadarff('C:\\Users\james\OneDrive\Documents\Current\CSCI3151\WEKA\dbworld_bodies.arff')
df_1 = pd.DataFrame(bodies[0])
df_1 = df_1.replace(b'0',0)
df_1 = df_1.replace(b'1',1)

#2 b) In the follwoing cell I simply train a Multinomial Naive-Bayes model

In [7]:
from sklearn.naive_bayes import MultinomialNB

X = df_1.drop(columns=['CLASS'])
y = df_1['CLASS']


model_NB = MultinomialNB()
model_score = cross_val_score(model_NB,X,y,cv=3)
avg_accuracy_score = np.mean(lm_score)
print(model_score)
print("Average Accuracy Score: ",avg_accuracy_score)

[0.95454545 0.86363636 0.75      ]
Average Accuracy Score:  0.7266666666666667


#2 c) In the following cell I use a BaggingClassifier and after experimenting with different hyperparameters I found that raising n_estimators usually increases accuracy, and that is the same for max_samples and max_features. I found that max_features has the lasrgest impact on accuracy out of the three parameters I tuned.

In [8]:
from sklearn.ensemble import BaggingClassifier

BC = BaggingClassifier(base_estimator=model_NB, n_estimators=10, max_samples=.75, max_features=.85)

np.mean(cross_val_score(BC, X, y, cv=3))

0.8863636363636364

#2 d) After parts (b) and (c) I have concluded (as stated similarily above) that higher n_estimators, higher max_samples, and higher max_features usually produces better results.

=======================================================================================================================

#3 a) In this question I had to do a lot of searching online for ways to replace the missing data but eventually my colleague suggested I use an Imputer and that seemed to work for me.

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

crime_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', header=None)
y = crime_data.iloc[:,127:128]
X = crime_data.iloc[:,0:127]
#preprocessing names of cities
X.iloc[:,3:4] = le.fit_transform(X.iloc[:,3:4])
#replacing '?' with NaN so the Imputer will work
X = X.replace('?', np.NaN)

#Imputer converts all NaN values to the mean value of that column
values = X.values
imputer = Imputer()
X = imputer.fit_transform(values)
X = pd.DataFrame(X)

reg = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X,y)
reg.fit(x_train, y_train)
score = reg.score(x_test,y_test)
print(score)

#3 b)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import metrics
pipeline = Pipeline([('var', VarianceThreshold()), ('reg',LinearRegression())])

parameters = {}
parameters['var__threshold'] = [0, 0.001, 0.005, 0.01, 0.1, 0.5, 1]

gridsearch = GridSearchCV(pipeline, parameters, scoring = 'r2', cv=3)
gridsearch.fit(x_train, y_train)

print('Best score and parameter combination = \n', gridsearch.best_score_, gridsearch.best_params_)

#3 c) I first trained my LinearRegression model and evaluated it, using the coeficient of determination, with no cross validation to just get a quick baseline. In part (b) I utilized GridSearchCV to determine the threshold I should use for removing attributes with low variance using R squared as my scoring parameter in GridSearchCV. Even though it seems weird to me, the GridSearch has determined that a low variance threshold of 0.01 returns the largest R squared value.

=========================================================================================================================

#4 a) I first had to install tensorflow (with keras) and since I only had to run that line once I left it there commented out.

I then added some simple Dense() layers to my Sequential model as suggested here: https://medium.com/@vidit0210/practical-deep-neural-network-in-keras-on-pima-diabetes-data-set-776c21424488

In [None]:
import sys
#!conda install --yes --prefix {sys.prefix} tensorflow
from keras.models import Sequential
from keras.layers import Dense

diabetes_data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', header=None)
y = diabetes_data.iloc[:,8:9]
X = diabetes_data.iloc[:,0:8]
x_train, x_test, y_train, y_test = train_test_split(X,y)

model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train,y_train, epochs=10, batch_size=50, validation_data=(x_test, y_test))
model.save('model')

#4 b) I found out how to make matplotlib plots from https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/. I didn't change it much because the plots given there fit my needs perfectly.

In [None]:
import matplotlib.pyplot as plt

print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

#4 c) The following cells are me showing how to generate predictions on new data but since I don't have any data to input I just split the same data differently and used that.

In [None]:
from keras.models import load_model
model = load_model('model')
x_train, x_test, y_train, y_test = train_test_split(X,y)
model.fit(x_train, y_train, epochs=10, batch_size=50, validation_data=(x_test, y_test))

#4 d) For this part I simply experimented with as many different parameters as I could find online but only left a few of them because there was over 50 lines of code here.

In [None]:
model_2 = Sequential()

model_2.add(Dense(512, input_dim=8, kernel_initializer='orthogonal', activation='tanh'))
model_2.add(Dense(128, kernel_initializer='orthogonal', activation='tanh'))
model_2.add(Dense(1, kernel_initializer='orthogonal', activation='softmax'))
model_2.compile(loss='hinge', optimizer='SGD', metrics=['accuracy'])
model_2.fit(x_train,y_train, epochs=10, batch_size=50, validation_data=(x_test, y_test))

model_3 = Sequential()

model_3.add(Dense(256, input_dim=8, kernel_initializer='he_normal', activation='linear'))
model_3.add(Dense(64, kernel_initializer='he_normal', activation='exponential'))
model_3.add(Dense(1, kernel_initializer='he_normal', activation='softsign'))
model_3.compile(loss='mean_squared_error', optimizer='adamax', metrics=['accuracy'])
model_3.fit(x_train,y_train, epochs=10, batch_size=50, validation_data=(x_test, y_test))