# HW2 Question 2

## Import modules

In [1]:
import pandas as pd # for data handling
from sklearn.model_selection import cross_val_score # for cross-validation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # evaluation metrics
import matplotlib.pyplot as plt # for plotting

# scikit-learn classifiers evaluated (change as desired)
from sklearn.naive_bayes import GaussianNB 
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

## Get data
We shall extract the following 3 *CSV* files with data from the *zip* file "*hw2.q2.data.zip*".

We shall read data from these *CSV* files into *pandas* dataframes.

### Extract *CSV* files from *zip* file
We shall use unzip to extract CSV files from zip file with data.
- “**!**” allows us to run command line commands from a code cell in a notebook.


In [2]:
! unzip '/content/drive/MyDrive/Colab Notebooks/courses/sklearn_classifiers/homework/hw2.q2.data.zip'

Archive:  /content/drive/MyDrive/Colab Notebooks/courses/sklearn_classifiers/homework/hw2.q2.data.zip
  inflating: hw2.q2.new.csv          
  inflating: hw2.q2.test.csv         
  inflating: hw2.q2.train.csv        


### Read data into *pandas* dataframes
We shall use *pandas* **read_csv** function to read data from the CSV files "*churn.train.csv*", "*churn.test.csv*", and "*churn.new.csv*" to *pandas* dataframes **train**, **test**, and **new**, respectively.

In [3]:
# Read data from CSV files into pandas dataframes
train = pd.read_csv('hw2.q2.train.csv') # training data
test = pd.read_csv('hw2.q2.test.csv') # test data
new = pd.read_csv('hw2.q2.new.csv') # unlabeled data
# Show number of rows and columns in each dataframe
print('train contains %d rows and %d columns' %train.shape)
print('test contains %d rows and %d columns' %test.shape)
print('new contains %d rows and %d columns' %new.shape)
print('First 3 rows in train:') 
train.head(3) # display first 3 training samples 

train contains 8000 rows and 11 columns
test contains 2000 rows and 11 columns
new contains 30 rows and 11 columns
First 3 rows in train:


Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,2.0,-0.613,-1.465,-1.105,0.436,-1.658,-1.357,-2.375,-0.997,-0.653,1.186
1,1.0,-1.154,0.473,3.159,-4.77,0.402,-0.16,-1.925,-0.105,-2.304,0.032
2,2.0,0.147,-0.814,-0.792,-1.403,2.124,-2.263,-2.133,-2.461,-0.781,0.932


In [4]:
print('Last 2 rows in new:') 
new.tail(2) # display last 2 unlabeled samples

Last 2 rows in new:


Unnamed: 0,ID,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
28,ID_029,3.877,-0.758,0.528,0.495,3.054,-1.305,-2.277,-2.976,-3.785,1.93
29,ID_030,-0.243,3.535,2.304,0.03,1.027,-0.605,-0.782,-2.649,-0.293,2.907


### Specify inputs and outputs
- **features**: List of the 16 input feature names
- **X_train**: $8000 \times 20$ array containing input values for training samples.
- **y_train**: Array containing labels for the 8000 training samples.
- **X_test**: $2000 \times 20$ array containing input values for test samples.
- **y_test**: Array containing labels for the 2000 training samples.
- **X_new**: $30 \times 20$ array containing input values for unlabeled samples.






In [5]:
features = list(train)[1:] # all but the first column header are feature names
print("features:", features)
X_train, X_test, X_new = train[features], test[features], new[features]
y_train, y_test = train.y, test.y
print('Shapes:')
print(f'X_train: {X_train.shape}, X_test: {X_test.shape}, X_new: {X_new.shape}')
print(f'y_train: {y_train.shape}, y_test: {y_test.shape}')

features: ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']
Shapes:
X_train: (8000, 10), X_test: (2000, 10), X_new: (30, 10)
y_train: (8000,), y_test: (2000,)


## Evaluate models using *k*-fold cross-validation
We shall use **4**-fold cross-validation so that 6000 of the 8000 training samples are used for training and the remaining 2000 samples are used for validation in each fold. The mean cross-validation accuracy for each model with chosen hyper-parameters on the 4 runs will be computed using the command:
- **score = cross_val_score(model, X_train, y_train, cv=4).mean()**
> - model: classifier object with specified hyperparameters
> - X_train, y_train: Inputs and output labels for training
> - cv: number of folds in cross-validation
> - mean(): computes mean accuracy from the *cv* runs 

You can look up the documentation for each classifier, change hyper-parameter values, and observe the results. We shall also observe the time it takes to train and evaluate each model 4 times in this process.


### GaussianNB

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [6]:
%%time
model = GaussianNB() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.8490
CPU times: user 29.6 ms, sys: 2.02 ms, total: 31.6 ms
Wall time: 39.9 ms


### DecisionTreeClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [9]:
%%time
model = DecisionTreeClassifier() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.8424
CPU times: user 388 ms, sys: 0 ns, total: 388 ms
Wall time: 385 ms


### RandomForestClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [10]:
%%time
model = RandomForestClassifier() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9404
CPU times: user 6.93 s, sys: 37.2 ms, total: 6.97 s
Wall time: 6.98 s


### ExtraTreesClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [11]:
%%time
model = ExtraTreesClassifier() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9519
CPU times: user 2.5 s, sys: 47 ms, total: 2.55 s
Wall time: 2.54 s


### KNeighborsClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [12]:
%%time
model = KNeighborsClassifier() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9640
CPU times: user 728 ms, sys: 709 µs, total: 729 ms
Wall time: 740 ms


### LogisticRegression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [13]:
%%time
model = LogisticRegression(max_iter=1000) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.8812
CPU times: user 491 ms, sys: 2.73 ms, total: 494 ms
Wall time: 512 ms


### SVC

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [14]:
%%time
model = SVC() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9669
CPU times: user 2 s, sys: 4 ms, total: 2.01 s
Wall time: 2.01 s


### MLPClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [16]:
%%time
model = MLPClassifier(max_iter=1000) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9589
CPU times: user 1min 8s, sys: 523 ms, total: 1min 9s
Wall time: 1min 8s


## Select a good model
Since a Support Vector Classifier has high cross-validation accuracy, we shall search for good hyper-parameter values for a SVC model using cross-validation. In this example I shall vary the regularization parameter C.

In [17]:
for c in [0.1, 1, 5, 10]: # number of rules
    model = SVC(C=c)
    score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
    print(f'Mean cross-validation accuracy with C = {c:0.1f} = {score:0.4f}')

Mean cross-validation accuracy with C = 0.1 = 0.9499
Mean cross-validation accuracy with C = 1.0 = 0.9669
Mean cross-validation accuracy with C = 5.0 = 0.9701
Mean cross-validation accuracy with C = 10.0 = 0.9698


In [18]:
chosen_model = SVC(C= 0.5)
chosen_model

SVC(C=0.5, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

## Train and test selected model

In [19]:
%%time
chosen_model.fit(X_train, y_train) # train selected model on ALL training examples
predicted = chosen_model.predict(X_test) # predicted churn for test examples
acc = accuracy_score(y_test, predicted) # accuracy on test samples
print(f'Accuracy on test samples = {acc:0.4f}') # show test accuracy
print("Classification report on test samples:") # for precision, recall, F1-score
print(classification_report(y_test, predicted, digits=4)) # rounded to 4 decimal places

Accuracy on test samples = 0.9650
Classification report on test samples:
              precision    recall  f1-score   support

         0.0     0.9569    0.9760    0.9663       500
         1.0     0.9477    0.9458    0.9467       498
         2.0     0.9796    0.9757    0.9776       493
         3.0     0.9761    0.9627    0.9693       509

    accuracy                         0.9650      2000
   macro avg     0.9651    0.9650    0.9650      2000
weighted avg     0.9651    0.9650    0.9650      2000

CPU times: user 838 ms, sys: 7.92 ms, total: 846 ms
Wall time: 844 ms


In [22]:
cm = pd.DataFrame(confusion_matrix(y_test, predicted))
cm.to_csv('cm.hw2.q2.csv')
cm

Unnamed: 0,0,1,2,3
0,488,6,1,5
1,14,471,7,6
2,4,7,481,1
3,4,13,2,490


## Predict class for unlabeled samples
We shall use our trained model to predict the output class for the unlabeled samples.

In [23]:
predicted_new = chosen_model.predict(X_new) # predicted classes for unlabeled samples
hw2q2_prediction = pd.DataFrame() # dataframe with predicted classes
hw2q2_prediction['ID'] = new.ID # identifiers for unlabeled samples
hw2q2_prediction['y'] = predicted_new # # predicted classes for unlabeled samples
hw2q2_prediction.to_csv('hw2.q2.prediction.csv', index=False) # save as CSV file
hw2q2_prediction # display results

Unnamed: 0,ID,y
0,ID_001,0.0
1,ID_002,0.0
2,ID_003,0.0
3,ID_004,0.0
4,ID_005,0.0
5,ID_006,0.0
6,ID_007,0.0
7,ID_008,0.0
8,ID_009,0.0
9,ID_010,0.0
