* Each record contain 9 numerical attributes. 
* The 10th column is the target column. 
* There are two classes to predict, we will call them positive and negative.

# Import necessary packages
For this exercise we need
 * pandas - data visualizer 
 * train_test_split - regression 
 * LogisticRegression - regression 
 * pyplot from matplotlib - daya visualizer
 * KNeighborsClassifier - classifier 
 * LogisticRegressionClassifier - classifier
 * RandomForestClassifier - classifier
 * DummyClassifier - classifier

In [None]:
#Setup dataframe and visualizations
import math
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt


#Figure set up for dark theme:
plt.style.use(['dark_background'])
sns.set()

#suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score, classification_report

# Load and prepare the dataset
* Load the training data into a dataframe named df_train_data 
* Create binary classification problem - rename some class labels
* Create a dataframe of 9 features named X, drop column 9.
* Create a data frame of labels named y, select only column 9.
* Split the data into a training set and a test set.


In [None]:
#Remove Following if Using SQL Alchemy
url ='http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/shuttle.tst'
df_train_data = pd.read_csv(url, header=None, sep=' ')

In [None]:
#Get a variable of interest
df_train_data.loc[df_train_data[9] != 4, 9] = 0
df_train_data[9].head()

0    4
1    4
2    0
3    4
4    0
Name: 9, dtype: int64

In [None]:
#Create a boolean - yes/no column. We'lll let 4 be our "off" classifier. Everything else is our "on" classifier.
df_train_data.loc[df_train_data[9] == 4, 9] = 1

#columns the dataset should not contain.
drop_cols = [9]
#variable of interest
var = 9

X = df_train_data.drop(axis=1, columns=drop_cols)
y = df_train_data[var]

In [None]:
#Size of the test set.
SIZE = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SIZE)
print('There are {:d} training samples and {:d} test samples'.format(X_train.shape[0], X_test.shape[0]))

There are 11600 training samples and 2900 test samples


# Create the model
* Instantiate a Logistic Regression classifier with a lbfgs solver.
* Fit the classifier to the data.

In [None]:
lr = LogisticRegression(solver='lbfgs', penalty='none', max_iter=1000)
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Calculate Accuracy
Calculate and print the accuracy of the model on the test data.


In [None]:
#Increasing the number of iterations did not resolve the issue. 
lr = LogisticRegression(solver='lbfgs', penalty='none', max_iter=10000, random_state=2)
lr.fit(X_train, y_train)

test_score = lr.score(X_test, y_test)
train_score = lr.score(X_train, y_train)

print('Score on training data: ', train_score)
print('Score on test data: ', test_score)


Score on training data:  0.8248275862068966
Score on test data:  0.8289655172413793


# Dummy Classifier
Use the dummy classifier to calculate the accuracy of a purely random chance.

Compare this result to the result of the logistic regression classifier above. What does this result tell you?

In [None]:
dummy = DummyClassifier(strategy = 'most_frequent')
dummy.fit(X_train, y_train)
print("Dummy: {}, Training: {}, Test {}".format(dummy.score(X_test, y_test), train_score, test_score))

##IT is possible to insert code that will tell me this. I just have to have a good idea of my threshold. 

Dummy: 0.8548275862068966, Training: 0.8248275862068966, Test 0.8289655172413793


It would appear that my model is doing about as good as a job as a random model.

# Confusion Matrix
Print the confusion matrix.

In [None]:
predictions = lr.predict(X_test)

confusion = confusion_matrix(y_test, predictions, labels=[1,0])
print(confusion)

[[  26  395]
 [ 101 2378]]


Plot a nicer confusion matrix (optional)
Use the plot_confusion_matrix function from above to plot a nicer looking confusion matrix.

Desired behavior: All True Positives are a different color than the falses. (26 should be hot pink and 2378 should be hot pink.)

In [None]:
#!pip install pygal
import pygal
from IPython.display import display, HTML
from pygal.style import NeonStyle

base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

def galplot(chart):
    rendered_chart = chart.render(is_unicode=True)
    plot_html = base_html.format(rendered_chart=rendered_chart)
    display(HTML(plot_html))
def plot_cm(y_true, y_pred):
    labels = np.unique(y_true)
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    dot_chart = pygal.Dot(x_label_rotation=30, interpolate='cubic', style=NeonStyle)
    dot_chart.title = 'Confusion Matrix'
    dot_chart.x_labels = labels
    dot_chart.x_title = "Predicted - Precision " + str(precision_score(y_test, predictions)) # Include percision percentage (False Positives) on the legend next to each definition.
    dot_chart.y_title = "Actual - Recall " + str(recall_score(y_test, predictions)) # Include recall percentage (False Negatives) on the legend next to each definition.
    for i in range(len(labels)):
        dot_chart.add(labels[i], cm[i,:])
    galplot(dot_chart)

plot_cm(y_test.astype(str), predictions.astype(str))

# Calculate Metrics
Print the F1, F beta, precision, recall and accuracy scores.

In [None]:
def CalculateMetrics(y_test, predictions):
  return {'accuracy': accuracy_score(y_test, predictions),
   'precision': precision_score(y_test, predictions),
   'recall': recall_score(y_test, predictions),
   'f1': f1_score(y_test, predictions),
   'fbeta_precision' : fbeta_score(y_test, predictions, 0.5),
    'fbeta_recall' : fbeta_score(y_test, predictions, 2)}

CalculateMetrics(y_test, predictions)

# Print a classification report

In [None]:
report = classification_report(y_test, predictions, target_names=['Positive', 'Negative'])
print(report)

# Plot ROC Curve and AUC
Caculate AUC and plot the curve.

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve, roc_auc_score
probs = lr.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probs)
auc = roc_auc_score(y_test, probs)
print('Area under the ROC curve: {:.3f}'.format(auc))

fig = plt.figure(figsize = (6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.fill(fpr, tpr, 'grey', alpha=0.3)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve for Logistic Regression Model')
plt.show()

# Plot Precision-Recall Curve
Plot the precision-recall curve for the model above.

Find the best value for C in the Logistic Regression Classifier for avoiding overfitting. Plot the training and testing accuracy over a range of C values from 0.05 to 1.5.

In [None]:
pres, rec, thresholds = precision_recall_curve(y_test, predictions)
fig = plt.figure(figsize = (6, 6))
plt.plot(rec, pres)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

In [None]:
c_vals = np.arange(0.05, 1.5, 0.05)
test_accuracy = []
train_accuracy = []

for c in c_vals:
  lr = LogisticRegression(solver='lbfgs', penalty='l2', C=c, max_iter=10000, random_state=2)
  lr.fit(X_train, y_train)
  test_accuracy.append(lr.score(X_test, y_test))
  train_accuracy.append(lr.score(X_train, y_train))

fig = plt.figure(figsize=(8, 4))
ax1 = fig.add_subplot(1, 1, 1)
ax1.plot(c_vals, test_accuracy, '-g', label='Test Accuracy')
ax1.plot(c_vals, train_accuracy, '-b', label='Train Accuracy')
ax1.set(xlabel='C', ylabel='Accuracy')
ax1.set_title('Effect of C on Accuracy')
ax1.legend()
plt.show()

The distance seems about the same no matter what value of C I choose. I'll go with 0.2. 

In [None]:
c = 0.2
#Increasing the number of iterations did not resolve the issue. 
lr = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, C=c, random_state=2)
lr.fit(X_train, y_train)

test_score = lr.score(X_test, y_test)
train_score = lr.score(X_train, y_train)

print('Score on training data: ', train_score)
print('Score on test data: ', test_score)

# Cross Validation
Perform 5-fold cross validation for a Logistic Regression Classifier. Print the 5 accuracy scores and the mean validation score.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score

clf = LogisticRegression(solver='lbfgs', max_iter=1000)
cv_scores = cross_val_score(clf, X_train, y_train, cv = 5)

print('Accuracy scores for the 5 folds: ', cv_scores)
print('Mean cross validation score: {:.3f}'.format(np.mean(cv_scores)))

# Is this really linear?
Our linear classifier is not giving us accuracy better than the dummy classifier. Suppose that the data was not linearly separable? Instantiate and train a KNN model with k = 7. How does the accuracy of the KNN model compare to the Logistic Regression from above? What does that tell you about the data?

In [None]:

knn = KNeighborsClassifier(n_neighbors=7)

# Then  fit the model
knn.fit(X_train, y_train)

# How well did we do
knn_7_score = knn.score(X_test, y_test)

print('Accuracy of KNN (k = 7): {:.3f}'.format(knn_7_score))

# Random Forest
Instantiate and fit a RandomForestClassifier and calculate the accuracy of that model.

In [None]:
 
rf = RandomForestClassifier(n_estimators = 22, random_state = 40)

rf.fit(X_train,y_train)

rf_score = rf.score(X_test, y_test)

print('Accuracy of Random Forest: {:.3f}'.format(rf_score))