Import necessary modules

In [1]:
import pandas as pd
import numpy as np
from imblearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

df=pd.read_csv(r'..\homework_13\diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

SMOTE-ENN is a combination of SMOTE (Synthetic Minority Oversampling Technique) and ENN (Edited Nearest Neighbor). ENN works by comparing each observation to its k number of nearest neighbors (default k=3).  If the observation is of a different class than the majority class of those three neighbors then the observation and its nearest neighbor are deleted from the dataset. SMOTE takes the distance between an observation and its nearest neighbors and multiplies it by a random number between 0 and 1 in order to generate a new datapoint.

In [2]:
#check to see how balanced the outcomes are
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [3]:
#identify the X and y values
y=df['Outcome'].values 
X=df.drop('Outcome',axis=1)

In [4]:
model=AdaBoostClassifier()
resample=SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='all'))
pipeline=Pipeline(steps=[('r', resample), ('m', model)])
cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)

print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.7331
Mean Precision: 0.7279
Mean Recall: 0.7465


In [5]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
resample = SMOTEENN()
pipeline = Pipeline(steps=[('r', resample), ('m', model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)

print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.7062
Mean Precision: 0.7012
Mean Recall: 0.7171


### Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

Precision and recall numbers went up for the minority class, or those with diabetes, using AdaBoostClassifer versus using the Decision Tree Classifier and versus using the DecisionTreeClassifier without SMOTEENN.  Since we have determined before that recall should be a priority because of potential health outcomes for those with undiagnosed diabetes this is an important improvement.

### What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection is detecting abnormal or unusual observations. Sklearn has a number of algorithms available for outlier detection.  These are robust covariance, one-class svm, isolation forest, and local outlier factor(LOF). While sklearn documentation gives information about all four methods, there seems to be a preference for the efficency of isolation forest and LOF.  LOF does well when there are clusters of observations with some density, making it useful for high dimension datsets.  Detecting outliers is useful as abnormal datapoints can skew the model toward unimportant information.

### Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 
### You can use this code, but otherwise you follow standard practices we have already used many times: 
### from sklearn.svm import SVC
### classifier = SVC(kernel='linear')

In [6]:
aus_df = pd.read_csv('australian.dat',delimiter=' ',header=None)
aus_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1,22.08,11.460,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.000,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.750,1,4,4,1.250,0,0,0,1,2,280,1,0
3,0,21.67,11.500,1,5,3,0.000,1,1,11,1,2,0,1,1
4,1,20.17,8.170,2,6,4,1.960,1,1,14,0,2,60,159,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,31.57,10.500,2,14,4,6.500,1,0,0,0,2,0,1,1
686,1,20.67,0.415,2,8,4,0.125,0,0,0,0,2,0,45,0
687,0,18.83,9.540,2,6,4,0.085,1,0,0,0,2,100,1,1
688,0,27.42,14.500,2,14,8,3.085,1,1,1,0,2,120,12,1


In [7]:
aus_df.columns=['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15']
aus_df

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
0,1,22.08,11.460,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.000,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.750,1,4,4,1.250,0,0,0,1,2,280,1,0
3,0,21.67,11.500,1,5,3,0.000,1,1,11,1,2,0,1,1
4,1,20.17,8.170,2,6,4,1.960,1,1,14,0,2,60,159,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,31.57,10.500,2,14,4,6.500,1,0,0,0,2,0,1,1
686,1,20.67,0.415,2,8,4,0.125,0,0,0,0,2,0,45,0
687,0,18.83,9.540,2,6,4,0.085,1,0,0,0,2,100,1,1
688,0,27.42,14.500,2,14,8,3.085,1,1,1,0,2,120,12,1


In [19]:
X= aus_df.drop('A15', axis=1)
y=aus_df['A15']

In [20]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42) # 70% training and 30% test

In [21]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [22]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8599033816425121


### How did the SVM model perform? Use a classification report. 

In [23]:
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.91      0.89       126
           1       0.85      0.78      0.81        81

    accuracy                           0.86       207
   macro avg       0.86      0.85      0.85       207
weighted avg       0.86      0.86      0.86       207



### What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

While I am interested in Machine Learning, and the way that good models are achieved, I find that I am most excited about problem solving that comes from analysis.  I believe data analytics is a good fit for the skills that I am learning in this class, as well as my previous experience in the business world working primarily with executive level clients. Taking complex knowledge gained from data and translating it in an effective manner to people who are not necessarily data people is an area where I feel confident.  I have really enjoyed the work we have done in SQL and specifically in python with data cleaning and visualization.  Data analysis seems like it would be interesting and engaging work.