1. Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [2]:
#combination of random oversampling and undersampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [3]:
diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [4]:
#looking for null values in the df
diabetes_df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [6]:
#Definining X and Y variables
X=diabetes_df.drop('Outcome',axis=1) #Features
Y=diabetes_df['Outcome'].values #Target


#Synthetic minority Oversampling technique - SMOTE
#sampling_strategy used here for EditedNearestNeighbours is 'all'. because ENN purpose is to delete some observations from both classes that are identified as having different class between the observation’s class and its K-nearest neighbor majority class.

In [7]:
##Using SMOTE-ENN to balance the data
#Define model
model=AdaBoostClassifier()
#Define SMOTE-ENN
resample=SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='all'))
#Define pipeline
pipeline=Pipeline(steps=[('r', resample), ('m', model)])
#Define evaluation procedure (here we use Repeated Stratified K-Fold CV)
cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
#Evaluate model
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X, Y, scoring=scoring, cv=cv, n_jobs=-1)

# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.7314
Mean Precision: 0.7278
Mean Recall: 0.7461


2. Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

Oversampling results form (week_15/Inclass_practice.ipynb)
Precision - 0.76
Recall - 0.74
Accuracy - 0.74
Undersampling results:
Precision - 0.76
Recall - 0.74
precision = True Positives/True Positives + false positives
Precision and accuracy dropped a bit after handling imbalanced dataset with SMOTEEN to balance. 
Though recall remained the same. 
Values when performed individually or either combined didn't make any difference here in this dataset. But SMOTE-ENN being the advanced model values looks more accurate. 

3. What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier is a deviation from the pattern or an extreme value from other observations in a dataset i.e., abnormal data points in data. Abnormal data is detected when preparing datasets for ML models. 
Generally we either remove them or analyse them to see why we got those readings. Finding abnormalities helps in predicting future methods. 
Methods to defect outliers are:
    1. Z-score / Etreme value analysis / standard deviation: 68% of data points lie within one standard deviation of the mean, 95% within 2 standard deviation, 99.7% lie within 3. Data points that are more than 3 times the standard deviation are considered outliers. 
    2. Boxplots: Data points that are above and below the whiskers are considered outliers
    3. DBScan clustering(Density based spatial clustering of applications with noise): Kmeans and heirachial clustering can also be used to detect outliers. 
        a. Core points: i. min_samples(minimum no. of core points needed to form a cluster). ii. eps(max distance between two samples for them to consider as in the same cluster)
        b. Border points: these are points in the same cluster as core points but much further away from the center of the cluster
        c. Noise point: Data points that do not belong to any cluster. These are considered as abnormal or normal
    4. Isolation Forest: Outliers are few and far from the rest of the observations
    5. Robust Random Cut Forest
    


4. Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 
You can use this code, but otherwise you follow standard practices we have already used many times: 
from sklearn.svm import SVC
classifier = SVC(kernel='linear')

In [8]:
import numpy as np
import pandas as pd

data_file = np.genfromtxt('/Users/trimpu/Downloads/australian.dat',
                     skip_header=1,
                     skip_footer=1,
                     names=True,
                     dtype=None,
                     delimiter=' ')
#data_file.shape
df = pd.DataFrame(data_file)
df.head(3)

Unnamed: 0,0,2267,7,2,8,4,0165,0_1,0_2,0_3,0_4,2_1,160,1,0_5
0,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
1,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
2,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


In [9]:
#renaming the columns
df.columns = ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14','A15']
df.head(2)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
0,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
1,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1


In [10]:
#checking for null_values. Preprocessing
df.isnull().sum()

A1     0
A2     0
A3     0
A4     0
A5     0
A6     0
A7     0
A8     0
A9     0
A10    0
A11    0
A12    0
A13    0
A14    0
A15    0
dtype: int64

In [11]:
#defining X and y values
X = df.drop('A15', axis=1)
y = df['A15']

In [12]:
#splitting the dataset into training and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)


In [13]:
#fit and train the training data
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
print(svclassifier.score(X_train, y_train))
print(svclassifier.score(X_test, y_test ))


0.8797814207650273
0.8260869565217391


In [14]:
y_pred = svclassifier.predict(X_test)
y_pred

array([1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1])

In [15]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[57 13]
 [11 57]]
              precision    recall  f1-score   support

           0       0.84      0.81      0.83        70
           1       0.81      0.84      0.83        68

    accuracy                           0.83       138
   macro avg       0.83      0.83      0.83       138
weighted avg       0.83      0.83      0.83       138




5. How did the SVM model perform? Use a classification report

In [16]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.81      0.83        70
           1       0.81      0.84      0.83        68

    accuracy                           0.83       138
   macro avg       0.83      0.83      0.83       138
weighted avg       0.83      0.83      0.83       138



[[57 13]
 [11 57]]

[[TP  FP]
[FN   TN]]


* updating comments - Based on precision and recall dataset looks very balanced. I believe realtime datasets doesn't show up such accurate correlation between precision and recall. 


6. What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

* Interested in playing with data, exploring datasets, visualizing data and making predictions that deliver business value. 
* Any data related technical job where i can use my skills that i learned at launch code and learn new skills. It needs to be more challenging because i get bored in doing same routine task without learning any new stuff. 
* Eventually looking for desiging and developing machine learning and deep learning systems, and learning data analysis tools.
