# 1. Look up SMOTE oversampling  https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html 
# a. Describe what it is in your own words in markdown.
# b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

SMOTE (Synthetic Minority Over-Sampling Technique) is a method to over sample the minority class in an imbalanced dataset so that the minority class can have the same or similar amount of samples as the majority class. SMOTE duplicates samples from existing samples in the minority class in the training dataset. 

A random sample from the minority class is chosen, then the k of the nearest neighbors to the random sample are found. A synthetic or duplicate example is created at a randomly selected point between the two examples in the feature space.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
#checking the samples for each outcome: 0 (doesn't have diabetes) or 1 (does have diabetes)
diabetes_df["Outcome"].value_counts()

#the classes are imbalanced

0    500
1    268
Name: Outcome, dtype: int64

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
# dividing data into X and y variables
X = diabetes_df.drop('Outcome', axis=1) #features
y = diabetes_df['Outcome'] #target

# train_test_split
# example of why to use stratify: https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/#:~:text=We%20can%20achieve%20this%20by,the%20provided%20%E2%80%9Cy%E2%80%9D%20array.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=24, stratify=y)

#Standardize
# you can standardize the X's before or after train_test_split
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

### Using SMOTE to handle the class imbalance

In [5]:
from imblearn.over_sampling import SMOTE

#instantiate SMOTE
smote = SMOTE(random_state = 24)
#this is where we apply the resampling technique to our data
#passing in the standardized data into the fit_resample function
# leaving y_train as is because we don't need to standardize what we're trying to predict
X_resampled, y_resampled = smote.fit_resample(X_train_scaler, y_train)



In [6]:
#train our model using resampled data. This done after preprocessing
model = LogisticRegression(random_state=24)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=24)

In [7]:
#calculate accuracy
#the balanced_accuracy score is assuming that the model is balanced
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7318518518518519

In [8]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.76      0.70      0.79      0.73      0.54       150
          1       0.61      0.70      0.76      0.66      0.73      0.53        81

avg / total       0.75      0.74      0.72      0.74      0.73      0.54       231



### In comparison to the simpliest regression approach, SMOTE improved the true positives in recall, so it provides a better understanding of those who have diabetes. In comparison to the random over sampler approach, SMOTE does not perform as well in identifying  the true postives.

# 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number.

## Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.
Examples:

16 --> 1+6=7

942 --> 9+4+2=15 --> 1+5=6

132189 --> 1+3+2+1+8+9=24 --> 2+4=6 493193 --> 4+9+3+1+9+3=29 --> 2+9=11

In [11]:
#https://www.geeksforgeeks.org/recursive-functions/

def rec_digit_sum(n):
    # create condition to return a single digit
    if len(n)  == 1:
        return int(n)
    #base case for termination condition
    total = 0
    #create for loop to add all the digits in the number
    for i in str(n):
        total += int(i)

    return rec_digit_sum(str(total))
rec_digit_sum(str(16))
rec_digit_sum(str(942))
rec_digit_sum(str(547890))

6