# Group Activity, Week 15

## 1. Look up SMOTE oversampling at:
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

### a. Describe what it is in your own words in markdown.


Synthetic Minority Oversampling Technique is a method of oversampling which creates and adds new data points in the minority group of a classification by choosing a random minority data point and connecting it with a nearest neighbor, then creating the new data point along the connection line. One criticism of SMOTE is that it can sometimes add data points close to outliers and thus skew the dataset.

### b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

Overall, the SMOTE technique performed a little worse across the board than the RandomOverSampler. In the classification report, all of the values are slightly less than they were for the ROS.

In [3]:
import pandas as pd
import numpy as np
import imblearn
from imblearn.over_sampling import SMOTE

diabetes_df = pd.read_csv("../Homework14/diabetes.csv")
diabetes_df.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)


smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_scaler, y_train)

#train using resampled data
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)


LogisticRegression(random_state=42)

In [5]:
#calculate accuracy
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7541975308641975

In [10]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.84      0.78      0.73      0.81      0.75      0.57       150
          1       0.64      0.73      0.78      0.68      0.75      0.57        81

avg / total       0.77      0.76      0.75      0.76      0.75      0.57       231



## 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number.

Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.

Examples:
16 --> 1+6=7
942 --> 9+4+2=15 --> 1+5=6
132189 --> 1+3+2+1+8+9=24 --> 2+4=6 493193 --> 4+9+3+1+9+3=29 --> 2+9=11 --> 1+1=2

In [46]:
def rec_digit_sum(n):
    s = 0

    while n:
        s += n % 10
        n //= 10

    if s > 9:
        return rec_digit_sum(s)
    
    return s

n = int(input("Non-negative Integer: "))
print(rec_digit_sum(n))

Positive Integer: 9
9


In [47]:
n = int(input("Non-negative Integer: "))
print(rec_digit_sum(n))

Positive Integer: 0
0


In [48]:
n = int(input("Non-negative Integer: "))
print(rec_digit_sum(n))

Non-negative Integer: 34567
7
