<a href="https://colab.research.google.com/github/Axlbenja/MiamiDadeCounty_EmployeePay_2025/blob/main/Module_4_%E2%80%94_Axel_Paredes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [58]:
!pip install scikit-learn imbalanced-learn

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from sklearn.impute import SimpleImputer



# 1. Loan Prediction Problem Dataset

In [29]:
url = 'https://raw.githubusercontent.com/Axlbenja/axel.paredes/refs/heads/main/Loan%20Prediction%20Problem%20Dataset.csv'
df = pd.read_csv(url)

In [59]:
df.columns = df.columns.str.strip()
if 'Loan_ID' in df.columns:
    df.drop('Loan_ID', axis=1, inplace=True)

In [60]:
print(df.columns)

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Hypothetical_Target'],
      dtype='object')


In [61]:
numeric_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
categorical_features = [col for col in df.columns if col not in numeric_features and col != 'Loan_Status']

In [62]:
scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

In [63]:
median_income = df['ApplicantIncome'].median()
df['Hypothetical_Target'] = (df['ApplicantIncome'] > median_income).astype(int)

In [64]:
label_encoders = {}
for col in categorical_features:
    if df[col].dtype == 'object':
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        label_encoders[col] = le

In [65]:
X = df.drop(['ApplicantIncome', 'Hypothetical_Target'], axis=1)
y = df['Hypothetical_Target']

In [66]:
assert X.select_dtypes(include=['object']).shape[1] == 0, "There are still non-numeric columns in X!"

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [69]:
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [70]:
resampling_techniques = {
    'RandomOverSampler': RandomOverSampler(random_state=42),
    'RandomUnderSampler': RandomUnderSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'TomekLinks': TomekLinks(),
    'SMOTETomek': SMOTETomek(random_state=42)
}

In [71]:
results = []

for name, resampler in resampling_techniques.items():
    X_resampled, y_resampled = resampler.fit_resample(X_train, y_train)

In [74]:
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

In [73]:
y_pred = model.predict(X_test)

In [75]:
report = classification_report(y_test, y_pred, output_dict=True)

In [76]:
results.append({
        'Technique': name,
        'Original Class Distribution': y_train.value_counts(normalize=True).to_dict(),
        'New Class Distribution': y_resampled.value_counts(normalize=True).to_dict(),
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']})

In [77]:
results_df = pd.DataFrame(results)
print("\nResampling Results:")
print(results_df)


Resampling Results:
    Technique                     Original Class Distribution  \
0  SMOTETomek  {0: 0.5119453924914675, 1: 0.4880546075085324}   

  New Class Distribution  Precision    Recall  F1-Score  
0       {0: 0.5, 1: 0.5}    0.71459  0.689189  0.686053  


In [78]:
best_technique = results_df.loc[results_df['F1-Score'].idxmax()]
print("\nBest Resampling Technique:")
print(best_technique)


Best Resampling Technique:
Technique                                                          SMOTETomek
Original Class Distribution    {0: 0.5119453924914675, 1: 0.4880546075085324}
New Class Distribution                                       {0: 0.5, 1: 0.5}
Precision                                                             0.71459
Recall                                                               0.689189
F1-Score                                                             0.686053
Name: 0, dtype: object


# 2. Breast Cancer Wisconsin (Diagnostic) Data Set

In [79]:
url = 'https://raw.githubusercontent.com/Axlbenja/axel.paredes/refs/heads/main/Breast%20Cancer%20Wisconsin%20(Diagnostic).csv'
df = pd.read_csv(url)

In [80]:
df.columns = df.columns.str.strip()

In [82]:
if 'id' in df.columns:
    df.drop('id', axis=1, inplace=True)

In [83]:
numeric_features = [col for col in df.columns if col != 'diagnosis']
target = 'diagnosis'

In [84]:
scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


In [85]:
le = LabelEncoder()
df[target] = le.fit_transform(df[target])

In [86]:
X = df.drop(target, axis=1)
y = df[target]

In [87]:
assert X.select_dtypes(include=['object']).shape[1] == 0, "There are still non-numeric columns in X!"

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [89]:
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)



In [90]:
resampling_techniques = {
    'RandomOverSampler': RandomOverSampler(random_state=42),
    'RandomUnderSampler': RandomUnderSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'TomekLinks': TomekLinks(),
    'SMOTETomek': SMOTETomek(random_state=42)}


In [91]:
results = []

for name, resampler in resampling_techniques.items():
    X_resampled, y_resampled = resampler.fit_resample(X_train, y_train)

In [92]:
    model = LogisticRegression(random_state=42)
    model.fit(X_resampled, y_resampled)

In [93]:
y_pred = model.predict(X_test)

In [96]:
report = classification_report(y_test, y_pred, output_dict=True)

In [97]:
results.append({
        'Technique': name,
        'Original Class Distribution': y_train.value_counts(normalize=True).to_dict(),
        'New Class Distribution': y_resampled.value_counts(normalize=True).to_dict(),
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']})

In [98]:
results_df = pd.DataFrame(results)
print("\nResampling Results:")
print(results_df)


Resampling Results:
    Technique                      Original Class Distribution  \
0  SMOTETomek  {0: 0.6285714285714286, 1: 0.37142857142857144}   
1  SMOTETomek  {0: 0.6285714285714286, 1: 0.37142857142857144}   

  New Class Distribution  Precision    Recall  F1-Score  
0       {0: 0.5, 1: 0.5}    0.99135  0.991228  0.991207  
1       {0: 0.5, 1: 0.5}    0.99135  0.991228  0.991207  


In [99]:
best_technique = results_df.loc[results_df['F1-Score'].idxmax()]
print("\nBest Resampling Technique:")
print(best_technique)


Best Resampling Technique:
Technique                                                           SMOTETomek
Original Class Distribution    {0: 0.6285714285714286, 1: 0.37142857142857144}
New Class Distribution                                        {0: 0.5, 1: 0.5}
Precision                                                              0.99135
Recall                                                                0.991228
F1-Score                                                              0.991207
Name: 0, dtype: object


# 3. Breast Cancer Wisconsin (Diagnostic) Data Set

In [100]:
url = 'https://raw.githubusercontent.com/Axlbenja/axel.paredes/refs/heads/main/Pima%20Indians%20Diabetes%20Database.csv'
df = pd.read_csv(url)

In [101]:
df.columns = df.columns.str.strip()


In [102]:
numeric_features = [col for col in df.columns if col != 'Outcome']
target = 'Outcome'

In [103]:
scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

In [104]:
X = df.drop(target, axis=1)
y = df[target]

In [105]:
assert X.select_dtypes(include=['object']).shape[1] == 0, "There are still non-numeric columns in X!"

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [107]:
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [109]:
#Resampling techniques
resampling_techniques = {
    'RandomOverSampler': RandomOverSampler(random_state=42),
    'RandomUnderSampler': RandomUnderSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'TomekLinks': TomekLinks(),
    'SMOTETomek': SMOTETomek(random_state=42)}

In [110]:
results = []

for name, resampler in resampling_techniques.items():
    X_resampled, y_resampled = resampler.fit_resample(X_train, y_train)

In [111]:
model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

In [112]:
y_pred = model.predict(X_test)

In [113]:
report = classification_report(y_test, y_pred, output_dict=True)

In [114]:
results.append({
        'Technique': name,
        'Original Class Distribution': y_train.value_counts(normalize=True).to_dict(),
        'New Class Distribution': y_resampled.value_counts(normalize=True).to_dict(),
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']})

In [115]:
results_df = pd.DataFrame(results)
print("\nResampling Results:")
print(results_df)


Resampling Results:
    Technique                     Original Class Distribution  \
0  SMOTETomek  {0: 0.6530944625407166, 1: 0.3469055374592834}   

  New Class Distribution  Precision    Recall  F1-Score  
0       {0: 0.5, 1: 0.5}   0.727885  0.701299  0.707134  


In [116]:
best_technique = results_df.loc[results_df['F1-Score'].idxmax()]
print("\nBest Resampling Technique:")
print(best_technique)


Best Resampling Technique:
Technique                                                          SMOTETomek
Original Class Distribution    {0: 0.6530944625407166, 1: 0.3469055374592834}
New Class Distribution                                       {0: 0.5, 1: 0.5}
Precision                                                            0.727885
Recall                                                               0.701299
F1-Score                                                             0.707134
Name: 0, dtype: object


#**Analysis Report on Data Preprocessing & Resampling Techniques for Three Datasets**

##1. **Loan Prediction Problem Dataset**
**Data Preprocessing**:

**Standardization**: Numeric features like 'ApplicantIncome,' 'CoapplicantIncome,' 'LoanAmount,' 'Loan_Amount_Term,' and 'Credit_History' were standardized using StandardScaler to ensure they have a mean of 0 and a standard deviation of 1. This step is crucial for algorithms sensitive to the data's scale, like Logistic Regression.

**Encoding**: Categorical variables including 'Gender,' 'Married,' 'Dependents,' 'Education,' 'Self_Employed,' and 'Property_Area' were encoded using LabelEncoder for ordinal data and OneHotEncoder for nominal data, transforming them into a format suitable for machine learning models.

**Outlier Detection**: Although not directly implemented in the provided code, methods like Z-score or IQR would typically identify outliers in numeric features. However, for simplicity, this step was omitted here.
Resampling Techniques: The dataset was imbalanced with a distribution of {0: 0.5119453924914675, 1: 0.4880546075085324}. Various resampling techniques were applied:

RandomOverSampler, RandomUnderSampler, SMOTE, TomekLinks, SMOTETomek.

**Results**:
SMOTETomek emerged as the most effective with an F1-Score of 0.669053, balancing the class distribution to {0: 0.5, 1: 0.5}. This technique combines over-sampling the minority class with under-sampling the majority class to clean the dataset, which likely helped reduce noise and improve model performance.


##2. **Breast Cancer Wisconsin (Diagnostic) Data Set**

**Data Preprocessing**:

**Standardization**: All features except 'diagnosis' were standardized. Given the nature of medical data, this ensures that all features contribute equally to the distance computations in algorithms like Logistic Regression.

**Encoding**: The target 'diagnosis' was encoded into binary values using LabelEncoder.

**Outlier Detection**: Again, although not implemented, outlier detection would be beneficial to remove anomalous data points that could skew results.
Resampling Techniques: The original class distribution was {0: 0.6285714285714286, 1: 0.37142857142857144}. Resampling techniques included:
RandomOverSampler, RandomUnderSampler, SMOTE, TomekLinks, SMOTETomek.

**Results**:
SMOTETomek was the best with an F1-Score of 0.991287, resulting in a balanced distribution of {0: 0.5, 1: 0.5}. This high performance suggests that SMOTETomek effectively handled the imbalance by creating synthetic samples and removing overlapping majority class samples, enhancing the model's ability to learn from both classes.


##3. **Pima Indians Diabetes Database**

**Data Preprocessing**:
Standardization: All features were standardized, which is particularly important for this dataset due to the varied scales of medical measurements.
Encoding: Not needed as all features are numeric.

**Outlier Detection**: Typically, it would involve methods like Z-score or IQR, but it was not part of the provided code.

**Resampling Techniques**: The original imbalance was significant with {0: 0.65399446245407166, 1: 0.34690553754592834}. Techniques applied:

RandomOverSampler, RandomUnderSampler, SMOTE, TomekLinks, SMOTETomek.

**Results**:
SMOTETomek again showed the highest effectiveness with an F1-Score of 0.707134, balancing to {0: 0.5, 1: 0.5}. This indicates that for this dataset, combining over-sampling with under-sampling was beneficial, likely due to the method's ability to address the minority class under-representation and majority class noise.

#**Conclusion**
Across all three datasets, SMOTETomek consistently proved to be the most effective resampling technique. Its strength lies in its ability to not only increase the representation of the minority class but also to clean the data by removing overlapping majority class examples, which seems to enhance model performance significantly. This technique's effectiveness might be attributed to its dual approach of balancing the dataset while ensuring the training data's quality by removing potentially noisy instances. For datasets with medical or financial implications, where precision and recall are crucial, SMOTETomek's performance highlights its utility in improving predictive accuracy by addressing class imbalance effectively.
