# Instead of undersampling and near-miss sampling then logistic regression, we can also choose oversampling with SMOTE then fitting into logistical regression.

In [64]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Assuming df contains your original DataFrame with 'Class' as the target variable

# Step 1: Split into features and target variable
X = df.drop('Class', axis=1)
y = df['Class']

# Step 2: Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Perform oversampling using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

# Step 4: Split oversampled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Step 5: Train logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Step 6: Make predictions on the test data
y_pred = log_reg.predict(X_test)

# Step 7: Calculate precision and ROC-AUC score
precision_pf = precision_score(y_test, y_pred)
roc_auc_pf = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])  # Extract probabilities for the positive class

print("Precision on test data :", precision_pf)
print("ROC AUC score on test data :", roc_auc_pf)


Precision on test data : 0.9735495039368613
ROC AUC score on test data : 0.9894235687711506


#### Thus we figure:
1. Logistical regression is the best fit for our model, it can enhance any connectivity between low correlation variables and gives max ROC AUC score of .98
2. t-SNE is the best method to visualise data since it reduces dimensions while preserving local clustering
3. Sampling can be done two ways- before dividing into train/test or after dividing just on the train variables. We choose to do the former since the dataset is too skewed in the beginning so it is better to divide it into equal distributions-otherwise test predictions may be skewly highed since most values are 0 anyways. We also tried the Near Miss Sampling but it reduced the precision by the nature of how it works, so we chose to follow the pipelines and compare the roc-aucs: 

First pipeline- undersampling, logistical regression- .97 roc-auc

Second pipeline- oversampling with SMOTE, logistical regression- .98 ROC-AUC