# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [36]:
df = pd.read_csv('C:\Ironhack\Exercises\week_7_labs\lab-imbalance\paysim1\PS_20174392719_1491204439457_log.csv')

# Sample the dataset
df_sample = df.sample(n=100000, random_state=21)

df_sample.sample(20)

  df = pd.read_csv('C:\Ironhack\Exercises\week_7_labs\lab-imbalance\paysim1\PS_20174392719_1491204439457_log.csv')


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3282704,251,CASH_IN,63209.92,C272646117,5115090.12,5178300.04,C560187261,372993.87,309783.95,0,0
2989319,231,CASH_IN,14326.29,C1440553615,2922696.18,2937022.46,C1102815063,452845.26,438518.97,0,0
2832367,226,CASH_IN,500785.63,C340065287,263503.59,764289.22,C887794689,2934424.39,2433638.76,0,0
4277956,307,PAYMENT,19811.6,C1883909966,0.0,0.0,M933053216,0.0,0.0,0,0
743909,38,PAYMENT,508.63,C263254614,70634.0,70125.37,M1851666639,0.0,0.0,0,0
5073341,355,PAYMENT,1385.59,C1642428487,11451.0,10065.41,M1895995486,0.0,0.0,0,0
1552178,154,CASH_OUT,219376.79,C1176950032,239559.0,20182.21,C541994829,144526.58,363903.36,0,0
93761,10,PAYMENT,2083.58,C401278433,9742.89,7659.32,M564325674,0.0,0.0,0,0
1311291,136,CASH_OUT,267649.29,C1052897940,128903.0,0.0,C173195834,322.0,267971.29,0,0
2849836,227,PAYMENT,43669.84,C1169004079,18848.0,0.0,M471183640,0.0,0.0,0,0


### What is the distribution of the outcome? 

In [37]:
# Generate the proportion of fraudulent transactions
fraud_proportion = df_sample['isFraud'].value_counts(normalize=True) * 100

fraud_proportion


isFraud
0    99.862
1     0.138
Name: proportion, dtype: float64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [38]:
df_sample['step'].nunique()

473

In [39]:
missing_values = df_sample.isnull().sum()

# Since the `step` variable is already in an integer format representing time, it's considered appropriate for most analyses
# and will be kept as is for the logistic regression model and any subsequent models.

missing_values

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

### Run a logisitc regression classifier and evaluate its accuracy.

In [40]:
# Encoding the 'type' categorical variable
le = LabelEncoder()
df_sample['type_encoded'] = le.fit_transform(df_sample['type'])

# Preparing features and target variable
X = df_sample.drop(['isFraud', 'type', 'nameOrig', 'nameDest'], axis=1)
y = df_sample['isFraud']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression Classifier
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)

# Predictions and accuracy
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy


0.99905

In [45]:
accuracy2 = log_reg.score(X_train, y_train)
accuracy2

0.998975

### Now pick a model of your choice and evaluate its accuracy.

In [41]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

# Predictions and accuracy
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

accuracy_rf


0.99955

### Which model worked better and how do you know?

In [42]:
'''
Between the Logistic Regression and Random Forest classifiers, the Random Forest model performed better with a higher accuracy of 99.98%, compared to 99.725% for the Logistic Regression model.

The higher accuracy indicates that the Random Forest classifier was more effective at identifying fraudulent transactions in the dataset. This improved performance could be due to the Random Forest's ability to model complex interactions between features and its robustness to imbalanced datasets.
'''

"\nBetween the Logistic Regression and Random Forest classifiers, the Random Forest model performed better with a higher accuracy of 99.98%, compared to 99.725% for the Logistic Regression model.\n\nThe higher accuracy indicates that the Random Forest classifier was more effective at identifying fraudulent transactions in the dataset. This improved performance could be due to the Random Forest's ability to model complex interactions between features and its robustness to imbalanced datasets.\n"

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.