# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
# Your code here
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### What is the distribution of the outcome? 

In [2]:
# Your response here

fraud_detection=pd.read_csv(filepath_or_buffer="PS_20174392719_1491204439457_log.csv",
    sep=',',
    nrows=100000)

fraud_detection.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0


In [3]:
fraud_detection.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            100000 non-null  int64  
 1   type            100000 non-null  object 
 2   amount          100000 non-null  float64
 3   nameOrig        100000 non-null  object 
 4   oldbalanceOrg   100000 non-null  float64
 5   newbalanceOrig  100000 non-null  float64
 6   nameDest        100000 non-null  object 
 7   oldbalanceDest  100000 non-null  float64
 8   newbalanceDest  100000 non-null  float64
 9   isFraud         100000 non-null  int64  
 10  isFlaggedFraud  100000 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 8.4+ MB


In [4]:
#Data Types
fraud_detection.columns.to_series().groupby(fraud_detection.dtypes).groups

{dtype('int64'): Index(['step', 'isFraud', 'isFlaggedFraud'], dtype='object'),
 dtype('float64'): Index(['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
        'newbalanceDest'],
       dtype='object'),
 dtype('O'): Index(['type', 'nameOrig', 'nameDest'], dtype='object')}

In [5]:
#Step Features
fraud_detection[fraud_detection['step']!=1].head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2708,2,PAYMENT,1826.81,C70165127,24622.0,22795.19,M2026706491,0.0,0.0,0,0
2709,2,PAYMENT,7709.2,C520830206,52152.0,44442.8,M351216770,0.0,0.0,0,0
2710,2,PAYMENT,3023.16,C1705281026,178.0,0.0,M1967667267,0.0,0.0,0,0


In [6]:
#nameOrig feature
#len(fraud_detection['nameOrig'].unique())
#Drop it

#nameOrig feature
#len(fraud_detection['nameDest'].unique())
#Drop it

#type feature
print("There are "+str(len(fraud_detection['type'].unique()))+" types of payment in the db.")
fraud_detection['type'].unique()
#Drop it

There are 5 types of payment in the db.


array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

In [7]:
#Count of isFraud
is_Fraud=fraud_detection[fraud_detection['isFraud']==1]
print("number of frauds: "+str(is_Fraud.count()[0]))
not_Fraud=fraud_detection[fraud_detection['isFraud']==0]
print("number of NOT frauds: "+str(not_Fraud.count()[0]))

#DF by type
by_type=fraud_detection.groupby(by='type').sum()

number of frauds: 116
number of NOT frauds: 99884


In [9]:
#ratio of frauds in the n=100000 sample:
print("Percentage of fraudulant transaction in sample: "+str(fraud_detection[fraud_detection['isFraud']==1].count()[0]/len(fraud_detection)))

#We safely assume that the target dataset is imbalanced

Percentage of fraudulant transaction in sample: 0.00116


### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [10]:
# Your code here

cleaned_fraud_detection=fraud_detection.drop(labels=['nameOrig','nameDest', 'isFlaggedFraud'], axis=1)
cleaned_fraud_detection.head(5)

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1
4,1,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0


In [11]:
cleaned_fraud_detection=pd.get_dummies(cleaned_fraud_detection, columns=['step','type'])
cleaned_fraud_detection.head(5)

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,step_1,step_2,step_3,step_4,...,step_6,step_7,step_8,step_9,step_10,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,9839.64,170136.0,160296.36,0.0,0.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1864.28,21249.0,19384.72,0.0,0.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,181.0,181.0,0.0,0.0,0.0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,181.0,181.0,0.0,21182.0,0.0,1,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,11668.14,41554.0,29885.86,0.0,0.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [12]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

y=cleaned_fraud_detection['isFraud']
x=cleaned_fraud_detection.drop('isFraud', axis=1)

X_train, X_test, y_train, y_test=train_test_split(x, y, train_size=0.8)

In [13]:
logistic_model=LogisticRegression(n_jobs=-1)

logistic_model=logistic_model.fit(X_train, y_train)
logistic_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
logistic_model_score_train=logistic_model.score(X_train,y_train)
print(f"The score of the logistic regression with the training data is: {logistic_model_score_train}.")

logistic_model=logistic_model.fit(X_test, y_test)


logistic_model_score_test=logistic_model.score(X_test,y_test)
print(f"The score of the logistic regression with the testing data is: {logistic_model_score_test}.")

The score of the logistic regression with the training data is: 0.999125.
The score of the logistic regression with the testing data is: 0.9992.


In [15]:
#over sampling to prevent overfit

from sklearn.utils import resample

#concatenate our training data back together
X=pd.concat([X_train,y_train],axis=1)

In [16]:
#Separate minority from majority classes
NOT_Fraud=cleaned_fraud_detection[cleaned_fraud_detection.isFraud==0]
Fraud=cleaned_fraud_detection[cleaned_fraud_detection.isFraud==1]

In [17]:
#downsample majority
NOT_fraud_downsampled=resample(NOT_Fraud,
                         replace=True, #sample with replacement
                         n_samples=len(Fraud), #match number in majority class
                         random_state=27) #reproducible results

#checking if they have the same shape:
print(NOT_fraud_downsampled.isFraud.value_counts())
print(Fraud.isFraud.value_counts())

final_df=pd.concat([Fraud,NOT_fraud_downsampled],axis=1)

0    116
Name: isFraud, dtype: int64
1    116
Name: isFraud, dtype: int64


In [18]:
y=final_df['isFraud']
x=final_df.drop('isFraud', axis=1)


X_train, X_test, y_train, y_test=train_test_split(x, y, train_size=0.8)

In [19]:
logistic_model_score_train=logistic_model.score(X_train,y_train)
print(f"The score of the logistic regression with the training data is: {logistic_model_score_train}.")

logistic_model=logistic_model.fit(X_test, y_test)


logistic_model_score_test=logistic_model.score(X_test,y_test)
print(f"The score of the logistic regression with the testing data is: {logistic_model_score_test}.")

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Now pick a model of your choice and evaluate its accuracy.

In [21]:
# Your code here

from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier_model=RandomForestClassifier() #-1 means using all processors
'''
# The number of jobs to use for the computation.
This will only provide speedup for n_targets > 1 and sufficient large problems.  -1 means using all processors.
'''

RandomForestClassifier_model=RandomForestClassifier_model.fit(X_train, y_train)
RandomForestClassifier_model

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

### Which model worked better and how do you know?

In [2]:
# Your response here

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.