
In this lab exercise, you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud. 

NOTE: you are not required to carry out data preprocessing step to enhance the prediction performance

In [184]:
# Import data from your Drive

In [185]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [186]:
import os
os.chdir('/content/drive/MyDrive')

In [187]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt


### Question 1
Import the data from `fraud_data.csv` from your local G-Drive. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [188]:
df = pd.read_csv('fraud_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21693 entries, 0 to 21692
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      21693 non-null  float64
 1   V2      21693 non-null  float64
 2   V3      21693 non-null  float64
 3   V4      21693 non-null  float64
 4   V5      21693 non-null  float64
 5   V6      21693 non-null  float64
 6   V7      21693 non-null  float64
 7   V8      21693 non-null  float64
 8   V9      21693 non-null  float64
 9   V10     21693 non-null  float64
 10  V11     21693 non-null  float64
 11  V12     21693 non-null  float64
 12  V13     21693 non-null  float64
 13  V14     21693 non-null  float64
 14  V15     21693 non-null  float64
 15  V16     21693 non-null  float64
 16  V17     21693 non-null  float64
 17  V18     21693 non-null  float64
 18  V19     21693 non-null  float64
 19  V20     21693 non-null  float64
 20  V21     21693 non-null  float64
 21  V22     21693 non-null  float64
 22

In [189]:
X= df.drop('Class', axis = 1)
y = df['Class']
def percentage_fraud():
  results = len(y[y==1])/(len(y[y==1])+len(y[y==0]))
  return round(results,5)
percentage_fraud()

0.01641

In [190]:
df = pd.read_csv('fraud_data.csv')

Split the raw data into train and test datasets

In [191]:
X= df.iloc[:,:-1]
y= df.iloc[:,-1]
X_train,X_test,y_train, y_test = train_test_split(X,y, random_state=0)


### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [192]:
def classified():
  from sklearn.dummy import DummyClassifier 
  from sklearn.metrics import recall_score

  clf = DummyClassifier()
  clf.fit(X_train,y_train)
  predict = clf.predict(X_test)
  acc_score = clf.score(X_test, y_test)
  recall = recall_score(y_test, predict, average='micro')
  return round(acc_score,5), round(recall,5)
classified()



(0.96515, 0.97179)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC, Decision Tree, k-NN classifiers using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [193]:
def SVC():
  from sklearn.metrics import recall_score,precision_score
  from sklearn.svm import SVC

  clf = SVC()
  clf.fit(X_train,y_train)
  predict = clf.predict(X_test)
  acc_score = clf.score(X_test,y_test) 
  recall = recall_score(y_test, predict,average='micro' )
  precision = precision_score(y_test, predict,average='micro')
  return round(acc_score,5), round(recall,5), round(precision,5)
SVC()

(0.99004, 0.99004, 0.99004)

In [194]:
def Decision_tree():
  from sklearn.metrics import recall_score,precision_score
  from sklearn import tree
  clf = tree.DecisionTreeClassifier()
  clf.fit(X_train,y_train)
  predict = clf.predict(X_test)
  acc_score = clf.score(X_test,y_test) 
  recall = recall_score(y_test, predict,average='micro' )
  precision = precision_score(y_test, predict,average='micro')
  return round(acc_score,5), round(recall,5), round(precision,5)
Decision_tree()

(0.99115, 0.99115, 0.99115)

In [195]:
def KNN():
  from sklearn.metrics import recall_score,precision_score
  from sklearn.neighbors import KNeighborsClassifier
  clf = KNeighborsClassifier(n_neighbors=2)
  clf.fit(X_train,y_train)
  predict = clf.predict(X_test)
  acc_score = clf.score(X_test,y_test) 
  recall = recall_score(y_test, predict,average='micro' )
  precision = precision_score(y_test, predict,average='micro')
  return round(acc_score,5), round(recall,5), round(precision,5)
KNN()

(0.99447, 0.99447, 0.99447)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [196]:
def SVC1():
  from sklearn.metrics import confusion_matrix
  from sklearn.svm import SVC

  clf = SVC(C= 1e9, gamma= 1e-07)
  clf.fit(X_train,y_train)
  y_score = clf.decision_function(X_test) >-220
  matrix = confusion_matrix(y_test, y_score)
  return matrix
SVC1()

array([[5320,   24],
       [  14,   66]])

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [197]:
def logistic_regression():
  from sklearn.linear_model import LogisticRegression
  from sklearn.metrics import precision_recall_curve, roc_curve, auc
  
  lr = LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)
  lr_predicted = lr.predict(X_test)
  precision, recall, thresholds = precision_recall_curve(y_test, lr_predicted)
  fpr_lr, tpr_lr,_= roc_curve(y_test, lr_predicted)
  
  closest_zero = np.argmin(np.abs(thresholds))
  closest_zero_p = precision[closest_zero]
  closest_zero_r = recall[closest_zero]
  return closest_zero_r, closest_zero_p
logistic_regression()

(1.0, 0.014749262536873156)

### Question 6

Do these 5 tasks above with `titanic` data.

In [198]:
import os
os.chdir('/content/drive/MyDrive/pandas-lecture/data')

In [199]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [200]:
df.dropna(subset=['Embarked'], axis = 0, inplace= True) #drop NA value in embark columns 
df['Age'] = df['Age'].interpolate(method ='linear', limit_direction ='forward')# fill Nan in Age columns
df= df.drop(['Cabin','Name','Ticket'], axis= 1)# Drop Name and Ticket columns since it's not neccessary, Cabin columns contain to many Nan value

In [201]:
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [202]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.2500,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.9250,S
3,4,1,1,female,35.0,1,0,53.1000,S
4,5,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,13.0000,S
887,888,1,1,female,19.0,0,0,30.0000,S
888,889,0,3,female,22.5,1,2,23.4500,S
889,890,1,1,male,26.0,0,0,30.0000,C


In [203]:
#preprocessing Data 
from sklearn.preprocessing import LabelEncoder
df['Sex']= LabelEncoder().fit_transform(df['Sex'])
df['Embarked']=LabelEncoder().fit_transform(df['Embarked'])

In [204]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Sex          889 non-null    int64  
 4   Age          889 non-null    float64
 5   SibSp        889 non-null    int64  
 6   Parch        889 non-null    int64  
 7   Fare         889 non-null    float64
 8   Embarked     889 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 69.5 KB


In [205]:
X= df.iloc[:,:-1]
y= df.iloc[:,-1]
X_train,X_test,y_train, y_test = train_test_split(X,y, random_state=0)

In [206]:
print(classified())
print(SVC())
print(Decision_tree())
print(KNN())

(0.59193, 0.56951)
(0.73094, 0.73094, 0.73094)
(0.75336, 0.75336, 0.75336)
(0.46188, 0.46188, 0.46188)


