<a href="https://colab.research.google.com/github/ADRIANVM117/proyectos_propios/blob/main/mini_proyectos/credit_card_fraud_advm/credit_card_fraud_advm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Fraud

This dataset consists of credit card transactions in the western United States. It includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

Not sure where to begin? Scroll to the bottom to find challenges!

## Data Dictionary

| transdatetrans_time | Transaction DateTime                        |
|---------------------|---------------------------------------------|
| merchant            | Merchant Name                               |
| category            | Category of Merchant                        |
| amt                 | Amount of Transaction                       |
| city                | City of Credit Card Holder                  |
| state               | State of Credit Card Holder                 |
| lat                 | Latitude Location of Purchase               |
| long                | Longitude Location of Purchase              |
| city_pop            | Credit Card Holder's City Population        |
| job                 | Job of Credit Card Holder                   |
| dob                 | Date of Birth of Credit Card Holder         |
| trans_num           | Transaction Number                          |
| merch_lat           | Latitude Location of Merchant               |
| merch_long          | Longitude Location of Merchant              |
| is_fraud            | Whether Transaction is Fraud (1) or Not (0) |

[Source](https://www.kaggle.com/kartik2112/fraud-detection?select=fraudTrain.csv) of dataset. The data was partially cleaned and adapted by DataCamp.

## Don't know where to start?

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: What types of purchases are most likely to be instances of fraud? Consider both product category and the amount of the transaction.
- 📊 **Visualize**: Use a geospatial plot to visualize the fraud rates across different states.
- 🔎 **Analyze**: Are older customers significantly more likely to be victims of credit card fraud?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent.

The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

---

✍️ _If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system._

# Libraries

In [2]:
# libraries________________________

# data cleaning and pandas manipulation ------------------------------
import pandas as pd 
import numpy as np 
# visualitions  -----------------------------------------
import matplotlib.pyplot as plt 
import seaborn as sns 

# stats ----------------------------
from scipy import stats 

#sklearn -------------------------------------
# model 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

#pipeline and preprocesing 

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

#model_selection -------------------------------------
from sklearn.model_selection import  train_test_split
from sklearn.model_selection import KFold , cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# metrics 
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

## Let´s see what inside on dataframe 

In [4]:
# type of variables 
df = pd.read_csv('/content/credit_card_fraud.csv') 
df.head(2)

Unnamed: 0,trans_date_trans_time,merchant,category,amt,city,state,lat,long,city_pop,job,dob,trans_num,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:44,"Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,49.159047,-118.186462,0.0
1,2019-01-01 00:00:51,Lind-Buckridge,entertainment,220.11,Malad City,ID,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481,0.0


## Pandas Manipulation 

- First. i will create a new column 'age'  with the columns 'dob' (date of birth) and drop the column 'merchant'

In [5]:
fraud = df.drop('merchant', axis = 1)

fraud['birth'] = pd.to_datetime(fraud['dob'])
fraud['year_birth'] = fraud['birth'].dt.year 
fraud = fraud.drop('dob', axis = 1 )
fraud.head(2)

# we get the age of users 
now = 2022

fraud['years'] = now - fraud['year_birth']
fraud = fraud.drop(['job', 'birth', 'year_birth'], axis = 1)

fraud.head()

Unnamed: 0,trans_date_trans_time,category,amt,city,state,lat,long,city_pop,trans_num,merch_lat,merch_long,is_fraud,years
0,2019-01-01 00:00:44,grocery_pos,107.23,Orient,WA,48.8878,-118.2105,149,1f76529f8574734946361c461b024d99,49.159047,-118.186462,0.0,44
1,2019-01-01 00:00:51,entertainment,220.11,Malad City,ID,42.1808,-112.262,4154,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481,0.0,60
2,2019-01-01 00:07:27,grocery_pos,96.29,Grenada,CA,41.6125,-122.5258,589,413636e759663f264aae1819a4d4f231,41.65752,-122.230347,0.0,77
3,2019-01-01 00:09:03,shopping_pos,7.77,High Rolls Mountain Park,NM,32.9396,-105.8189,899,8a6293af5ed278dea14448ded2685fea,32.863258,-106.520205,0.0,55
4,2019-01-01 00:21:32,misc_pos,6.85,Freedom,WY,43.0172,-111.0292,471,f3c43d336e92a44fc2fb67058d5949e3,43.753735,-111.454923,0.0,55


In [None]:
#from shapely.geometry import Point

#f = fraud.iloc[:,3:7]
#f['geometry'] = f.apply(lambda x: Point((x.long, x.lat)), axis = 1 )
#f.head(2)

  arr = construct_1d_object_array_from_listlike(values)


Unnamed: 0,city,state,lat,long,geometry
0,Orient,WA,48.8878,-118.2105,POINT (-118.2105 48.8878)
1,Malad City,ID,42.1808,-112.262,POINT (-112.262 42.1808)


## 📊 Visualize: Use a geospatial plot to visualize the fraud rates across different states.

## Classification model 

In [6]:
fraud_1 = fraud[['category','amt'	,'city','state','city_pop','is_fraud',	'years']]


fraud_1.head()

Unnamed: 0,category,amt,city,state,city_pop,is_fraud,years
0,grocery_pos,107.23,Orient,WA,149,0.0,44
1,entertainment,220.11,Malad City,ID,4154,0.0,60
2,grocery_pos,96.29,Grenada,CA,589,0.0,77
3,shopping_pos,7.77,High Rolls Mountain Park,NM,899,0.0,55
4,misc_pos,6.85,Freedom,WY,471,0.0,55


In [7]:
fraud_dummies = pd.get_dummies(fraud_1,  columns=['category','state'])
fraud_dummies.head()

Unnamed: 0,amt,city,city_pop,is_fraud,years,category_entertainment,category_food_dining,category_gas_transport,category_grocery_net,category_grocery_pos,...,state_CO,state_HI,state_ID,state_MO,state_NE,state_NM,state_OR,state_UT,state_WA,state_WY
0,107.23,Orient,149,0.0,44,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,220.11,Malad City,4154,0.0,60,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,96.29,Grenada,589,0.0,77,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,7.77,High Rolls Mountain Park,899,0.0,55,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,6.85,Freedom,471,0.0,55,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [8]:
fraud_df = fraud_dummies.dropna(subset=['years', 'is_fraud','city_pop',
'amt'])

In [9]:
fraud_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23094 entries, 0 to 23093
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   amt                      23094 non-null  float64
 1   city                     23094 non-null  object 
 2   city_pop                 23094 non-null  int64  
 3   is_fraud                 23093 non-null  float64
 4   years                    23094 non-null  int64  
 5   category_entertainment   23094 non-null  uint8  
 6   category_food_dining     23094 non-null  uint8  
 7   category_gas_transport   23094 non-null  uint8  
 8   category_grocery_net     23094 non-null  uint8  
 9   category_grocery_pos     23094 non-null  uint8  
 10  category_health_fitness  23094 non-null  uint8  
 11  category_home            23094 non-null  uint8  
 12  category_kids_pets       23094 non-null  uint8  
 13  category_misc_net        23094 non-null  uint8  
 14  category_misc_pos     

In [10]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 


In [11]:
X = fraud_dummies.drop(['city','is_fraud'], axis = 1)
X = X.drop(1 ,axis=0)
y = fraud_df['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [12]:
vt = DecisionTreeClassifier(max_depth= 3)

vt.fit(X_train, y_train)

y_predict  = vt.predict(X_test)

score = accuracy_score(y_test, y_predict)

print(score)

0.9909064665127021


In [28]:
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)

# Use dt_entropy to predict test set labels
y_pred= dt_entropy.predict(X_test)

# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_test, y_pred)

# Print accuracy_entropy
print(f'Accuracy achieved by using entropy: {accuracy_entropy:.3f}')

# Print accuracy_gini
print(f'Accuracy achieved by using the gini index: {score:.3f}')

Accuracy achieved by using entropy: 0.990
Accuracy achieved by using the gini index: 0.991


## OOB EVALUATION 

In [36]:
# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth = 3, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator= dt, n_estimators=50, oob_score= True, random_state=1)

# Fit bc to the training set 
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_test, y_pred)

# Evaluate OOB accuracy
acc_oob =  bc.oob_score_

print('OOB SCORE : {:.5f}'.format(acc_oob))

OOB SCORE : 0.99054


In [None]:
# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf= 8, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator= dt, 
            n_estimators=50,
            oob_score= True,
            random_state=1)

In [29]:
MSE_CV_scores = - cross_val_score(vt  ,X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 0.10
