# Anomaly Detection in R

# Mary Donovan Martello

## The goal of this project was to use R to design unsupervised predictive binary classification models to predict whether credit card transactions are fraudulent transactions.  This file prepares the dataset for feature selection and supervised learning models that are used for comparison with the unsupervised models.

## Part 2:  Undersample Data for Feature Selection and Supervised Learning

In [2]:
# Importing required libraries
library(dplyr)
library(caret)
library(ggplot2)
library(caTools)
library(ROSE)
library(smotefamily)
library(rpart)
library(rpart.plot)
library(psych)
library(ltm)
library(corrplot)
library(e1071)
library(data.table)

suppressMessages(library(dplyr))
suppressMessages(library(caTools))
suppressMessages(library(ROSE))
suppressMessages(library(smotefamily))
suppressMessages(library(rpart.plot))
suppressMessages(library(psych))
suppressMessages(library(ltm))
suppressMessages(library(corrplot))
suppressMessages(library(e1071))
suppressMessages(library(data.table))

In [3]:
#Loading the dataset
fulldf<-read.csv("creditcard.csv")

In [4]:
# Convert class to a factor variable
fulldf$Class <- factor(fulldf$Class, levels =  c(0,1))

In [6]:
fulldf%>%head(3)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.07278117,2.5363467,1.3781552,-0.33832077,0.46238778,0.23959855,0.0986979,0.363787,...,-0.01830678,0.2778376,-0.1104739,0.06692807,0.1285394,-0.1891148,0.133558377,-0.02105305,149.62,0
0,1.191857,0.26615071,0.1664801,0.4481541,0.06001765,-0.08236081,-0.07880298,0.08510165,-0.2554251,...,-0.22577525,-0.638672,0.101288,-0.33984648,0.1671704,0.1258945,-0.008983099,0.01472417,2.69,0
1,-1.358354,-1.34016307,1.7732093,0.3797796,-0.50319813,1.80049938,0.79146096,0.24767579,-1.5146543,...,0.24799815,0.7716794,0.9094123,-0.68928096,-0.3276418,-0.1390966,-0.055352794,-0.05975184,378.66,0


In [5]:
dim(fulldf)

### The dataset contains transactions made by credit cards in September 2013 by European cardholders. The dimension of the dataset is 31 features and 284,807 records. The dataset is highly imbalanced with respect to the percent of records with fraud labels (0.17%) versus records with non-fraud labels (99.83%).  To have feature selection better identify the significant features in the fraudulent cases, under sample the data for feature selection.

### Scaling

**Note: Data does not need to be scaled because Time removed and all but one remaining feature is a PCA feature.**

### Under Sample Data

In [5]:
# Random Under-Sampling (RUS)

# set the number of non-fraud records to under sample to
n_fraud <- 492
new_frac_fraud <- 0.50
new_n_total <- n_fraud/new_frac_fraud

undersampling_result <- ovun.sample(Class ~ .,
                                   data = fulldf,
                                   method = "under",
                                   N = new_n_total,
                                   seed =123)

undersampled_credit <- undersampling_result$data

table(undersampled_credit$Class)


  0   1 
492 492 

In [6]:
dim(undersampled_credit)

In [8]:
write.csv(undersampled_credit,"C:\\Users\\trave\\1_DSC680_Project3\\smallCreditFraud.csv", row.names = FALSE)