# Anomaly Detection

In this assignment, you will use anomaly detection techniques to detect credit card fraud using this modified dataset originally from Kaggle.  Each row represents one credit card transaction.  This dataset has anonymized features, except for the amount of the transaction.  

Your stakeholder reports that they have found that about 0.4% of transactions are fraudulent, and 99.6% are valid.  

Your task will be to locate anomalous transactions in this data using KMeans and IsolationForest models  

1. KMeans:
Remember to scale you
r data.
Fit a KMeans model to create 3 c

sters.  Please use a random state of 42 for you
r model.
Use scipy.spatial.distance.cdist to create a matrix of distances between each data point and each clust
er center
Define a list of the indices of the anomalous data using the threshold given by the stakeholder (99.
6% valid).
Note that you cannot visualize your clusters since this dataset has 29 features. (To visualize, you could apply PCA to reduce the dimensionality to 2 features, but visualization is not r3quired.) 
2. Isolation Forest:

Note: If you added any columns to the original dataset in the previous step, be sure to exclude them before fitting your Isolation Forest.
Instantiate and fit an Isolation Forest with the correct contamination value based on the threshold given by the stakeholder (Be careful: 0.4% = 0.004). Please use a random state of 42 for your model.
Define a list of the indices of anomalous data.  Remember that anomalies are marked as -1, and normal data is marked as 1.

3. Compare the list of anomalies from KMeans and Isolation Forest.  

Once you have a list of indices from each model, you can use list comprehension to find the points in common. For example:
# Make a list of anomolies identified in both methods
both = [a for a in iso_anomalies if a in kmeans_anomalies]
copy
Answer the following:
a. How many anomalies did the two approaches agree on?
b. What percentage of the anomalies did the two approaches agree on?

## Import Packages

In [23]:
# import packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

## Load and Clean Data

In [2]:
# import data
df = pd.read_csv('Data/credit_card.csv')

In [3]:
# preview data
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [5]:
# check for missing values
df.isna().sum().sum()

0

In [6]:
# check for duplicate values
df.duplicated().sum()

0

## KMeans

In [7]:
# instantiate scaler
scaler = StandardScaler()

In [8]:
# fit and transform data
scaled_df = scaler.fit_transform(df)

In [10]:
# Refit KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, n_init= 'auto', random_state=42)
kmeans.fit(scaled_df)

In [13]:
# Add the clusters as a column in the dataframe
df['cluster'] = kmeans.labels_
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,cluster
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,2
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,2
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,2
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,2
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,2


In [15]:
# use cdist to find out how far each data point is from each cluster center
from scipy.spatial.distance import cdist
# Calculate distance to each cluster center
distances = cdist(scaled_df, kmeans.cluster_centers_, 'euclidean')

In [19]:
# view shape
distances.shape

(10000, 3)

In [18]:
# view df shape
scaled_df.shape

(10000, 29)

In [21]:
# Saving distances as a dataframe for convenience
cluter_cols = [f"Distance (Cluster {c})" for  c in range(len(kmeans.cluster_centers_))]
distance_df = pd.DataFrame(distances, columns = cluter_cols)
distance_df.head()

Unnamed: 0,Distance (Cluster 0),Distance (Cluster 1),Distance (Cluster 2)
0,4.127849,4.695017,2.825513
1,4.287068,3.583366,2.40011
2,7.190748,7.558619,6.579811
3,5.964779,5.715718,4.714446
4,4.132515,4.825469,3.289179


In [24]:
# Get the minimum distance to any cluster for each point
min_distances = np.min(distances, axis=1)
# Display first 5 values
min_distances[:5]

array([2.82551348, 2.40010997, 6.57981134, 4.71444567, 3.28917859])

In [25]:
# Set a threshold based on a percentile
threshold = np.percentile(min_distances, 99.6)
threshold

20.678582375061943

In [26]:
# Identify anomalies where the distance to closest cluster center is above the threshold
filter_anomalies = min_distances > threshold
# how many were found?
filter_anomalies.sum()

40

In [34]:
# Getting the row indices of the anomalies
idx_anomalies = df[filter_anomalies].index
idx_anomalies

Int64Index([ 159, 1376, 1619, 2156, 2212, 2439, 2594, 2654, 2756, 2911, 2914,
            2917, 2923, 3443, 5303, 5412, 5413, 5529, 5674, 5704, 5764, 5977,
            6489, 6643, 6672, 7322, 7338, 7470, 7596, 7597, 8124, 8163, 8437,
            8442, 8856, 8939, 8999, 9071, 9304, 9326],
           dtype='int64')

## Isolation Forests

In [38]:
# remove added columns from previous steps
df_iso = df.drop(columns=['cluster'])

In [40]:
# verify changes
df_iso.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      10000 non-null  float64
 1   V2      10000 non-null  float64
 2   V3      10000 non-null  float64
 3   V4      10000 non-null  float64
 4   V5      10000 non-null  float64
 5   V6      10000 non-null  float64
 6   V7      10000 non-null  float64
 7   V8      10000 non-null  float64
 8   V9      10000 non-null  float64
 9   V10     10000 non-null  float64
 10  V11     10000 non-null  float64
 11  V12     10000 non-null  float64
 12  V13     10000 non-null  float64
 13  V14     10000 non-null  float64
 14  V15     10000 non-null  float64
 15  V16     10000 non-null  float64
 16  V17     10000 non-null  float64
 17  V18     10000 non-null  float64
 18  V19     10000 non-null  float64
 19  V20     10000 non-null  float64
 20  V21     10000 non-null  float64
 21  V22     10000 non-null  float64
 22 

2. Isolation Forest:

Instantiate and fit an Isolation Forest with the correct contamination value based on the threshold given by the stakeholder (Be careful: 0.4% = 0.004).

Please use a random state of 42 for your model.

Define a list of the indices of anomalous data.

Remember that anomalies are marked as -1, and normal data is marked as 1.

In [41]:
# imports
from sklearn.ensemble import IsolationForest # new!
from sklearn import set_config
set_config(transform_output='pandas')

In [42]:
# instantiate model with contamination of .04%/.0004
iso_0004 = IsolationForest(contamination=0.0004, random_state=42)

In [48]:
# fit model with values to avoid warning
iso_0004.fit(df_iso.values)

In [49]:
# Obtain results from the model
predictions = iso_0004.predict(df_iso.values)
predictions[:100]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [51]:
# change labels for consistency with KMeans model

# Not anamolies
predictions[predictions ==1] = 0
# Anomalies
predictions[predictions ==-1] = 1
# Preview new labels
predictions[:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [52]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,cluster
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,2
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,2
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,2
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,2
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,2


In [56]:
predictions.value_counts()

AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

In [54]:
# Make a list of anomolies identified in both methods
both = [a for a in idx_anomalies if a in predictions]

In [55]:
both

[]

Answer the following:
a. How many anomalies did the two approaches agree on?
b. What percentage of the anomalies did the two approaches agree on?

The approaches did not agree on any anomaly