## Day 13 ‚Äì Outlier Detection | Credit Card Fraud Detection
- **Topics Covered**
    - What are Outliers
    - Why Outlier Detection is Important
    - Univariate Outlier Detection (IQR, Normal Distribution)
    - Limitations of Univariate Methods
    - Multivariate Outlier Detection
    - Unsupervised Algorithms: Isolation Forest, LOF, KNN
    - Contamination Parameter
    - Case Study: Credit Card Fraud Detection

#### What is an Outlier?
- An outlier is any data point that is significantly different from the remaining data points in a dataset.
- These points may represent:
- Errors in data collection
- Rare but important events (e.g., fraud, anomalies)
- Natural extreme variations

#### Why Detect Outliers?
- Outliers can:
    - Skew statistical measures like mean and variance
    - Distort regression hyperplanes
    - Shift classification boundaries
    - Reduce overall model performance

- Therefore, detecting and handling outliers helps improve model accuracy, robustness, and generalizatio

#### Univariate Outlier Detection Methods
- These methods work on a single feature at a time.

- IQR (Interquartile Range) Method
    - Compute Q1 and Q3
    - IQR = Q3 ‚àí Q1

- Outliers are points outside:
- Lower bound = Q1 ‚àí 1.5 √ó IQR
- Upper bound = Q3 + 1.5 √ó IQR

- Normal Distribution (Z-score)
    - Assumes data is normally distributed.
    - Outliers lie outside:  "Œº ¬± 3œÉ"
    - where Œº = mean and œÉ = standard deviation.
- Limitation:
- These methods are applicable only for single-variable analysis and fail when handling high-dimensional data.

#### Multivariate Outlier Detection
- In real-world datasets, outliers depend on multiple features together, not just one.
So we use multivariate & unsupervised algorithms that consider feature interactions.

#### Outlier Detection Libraries
- PyOD is a popular Python library that provides multiple outlier detection algorithms in one place.
- Some common methods:
    - Isolation Forest
    - Local Outlier Factor (LOF)
    - KNN-based Outlier Detection

#### Isolation Forest ‚Äì Intuition

- Isolation Forest isolates observations instead of profiling normal points.
- Key idea:
    - Outliers are easier to separate and require fewer splits in a tree.

- Steps:
    - Build many random trees on the dataset
    - Randomly split features and values
    - Compute path length for each point
    - Shorter path ‚áí more likely to be an outlier

- Anomaly Score:
    - s(x)=E(h(x))/c(n)
 
- Where:
    - ùê∏(‚Ñé(ùë•)):average path length
    - c(n): normalization constant
    - Lower path length ‚áí higher anomaly

#### Local Outlier Factor (LOF)
- LOF compares the local density of a point with its neighbors.
- If a point has much lower density than neighbors ‚Üí outlier
- Works well when outliers are in sparse regions
- In this notebook, LOF is used to detect fraudulent transactions.

#### KNN-Based Outlier Detection
- Uses distance to k-nearest neighbors:
- Points far away from neighbors are treated as anomalies.

#### Contamination Parameter
- contamination represents the expected proportion of outliers in the dataset.
- Example:
    - 0.01 ‚Üí assume 1% of data are outliers
    - This helps algorithms decide the threshold for labeling anomalies

#### Handling Outliers
üîπ Trimming

Remove observations that are outliers.
‚úîÔ∏è Simple
‚ùå May lead to data loss

üîπ Capping

Replace outliers with boundary values (e.g., IQR limits).
‚úîÔ∏è Keeps data size
‚ùå Can change data distribution

‚ö†Ô∏è Trade-off:

Trimming ‚Üí loss of information

Capping ‚Üí distortion of distribution

#### Why Use Unsupervised Methods?

- Many real-world problems:
    - Do not have labeled anomalies
    - Are highly imbalanced
- Unsupervised methods:
    - Do not need target labels
    - Learn patterns from data itself
    - Directly flag rare observations

#### Case Study: Credit Card Fraud Detection
- Fraud transactions are:
    - Extremely rare
- Different from normal spending behavior
- Critical to detect
- This makes fraud detection a perfect use case for outlier/anomaly detection.

#### In this notebook:
- Dataset is highly imbalanced
- LOF is applied as an unsupervised model
- Outliers are mapped to fraud cases
- Performance is evaluated using suitable metrics

#### Key Learnings
- Understood what outliers are and why they matter
- Learned univariate vs multivariate detection
- Explored unsupervised algorithms for anomaly detection
- Implemented LOF on a real-world imbalanced dataset
- Observed limitations of accuracy in fraud detection

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv(r"C:\Users\sande\Downloads\creditcardfraud\creditcard.csv")
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
data.shape

(284807, 31)

In [5]:
data['Class'].value_counts(normalize = True)

Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64

In [6]:
import numpy as np

In [7]:
# Model
def predict(X): # Features
  return np.zeros(X.shape[0]) # Fair

In [8]:
X = data.drop('Class', axis = 1)
y = data['Class']

In [9]:
pred = predict(X)
pred

array([0., 0., 0., ..., 0., 0., 0.])

In [10]:
pred.shape

(284807,)

In [11]:
from sklearn.metrics import accuracy_score

In [12]:
accuracy_score(y, pred) # Imbalanced data - Dumb - Biased towards Majority

0.9982725143693799

### Use Outlier detection (Unsupervised Algorithm) and Predict Fraud transactions

In [13]:
data.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [14]:
data.duplicated().sum()

1081

In [15]:
data = data.drop_duplicates()
data.shape

(283726, 31)

In [16]:
data['Class'].value_counts(normalize = True)

Class
0    0.998333
1    0.001667
Name: proportion, dtype: float64

In [17]:
X = data.drop('Class', axis = 1)
y = data['Class']

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((226980, 30), (56746, 30), (226980,), (56746,))

In [20]:
from sklearn.ensemble import IsolationForest

In [21]:
IF = IsolationForest(n_estimators=200)
IF

In [22]:
IF.fit(X_train)

In [23]:
IF.decision_function(X_train) # Anamoly scores

array([0.08344552, 0.09700927, 0.08248977, ..., 0.10165084, 0.13060605,
       0.09147624])

In [24]:
pred = IF.predict(X_train)
pred

array([1, 1, 1, ..., 1, 1, 1])

In [25]:
out = np.where(pred < 0)
out

(array([     5,     31,     45, ..., 226951, 226960, 226973], dtype=int64),)

In [26]:
out[0].shape

(8449,)

In [27]:
IF = IsolationForest(n_estimators=200, contamination = 0.001667)
IF

In [28]:
IF.fit(X_train)

In [29]:
pred = IF.predict(X_train)
pred

array([1, 1, 1, ..., 1, 1, 1])

In [32]:
out = np.where(pred < 0)
out[0].shape

(379,)

In [31]:
out[0].shape

(379,)

In [33]:
y_pred = IF.predict(X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [34]:
y_pred = np.where(y_pred < 0, 1, 0)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [35]:
accuracy_score(y_pred, y_test)

0.9972861523279174

In [36]:
from sklearn.neighbors import LocalOutlierFactor

In [37]:
lof = LocalOutlierFactor(contamination = 0.001667)
lof

In [38]:
lof.fit_predict(X_train)

array([1, 1, 1, ..., 1, 1, 1])