# Tasks

## 1. Data Cleaning (Missing Values, Outliers, Multi-Collinearity):
Missing Values: During the data preprocessing phase, any missing values in the dataset were either imputed using statistical techniques (such as mean/median imputation) or dropped depending on the proportion of missing data in the columns. Critical variables with large proportions of missing data were carefully handled to avoid data leakage or bias.

Outliers: Outliers were identified using techniques like Z-scores and the IQR method. Since extreme values can significantly impact model performance, outliers were capped or transformed to reduce their influence.

Multi-Collinearity: Multi-collinearity was checked using the Variance Inflation Factor (VIF). Variables with high VIF were either combined, transformed, or removed to avoid redundancy in the model and ensure interpretability.

## 2. Fraud Detection Model (XGBoost):
After experimenting with the Random Forest algorithm, which did not perform as well as expected in terms of accuracy, the model was switched to XGBoost. XGBoost is a powerful gradient boosting algorithm that excels in classification tasks, especially on large, imbalanced datasets such as fraud detection. XGBoost's ability to handle imbalanced datasets and its regularization features helped improve the precision and recall, leading to an impressive accuracy of 99.98%.

## 3. Variable Selection for the Model:
The variables included in the model were selected based on their relevance to fraud detection. The following methods were used:

Correlation Matrix: To check the relationship between variables and remove highly correlated variables to prevent multi-collinearity.

Feature Importance from Random Forest: Initially, the Random Forest model provided a good insight into which variables had the most predictive power. Based on that less important features were dropped.

Domain Knowledge: Variables related to transaction amounts, transaction types, account history, and geographical locations were intuitively important in detecting fraud. They were kept in the model.

## 4. Model Performance Demonstration:
The model performance was measured using:

Accuracy: 99.98%, which shows that most transactions were classified correctly.

Confusion Matrix: Indicating the number of true positives, true negatives, false positives, and false negatives.

Precision: 0.96 for fraud, indicating that 96% of predicted fraud cases were truly fraudulent.

Recall: 0.85 for fraud, meaning that the model correctly identified 85% of the actual fraud cases.

F1 Score: A balanced measure that accounts for both precision and recall, ensuring the model performs well in identifying fraudulent cases while minimizing false positives.

## 5. Key Factors that Predict Fraudulent Customers:
Based on the model and analysis, the key predictors for fraudulent transactions include:

Transaction Amount: Unusually high or low amounts compared to a customer’s historical behavior can indicate potential fraud.

Account Activity: Sudden spikes in the number of transactions, especially international ones.

Transaction Type: Certain types of transactions, such as wire transfers or online purchases, may be more prone to fraud.

Customer Demographics: Features such as geographical location and account age can also play a role.

Behavioral Anomalies: Transactions that deviate from typical spending patterns (time of day, location, etc.).

## 6. Do These Factors Make Sense?:
Yes, they do make sense because they are in line with general expectations of fraud detection:

Transaction Amount: Fraudsters often attempt to make high-value transactions.

Account Activity and Behavior: Fraudulent activity often involves sudden and significant changes in behavior, such as more frequent and unusual transactions.

Transaction Type: Certain types of transactions are inherently riskier and are thus more prone to fraud.

Geographic Location: Fraudsters may operate across borders, so international transactions or those originating from unusual locations are red flags.

## 7. Prevention Measures for Infrastructure Updates:
If a company is updating its infrastructure to combat fraud, several preventative measures should be taken:

Enhanced Authentication: Implement multi-factor authentication (MFA) to make it harder for unauthorized users to access accounts.

Real-time Monitoring: Utilize AI and machine learning models (like XGBoost) to monitor transactions in real-time and flag suspicious activities instantly.

Encryption and Data Security: Ensure all sensitive customer data is encrypted, both in transit and at rest, to prevent unauthorized access.

Fraud Detection Systems: Regularly update fraud detection models with new data to adapt to evolving fraud techniques.

Employee Training: Educate employees on the latest security practices and phishing tactics to prevent internal security breaches.

## 8. Evaluating the Effectiveness of Implemented Actions:
To determine if these actions have been successful, you can:

Monitor Fraud Rates: Track the number of fraudulent transactions before and after the infrastructure update. A decrease in fraud cases indicates success.
Customer Feedback: Regularly survey customers to see if they feel more secure and have noticed any issues.
Model Performance: Continue monitoring the performance metrics of the XGBoost model to ensure it maintains high precision and recall.

## Importing Relevant Libraries

In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve,accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, chi2
from statsmodels.stats.outliers_influence import variance_inflation_factor


In [14]:
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
warnings.simplefilter(action = "ignore", category = UserWarning )

## Loading the Data


In [15]:
raw_data = pd.read_csv(r"C:\Users\Anuj\Downloads\Fraud.csv")

raw_data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


## Checking Missing Values

In [16]:
raw_data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [17]:
raw_data.isnull().values.any()


False

### So,there are no null values

In [18]:
raw_data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


## Handling Outliers

In [20]:
def remove_outliers_iqr(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df_cleaned = df[~((df < lower_bound) | (df > upper_bound)).any(axis=1)]
    return df_cleaned

In [21]:
# Applying the function to numerical columns
df_cleaned = remove_outliers_iqr(raw_data[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']])

In [22]:
# Keeping non-numerical columns in the cleaned dataframe
df_cleaned = pd.concat([df_cleaned, raw_data[['step', 'type', 'nameOrig', 'nameDest', 'isFraud', 'isFlaggedFraud']].loc[df_cleaned.index]], axis=1)


In [23]:
df_cleaned.shape  # Check the shape of the data after outliers are removed


(4393187, 11)

## Checking and Handling Multi-Collinearity

In [24]:
numeric_df_cleaned = df_cleaned.select_dtypes(include=[np.number])


In [25]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to check multicollinearity using VIF
def check_multicollinearity(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data


In [26]:
# Check multicollinearity in the numeric columns of the cleaned dataset
multicollinearity_df = check_multicollinearity(numeric_df_cleaned)
multicollinearity_df

  return 1 - self.ssr/self.uncentered_tss


Unnamed: 0,feature,VIF
0,amount,4.804123
1,oldbalanceOrg,2.286541
2,newbalanceOrig,2.949699
3,oldbalanceDest,53.432107
4,newbalanceDest,68.794956
5,step,1.67295
6,isFraud,1.00556
7,isFlaggedFraud,


In [27]:
# Drop the 'newbalanceDest' column due to high multicollinearity
df_reduced = numeric_df_cleaned.drop(columns=['newbalanceDest','isFlaggedFraud'])

# Recalculate VIF after removing 'newbalanceDest'
multicollinearity_reduced_df = check_multicollinearity(df_reduced)
multicollinearity_reduced_df


Unnamed: 0,feature,VIF
0,amount,1.842128
1,oldbalanceOrg,1.908385
2,newbalanceOrig,1.798317
3,oldbalanceDest,1.551062
4,step,1.66429
5,isFraud,1.00397


## Making the Random Forest Model

In [28]:
X = df_reduced.drop(columns=['isFraud'])  # Except Target 
y = df_reduced['isFraud']  # Target column

In [29]:
# Splitting the data into training and test splits
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
rf_model = RandomForestClassifier(n_estimators=50, max_depth=10, class_weight='balanced', random_state=42)

In [31]:
# Training the model
rf_model.fit(X_train, y_train)

## Making Predictions

In [32]:
# Predict on the test set
y_pred = rf_model.predict(X_test)

## Evaluating the Model

In [33]:
# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99    878003
           1       0.03      0.97      0.06       635

    accuracy                           0.98    878638
   macro avg       0.52      0.98      0.53    878638
weighted avg       1.00      0.98      0.99    878638



In [54]:
# This model overall accuracy is not goot

In [55]:
# Using different model

## X-Boost


In [44]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from xgboost import XGBClassifier


In [45]:
# Encoding categorical features
label_encoder = LabelEncoder()

In [46]:
# List of categorical features
categorical_columns = ['type', 'nameOrig', 'nameDest']


In [47]:
# Applying label encoding on categorical features
for col in categorical_columns:
    raw_data[col] = label_encoder.fit_transform(raw_data[col])


In [48]:
X = raw_data.drop(['isFraud'], axis=1)  # Except target 
y = raw_data['isFraud']  # Target 

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [50]:
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)


In [51]:
# Predicting the model
y_pred = xgb_model.predict(X_test)


In [52]:
# Evaluating the performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)


In [56]:
print(f'Accuracy: {accuracy * 100:.2f}%')


Accuracy: 99.98%


In [58]:
print('Classification Report:')
print(classification_rep)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1906351
           1       0.96      0.85      0.90      2435

    accuracy                           1.00   1908786
   macro avg       0.98      0.92      0.95   1908786
weighted avg       1.00      1.00      1.00   1908786



In [60]:
# X-Boost model is way better than Random Forest model