In [1]:
import pandas as pd

# Load the dataset
file_path = "C:/Users/samvi/OneDrive/Desktop/samvidhaa/Fraud.csv"  
df = pd.read_csv(file_path)
print(df.head())


   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

In [3]:
# 2. Data Cleaning
# Handle missing values
df.dropna(inplace=True)

In [4]:
# Handle outliers
# For simplicity, let's winsorize the data to deal with outliers in numerical columns
from scipy.stats.mstats import winsorize
numeric_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
df[numeric_cols] = df[numeric_cols].apply(lambda x: winsorize(x, limits=[0.01, 0.01]), axis=0)


In [5]:
# 3. Feature Engineering
# Select relevant features
X = df[['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
y = df['isFraud']


In [6]:
# 4. Multicollinearity Assessment
# Check for multicollinearity using correlation matrix
correlation_matrix = X.corr()

In [7]:
# 5. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
# 6. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [9]:
# 7. Model Training
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

LogisticRegression(max_iter=1000)

In [10]:
# 8. Model Evaluation
y_pred = model.predict(X_test_scaled)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# AUC-ROC score
y_pred_proba = model.predict_proba(X_test_scaled)[:,1]
print("AUC-ROC Score:", roc_auc_score(y_test, y_pred_proba))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.89      0.42      0.57      1620

    accuracy                           1.00   1272524
   macro avg       0.95      0.71      0.79   1272524
weighted avg       1.00      1.00      1.00   1272524

Confusion Matrix:
[[1270823      81]
 [    941     679]]
Accuracy: 0.9991968717289419
AUC-ROC Score: 0.9444775143723885


In [11]:
'''

1. Data cleansing with multi-collinearity, outliers, and missing values:

Missing Values: As the line df.dropna(inplace=True) indicates, any rows containing missing values will be dropped as part of the data cleaning procedure.
Outliers: Winsorization is used to handle outliers with a 1% trimming limit.
Multicollinearity: A correlation matrix (correlation_matrix) computed on the feature variables is used to evaluate multicollinearity.


2. Describe your fraud detection model in elaboration.

In this instance, the binary classification approach known as logistic regression is employed as the fraud detection model.
Based on the chosen predictor variables (features), logistic regression models the likelihood of a binary result (fraudulent or non-fraudulent transaction). The log odds of the event's likelihood are computed by the model.


3. How did you select variables to be included in the model?

Step, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, and newbalanceDest are the variables chosen for the model. 
These characteristics were picked because they may be useful in differentiating between transactions that are fraudulent and those that are not.


4. Demonstrate the performance of the model by using best set of tools. 

Accuracy, precision, recall, F1-score, confusion matrix, and AUC-ROC score are among the performance indicators provided. 
These indicators offer valuable insights into the model's performance with respect to accurately detecting fraudulent transactions.


5. What are the key factors that predict fraudulent customer?

The primary variables predicting fraudulent transactions are not specifically stated, according to the model's success metrics. 
Nevertheless, transaction amounts, account balances prior to and following transactions, transaction kinds, and maybe the timing of the transaction (step) are examples of common critical elements.


6. Do these factors make sense? If yes, How? If not, How not? 

Indeed, these factors are logical. 
In contrast to authentic transactions, fraudulent transactions may contain odd transaction amounts, abrupt changes in account balances, certain transaction types (like transfers), and distinct patterns. 
These elements are consistent with traits of dishonest behavior.


7. What kind of prevention should be adopted while company update its infrastructure?

Implementing sophisticated anomaly detection algorithms, improving authentication methods (like multi-factor authentication), keeping an eye on real-time transaction patterns, and enforcing stringent restrictions over high-risk transactions (such large transfers) are a few examples of preventive measures.


8. Assuming these actions have been implemented, how would you determine if they work?

By keeping an eye on key performance metrics like the decline in false positive rates, the number of fraudulent transactions, the improvement in model accuracy and AUC-ROC score, and the capacity to identify novel forms of fraudulent activity, one can assess the efficacy of preventative measures. 
Continuous evaluation and enhancement of the fraud detection system would also require regular audits and assessments of its operation.

'''

"\n\n1. Data cleansing with multi-collinearity, outliers, and missing values:\n\nMissing Values: As the line df.dropna(inplace=True) indicates, any rows containing missing values will be dropped as part of the data cleaning procedure.\nOutliers: Winsorization is used to handle outliers with a 1% trimming limit.\nMulticollinearity: A correlation matrix (correlation_matrix) computed on the feature variables is\xa0used to evaluate multicollinearity.\n\n\n2. Describe your fraud detection model in elaboration.\n\nIn this instance, the binary classification approach known as logistic regression is employed as the fraud detection model.\nBased on the chosen predictor variables (features), logistic regression models the likelihood of a binary result (fraudulent or non-fraudulent transaction). The log odds of the event's likelihood are computed by the model.\n\n\n3. How did you select variables to be included in the model?\n\nStep, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, and newb