## Summary:

### Data Analysis Key Findings

*   The dataset contains 1,872,830 transactions with 11 features, including `step`, `type`, `amount`, account balances, and the target variable `isFraud`.
*   There is a severe class imbalance in the target variable `isFraud`, with only 1,882 fraudulent transactions compared to 1,870,947 non-fraudulent ones.
*   A small number of missing values (1) were found in several columns (`nameDest`, `oldbalanceDest`, `newbalanceDest`, `isFraud`, `isFlaggedFraud`).
*   The `isFlaggedFraud` column had only one unique value (0.0) and was not useful for discrimination.
*   The `nameOrig` and `nameDest` columns had very high cardinality.
*   After data preparation (handling missing values, dropping irrelevant columns, one-hot encoding 'type', and scaling numerical features), the data was split into training (1,310,980 rows) and testing (561,849 rows) sets, maintaining the class distribution through stratification.
*   A RandomForestClassifier with balanced class weights was trained to address the class imbalance.
*   The trained model achieved high overall accuracy (1.00).
*   For the fraudulent class (1.0), the model achieved a precision of 0.99 (correctly predicting fraud 99% of the time it predicted fraud) and a recall of 0.73 (identifying 73% of actual fraudulent transactions).
*   The confusion matrix showed 4 false positives (legitimate transactions flagged as fraud) and 151 false negatives (fraudulent transactions missed by the model) on the test set.

### Insights or Next Steps

*   While the model shows high precision for fraud detection, the recall of 0.73 indicates that a significant number of fraudulent transactions are not being identified. Further investigation into optimizing the model or employing advanced data balancing techniques could improve the detection rate of actual fraud.
*   The high cardinality of `nameOrig` and `nameDest` was handled by dropping these columns. Exploring alternative approaches like feature hashing or embedding could potentially incorporate information from these columns into the model if deemed important.


## Summary:

### Data Analysis Key Findings

*   The dataset contains 1,872,830 transactions with 11 features, including `step`, `type`, `amount`, account balances, and the target variable `isFraud`.
*   There is a severe class imbalance in the target variable `isFraud`, with only 1,882 fraudulent transactions compared to 1,870,947 non-fraudulent ones.
*   A small number of missing values (1) were found in several columns (`nameDest`, `oldbalanceDest`, `newbalanceDest`, `isFraud`, `isFlaggedFraud`).
*   The `isFlaggedFraud` column had only one unique value (0.0) and was not useful for discrimination.
*   The `nameOrig` and `nameDest` columns had very high cardinality.
*   After data preparation (handling missing values, dropping irrelevant columns, one-hot encoding 'type', and scaling numerical features), the data was split into training (1,310,980 rows) and testing (561,849 rows) sets, maintaining the class distribution through stratification.
*   A RandomForestClassifier with balanced class weights was trained to address the class imbalance.
*   The trained model achieved high overall accuracy (1.00).
*   For the fraudulent class (1.0), the model achieved a precision of 0.99 (correctly predicting fraud 99% of the time it predicted fraud) and a recall of 0.73 (identifying 73% of actual fraudulent transactions).
*   The confusion matrix showed 4 false positives (legitimate transactions flagged as fraud) and 151 false negatives (fraudulent transactions missed by the model) on the test set.

### Insights or Next Steps

*   While the model shows high precision for fraud detection, the recall of 0.73 indicates that a significant number of fraudulent transactions are not being identified. Further investigation into optimizing the model or employing advanced data balancing techniques could improve the detection rate of actual fraud.
*   The high cardinality of `nameOrig` and `nameDest` was handled by dropping these columns. Exploring alternative approaches like feature hashing or embedding could potentially incorporate information from these columns into the model if deemed important.


# Task
Perform the CRISP-DM process on the `/content/onlinefraud.csv` dataset to predict the 'fraud' variable using a machine learning model, and calculate predictions.

## Business understanding

### Subtask:
Understand the business problem related to online fraud prediction.


**Reasoning**:
Reflect on the typical challenges and goals associated with predicting online fraud, the potential impact of successful fraud detection, potential costs of false positives and negatives, potential types of online fraud, and what a successful outcome of this project would look like from a business perspective.



In [None]:
# 1. Challenges and Goals:
# Challenges: Highly imbalanced datasets (fraud is rare), evolving fraud techniques, need for real-time detection,
# maintaining customer experience while detecting fraud.
# Goals: Minimize financial losses due to fraud, protect customer trust, reduce operational costs associated with manual fraud review.

# Impact:
# Businesses: Reduced financial losses, improved reputation, increased customer confidence.
# Customers: Protection from unauthorized transactions, increased trust in online platforms.

# 2. Costs of Errors:
# False Positives: Annoying legitimate customers, potential loss of business, increased operational costs for manual review.
# False Negatives: Direct financial loss due to fraud, damage to reputation, potential legal issues.

# 3. Types of Online Fraud:
# Credit card fraud, identity theft, account takeover, friendly fraud (chargebacks), synthetic identity fraud.
# Manifestation in data: Suspicious transaction patterns, unusual locations, large transaction amounts,
# multiple failed login attempts, inconsistent user behavior.

# 4. Successful Outcome:
# A successful outcome would involve building a predictive model that effectively identifies fraudulent transactions
# with a high true positive rate (recall) while maintaining a low false positive rate (precision).
# The model should be deployable in real-time and provide actionable insights for fraud prevention.
# Success metrics could include reduction in fraud losses, improved detection rate, and minimal disruption to legitimate transactions.

print("Business problem understanding completed.")

Business problem understanding completed.


## Data understanding

### Subtask:
Load and explore the `/content/onlinefraud.csv` dataset.


**Reasoning**:
Load the dataset from `/content/onlinefraud.csv` into a pandas DataFrame named `df`, display the first 5 rows, print the concise summary, display descriptive statistics, display the count of unique values, check for and display the number of missing values, and examine the distribution of the target variable 'fraud'.



In [None]:
import pandas as pd

# 1. Load the dataset
df = pd.read_csv('/content/onlinefraud.csv')

# 2. Display the first 5 rows
print("First 5 rows of the DataFrame:")
display(df.head())

# 3. Print the concise summary
print("\nConcise summary of the DataFrame:")
df.info()

# 4. Display descriptive statistics of numerical columns
print("\nDescriptive statistics of numerical columns:")
display(df.describe())

# 5. Display the count of unique values for each column
print("\nNumber of unique values for each column:")
display(df.nunique())

# 6. Check for and display the number of missing values
print("\nNumber of missing values in each column:")
display(df.isnull().sum())

# 7. Examine the distribution of the target variable 'fraud'
print("\nDistribution of the target variable 'fraud':")
display(df['isFraud'].value_counts())

First 5 rows of the DataFrame:


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0.0,0.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1.0,0.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0.0,0.0



Concise summary of the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1872830 entries, 0 to 1872829
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         float64
 10  isFlaggedFraud  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 157.2+ MB

Descriptive statistics of numerical columns:


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,1872830.0,1872830.0,1872830.0,1872830.0,1872829.0,1872829.0,1872829.0,1872829.0
mean,78.91335,161498.9,855814.6,877016.1,997067.7,1117271.0,0.001004897,0.0
std,60.43201,267423.2,2923475.0,2959614.0,2306712.0,2402849.0,0.03168419,0.0
min,1.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,13294.83,0.0,0.0,0.0,0.0,0.0,0.0
50%,43.0,79274.34,14975.0,0.0,139395.1,231124.8,0.0,0.0
75%,139.0,217467.9,121377.0,163329.0,949545.0,1156044.0,0.0,0.0
max,164.0,10000000.0,38939420.0,38946230.0,42207400.0,42283780.0,1.0,0.0



Number of unique values for each column:


Unnamed: 0,0
step,164
type,5
amount,1762981
nameOrig,1872021
oldbalanceOrg,660146
newbalanceOrig,812108
nameDest,792900
oldbalanceDest,1082303
newbalanceDest,912451
isFraud,2



Number of missing values in each column:


Unnamed: 0,0
step,0
type,0
amount,0
nameOrig,0
oldbalanceOrg,0
newbalanceOrig,0
nameDest,1
oldbalanceDest,1
newbalanceDest,1
isFraud,1



Distribution of the target variable 'fraud':


Unnamed: 0_level_0,count
isFraud,Unnamed: 1_level_1
0.0,1870947
1.0,1882


## Data preparation

### Subtask:
Clean and prepare the data for machine learning, including handling missing values, encoding categorical features, and feature scaling if necessary.


**Reasoning**:
Handle missing values, drop irrelevant columns, one-hot encode the 'type' column, scale numerical features, define features and target, and split the data into stratified training and testing sets.



In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# 1. Handle the single missing row by dropping it
df.dropna(inplace=True)

# 2. Drop the isFlaggedFraud column as it has only one unique value and is not discriminatory.
df.drop('isFlaggedFraud', axis=1, inplace=True)

# 3. Handle categorical features: 'type' and drop 'nameOrig' and 'nameDest'
df.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)
df = pd.get_dummies(df, columns=['type'], drop_first=True) # One-hot encode 'type'

# 4. Define numerical features for scaling
numerical_features = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

# 5. Apply standard scaler to numerical features
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# 6. Define feature matrix X and target vector y
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# 7. Split data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("Data cleaning and preparation completed.")
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)

Data cleaning and preparation completed.
Shape of X_train: (1310980, 10)
Shape of y_train: (1310980,)
Shape of X_test: (561849, 10)
Shape of y_test: (561849,)


## Modeling

### Subtask:
Select and train a machine learning model to predict the 'fraud' variable.


**Reasoning**:
Import and train a RandomForestClassifier model with balanced class weights to address the class imbalance.



In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate the RandomForestClassifier model with balanced class weights
model = RandomForestClassifier(class_weight='balanced', random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)

print("Model training completed.")

Model training completed.


## Evaluation

### Subtask:
Evaluate the performance of the trained model using appropriate metrics.


**Reasoning**:
Import the necessary evaluation metrics and make predictions on the test set.



In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test set
y_pred = model.predict(X_test)

**Reasoning**:
Generate and print the classification report and confusion matrix to evaluate the model's performance.



In [None]:
# Generate and print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate and print the confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00    561284
         1.0       0.99      0.73      0.84       565

    accuracy                           1.00    561849
   macro avg       1.00      0.87      0.92    561849
weighted avg       1.00      1.00      1.00    561849


Confusion Matrix:
[[561280      4]
 [   151    414]]


## Deployment

### Subtask:
Discuss how the model could be deployed.


**Reasoning**:
Discuss the deployment considerations for the trained model, covering technical infrastructure, challenges, and maintenance.



In [None]:
# 1. Technical Infrastructure:
print("Technical Infrastructure:")
print("- **APIs:** A RESTful API would be needed to allow other applications (e.g., e-commerce platforms, banking systems) to send transaction data for real-time fraud prediction.")
print("- **Databases:** A low-latency database would be required to store transaction data, model predictions, and potentially feature stores for fast retrieval. Examples include NoSQL databases like Cassandra or managed services like Amazon DynamoDB.")
print("- **Monitoring Systems:** Robust monitoring is crucial to track model performance (e.g., prediction latency, error rates, drift in data distribution), system health (e.g., server load, memory usage), and business metrics (e.g., fraud detection rate, false positive rate). Tools like Prometheus, Grafana, or cloud-specific monitoring services could be used.")
print("- **Deployment Platform:** The model could be deployed on cloud platforms (e.g., AWS SageMaker, Google AI Platform, Azure Machine Learning) or on-premise infrastructure using containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes).")
print("- **Message Queues:** For asynchronous processing or handling bursts of traffic, message queues (e.g., Kafka, RabbitMQ) could be used to decouple the API from the prediction service.")

# 2. Challenges and Strategies for Low Latency and High Availability:
print("\nChallenges and Strategies for Low Latency and High Availability:")
print("- **Low Latency:**")
print("  - **Model Optimization:** Use efficient model architectures and optimize model inference (e.g., using tools like ONNX Runtime, TensorFlow Serving with TensorRT).")
print("  - **Infrastructure Proximity:** Deploy the model geographically close to the users or services making predictions.")
print("  - **Caching:** Cache frequently accessed data or model predictions.")
print("  - **Horizontal Scaling:** Scale the prediction service horizontally to handle increased load.")
print("- **High Availability:**")
print("  - **Redundancy:** Deploy the model and its dependencies across multiple availability zones or regions.")
print("  - **Load Balancing:** Distribute incoming requests across multiple instances of the prediction service.")
print("  - **Automated Failover:** Implement automated failover mechanisms to redirect traffic to healthy instances in case of failures.")
print("  - **Circuit Breaker Pattern:** Implement circuit breakers to prevent cascading failures when downstream services are unavailable.")

# 3. Ongoing Maintenance and Retraining:
print("\nOngoing Maintenance and Retraining:")
print("- **Monitoring Model Performance:** Continuously monitor the model's performance metrics (precision, recall, F1-score) in production to detect degradation.")
print("- **Data Drift Detection:** Monitor for changes in the distribution of incoming data compared to the training data.")
print("- **Retraining Frequency:** Establish a schedule for retraining the model with new data to capture evolving fraud patterns. This could be daily, weekly, or monthly depending on the data volume and the rate of concept drift.")
print("- **Automated Retraining Pipelines:** Implement automated pipelines for data ingestion, preprocessing, model training, validation, and deployment to streamline the retraining process.")
print("- **A/B Testing:** Use A/B testing to compare the performance of new model versions against the current production model before full rollout.")
print("- **Incident Response:** Have a plan in place to address production issues, model failures, or performance degradation.")

print("\nDiscussion on model deployment completed.")

Technical Infrastructure:
- **APIs:** A RESTful API would be needed to allow other applications (e.g., e-commerce platforms, banking systems) to send transaction data for real-time fraud prediction.
- **Databases:** A low-latency database would be required to store transaction data, model predictions, and potentially feature stores for fast retrieval. Examples include NoSQL databases like Cassandra or managed services like Amazon DynamoDB.
- **Monitoring Systems:** Robust monitoring is crucial to track model performance (e.g., prediction latency, error rates, drift in data distribution), system health (e.g., server load, memory usage), and business metrics (e.g., fraud detection rate, false positive rate). Tools like Prometheus, Grafana, or cloud-specific monitoring services could be used.
- **Deployment Platform:** The model could be deployed on cloud platforms (e.g., AWS SageMaker, Google AI Platform, Azure Machine Learning) or on-premise infrastructure using containerization (e.g., 

## Prediction

### Subtask:
Use the trained model to make predictions on new data.


**Reasoning**:
Use the trained model to make predictions on the test data as instructed.



In [None]:
# Use the .predict() method of the trained model to make predictions on the X_test dataset.
y_pred = model.predict(X_test)

# Store the predictions in a variable named y_pred. (This is already done in the previous line)
y_pred

array([0., 0., 0., ..., 0., 0., 0.])

## Summary:

### Data Analysis Key Findings

*   The dataset contains 1,872,830 transactions with 11 features, including `step`, `type`, `amount`, account balances, and the target variable `isFraud`.
*   There is a severe class imbalance in the target variable `isFraud`, with only 1,882 fraudulent transactions compared to 1,870,947 non-fraudulent ones.
*   A small number of missing values (1) were found in several columns (`nameDest`, `oldbalanceDest`, `newbalanceDest`, `isFraud`, `isFlaggedFraud`).
*   The `isFlaggedFraud` column had only one unique value (0.0) and was not useful for discrimination.
*   The `nameOrig` and `nameDest` columns had very high cardinality.
*   After data preparation (handling missing values, dropping irrelevant columns, one-hot encoding 'type', and scaling numerical features), the data was split into training (1,310,980 rows) and testing (561,849 rows) sets, maintaining the class distribution through stratification.
*   A RandomForestClassifier with balanced class weights was trained to address the class imbalance.
*   The trained model achieved high overall accuracy (1.00).
*   For the fraudulent class (1.0), the model achieved a precision of 0.99 (correctly predicting fraud 99% of the time it predicted fraud) and a recall of 0.73 (identifying 73% of actual fraudulent transactions).
*   The confusion matrix showed 4 false positives (legitimate transactions flagged as fraud) and 151 false negatives (fraudulent transactions missed by the model) on the test set.

### Insights or Next Steps

*   While the model shows high precision for fraud detection, the recall of 0.73 indicates that a significant number of fraudulent transactions are not being identified. Further investigation into optimizing the model or employing advanced data balancing techniques could improve the detection rate of actual fraud.
*   The high cardinality of `nameOrig` and `nameDest` was handled by dropping these columns. Exploring alternative approaches like feature hashing or embedding could potentially incorporate information from these columns into the model if deemed important.


In [None]:
# Get the predicted probabilities for each class
y_pred_proba = model.predict_proba(X_test)

# Display the first few predicted probabilities
print("Predicted probabilities (first 10):")
display(y_pred_proba[:10])

Predicted probabilities (first 10):


array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

# Task
Create a form to input transaction details and use the trained model to predict the probability of fraud based on the input.

## Create a form

### Subtask:
Generate a code cell with a form to input the transaction details (step, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest, and type).


**Reasoning**:
Create a form with input fields for transaction details and a submit button.



In [None]:
#import ipywidgets as widgets
from IPython.display import display

# Create input widgets for numerical features
#step_input = widgets.IntText(description='Step:')
#amount_input = widgets.FloatText(description='Amount:')
#oldbalanceOrg_input = widgets.FloatText(description='Old Balance Originator:')
#newbalanceOrig_input = widgets.FloatText(description='New Balance Originator:')
#oldbalanceDest_input = widgets.FloatText(description='Old Balance Destination:')
#newbalanceDest_input = widgets.FloatText(description='New Balance Destination:')

# Create a dropdown for the 'type' feature
#type_options = ['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN']
#type_dropdown = widgets.Dropdown(
#    options=type_options,
#    description='Type:',
#)

# Create a button to submit the form
#submit_button = widgets.Button(description='Predict Fraud Probability')

# Arrange the widgets in a vertical box
#form = widgets.VBox([
    step_input,
    amount_input,
    oldbalanceOrg_input,
    newbalanceOrig_input,
    oldbalanceDest_input,
    newbalanceDest_input,
    type_dropdown,
    submit_button
])

# Display the form
display(form)

VBox(children=(IntText(value=0, description='Step:'), FloatText(value=0.0, description='Amount:'), FloatText(v…


Captured and preprocessed input data:


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,2,36.790013,-0.292738,-0.296327,-0.432246,-0.464978,True,False,False,False



Predicted Probability of Fraud: 0.1100


'\nPredicted Probability of Fraud: 0.1100'

## Process form input

### Subtask:
Capture the form input and convert it into a format suitable for the model.


**Reasoning**:
Define a function to capture the form input, preprocess it to match the training data format, and display the preprocessed data.



In [None]:
def predict_fraud(button):
    """Captures form input, preprocesses it, and makes a fraud probability prediction."""

    # 2. Access the values from the input widgets
    input_data = {
        'step': step_input.value,
        'amount': amount_input.value,
        'oldbalanceOrg': oldbalanceOrg_input.value,
        'newbalanceOrig': newbalanceOrig_input.value,
        'oldbalanceDest': oldbalanceDest_input.value,
        'newbalanceDest': newbalanceDest_input.value,
        'type': type_dropdown.value
    }

    # 3. Create a dictionary and handle one-hot encoding for 'type'
    processed_input = {
        'step': input_data['step'],
        'amount': input_data['amount'],
        'oldbalanceOrg': input_data['oldbalanceOrg'],
        'newbalanceOrig': input_data['newbalanceOrig'],
        'oldbalanceDest': input_data['oldbalanceDest'],
        'newbalanceDest': input_data['newbalanceDest'],
        'type_CASH_OUT': input_data['type'] == 'CASH_OUT',
        'type_DEBIT': input_data['type'] == 'DEBIT',
        'type_PAYMENT': input_data['type'] == 'PAYMENT',
        'type_TRANSFER': input_data['type'] == 'TRANSFER'
    }

    # 4. Convert the dictionary into a pandas DataFrame
    input_df = pd.DataFrame([processed_input])

    # 5. Apply the same scaling transformation to numerical features
    numerical_features = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

    # Create a new scaler with the *same* parameters as the one used for training data
    # We need to access the original scaler used in the data preparation step (cell_id: d45b3ce7)
    # Assuming the scaler object is named 'scaler' in that cell and is available in the kernel's memory.
    # If not, we would need to re-fit a scaler on the training data (X_train)
    # For this example, we assume 'scaler' is available.

    input_df[numerical_features] = scaler.transform(input_df[numerical_features])


    # 6. Ensure the column order matches X_train
    input_df = input_df[X_train.columns]


    print("\nCaptured and preprocessed input data:")
    display(input_df)

# Link the button click to the function
submit_button.on_click(predict_fraud)

## Make prediction

### Subtask:
Use the trained model to predict the probability of fraud based on the form input.


**Reasoning**:
Use the trained model to predict the probability of fraud based on the preprocessed input data and extract the fraud probability.



In [None]:
def predict_fraud(button):
    """Captures form input, preprocesses it, and makes a fraud probability prediction."""

    # 2. Access the values from the input widgets
    input_data = {
        'step': step_input.value,
        'amount': amount_input.value,
        'oldbalanceOrg': oldbalanceOrg_input.value,
        'newbalanceOrig': newbalanceOrig_input.value,
        'oldbalanceDest': oldbalanceDest_input.value,
        'newbalanceDest': newbalanceDest_input.value,
        'type': type_dropdown.value
    }

    # 3. Create a dictionary and handle one-hot encoding for 'type'
    processed_input = {
        'step': input_data['step'],
        'amount': input_data['amount'],
        'oldbalanceOrg': input_data['oldbalanceOrg'],
        'newbalanceOrig': input_data['newbalanceOrig'],
        'oldbalanceDest': input_data['oldbalanceDest'],
        'newbalanceDest': input_data['newbalanceDest'],
        'type_CASH_OUT': input_data['type'] == 'CASH_OUT',
        'type_DEBIT': input_data['type'] == 'DEBIT',
        'type_PAYMENT': input_data['type'] == 'PAYMENT',
        'type_TRANSFER': input_data['type'] == 'TRANSFER'
    }

    # 4. Convert the dictionary into a pandas DataFrame
    input_df = pd.DataFrame([processed_input])

    # 5. Apply the same scaling transformation to numerical features
    numerical_features = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

    # Create a new scaler with the *same* parameters as the one used for training data
    # We need to access the original scaler used in the data preparation step (cell_id: d45b3ce7)
    # Assuming the scaler object is named 'scaler' in that cell and is available in the kernel's memory.
    # If not, we would need to re-fit a scaler on the training data (X_train)
    # For this example, we assume 'scaler' is available.

    input_df[numerical_features] = scaler.transform(input_df[numerical_features])


    # 6. Ensure the column order matches X_train
    input_df = input_df[X_train.columns]

    # 1. Use the trained model's .predict_proba() method to get the probability of each class
    y_pred_proba = model.predict_proba(input_df)

    # 2. Extract the probability of fraud (class 1)
    fraud_probability = y_pred_proba[0][1]

    # 3. Print the fraud probability
    print(f"\nPredicted Probability of Fraud: {fraud_probability:.4f}")


# Link the button click to the function
submit_button.on_click(predict_fraud)

## Display result

### Subtask:
Display the predicted probability of fraud.


**Reasoning**:
Display the predicted probability of fraud using the display function.



In [None]:
from IPython.display import display

def predict_fraud(button):
    """Captures form input, preprocesses it, and makes a fraud probability prediction."""

    # 2. Access the values from the input widgets
    input_data = {
        'step': step_input.value,
        'amount': amount_input.value,
        'oldbalanceOrg': oldbalanceOrg_input.value,
        'newbalanceOrig': newbalanceOrig_input.value,
        'oldbalanceDest': oldbalanceDest_input.value,
        'newbalanceDest': newbalanceDest_input.value,
        'type': type_dropdown.value
    }

    # 3. Create a dictionary and handle one-hot encoding for 'type'
    processed_input = {
        'step': input_data['step'],
        'amount': input_data['amount'],
        'oldbalanceOrg': input_data['oldbalanceOrg'],
        'newbalanceOrig': input_data['newbalanceOrig'],
        'oldbalanceDest': input_data['oldbalanceDest'],
        'newbalanceDest': input_data['newbalanceDest'],
        'type_CASH_OUT': input_data['type'] == 'CASH_OUT',
        'type_DEBIT': input_data['type'] == 'DEBIT',
        'type_PAYMENT': input_data['type'] == 'PAYMENT',
        'type_TRANSFER': input_data['type'] == 'TRANSFER'
    }

    # 4. Convert the dictionary into a pandas DataFrame
    input_df = pd.DataFrame([processed_input])

    # 5. Apply the same scaling transformation to numerical features
    numerical_features = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

    # Create a new scaler with the *same* parameters as the one used for training data
    # We need to access the original scaler used in the data preparation step (cell_id: d45b3ce7)
    # Assuming the scaler object is named 'scaler' in that cell and is available in the kernel's memory.
    # If not, we would need to re-fit a scaler on the training data (X_train)
    # For this example, we assume 'scaler' is available.

    input_df[numerical_features] = scaler.transform(input_df[numerical_features])


    # 6. Ensure the column order matches X_train
    input_df = input_df[X_train.columns]

    # Use the trained model's .predict_proba() method to get the probability of each class
    y_pred_proba = model.predict_proba(input_df)

    # Extract the probability of fraud (class 1)
    fraud_probability = y_pred_proba[0][1]

    # Display the predicted probability of fraud using the display function
    display(f"\nPredicted Probability of Fraud: {fraud_probability:.4f}")


# Link the button click to the function
submit_button.on_click(predict_fraud)

## Summary:

### Data Analysis Key Findings

*   A form was successfully created using `ipywidgets` to capture transaction details, including numerical features and a categorical type.
*   A function `predict_fraud` was developed to process the form input.
*   The processing involved one-hot encoding the categorical 'type' feature and scaling the numerical features using a pre-trained scaler.
*   The preprocessed input was converted into a pandas DataFrame with columns ordered to match the training data used by the model.
*   The trained model's `.predict_proba()` method was used to predict the probability of fraud for the input transaction.
*   The predicted probability of fraud (class 1) was extracted and displayed to the user.

### Insights or Next Steps

*   The interactive form allows for real-time testing of the trained fraud detection model with custom transaction data.
*   Further enhancements could include adding input validation to the form and providing a clearer visual indication of the prediction result (e.g., color-coding based on probability).


# Task
Split the file "/content/onlinefraud.csv" into multiple files, each less than 25MB.

## Load a small portion of the data

### Subtask:
Load a small number of rows from the CSV to estimate the average row size.


**Reasoning**:
Read a small number of rows from the CSV file to estimate the average row size.



In [None]:
# Read the first 1000 rows
df_head = pd.read_csv('/content/onlinefraud.csv', nrows=1000)

# Print the shape of the DataFrame
print("Shape of the first 1000 rows:", df_head.shape)

Shape of the first 1000 rows: (1000, 11)


**Reasoning**:
Estimate the average row size using the size of the small DataFrame and the number of rows.



In [None]:
# Calculate the size of the DataFrame in bytes
size_in_bytes = df_head.memory_usage(deep=True).sum()

# Calculate the number of rows
num_rows = df_head.shape[0]

# Estimate the average row size in bytes
average_row_size = size_in_bytes / num_rows

print(f"Estimated average row size: {average_row_size:.2f} bytes")

Estimated average row size: 239.35 bytes


## Estimate rows per chunk

### Subtask:
Calculate approximately how many rows will fit into a 25MB chunk based on the estimated row size.


**Reasoning**:
Calculate the estimated number of rows per chunk based on the target chunk size and the estimated average row size.



In [None]:
# Define the target chunk size in bytes (25MB)
target_chunk_size_bytes = 25 * 1024 * 1024

# Calculate the estimated number of rows per chunk
estimated_rows_per_chunk = target_chunk_size_bytes / average_row_size

# Print the estimated number of rows per chunk
print(f"Estimated number of rows per 25MB chunk: {int(estimated_rows_per_chunk)}")

Estimated number of rows per 25MB chunk: 109525


## Read and split the data

### Subtask:
Read the entire CSV file in chunks and save each chunk to a new CSV file.


**Reasoning**:
Read the entire CSV file in chunks and save each chunk to a new CSV file.



In [None]:
# 1. Define the chunk size using the previously calculated estimated_rows_per_chunk, converting it to an integer.
chunk_size = int(estimated_rows_per_chunk)

# 2. Initialize a counter variable for the chunk number to 0.
chunk_number = 0

# 3. Create a pandas.read_csv iterator for the input file /content/onlinefraud.csv with the specified chunk size.
# 4. Iterate through the chunks using a for loop.
for chunk in pd.read_csv('/content/onlinefraud.csv', chunksize=chunk_size):
    # 5. Inside the loop, construct the output filename for the current chunk using an f-string
    output_filename = f'onlinefraud_chunk_{chunk_number}.csv'

    # 6. Save the current chunk to a CSV file. For the first chunk, include the header; for subsequent chunks, exclude the header to avoid repetition.
    if chunk_number == 0:
        chunk.to_csv(output_filename, index=False, header=True)
    else:
        chunk.to_csv(output_filename, index=False, header=False, mode='a') # Use mode='a' to append if file exists for later chunks, but header=False prevents header repetition. Using mode='w' would overwrite which is not what we want, so mode='a' is safer although header=False handles the header. However, since we are writing a new file for each chunk, mode='w' is fine too. Let's stick to 'w' for clarity that each chunk is a new file.
        chunk.to_csv(output_filename, index=False, header=False)

    # 7. Increment the chunk counter.
    chunk_number += 1

# 8. After the loop finishes, print a message indicating that the file splitting is complete and mention the number of chunks created.
print(f"File splitting complete. Created {chunk_number} chunks.")

File splitting complete. Created 59 chunks.


## Verify file sizes

### Subtask:
Check the sizes of the generated files to ensure they are all under 25MB.


**Reasoning**:
Check the size of each generated chunk file and report if any exceed the maximum allowed size.



In [None]:
import os

# 2. Define the maximum allowed file size in bytes (25MB)
max_file_size_bytes = 25 * 1024 * 1024

# 6. Set a flag to indicate failure
failure_flag = False

# 3. Iterate through the files in the current directory
print("Checking sizes of generated chunk files:")
for filename in os.listdir('.'):
    # 4. For each file that matches the naming pattern of the split files
    if filename.startswith('onlinefraud_chunk_') and filename.endswith('.csv'):
        file_path = os.path.join('.', filename)
        # 4. get its size in bytes
        file_size = os.path.getsize(file_path)

        # 5. Check if the file size is greater than the maximum allowed size
        if file_size > max_file_size_bytes:
            # 6. If any file exceeds the maximum size, print an error message
            print(f"Error: File '{filename}' size ({file_size} bytes) exceeds the maximum allowed size ({max_file_size_bytes} bytes).")
            failure_flag = True
        else:
            print(f"File '{filename}': {file_size} bytes (under limit)")

# 7. After checking all files, if no file exceeded the maximum size, print a success message.
if not failure_flag:
    print("\nAll generated chunk files are under the 25MB limit.")

Checking sizes of generated chunk files:
File 'onlinefraud_chunk_10.csv': 8532547 bytes (under limit)
File 'onlinefraud_chunk_31.csv': 8501771 bytes (under limit)
File 'onlinefraud_chunk_56.csv': 8497968 bytes (under limit)
File 'onlinefraud_chunk_39.csv': 8457832 bytes (under limit)
File 'onlinefraud_chunk_53.csv': 8508103 bytes (under limit)
File 'onlinefraud_chunk_16.csv': 8532077 bytes (under limit)
File 'onlinefraud_chunk_5.csv': 8365358 bytes (under limit)
File 'onlinefraud_chunk_37.csv': 8508852 bytes (under limit)
File 'onlinefraud_chunk_8.csv': 8419355 bytes (under limit)
File 'onlinefraud_chunk_41.csv': 8532990 bytes (under limit)
File 'onlinefraud_chunk_18.csv': 8514352 bytes (under limit)
File 'onlinefraud_chunk_9.csv': 8300565 bytes (under limit)
File 'onlinefraud_chunk_15.csv': 8529967 bytes (under limit)
File 'onlinefraud_chunk_0.csv': 8272403 bytes (under limit)
File 'onlinefraud_chunk_40.csv': 8482357 bytes (under limit)
File 'onlinefraud_chunk_43.csv': 8422539 bytes (

## Summary:

### Data Analysis Key Findings

*   The first 1000 rows of the dataset have an estimated average size of approximately 239.35 bytes.
*   Based on a target chunk size of 25MB, an estimated 109,525 rows fit into each chunk.
*   The `/content/onlinefraud.csv` file was successfully split into 59 smaller CSV files.
*   Each generated chunk file was confirmed to be under the 25MB limit.

### Insights or Next Steps

*   The splitting process successfully created manageable file sizes for further processing or analysis that might be limited by file size constraints.
*   These smaller files are now ready for individual or distributed processing workflows.


## Understanding the Business Problem: Predicting Online Fraud

The rise of online transactions has unfortunately been accompanied by a significant increase in fraudulent activities. For businesses operating online, accurately identifying and preventing fraud is paramount not only to minimize financial losses but also to maintain customer trust and a positive brand reputation.

The core business problem we address in this project is the need for a robust system to predict fraudulent online transactions. This involves analyzing transaction data to distinguish between legitimate and illicit activities.

Key considerations from a business perspective include:

*   **Minimizing Financial Losses:** The most direct impact of fraud is financial loss. An effective prediction model can significantly reduce the amount of money lost to fraudulent transactions.
*   **Protecting Customer Trust:** Fraudulent transactions can erode customer confidence in a platform's security. Preventing fraud helps build and maintain trust.
*   **Operational Efficiency:** Manually reviewing suspicious transactions is time-consuming and costly. An automated system can flag high-risk transactions for review, improving operational efficiency.
*   **Balancing False Positives and False Negatives:**
    *   **False Positives:** Legitimate transactions incorrectly flagged as fraud. These can lead to customer frustration, lost sales, and increased review costs.
    *   **False Negatives:** Fraudulent transactions missed by the system. These result in direct financial losses and can damage reputation.
    Achieving a balance between minimizing these two types of errors is crucial, often prioritizing the reduction of false negatives (catching more fraud) while keeping false positives at an acceptable level.
*   **Evolving Fraud Tactics:** Fraudsters constantly adapt their methods, requiring the fraud detection system to be adaptable and continuously updated.

A successful outcome for this project would be the deployment of a predictive model that effectively identifies a high percentage of actual fraudulent transactions (high recall) while keeping the number of legitimate transactions incorrectly flagged as fraud low (low precision). This would lead to reduced financial losses, improved customer satisfaction, and a more efficient fraud prevention process.

## Understanding the Business Problem: Predicting Online Fraud

The digital landscape has revolutionized commerce, banking, and social interaction, but this rapid expansion has also created fertile ground for malicious activities, most notably online fraud. For businesses operating in this space – from e-commerce giants and financial institutions to online gaming platforms and social media networks – the threat of fraud is a constant and evolving challenge. Accurately identifying and preventing fraudulent transactions is not merely a technical problem; it's a critical business imperative with far-reaching implications that extend beyond immediate financial losses to impact customer trust, brand reputation, and operational sustainability.

The core business problem we are tackling in this project is the development of a robust and effective system for predicting fraudulent online transactions. This involves delving into vast datasets of transaction records to discern subtle patterns and anomalies that differentiate legitimate activities from illicit ones. The goal is to build a predictive model capable of flagging suspicious transactions in near real-time, enabling businesses to take timely action to prevent fraud.

From a business perspective, several key considerations underscore the importance and complexity of this problem:

**1. The Scale and Cost of Online Fraud:** The financial impact of online fraud is staggering and continues to grow annually. Businesses face direct losses from fraudulent transactions, chargebacks, and associated fees. Beyond these direct costs, there are indirect expenses related to investigating and resolving fraudulent cases, legal costs, and potential regulatory fines if fraud prevention measures are deemed inadequate. For many businesses, particularly smaller ones, a significant fraud event can be devastating, threatening their very existence.

**2. Protecting and Enhancing Customer Trust and Experience:** In the digital age, customer trust is a fragile but invaluable asset. A single fraudulent transaction or a frustrating experience with a false positive (a legitimate transaction incorrectly flagged as fraud) can severely erode customer confidence and lead to churn. Businesses must navigate a delicate balance: implementing stringent security measures to prevent fraud without creating unnecessary friction or inconvenience for legitimate customers. A successful fraud detection system should ideally be invisible to the vast majority of users, intervening only when truly necessary.

**3. Operational Efficiency and Resource Allocation:** Manually reviewing every transaction for potential fraud is an impossible task given the sheer volume of online activity. Fraud detection teams are often overwhelmed, leading to delays in processing legitimate transactions and increasing operational costs. An effective predictive model can significantly improve operational efficiency by prioritizing high-risk transactions for human review, allowing fraud analysts to focus their expertise on the most complex and suspicious cases. This targeted approach reduces the workload and optimizes resource allocation within the fraud prevention department.

**4. The Perilous Balance: False Positives vs. False Negatives:** This is perhaps the most critical trade-off in fraud detection.
    *   **False Positives:** Occur when a legitimate transaction is incorrectly identified as fraudulent. The consequences include frustrating customers, potentially losing sales, incurring costs associated with manual review and customer service, and damaging the brand's reputation for security and reliability.
    *   **False Negatives:** Occur when a fraudulent transaction is missed by the system. The direct consequence is financial loss to the business or its customers. Repeated false negatives can also indicate vulnerabilities that fraudsters may exploit further, leading to larger and more frequent attacks. From a risk management perspective, while false positives are undesirable, false negatives often represent a more significant threat to the business's financial health and reputation. Therefore, the objective is often to maximize the detection of actual fraud (high recall) while keeping the rate of false positives at an acceptable, manageable level.

**5. The Dynamic Nature of Fraud:** Fraud is not a static problem. Fraudsters are often sophisticated and constantly adapt their tactics and techniques to bypass existing security measures. This concept, known as "concept drift," means that a fraud detection model trained on past data may become less effective over time as new fraud patterns emerge. A successful fraud prevention strategy requires continuous monitoring of the model's performance, regular retraining with fresh data, and the ability to quickly adapt to new threats.

**6. Stakeholders and Their Perspectives:** A fraud detection project involves multiple stakeholders with varying perspectives and priorities:
    *   **Business Executives:** Concerned with overall financial losses, reputation, and regulatory compliance.
    *   **Risk Management Teams:** Focused on identifying, assessing, and mitigating fraud risks.
    *   **Fraud Prevention Analysts:** Responsible for investigating flagged transactions and developing fraud prevention strategies.
    *   **Customers:** Expect secure transactions and a frictionless user experience.
    *   **Regulators:** Enforce compliance with anti-money laundering (AML) and Know Your Customer (KYC) regulations.
A successful project must consider the needs and concerns of all these stakeholders.

**What Does a Successful Outcome Look Like?**

From a business perspective, a truly successful outcome for this project would be the implementation of a predictive model that:

*   **Significantly Reduces Financial Losses:** By effectively identifying and preventing a high percentage of fraudulent transactions.
*   **Maintains or Improves Customer Experience:** By minimizing false positives and ensuring legitimate transactions are processed smoothly.
*   **Increases Operational Efficiency:** By allowing fraud analysts to focus on high-risk cases.
*   **Is Adaptable to Evolving Threats:** With a framework for continuous monitoring and retraining.
*   **Provides Actionable Insights:** Beyond just flagging transactions, the model or the analysis should provide insights into emerging fraud patterns.

Ultimately, the goal is to move from a reactive approach to fraud management to a proactive one, using the power of data and machine learning to stay ahead of fraudsters and protect the integrity of online transactions. This project aims to lay the groundwork for such a system by developing and evaluating a predictive model for online fraud detection.