# Step 1: Import  libraries 

Import libraries such as Pandas, NumPy, and Scikit-learn, which are commonly used for data preprocessing.
                                                                                 
# Step 2: Handle missing values

Check for missing values in the dataset using the isnull() or isna() function.
Decide on a strategy to handle missing values, such as:
Dropping rows or columns with missing values (if the dataset is large and missing values are sparse).
Filling missing values with a specific value (e.g., mean, median, or mode) using the fillna() function.
Imputing missing values using a machine learning algorithm (e.g., K-Nearest Neighbors or decision trees).
Implement the chosen strategy to handle missing values.

    
# Step 3: Data cleaning and normalization

Check for outliers and anomalies in the data using visualization techniques (e.g., histograms, scatter plots) or statistical methods (e.g., Z-score, modified Z-score).
Remove or transform outliers and anomalies as necessary.
Normalize or scale the data to prevent features with large ranges from dominating the model. Common techniques include:
Standardization (Z-scoring): subtract the mean and divide by the standard deviation for each feature.
Min-max scaling: rescale features to a common range (e.g., 0 to 1) using the

                                                                                  
                                                                                   
# Step 4: Feature selection and engineering

Identify relevant features and remove irrelevant or redundant ones using techniques such as:
Correlation analysis: calculate the correlation between features and the target variable.
Mutual information: calculate the mutual information between features and the target variable.
Recursive feature elimination: recursively eliminate the least important features until a specified number of features is reached.

# Step 5: Split data into training and testing sets

Split the preprocessed data into training and testing sets using techniques such as:
Random sampling: randomly split the data into training and testing sets.
Stratified sampling: split the data into training and testing sets while maintaining the same class distribution.

                                                                                   
# Step 6: Verify data quality

Verify that the preprocessed data meets the requirements for modeling, such as:
No missing values.   
Features are scaled and normalized.
Outliers and anomalies are handled.
Relevant features are selected and engineered.                                                                                   
    

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [11]:
file_path =(r"creditcard(1).xlsx")
df = pd.read_excel(file_path)


print(df.head())


missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

df.dropna(inplace=True)


X = df.drop(columns=['Class'])  
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Done")

   Time        V1        V2        V3        V4        V5        V6        V7  \
0     0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1     0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2     1 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3     1 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4     2 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28 

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC()
}

model_performance = {}

for model_name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    model_performance[model_name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

print(model_performance)

{'Logistic Regression': {'Accuracy': 0.9991222218320986, 'Precision': 0.8636363636363636, 'Recall': 0.5816326530612245, 'F1 Score': 0.6951219512195121}, 'Random Forest': {'Accuracy': 0.9995611109160493, 'Precision': 0.974025974025974, 'Recall': 0.7653061224489796, 'F1 Score': 0.8571428571428571}, 'Decision Tree': {'Accuracy': 0.9991397773954567, 'Precision': 0.7247706422018348, 'Recall': 0.8061224489795918, 'F1 Score': 0.7632850241545894}, 'SVM': {'Accuracy': 0.9993153330290369, 'Precision': 0.9682539682539683, 'Recall': 0.6224489795918368, 'F1 Score': 0.7577639751552796}}


In [14]:
from sklearn.metrics import confusion_matrix


lr_model = models['Logistic Regression']
y_pred_lr = lr_model.predict(X_test_scaled)


conf_matrix = confusion_matrix(y_test, y_pred_lr)
print("Confusion Matrix:\n", conf_matrix)

precision = precision_score(y_test, y_pred_lr)
recall = recall_score(y_test, y_pred_lr)
f1 = f1_score(y_test, y_pred_lr)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Confusion Matrix:
 [[56855     9]
 [   41    57]]
Precision: 0.8636363636363636
Recall: 0.5816326530612245
F1 Score: 0.6951219512195121


# QUE 1.5

To ensure a machine learning model for credit card fraud detection remains effective over time, use the following strategies:

 TECHNICAL STRATEGIES

1. Regular Model Retraining: Update the model with new data periodically.
2. Adaptive Algorithms: Use algorithms that adapt to changes in data distribution.
3. Model Monitoring: Track performance metrics and set up alerts for significant drops.
4. Data Pipeline Management: Maintain a robust data pipeline with quality checks.
5. Ensemble Methods: Combine predictions from multiple models.
6. Feature Engineering: Update features to capture emerging fraud patterns.


  OPERATIONAL STRATEGIES

1. Human-in-the-Loop: Use feedback from fraud analysts to refine the model.
2. Continuous Evaluation: Regularly test the model on validation sets or with A/B testing.
3. Collaboration with Fraud Analysts: Work closely with fraud detection teams.
4. Model Governance: Document and manage changes to the model.
5. Scalability and Performance: Ensure the infrastructure can handle increasing data volumes.
6. Stakeholder Communication: Keep stakeholders informed about the model’s performance and updates.