## Machine Learning Models and Predictions

### Fraud Detection Model
- **Model**: Classification (Logistic Regression, Neural Networks)
- **Objective**: Classify whether a transaction is fraudulent (SUSPECTED_FRAUD) based on sales data and other features.
- **Features**: Sales, Product Category, Department, Order Value, Customer Segment, Transaction Type.
- **Target**: SUSPECTED_FRAUD (binary classification).

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

df = pd.read_csv('/Users/teitelbaumsair/Desktop/Data Bootcamp Repo/DI_Bootcamp/Data Bootcamp Final Project/Data/DataCoSupplyChainDataset_clean.csv')

# Inspect the target variable
print("Unique values in SUSPECTED_FRAUD:", df['SUSPECTED_FRAUD'].unique())


Unique values in SUSPECTED_FRAUD: [0 1]


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 54 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Type                           180519 non-null  object 
 1   Days for shipping (real)       180519 non-null  int64  
 2   Days for shipment (scheduled)  180519 non-null  int64  
 3   Benefit per order              180519 non-null  float64
 4   Sales per customer             180519 non-null  float64
 5   Delivery Status                180519 non-null  object 
 6   Late_delivery_risk             180519 non-null  int64  
 7   Category Id                    180519 non-null  int64  
 8   Category Name                  180519 non-null  object 
 9   Customer City                  180519 non-null  object 
 10  Customer Country               180519 non-null  object 
 11  Customer Fname                 180519 non-null  object 
 12  Customer Id                   

In [17]:

# Drop unnecessary columns
drop_columns = [
    'Customer Fname', 'Customer Lname', 'Customer Street', 'Customer City', 'Customer State',
    'Customer Country', 'Customer Zipcode', 'Customer Id', 'Order City', 'Order Country', 
    'Order State', 'Order Region', 'Order Status', 'Product Image', 'Product Name', 'Order Id', 
    'Order Item Cardprod Id', 'Order Item Id', 'Order Item Quantity', 'Order Item Profit Ratio',
    'Order Item Product Price', 'Order Item Discount', 'Order Item Discount Rate', 'Order Item Total',
    'Order Profit Per Order', 'Product Card Id', 'Product Category Id', 'shipping date (DateOrders)',
    'order date (DateOrders)', 'Product Price', 'Product Status', 'Latitude', 'Longitude', 
    'Year_Month', 'Month_Year', 'Month', 'Year'
]
df.drop(columns=drop_columns, axis=1, inplace=True)

# One-hot encode categorical columns
categorical_cols = ['Type', 'Category Name', 'Market', 'Department Name', 'Shipping Mode','Delivery Status', 'Customer Segment']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (X) and target (y)
X = df.drop('Late_delivery_risk', axis=1)
y = df['Late_delivery_risk']

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression Model
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg_preds = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_reg_preds))
print(classification_report(y_test, log_reg_preds))
print(confusion_matrix(y_test, log_reg_preds))

# Neural Network (MLPClassifier) Model
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(hidden_layer_sizes=(50, 50), max_iter=1000, random_state=42)
nn.fit(X_train, y_train)
nn_preds = nn.predict(X_test)
print("Neural Network Accuracy:", accuracy_score(y_test, nn_preds))
print(classification_report(y_test, nn_preds))
print(confusion_matrix(y_test, nn_preds))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Accuracy: 0.9542709949036118
              precision    recall  f1-score   support

           0       0.97      0.92      0.95     16199
           1       0.94      0.98      0.96     19905

    accuracy                           0.95     36104
   macro avg       0.96      0.95      0.95     36104
weighted avg       0.96      0.95      0.95     36104

[[14956  1243]
 [  408 19497]]
Neural Network Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     16199
           1       1.00      1.00      1.00     19905

    accuracy                           1.00     36104
   macro avg       1.00      1.00      1.00     36104
weighted avg       1.00      1.00      1.00     36104

[[16199     0]
 [    0 19905]]


### Model Evaluation Summary:

#### Logistic Regression:
- **Accuracy**: 95.43%
- **Precision**: 
  - Class 0 (Non-Fraud): 0.97
  - Class 1 (Fraud): 0.94
- **Recall**: 
  - Class 0 (Non-Fraud): 0.92
  - Class 1 (Fraud): 0.98
- **F1-Score**: 
  - Class 0 (Non-Fraud): 0.95
  - Class 1 (Fraud): 0.96
- **Confusion Matrix**:
  - True Positives (Class 1): 19,497
  - False Positives (Class 1): 1,243
  - True Negatives (Class 0): 14,956
  - False Negatives (Class 0): 408
  
**Interpretation**:
The Logistic Regression model performs well with an overall accuracy of 95.43%. The model is particularly good at identifying fraud (Class 1) with a high recall of 0.98. However, there is some misclassification of non-fraud cases (Class 0) as fraud, with a slightly lower recall of 0.92. Despite this, the model’s F1-scores are strong, indicating a good balance between precision and recall for both classes.

**Warning**: The model issued a `ConvergenceWarning`, suggesting that the solver didn’t fully converge within the default number of iterations. This could potentially be addressed by scaling the features or increasing the number of iterations.

---

#### Neural Network (MLPClassifier):
- **Accuracy**: 100%
- **Precision**: 
  - Class 0 (Non-Fraud): 1.00
  - Class 1 (Fraud): 1.00
- **Recall**: 
  - Class 0 (Non-Fraud): 1.00
  - Class 1 (Fraud): 1.00
- **F1-Score**: 
  - Class 0 (Non-Fraud): 1.00
  - Class 1 (Fraud): 1.00
- **Confusion Matrix**:
  - True Positives (Class 1): 19,905
  - False Positives (Class 1): 0
  - True Negatives (Class 0): 16,199
  - False Negatives (Class 0): 0

**Interpretation**:
The Neural Network model demonstrates perfect performance with an accuracy of 100%. It achieves flawless precision, recall, and F1-score for both fraud and non-fraud classes, correctly classifying all instances without any misclassifications. This suggests overfitting, where the model might have learned too well on the training data and could perform poorly on unseen data. This issue should be examined carefully, possibly by reducing the complexity of the model or increasing regularization.

---

### Conclusion:
- The **Logistic Regression** model shows strong performance but could be further optimized for convergence and potential overfitting.
- The **Neural Network** achieves perfect results but might not generalize well, as indicated by the perfect classification of both classes, which raises concerns about overfitting.

Further evaluation on unseen test data would be important to verify the robustness of both models.