Using RF classifier with smote bc it had the best balance between precision and recall for the reformulated products

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc, classification_report
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

In [6]:
df = pd.read_csv("cscpopendata_clean.csv")

chemical_hazard_map = {
    'Formaldehyde': 'Confirmed Carcinogen',
    'Titanium dioxide': 'Suspected Carcinogen',
    'Distillates (coal tar)': 'Confirmed Carcinogen',
    'Fragrance': 'Allergen/Irritant',
    'Linalool': 'Allergen/Irritant',
    'Glycerin': 'Not Hazardous',
    'Ethanol': 'Not Hazardous',
    'Sodium Lauryl Sulfate': 'Allergen/Irritant'
}

df['ChemicalName_Clean'] = df['ChemicalName_Clean'].str.strip()  # Ensure no leading/trailing whitespace
df['HazardType'] = df['ChemicalName_Clean'].map(chemical_hazard_map).fillna('Not Classified')
df['Reformulated'] = df['ChemicalDateRemoved'].notna().astype(int)

df['InitialDateReported'] = pd.to_datetime(df['InitialDateReported'], errors='coerce')
df['MostRecentDateReported'] = pd.to_datetime(df['MostRecentDateReported'], errors='coerce')
df['DiscontinuedDate'] = pd.to_datetime(df['DiscontinuedDate'], errors='coerce')

df['ChemicalAge'] = (df['MostRecentDateReported'] - df['InitialDateReported']).dt.days
df['ProductLifespan'] = (df['DiscontinuedDate'] - df['InitialDateReported']).dt.days

def create_hazard_score(row):
    if 'carcinogen' in str(row['HazardType']).lower():
        return 10
    elif 'reproductive' in str(row['HazardType']).lower():
        return 8
    else:
        return 5

df['HazardScore'] = df.apply(create_hazard_score, axis=1)

In [7]:
X = df[['ChemicalAge', 'ProductLifespan', 'HazardScore']]
y = df['Reformulated']

data_clean = pd.concat([X, y], axis=1).dropna()
X_numeric = data_clean[['ChemicalAge', 'ProductLifespan', 'HazardScore']]
y_numeric = data_clean['Reformulated']

X_train, X_test, y_train, y_test = train_test_split(X_numeric, y_numeric, test_size=0.2, random_state=42, stratify=y_numeric)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [8]:
# Final Model: Random Forest Classifier
best_model = RandomForestClassifier(random_state=42)
best_model.fit(X_train_smote, y_train_smote)
y_prob = best_model.predict_proba(X_test)[:, 1]
y_pred = best_model.predict(X_test)

# Calculate Precision-Recall AUC (PR-AUC)
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc_score = auc(recall, precision)

# Display the final classification report
print("Classification Report for Final Model:")
print(classification_report(y_test, y_pred))

print(f"\nFinal PR-AUC Score: {pr_auc_score:.4f}")

Classification Report for Final Model:
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      2500
           1       0.52      0.79      0.63        84

    accuracy                           0.97      2584
   macro avg       0.76      0.88      0.81      2584
weighted avg       0.98      0.97      0.97      2584


Final PR-AUC Score: 0.6893


Key Insight: This model successfully balances high recall (0.79) and good precision (0.52).

The model is highly effective at finding most reformulated products (high recall) while also being reasonably accurate in its predictions (good precision).

Final Score: The PR-AUC of 0.69 confirms it's a strong and reliable model for a difficult, imbalanced problem.