<a href="https://colab.research.google.com/github/ShabnaIlmi/Data-Science-Group-Project/blob/main/DSGP_startover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load and Inspect the Data**

In [37]:
import pandas as pd

# Load datasets
recipes_path = "/content/recipes_nodup.csv"
chem_path = "/content/chem.csv"

df_recipes = pd.read_csv(recipes_path)
df_chem = pd.read_csv(chem_path)

# Display first few rows of each dataset
print("📌 recipes_nodup.csv:")
print(df_recipes.head(), "\n\n")

print("📌 chem.csv:")
print(df_chem.head(), "\n\n")

# Check dataset shapes
print(f"🔍 recipes_nodup.csv Shape: {df_recipes.shape}")
print(f"🔍 chem.csv Shape: {df_chem.shape}")

# Show column names
print("🛠 recipes_nodup Columns:", df_recipes.columns)
print("🛠 chem Columns:", df_chem.columns)

# Check missing values
print("⚠️ Missing values in recipes_nodup:\n", df_recipes.isnull().sum(), "\n")
print("⚠️ Missing values in chem:\n", df_chem.isnull().sum(), "\n")


📌 recipes_nodup.csv:
   Recipe ID                                    Chemical Names  \
0          1               Ephedrine + Red Phosphorus + Iodine   
1          2             Toluene + Nitric Acid + Sulfuric Acid   
2          3       Hydrogen Peroxide + Acetone + Sulfuric Acid   
3          4  Ephedrine + Potassium Permanganate + Acetic Acid   
4          5             Potassium Nitrate + Charcoal + Sulfur   

                     Formulas   Quantities (g/mL)  \
0           C10H15NO + P + I2     30g + 15g + 10g   
1         C7H8 + HNO3 + H2SO4  50mL + 30mL + 40mL   
2        H2O2 + C3H6O + H2SO4   20mL + 30mL + 5mL   
3  C10H15NO + KMnO4 + CH3COOH    25g + 10g + 50mL   
4                KNO3 + C + S     75g + 15g + 10g   

                         CAS Numbers    Solvent Used  \
0   299-42-3 + 7723-14-0 + 7553-56-2  Acetone, Ether   
1   108-88-3 + 7697-37-2 + 7664-93-9             NaN   
2    7722-84-1 + 67-64-1 + 7664-93-9             NaN   
3     299-42-3 + 7722-64-7 + 64-19-7   

In [38]:
# Check data types of columns
print("🔍 Data types in recipes_nodup.csv:\n", df_recipes.dtypes, "\n")
print("🔍 Data types in chem.csv:\n", df_chem.dtypes, "\n")


🔍 Data types in recipes_nodup.csv:
 Recipe ID                                       int64
Chemical Names                                 object
Formulas                                       object
Quantities (g/mL)                              object
CAS Numbers                                    object
Solvent Used                                   object
Reaction Conditions                            object
Toxicity Level                                 object
Flammability (Yes/No)                          object
Reactivity (Stable/Unstable)                   object
Explosiveness (1-10)                            int64
Health Risk Score (0-100)                       int64
Environmental Hazard (Yes/No)                  object
Dual Use Potential (Yes/No)                    object
Intended Use                                   object
Export Restriction (Yes/No)                    object
Controlled Substance (Yes/No)                  object
Risk Assessment Score (0-100)                 

**Check Missing Values and Remove Duplicates**

In [39]:
print("📌 Duplicate rows in recipes_nodup:", df_recipes.duplicated().sum())
print("📌 Duplicate rows in chem:", df_chem.duplicated().sum())

print("⚠️ Missing values in recipes_nodup:\n", df_recipes.isnull().sum(), "\n")
print("⚠️ Missing values in chem:\n", df_chem.isnull().sum(), "\n")



📌 Duplicate rows in recipes_nodup: 0
📌 Duplicate rows in chem: 0
⚠️ Missing values in recipes_nodup:
 Recipe ID                                       0
Chemical Names                                  0
Formulas                                        0
Quantities (g/mL)                               0
CAS Numbers                                     0
Solvent Used                                   89
Reaction Conditions                             0
Toxicity Level                                  0
Flammability (Yes/No)                           0
Reactivity (Stable/Unstable)                    0
Explosiveness (1-10)                            0
Health Risk Score (0-100)                       0
Environmental Hazard (Yes/No)                   0
Dual Use Potential (Yes/No)                     0
Intended Use                                    0
Export Restriction (Yes/No)                     0
Controlled Substance (Yes/No)                   0
Risk Assessment Score (0-100)                   

In [40]:
# Display unique values for categorical columns
print("Unique values in key columns (recipes_nodup):\n")
for col in df_recipes.columns:
    print(f"{col}: {df_recipes[col].nunique()} unique values")

print("\n🛠 Unique values in key columns (chem):\n")
for col in df_chem.columns:
    print(f"{col}: {df_chem[col].nunique()} unique values")


Unique values in key columns (recipes_nodup):

Recipe ID: 150 unique values
Chemical Names: 149 unique values
Formulas: 140 unique values
Quantities (g/mL): 85 unique values
CAS Numbers: 142 unique values
Solvent Used: 4 unique values
Reaction Conditions: 65 unique values
Toxicity Level: 5 unique values
Flammability (Yes/No): 2 unique values
Reactivity (Stable/Unstable): 3 unique values
Explosiveness (1-10): 10 unique values
Health Risk Score (0-100): 18 unique values
Environmental Hazard (Yes/No): 2 unique values
Dual Use Potential (Yes/No): 2 unique values
Intended Use: 106 unique values
Export Restriction (Yes/No): 2 unique values
Controlled Substance (Yes/No): 2 unique values
Risk Assessment Score (0-100): 18 unique values
Regulatory Body: 7 unique values
Compliance Status (Compliant/Non-compliant): 2 unique values
Risk Category: 4 unique values
Risk Score (0-100): 16 unique values

🛠 Unique values in key columns (chem):

ID: 401 unique values
Chemical name: 393 unique values
molar

**Preprocessing recipes_nodup.csv**

In [41]:
# Fill missing values with "Unknown"
df_recipes["Solvent Used"].fillna("Unknown", inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_recipes["Solvent Used"].fillna("Unknown", inplace=True)


In [42]:
import re
import pandas as pd

df_recipes = pd.read_csv(recipes_path)

# Function to combine chemicals with their quantities
def combine_chemicals(row):
    chemicals = row["Chemical Names"].split(" + ")
    quantities = row["Quantities (g/mL)"].split(" + ")

    combined = []
    for chem, qty in zip(chemicals, quantities):
        qty_numeric = re.sub(r"[^\d.]", "", qty)  # Remove non-numeric characters
        combined.append(f"{chem}:{qty_numeric}")  # Format as "Chemical:Quantity"

    return " + ".join(combined)  # Join all pairs into a single text format

# Apply transformation
df_recipes["Combined Recipe"] = df_recipes.apply(combine_chemicals, axis=1)

# Drop old columns
df_recipes.drop(columns=["Chemical Names", "Quantities (g/mL)"], inplace=True)




Encode Categorical Features

In [43]:
from sklearn.preprocessing import LabelEncoder

binary_cols = [
    "Flammability (Yes/No)", "Reactivity (Stable/Unstable)",
    "Environmental Hazard (Yes/No)", "Dual Use Potential (Yes/No)",
    "Export Restriction (Yes/No)", "Controlled Substance (Yes/No)",
    "Compliance Status (Compliant/Non-compliant)"
]

# Convert Yes/No to 0/1
for col in binary_cols:
    df_recipes[col] = df_recipes[col].apply(lambda x: 1 if x in ["Yes", "Compliant", "Stable"] else 0)

# Encode Risk Category
df_recipes["Risk Category Encoded"] = LabelEncoder().fit_transform(df_recipes["Risk Category"])


Normalize Numerical Features

In [44]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Remove "Total Quantity (g/mL)" from num_cols as it is no longer present in df_recipes
num_cols = ["Risk Score (0-100)", "Health Risk Score (0-100)", "Risk Assessment Score (0-100)"]

df_recipes[num_cols] = scaler.fit_transform(df_recipes[num_cols])


# Save processed data
df_recipes.to_csv("/content/processed_recipes.csv", index=False)
print("✅ Successfully converted recipes into structured format!")

✅ Successfully converted recipes into structured format!


**Preprocessing chem.csv**

In [45]:
# Fill missing values in chem.csv
df_chem["CAS number"].fillna("Unknown", inplace=True)
df_chem["UN number"].fillna("Unknown", inplace=True)
df_chem["synonyms"].fillna("Unknown", inplace=True)

# Standardize column names
df_chem.columns = df_chem.columns.str.strip().str.lower().str.replace(" ", "_")

# Save preprocessed chem.csv
df_chem.to_csv("/content/preprocessed_chem.csv", index=False)
print("Preprocessed chem.csv saved!")


Preprocessed chem.csv saved!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_chem["CAS number"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_chem["UN number"].fillna("Unknown", inplace=True)
  df_chem["UN number"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never

***Convert Recipes into Features***

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert "Combined Recipe" into TF-IDF vectors
vectorizer = TfidfVectorizer(max_features=500)  # Keep top 500 most important features
X_tfidf = vectorizer.fit_transform(df_recipes["Combined Recipe"]).toarray()

# Convert TF-IDF output to DataFrame and merge with other numerical features
X_tfidf_df = pd.DataFrame(X_tfidf, columns=vectorizer.get_feature_names_out())

# Drop "Combined Recipe" and merge TF-IDF features
df_final = pd.concat([df_recipes.drop(columns=["Combined Recipe"]), X_tfidf_df], axis=1)

print("✅ TF-IDF transformation complete!")


✅ TF-IDF transformation complete!


In [47]:
from sklearn.preprocessing import LabelEncoder
import joblib

# Define Label Encoder
label_encoder = LabelEncoder()

# Fit on the target variable (Risk Category)
df_recipes["Risk Category Encoded"] = label_encoder.fit_transform(df_recipes["Risk Category"])

# Save the trained LabelEncoder
joblib.dump(label_encoder, "risk_category_encoder.pkl")

print("✅ Label Encoder saved successfully!")


✅ Label Encoder saved successfully!


**Train ML Models**

In [48]:
import joblib

joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
joblib.dump(scaler, "minmax_scaler.pkl")
joblib.dump(label_encoder, "risk_category_encoder.pkl")

print("✅ Preprocessing models saved!")


✅ Preprocessing models saved!


In [49]:
import re
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the trained vectorizer, scaler, and label encoders
vectorizer = joblib.load("tfidf_vectorizer.pkl")
scaler = joblib.load("minmax_scaler.pkl")
label_encoder = joblib.load("risk_category_encoder.pkl")

def preprocess_new_input(df):
    df = df.copy()

    # Convert "Combined Recipe" into structured format
    if "Combined Recipe" in df.columns:
        def combine_chemicals(row):
            chemicals = row.split(" + ")
            combined = [f"{chem.split(':')[0]}:{re.sub(r'[^0-9.]', '', chem.split(':')[1])}" for chem in chemicals]
            return " + ".join(combined)

        df["Combined Recipe"] = df["Combined Recipe"].apply(combine_chemicals)

    # Convert categorical Yes/No features to 0/1
    binary_cols = [
        "Flammability (Yes/No)", "Reactivity (Stable/Unstable)",
        "Environmental Hazard (Yes/No)", "Dual Use Potential (Yes/No)",
        "Export Restriction (Yes/No)", "Controlled Substance (Yes/No)",
        "Compliance Status (Compliant/Non-compliant)"
    ]
    for col in binary_cols:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: 1 if x in ["Yes", "Compliant", "Stable"] else 0)

    # Scale numerical columns
    num_cols = ["Risk Score (0-100)", "Health Risk Score (0-100)", "Risk Assessment Score (0-100)"]
    if set(num_cols).issubset(df.columns):
        df[num_cols] = scaler.transform(df[num_cols])  # Apply trained scaler

    # Transform text using trained TF-IDF vectorizer
    '''if "Combined Recipe" in df.columns:
        tfidf_features = vectorizer.transform(df["Combined Recipe"]).toarray()
        tfidf_df = pd.DataFrame(tfidf_features, columns=vectorizer.get_feature_names_out())
        df = pd.concat([df.drop(columns=["Combined Recipe"], errors="ignore"), tfidf_df], axis=1)

    return df'''


In [50]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df_final.drop(columns=["Risk Category", "Risk Category Encoded"])  # Features
y = df_final["Risk Category Encoded"]  # Target

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("✅ Data split complete! Ready for training.")


✅ Data split complete! Ready for training.


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=100)

# Transform text columns
X_train_tfidf = vectorizer.fit_transform(X_train["Formulas"].fillna("")).toarray()
X_test_tfidf = vectorizer.transform(X_test["Formulas"].fillna("")).toarray()

# Convert to DataFrame and merge
X_train_tfidf_df = pd.DataFrame(X_train_tfidf, columns=vectorizer.get_feature_names_out())
X_test_tfidf_df = pd.DataFrame(X_test_tfidf, columns=vectorizer.get_feature_names_out())

X_train = pd.concat([X_train.drop(columns=["Formulas"]), X_train_tfidf_df], axis=1)
X_test = pd.concat([X_test.drop(columns=["Formulas"]), X_test_tfidf_df], axis=1)

print("✅ TF-IDF applied to text columns!")


✅ TF-IDF applied to text columns!


In [52]:
X_train = X_train.drop(columns=["Formulas", "CAS Numbers"], errors="ignore")
X_test = X_test.drop(columns=["Formulas", "CAS Numbers"], errors="ignore")

print("✅ Dropped Formulas & CAS Numbers (Not needed for model training).")


✅ Dropped Formulas & CAS Numbers (Not needed for model training).


In [53]:
# Combine features and target into one DataFrame if not done already
X = df_final.drop(columns=["Risk Category", "Risk Category Encoded"])
y = df_final["Risk Category Encoded"]

print("Before split - X shape:", X.shape, "y shape:", y.shape)

# Perform a stratified split to maintain class balance
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("After split - X_train shape:", X_train.shape, "y_train shape:", y_train.shape)


Before split - X shape: (150, 171) y shape: (150,)
After split - X_train shape: (120, 171) y_train shape: (120,)


In [54]:
from sklearn.preprocessing import LabelEncoder

# List of categorical columns
cat_cols = ["Solvent Used", "Reaction Conditions", "Toxicity Level", "Intended Use", "Regulatory Body"]

# Apply Label Encoding correctly
for col in cat_cols:
    encoder = LabelEncoder()

    # Fit on both training & testing data
    encoder.fit(pd.concat([X_train[col], X_test[col]], axis=0).astype(str))

    # Transform both datasets
    X_train[col] = encoder.transform(X_train[col].astype(str))
    X_test[col] = encoder.transform(X_test[col].astype(str))

print("✅ Successfully encoded categorical columns!")



✅ Successfully encoded categorical columns!


In [55]:
print("🔍 Data Types in X_train:\n", X_train.dtypes[X_train.dtypes == "object"])


🔍 Data Types in X_train:
 Formulas       object
CAS Numbers    object
dtype: object


In [56]:
drop_cols = ["Formulas", "CAS Numbers"]  # Add any other text-based columns

X_train = X_train.drop(columns=drop_cols, errors="ignore")
X_test = X_test.drop(columns=drop_cols, errors="ignore")

print("✅ Removed text columns from training data.")


✅ Removed text columns from training data.


In [57]:
X_train = X_train.astype(float)
X_test = X_test.astype(float)

print("✅ Confirmed all features are numeric!")


✅ Confirmed all features are numeric!


In [58]:
from imblearn.over_sampling import SMOTE

# Adjust SMOTE settings
smote = SMOTE(random_state=42, k_neighbors=1)  # Reduce k_neighbors
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Check new class distribution
import numpy as np
print("📊 Class distribution before SMOTE:", dict(zip(*np.unique(y_train, return_counts=True))))
print("📊 Class distribution after SMOTE:", dict(zip(*np.unique(y_train_balanced, return_counts=True))))



📊 Class distribution before SMOTE: {0: 65, 1: 13, 2: 23, 3: 19}
📊 Class distribution after SMOTE: {0: 65, 1: 65, 2: 65, 3: 65}


In [59]:
''' from imblearn.over_sampling import BorderlineSMOTE

borderline_smote = BorderlineSMOTE(random_state=42, k_neighbors=1)
X_train_balanced, y_train_balanced = borderline_smote.fit_resample(X_train, y_train) '''


' from imblearn.over_sampling import BorderlineSMOTE\n\nborderline_smote = BorderlineSMOTE(random_state=42, k_neighbors=1)\nX_train_balanced, y_train_balanced = borderline_smote.fit_resample(X_train, y_train) '

In [60]:
'''from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate performance
print("📊 Classification Report:\n", classification_report(y_test, y_pred))
print("✅ Random Forest Accuracy:", accuracy_score(y_test, y_pred))'''


'from sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import classification_report, accuracy_score\n\n# Train Random Forest model\nrf_model = RandomForestClassifier(n_estimators=100, random_state=42)\nrf_model.fit(X_train, y_train)\n\n# Make predictions\ny_pred = rf_model.predict(X_test)\n\n# Evaluate performance\nprint("📊 Classification Report:\n", classification_report(y_test, y_pred))\nprint("✅ Random Forest Accuracy:", accuracy_score(y_test, y_pred))'

In [61]:
'''best_rf = RandomForestClassifier(
    **grid_search.best_params_,
    class_weight="balanced",  # Handle class imbalance
    random_state=42
)

best_rf.fit(X_train_balanced, y_train_balanced)

# Make predictions
y_pred_best_rf = best_rf.predict(X_test)

# Evaluate performance
print("📊 Balanced Random Forest Classification Report:\n", classification_report(y_test, y_pred_best_rf))
print("✅ Balanced Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf))'''


'best_rf = RandomForestClassifier(\n    **grid_search.best_params_,\n    class_weight="balanced",  # Handle class imbalance\n    random_state=42\n)\n\nbest_rf.fit(X_train_balanced, y_train_balanced)\n\n# Make predictions\ny_pred_best_rf = best_rf.predict(X_test)\n\n# Evaluate performance\nprint("📊 Balanced Random Forest Classification Report:\n", classification_report(y_test, y_pred_best_rf))\nprint("✅ Balanced Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf))'

In [63]:
#RANDOM FORREST OPTIMIZED
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train_balanced, y_train_balanced)

print("✅ Best Parameters:", grid_search.best_params_)

# Train optimized model
best_rf = RandomForestClassifier(**grid_search.best_params_, random_state=42)
best_rf.fit(X_train_balanced, y_train_balanced)

# Make predictions
y_pred_best_rf = best_rf.predict(X_test)

# Evaluate performance
print("📊 Optimized Random Forest Classification Report:\n", classification_report(y_test, y_pred_best_rf))
print("✅ Optimized Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf))


✅ Best Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}
📊 Optimized Random Forest Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.75      1.00      0.86         3
           2       1.00      1.00      1.00         6
           3       1.00      0.80      0.89         5

    accuracy                           0.97        30
   macro avg       0.94      0.95      0.94        30
weighted avg       0.97      0.97      0.97        30

✅ Optimized Random Forest Accuracy: 0.9666666666666667


In [64]:
from xgboost import XGBClassifier

# Train XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', scale_pos_weight=10, random_state=42)
xgb_model.fit(X_train_balanced, y_train_balanced)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate performance
from sklearn.metrics import classification_report, accuracy_score
print("📊 XGBoost Classification Report:\n", classification_report(y_test, y_pred_xgb))
print("✅ XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



📊 XGBoost Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.94      0.97        16
           1       0.75      1.00      0.86         3
           2       1.00      1.00      1.00         6
           3       0.80      0.80      0.80         5

    accuracy                           0.93        30
   macro avg       0.89      0.93      0.91        30
weighted avg       0.94      0.93      0.94        30

✅ XGBoost Accuracy: 0.9333333333333333


In [65]:
import joblib

# Train and save model
rf_model.fit(X_train_balanced, y_train_balanced)
joblib.dump(rf_model, "random_forest_risk_model.pkl")

xgb_model.fit(X_train_balanced, y_train_balanced)
joblib.dump(xgb_model, "xgboost_risk_model.pkl")

print("✅ Models saved successfully!")



✅ Models saved successfully!


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



In [66]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(xgb_model, X_train_balanced, y_train_balanced, cv=5, scoring="accuracy")

print("📊 Cross-Validation Accuracy Scores:", cv_scores)
print("✅ Average Accuracy:", cv_scores.mean())


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



📊 Cross-Validation Accuracy Scores: [0.96153846 0.98076923 0.98076923 1.         0.96153846]
✅ Average Accuracy: 0.9769230769230768


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



In [67]:
# Load trained models
rf_model = joblib.load("random_forest_risk_model.pkl")
xgb_model = joblib.load("xgboost_risk_model.pkl")

# Example new recipe input
new_recipe = pd.DataFrame({
    "Combined Recipe": ["Water:500 + Ethanol:200 + Sulfuric Acid:50"],
    "Flammability (Yes/No)": [1],
    "Reactivity (Stable/Unstable)": [0],
    "Explosiveness (1-10)": [3],
    "Health Risk Score (0-100)": [45],
    "Risk Score (0-100)": [55]
})

# Apply preprocessing
new_recipe_processed = preprocess_new_input(new_recipe)

# Ensure new data matches training features
missing_cols = set(X_train.columns) - set(new_recipe_processed.columns)
for col in missing_cols:
    new_recipe_processed[col] = 0  # Add missing columns as zeros

new_recipe_processed = new_recipe_processed[X_train.columns]  # Reorder columns

# Predict risk category
predicted_risk = xgb_model.predict(new_recipe_processed)
predicted_risk_label = label_encoder.inverse_transform(predicted_risk)

print("🔍 Predicted Risk Category:", predicted_risk_label[0])


AttributeError: 'NoneType' object has no attribute 'columns'

In [71]:
# Example new recipe input (user provides chemicals and quantities)
new_recipe_raw = "Water:500 + Ethanol:200 + Sulfuric Acid:50"

# Apply preprocessing
new_recipe = preprocess_new_recipe(new_recipe_raw)  # This function is fixed in step 2 below


In [72]:
import re
import pandas as pd

# ✅ **Function to preprocess a new user-input recipe**
def preprocess_new_recipe(recipe):
    # Extract chemicals and quantities
    chemicals = []
    quantities = []

    for chem in recipe.split(" + "):
        parts = chem.split(":")
        if len(parts) == 2:
            chem_name = parts[0].strip().title()  # Standardize chemical name
            quantity = re.sub(r"[^\d.]", "", parts[1])  # Extract only numbers
            chemicals.append(chem_name)
            quantities.append(float(quantity) if quantity else 0)

    # Create Combined Recipe String
    combined_recipe = " + ".join([f"{chem}:{qty}" for chem, qty in zip(chemicals, quantities)])

    return pd.DataFrame({
        "Combined Recipe": [combined_recipe]
    })


# Convert "Combined Recipe" into TF-IDF vector
new_recipe_tfidf = vectorizer.transform(new_recipe["Combined Recipe"]).toarray()
new_recipe_tfidf = pd.DataFrame(new_recipe_tfidf, columns=vectorizer.get_feature_names_out())

# Ensure new data matches training features
missing_cols = set(X_train.columns) - set(new_recipe_tfidf.columns)
for col in missing_cols:
    new_recipe_tfidf[col] = 0  # Add missing columns as zeros

new_recipe_processed = new_recipe_tfidf[X_train.columns]  # Reorder columns


# Convert "Combined Recipe" into TF-IDF vector
new_recipe_tfidf = vectorizer.transform(new_recipe["Combined Recipe"]).toarray()
new_recipe_tfidf = pd.DataFrame(new_recipe_tfidf, columns=vectorizer.get_feature_names_out())

# Ensure new data matches training features
missing_cols = set(X_train.columns) - set(new_recipe_tfidf.columns)
for col in missing_cols:
    new_recipe_tfidf[col] = 0  # Add missing columns as zeros

new_recipe_processed = new_recipe_tfidf[X_train.columns]  # Reorder columns


# Predict Safety Properties using the trained model
predicted_properties = rf_model.predict(new_recipe_processed)

# Extract Predictions Correctly
flammability = "Yes" if predicted_properties[0][0] == 1 else "No"
reactivity = "Stable" if predicted_properties[0][1] == 1 else "Unstable"
explosiveness = predicted_properties[0][2]  # Numeric score
health_risk = predicted_properties[0][3]  # Numeric score
risk_score = predicted_properties[0][4]  # Numeric score

# ✅ Display Predicted Safety Properties
print("🔍 Predicted Safety Properties:")
print(f"🔥 Flammability: {flammability}")
print(f"⚠ Reactivity: {reactivity}")
print(f"💥 Explosiveness: {explosiveness}/10")
print(f"🩺 Health Risk Score: {health_risk}/100")
print(f"⚠ Risk Score: {risk_score}/100")


# Predict risk category using XGBoost model
predicted_risk = xgb_model.predict(new_recipe_processed)
predicted_risk_label = label_encoder.inverse_transform(predicted_risk)

print("🚨 Final Predicted Risk Category:", predicted_risk_label[0])


IndexError: invalid index to scalar variable.

In [73]:
print("🔍 Model Prediction Output:", predicted_properties)
print("🔍 Shape of Predictions:", predicted_properties.shape if hasattr(predicted_properties, "shape") else type(predicted_properties))


🔍 Model Prediction Output: [2]
🔍 Shape of Predictions: (1,)


In [None]:
tfidf_features = vectorizer.transform(new_recipe["Combined Recipe"]).toarray()
print("🔍 TF-IDF Feature Vector:", tfidf_features)
