<a href="https://colab.research.google.com/github/ShabnaIlmi/Data-Science-Group-Project/blob/main/DSGP_startover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load and Inspect the Data**

In [96]:
import pandas as pd

# Load datasets
recipes_path = "/content/recipes_nodup.csv"
chem_path = "/content/chem.csv"

df_recipes = pd.read_csv(recipes_path)
df_chem = pd.read_csv(chem_path)

# Display first few rows of each dataset
print("📌 recipes_nodup.csv:")
print(df_recipes.head(), "\n\n")

print("📌 chem.csv:")
print(df_chem.head(), "\n\n")

# Check dataset shapes
print(f"🔍 recipes_nodup.csv Shape: {df_recipes.shape}")
print(f"🔍 chem.csv Shape: {df_chem.shape}")

# Show column names
print("🛠 recipes_nodup Columns:", df_recipes.columns)
print("🛠 chem Columns:", df_chem.columns)

# Check missing values
print("⚠️ Missing values in recipes_nodup:\n", df_recipes.isnull().sum(), "\n")
print("⚠️ Missing values in chem:\n", df_chem.isnull().sum(), "\n")


📌 recipes_nodup.csv:
   Recipe ID                                    Chemical Names  \
0          1               Ephedrine + Red Phosphorus + Iodine   
1          2             Toluene + Nitric Acid + Sulfuric Acid   
2          3       Hydrogen Peroxide + Acetone + Sulfuric Acid   
3          4  Ephedrine + Potassium Permanganate + Acetic Acid   
4          5             Potassium Nitrate + Charcoal + Sulfur   

                     Formulas   Quantities (g/mL)  \
0           C10H15NO + P + I2     30g + 15g + 10g   
1         C7H8 + HNO3 + H2SO4  50mL + 30mL + 40mL   
2        H2O2 + C3H6O + H2SO4   20mL + 30mL + 5mL   
3  C10H15NO + KMnO4 + CH3COOH    25g + 10g + 50mL   
4                KNO3 + C + S     75g + 15g + 10g   

                         CAS Numbers    Solvent Used  \
0   299-42-3 + 7723-14-0 + 7553-56-2  Acetone, Ether   
1   108-88-3 + 7697-37-2 + 7664-93-9             NaN   
2    7722-84-1 + 67-64-1 + 7664-93-9             NaN   
3     299-42-3 + 7722-64-7 + 64-19-7   

In [97]:
# Check data types of columns
print("🔍 Data types in recipes_nodup.csv:\n", df_recipes.dtypes, "\n")
print("🔍 Data types in chem.csv:\n", df_chem.dtypes, "\n")


🔍 Data types in recipes_nodup.csv:
 Recipe ID                                       int64
Chemical Names                                 object
Formulas                                       object
Quantities (g/mL)                              object
CAS Numbers                                    object
Solvent Used                                   object
Reaction Conditions                            object
Toxicity Level                                 object
Flammability (Yes/No)                          object
Reactivity (Stable/Unstable)                   object
Explosiveness (1-10)                            int64
Health Risk Score (0-100)                       int64
Environmental Hazard (Yes/No)                  object
Dual Use Potential (Yes/No)                    object
Intended Use                                   object
Export Restriction (Yes/No)                    object
Controlled Substance (Yes/No)                  object
Risk Assessment Score (0-100)                 

**Check Missing Values and Remove Duplicates**

In [98]:
print("📌 Duplicate rows in recipes_nodup:", df_recipes.duplicated().sum())
print("📌 Duplicate rows in chem:", df_chem.duplicated().sum())

print("⚠️ Missing values in recipes_nodup:\n", df_recipes.isnull().sum(), "\n")
print("⚠️ Missing values in chem:\n", df_chem.isnull().sum(), "\n")



📌 Duplicate rows in recipes_nodup: 0
📌 Duplicate rows in chem: 0
⚠️ Missing values in recipes_nodup:
 Recipe ID                                       0
Chemical Names                                  0
Formulas                                        0
Quantities (g/mL)                               0
CAS Numbers                                     0
Solvent Used                                   54
Reaction Conditions                             0
Toxicity Level                                  0
Flammability (Yes/No)                           0
Reactivity (Stable/Unstable)                    0
Explosiveness (1-10)                            0
Health Risk Score (0-100)                       0
Environmental Hazard (Yes/No)                   0
Dual Use Potential (Yes/No)                     0
Intended Use                                    0
Export Restriction (Yes/No)                     0
Controlled Substance (Yes/No)                   0
Risk Assessment Score (0-100)                   

In [99]:
# Display unique values for categorical columns
print("Unique values in key columns (recipes_nodup):\n")
for col in df_recipes.columns:
    print(f"{col}: {df_recipes[col].nunique()} unique values")

print("\n🛠 Unique values in key columns (chem):\n")
for col in df_chem.columns:
    print(f"{col}: {df_chem[col].nunique()} unique values")


Unique values in key columns (recipes_nodup):

Recipe ID: 76 unique values
Chemical Names: 76 unique values
Formulas: 75 unique values
Quantities (g/mL): 57 unique values
CAS Numbers: 76 unique values
Solvent Used: 3 unique values
Reaction Conditions: 37 unique values
Toxicity Level: 3 unique values
Flammability (Yes/No): 2 unique values
Reactivity (Stable/Unstable): 3 unique values
Explosiveness (1-10): 8 unique values
Health Risk Score (0-100): 14 unique values
Environmental Hazard (Yes/No): 1 unique values
Dual Use Potential (Yes/No): 1 unique values
Intended Use: 46 unique values
Export Restriction (Yes/No): 2 unique values
Controlled Substance (Yes/No): 2 unique values
Risk Assessment Score (0-100): 13 unique values
Regulatory Body: 5 unique values
Compliance Status (Compliant/Non-compliant): 2 unique values
Risk Category: 3 unique values
Risk Score (0-100): 10 unique values

🛠 Unique values in key columns (chem):

ID: 401 unique values
Chemical name: 393 unique values
molarcular 

**Preprocessing recipes_nodup.csv**

In [100]:
# Fill missing values with "Unknown"
df_recipes["Solvent Used"].fillna("Unknown", inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_recipes["Solvent Used"].fillna("Unknown", inplace=True)


In [101]:
import re
import pandas as pd

df_recipes = pd.read_csv(recipes_path)

# Function to combine chemicals with their quantities
def combine_chemicals(row):
    chemicals = row["Chemical Names"].split(" + ")
    quantities = row["Quantities (g/mL)"].split(" + ")

    combined = []
    for chem, qty in zip(chemicals, quantities):
        qty_numeric = re.sub(r"[^\d.]", "", qty)  # Remove non-numeric characters
        combined.append(f"{chem}:{qty_numeric}")  # Format as "Chemical:Quantity"

    return " + ".join(combined)  # Join all pairs into a single text format

# Apply transformation
df_recipes["Combined Recipe"] = df_recipes.apply(combine_chemicals, axis=1)

# Drop old columns
df_recipes.drop(columns=["Chemical Names", "Quantities (g/mL)"], inplace=True)




Encode Categorical Features

In [102]:
from sklearn.preprocessing import LabelEncoder

binary_cols = [
    "Flammability (Yes/No)", "Reactivity (Stable/Unstable)",
    "Environmental Hazard (Yes/No)", "Dual Use Potential (Yes/No)",
    "Export Restriction (Yes/No)", "Controlled Substance (Yes/No)",
    "Compliance Status (Compliant/Non-compliant)"
]

# Convert Yes/No to 0/1
for col in binary_cols:
    df_recipes[col] = df_recipes[col].apply(lambda x: 1 if x in ["Yes", "Compliant", "Stable"] else 0)

# Encode Risk Category
df_recipes["Risk Category Encoded"] = LabelEncoder().fit_transform(df_recipes["Risk Category"])


Normalize Numerical Features

In [103]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Remove "Total Quantity (g/mL)" from num_cols as it is no longer present in df_recipes
num_cols = ["Risk Score (0-100)", "Health Risk Score (0-100)", "Risk Assessment Score (0-100)"]

df_recipes[num_cols] = scaler.fit_transform(df_recipes[num_cols])


# Save processed data
df_recipes.to_csv("/content/processed_recipes.csv", index=False)
print("✅ Successfully converted recipes into structured format!")

✅ Successfully converted recipes into structured format!


**Preprocessing chem.csv**

In [104]:
# Fill missing values in chem.csv
df_chem["CAS number"].fillna("Unknown", inplace=True)
df_chem["UN number"].fillna("Unknown", inplace=True)
df_chem["synonyms"].fillna("Unknown", inplace=True)

# Standardize column names
df_chem.columns = df_chem.columns.str.strip().str.lower().str.replace(" ", "_")

# Save preprocessed chem.csv
df_chem.to_csv("/content/preprocessed_chem.csv", index=False)
print("Preprocessed chem.csv saved!")


Preprocessed chem.csv saved!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_chem["CAS number"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_chem["UN number"].fillna("Unknown", inplace=True)
  df_chem["UN number"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never

***Convert Recipes into Features***

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text into numerical features using TF-IDF
vectorizer = TfidfVectorizer(max_features=500)  # Limit to top 500 words
X_tfidf = vectorizer.fit_transform(df_recipes["Combined Recipe"]).toarray()

# Convert to DataFrame and merge with the dataset
X_tfidf_df = pd.DataFrame(X_tfidf, columns=vectorizer.get_feature_names_out())
df_final = pd.concat([df_recipes.drop(columns=["Combined Recipe"]), X_tfidf_df], axis=1)

print("✅ TF-IDF transformation complete! Data ready for ML models.")


✅ TF-IDF transformation complete! Data ready for ML models.


In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert "Combined Recipe" into TF-IDF vectors
vectorizer = TfidfVectorizer(max_features=500)  # Keep top 500 most important features
X_tfidf = vectorizer.fit_transform(df_recipes["Combined Recipe"]).toarray()

# Convert TF-IDF output to DataFrame and merge with other numerical features
X_tfidf_df = pd.DataFrame(X_tfidf, columns=vectorizer.get_feature_names_out())

# Drop "Combined Recipe" and merge TF-IDF features
df_final = pd.concat([df_recipes.drop(columns=["Combined Recipe"]), X_tfidf_df], axis=1)

print("✅ TF-IDF transformation complete! Now we can train ML models.")


✅ TF-IDF transformation complete! Now we can train ML models.


**Train ML Models**

In [107]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df_final.drop(columns=["Risk Category", "Risk Category Encoded"])  # Features
y = df_final["Risk Category Encoded"]  # Target

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("✅ Data split complete! Ready for training.")


✅ Data split complete! Ready for training.


In [108]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=100)

# Transform text columns
X_train_tfidf = vectorizer.fit_transform(X_train["Formulas"].fillna("")).toarray()
X_test_tfidf = vectorizer.transform(X_test["Formulas"].fillna("")).toarray()

# Convert to DataFrame and merge
X_train_tfidf_df = pd.DataFrame(X_train_tfidf, columns=vectorizer.get_feature_names_out())
X_test_tfidf_df = pd.DataFrame(X_test_tfidf, columns=vectorizer.get_feature_names_out())

X_train = pd.concat([X_train.drop(columns=["Formulas"]), X_train_tfidf_df], axis=1)
X_test = pd.concat([X_test.drop(columns=["Formulas"]), X_test_tfidf_df], axis=1)

print("✅ TF-IDF applied to text columns!")


✅ TF-IDF applied to text columns!


In [122]:
X_train = X_train.drop(columns=["Formulas", "CAS Numbers"], errors="ignore")
X_test = X_test.drop(columns=["Formulas", "CAS Numbers"], errors="ignore")

print("✅ Dropped Formulas & CAS Numbers (Not needed for model training).")


✅ Dropped Formulas & CAS Numbers (Not needed for model training).


In [123]:
from sklearn.preprocessing import LabelEncoder

# List of categorical columns
cat_cols = ["Solvent Used", "Reaction Conditions", "Toxicity Level", "Intended Use", "Regulatory Body"]

# Apply Label Encoding correctly
for col in cat_cols:
    encoder = LabelEncoder()

    # Fit on both training & testing data
    encoder.fit(pd.concat([X_train[col], X_test[col]], axis=0).astype(str))

    # Transform both datasets
    X_train[col] = encoder.transform(X_train[col].astype(str))
    X_test[col] = encoder.transform(X_test[col].astype(str))

print("✅ Successfully encoded categorical columns!")



✅ Successfully encoded categorical columns!


In [124]:
# Combine features and target into one DataFrame if not done already
X = df_final.drop(columns=["Risk Category", "Risk Category Encoded"])
y = df_final["Risk Category Encoded"]

print("Before split - X shape:", X.shape, "y shape:", y.shape)

# Perform a stratified split to maintain class balance
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("After split - X_train shape:", X_train.shape, "y_train shape:", y_train.shape)


Before split - X shape: (76, 120) y shape: (76,)
After split - X_train shape: (60, 120) y_train shape: (60,)


In [125]:
print("🔍 Data Types in X_train:\n", X_train.dtypes[X_train.dtypes == "object"])


🔍 Data Types in X_train:
 Formulas               object
CAS Numbers            object
Solvent Used           object
Reaction Conditions    object
Toxicity Level         object
Intended Use           object
Regulatory Body        object
dtype: object


In [None]:
drop_cols = ["Combined Recipe", "Formulas", "CAS Numbers"]  # Add any other text-based columns

X_train = X_train.drop(columns=drop_cols, errors="ignore")
X_test = X_test.drop(columns=drop_cols, errors="ignore")

print("✅ Removed text columns from training data.")


In [120]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE correctly
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Verify shapes after SMOTE
print("✅ After SMOTE - X_train_balanced shape:", X_train_balanced.shape)
print("✅ After SMOTE - y_train_balanced shape:", y_train_balanced.shape)


ValueError: could not convert string to float: 'AgCNO + C3H6O'

In [119]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate performance
print("📊 Classification Report:\n", classification_report(y_test, y_pred))
print("✅ Random Forest Accuracy:", accuracy_score(y_test, y_pred))


ValueError: could not convert string to float: 'AgCNO + C3H6O'