<a href="https://colab.research.google.com/github/ShabnaIlmi/Data-Science-Group-Project/blob/recipe-risk-analyzer/DSGP_startover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load and Inspect the Data**

In [10]:
import pandas as pd

# Load datasets
file1 = "/content/recipes_nodup.csv"
file2 = "/content/chem.csv"

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Display first few rows
print("📌 recipes_nodup.csv")
print(df1.head(), "\n\n")

print("📌 chem.csv")
print(df2.head(), "\n\n")

# Check for missing values
print("🔍 Missing values in recipes_nodup:\n", df1.isnull().sum(), "\n")
print("🔍 Missing values in chem:\n", df2.isnull().sum(), "\n")


📌 recipes_nodup.csv
   Recipe ID                                    Chemical Names  \
0          1               Ephedrine + Red Phosphorus + Iodine   
1          2             Toluene + Nitric Acid + Sulfuric Acid   
2          3       Hydrogen Peroxide + Acetone + Sulfuric Acid   
3          4  Ephedrine + Potassium Permanganate + Acetic Acid   
4          5             Potassium Nitrate + Charcoal + Sulfur   

                     Formulas   Quantities (g/mL)  \
0           C10H15NO + P + I2     30g + 15g + 10g   
1         C7H8 + HNO3 + H2SO4  50mL + 30mL + 40mL   
2        H2O2 + C3H6O + H2SO4   20mL + 30mL + 5mL   
3  C10H15NO + KMnO4 + CH3COOH    25g + 10g + 50mL   
4                KNO3 + C + S     75g + 15g + 10g   

                         CAS Numbers    Solvent Used  \
0   299-42-3 + 7723-14-0 + 7553-56-2  Acetone, Ether   
1   108-88-3 + 7697-37-2 + 7664-93-9             NaN   
2    7722-84-1 + 67-64-1 + 7664-93-9             NaN   
3     299-42-3 + 7722-64-7 + 64-19-7    

**Handle Missing Values and Remove Duplicates**

In [11]:
# Drop rows with too many missing values
df1 = df1.dropna()
df2 = df2.dropna()

print(f"📉 recipes_nodup: {df1.shape}, chem: {df2.shape}")  # Check new sizes


df1 = df1.drop_duplicates()
df2 = df2.drop_duplicates()


📉 recipes_nodup: (22, 22), chem: (383, 8)


**Convert Chemical Names to Vectorized Format**

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=500)  # Limit to top 500 words
df1_tfidf = vectorizer.fit_transform(df1["Chemical Names"]).toarray()

# Convert to DataFrame
df1_tfidf = pd.DataFrame(df1_tfidf, columns=vectorizer.get_feature_names_out())

# Merge back to dataset
df1 = pd.concat([df1, df1_tfidf], axis=1).drop(columns=["Chemical Names"])


**Normalize Numerical Columns**

In [14]:
print("🔍 Column Names in df1:", df1.columns)


🔍 Column Names in df1: Index(['Recipe ID', 'Formulas', 'Quantities (g/mL)', 'CAS Numbers',
       'Solvent Used', 'Reaction Conditions', 'Toxicity Level',
       'Flammability (Yes/No)', 'Reactivity (Stable/Unstable)',
       'Explosiveness (1-10)', 'Health Risk Score (0-100)',
       'Environmental Hazard (Yes/No)', 'Dual Use Potential (Yes/No)',
       'Intended Use', 'Export Restriction (Yes/No)',
       'Controlled Substance (Yes/No)', 'Risk Assessment Score (0-100)',
       'Regulatory Body', 'Compliance Status (Compliant/Non-compliant)',
       'Risk Category', 'Risk Score (0-100)', 'acetic', 'acetone', 'acid',
       'ammonia', 'charcoal', 'chlorine', 'chloropicrin', 'cyanide', 'dioxide',
       'dust', 'ephedrine', 'ethanol', 'ethylene', 'gas', 'hydrochloric',
       'hydrogen', 'hydroxide', 'iodine', 'lead', 'nitrate', 'nitrocellulose',
       'oxide', 'permanganate', 'peroxide', 'phosphorus', 'picric',
       'potassium', 'pseudoephedrine', 'red', 'sodium', 'sulfur',
       '

In [17]:
import re

# Function to extract and sum numerical values
def extract_quantity(value):
    numbers = re.findall(r"\d+", str(value))  # Extract all numbers
    return sum(map(int, numbers)) if numbers else 0  # Sum all extracted numbers

# Apply function to convert text to numerical values
df1["Quantities (g/mL)"] = df1["Quantities (g/mL)"].apply(extract_quantity)

# Now apply MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df1[["Quantities (g/mL)"]] = scaler.fit_transform(df1[["Quantities (g/mL)"]])

print(" Successfully converted and scaled Quantities!")


 Successfully converted and scaled Quantities!


**Encode Categorical Variables**

In [20]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df1["Risk Level Encoded"] = label_encoder.fit_transform(df1["Risk Score (0-100)"])

print("Done !")


Done !


**Save the Cleaned Data**

In [22]:
df1.to_csv("/content/processed_data.csv", index=False)
print(" Processed data saved as processed_data.csv")


 Processed data saved as processed_data.csv
