## **Upload & Load the Dataset**

Before starting any data analysis or preprocessing, the first step is to load the dataset into a pandas DataFrame.
This allows us to view, explore, and work with the data easily in Python.

In [57]:
import pandas as pd

# Load the Excel file into a pandas DataFrame
file_path = "nanotox_dataset.csv"
df = pd.read_csv(file_path)

# Display the first few rows to verify that the dataset has loaded correctly
df.head()


Unnamed: 0,NPs,coresize,hydrosize,surfcharge,surfarea,Ec,Expotime,dosage,e,NOxygen,class
0,Al2O3,39.7,267.0,36.3,64.7,-1.51,24,0.001,1.61,3,nonToxic
1,Al2O3,39.7,267.0,36.3,64.7,-1.51,24,0.01,1.61,3,nonToxic
2,Al2O3,39.7,267.0,36.3,64.7,-1.51,24,0.1,1.61,3,nonToxic
3,Al2O3,39.7,267.0,36.3,64.7,-1.51,24,1.0,1.61,3,nonToxic
4,Al2O3,39.7,267.0,36.3,64.7,-1.51,24,5.0,1.61,3,nonToxic


## **Check Columns and Data Types**

After loading the dataset, the next step is to inspect the structure of the DataFrame — including column names, data types, and non-null counts.
This helps us understand what kind of data we’re working with and if there are any issues.

In [58]:
# Display basic information about the dataset
df.info()

# Print just the column names separately
print("\n Column Names:", df.columns.tolist())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 881 entries, 0 to 880
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   NPs         881 non-null    object 
 1   coresize    881 non-null    float64
 2   hydrosize   881 non-null    float64
 3   surfcharge  881 non-null    float64
 4   surfarea    881 non-null    float64
 5   Ec          881 non-null    float64
 6   Expotime    881 non-null    int64  
 7   dosage      881 non-null    float64
 8   e           881 non-null    float64
 9   NOxygen     881 non-null    int64  
 10  class       881 non-null    object 
dtypes: float64(7), int64(2), object(2)
memory usage: 75.8+ KB

 Column Names: ['NPs', 'coresize', 'hydrosize', 'surfcharge', 'surfarea', 'Ec', 'Expotime', 'dosage', 'e', 'NOxygen', 'class']


## **Get Basic Statistical Summary**

Now that we know the columns and their types, the next step is to examine the statistical properties of the numeric features.
This gives us insights into the range, central tendency, and spread of the data.

In [59]:
# Get basic statistics for numerical columns
df.describe()

Unnamed: 0,coresize,hydrosize,surfcharge,surfarea,Ec,Expotime,dosage,e,NOxygen
count,881.0,881.0,881.0,881.0,881.0,881.0,881.0,881.0,881.0
mean,56.31328,513.781385,1.642111,42.074075,-4.018127,27.459705,39.65127,1.64605,1.30874
std,33.700297,346.601373,25.63578,47.111739,0.509806,19.534667,38.163289,0.089304,0.543581
min,7.5,74.0,-41.6,7.0,-5.17,3.0,1e-05,1.54,1.0
25%,32.0,273.4,-11.7,15.0,-4.16,12.0,10.0,1.65,1.0
50%,45.3,327.0,-9.3,24.1,-3.89,24.0,25.0,1.65,1.0
75%,86.0,687.0,29.4,42.5,-3.89,24.0,50.0,1.65,2.0
max,125.0,1843.0,42.8,210.0,-1.51,72.0,300.0,1.9,3.0


## **Check for Missing Values**

Before moving to preprocessing, it’s important to check if any columns contain missing or null values.
Missing data can affect training, so we’ll either need to fill or remove them later.

In [60]:
# Check how many missing values each column has
df.isnull().sum()

Unnamed: 0,0
NPs,0
coresize,0
hydrosize,0
surfcharge,0
surfarea,0
Ec,0
Expotime,0
dosage,0
e,0
NOxygen,0


## **Check Class Balance in the Target Column**

Since your dataset already has a Toxicity column, the next step is to see how many samples belong to each class (Toxic vs Non-toxic).
This helps us understand if we need to balance the dataset during training.

In [61]:
# Count how many samples are Toxic vs Non-toxic
df['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
Toxic,476
nonToxic,405


## **Select Only the Useful Features for Training**

The dataset contains several metadata and method columns that are not needed for deep learning training.
We’ll keep only the core numeric features relevant for toxicity prediction and the target column.

In [62]:
# Select only the relevant numeric features and target column
selected_columns = [
    'coresize',
    'hydrosize',
    'surfcharge',
    'e',
    'dosage',
    'class'
]

df_model = df[selected_columns].copy()

# Display the first few rows of the cleaned dataset
df_model.head()

Unnamed: 0,coresize,hydrosize,surfcharge,e,dosage,class
0,39.7,267.0,36.3,1.61,0.001,nonToxic
1,39.7,267.0,36.3,1.61,0.01,nonToxic
2,39.7,267.0,36.3,1.61,0.1,nonToxic
3,39.7,267.0,36.3,1.61,1.0,nonToxic
4,39.7,267.0,36.3,1.61,5.0,nonToxic


## **Encode Target Column and Separate Features & Labels**

Our Toxicity column currently has string values: "Nontoxic" and "Toxic".
We need to convert them to numeric labels —

0 → Nontoxic

1 → Toxic

Then we’ll separate the feature matrix X and the target vector y for training.

In [63]:
# Encode target labels: Nontoxic -> 0, Toxic -> 1
df_model['class'] = df_model['class'].map({'nonToxic': 0, 'Toxic': 1})

# Separate features (X) and target (y)
X = df_model.drop(columns=['class'])
y = df_model['class']

# Check shapes
print("Feature shape (X):", X.shape)
print("Target shape (y):", y.shape)

# Display first few rows to verify encoding
df_model.head()


Feature shape (X): (881, 5)
Target shape (y): (881,)


Unnamed: 0,coresize,hydrosize,surfcharge,e,dosage,class
0,39.7,267.0,36.3,1.61,0.001,0
1,39.7,267.0,36.3,1.61,0.01,0
2,39.7,267.0,36.3,1.61,0.1,0
3,39.7,267.0,36.3,1.61,1.0,0
4,39.7,267.0,36.3,1.61,5.0,0


In [64]:
import numpy as np

X_final = X.values
y_final = y.values

# Check the final shapes
print(" Final feature shape:", X_final.shape)
print(" Final target shape:", y_final.shape)


 Final feature shape: (881, 5)
 Final target shape: (881,)


In [65]:
X_final,y_final

(array([[ 3.970e+01,  2.670e+02,  3.630e+01,  1.610e+00,  1.000e-03],
        [ 3.970e+01,  2.670e+02,  3.630e+01,  1.610e+00,  1.000e-02],
        [ 3.970e+01,  2.670e+02,  3.630e+01,  1.610e+00,  1.000e-01],
        ...,
        [ 4.630e+01,  2.390e+02,  4.280e+01,  1.900e+00,  1.000e+02],
        [ 3.560e+01,  2.955e+02, -4.160e+01,  1.650e+00,  1.000e+01],
        [ 4.630e+01,  2.390e+02,  4.280e+01,  1.900e+00,  1.000e+02]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [66]:
from sklearn.ensemble import RandomForestClassifier

# Create Random Forest model
rf = RandomForestClassifier(
    n_estimators=500,      # ntree = 500 as per research paper
    max_features='sqrt',  # common best practice
    random_state=42,
    class_weight='balanced'   # helps reduce class imbalance issues
)

# Train model
rf.fit(X_final, y_final)

print("Random Forest model trained successfully!")


Random Forest model trained successfully!


In [67]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predictions on training data
train_pred = rf.predict(X_final)

# Accuracy
train_acc = accuracy_score(y_final, train_pred)
print(f" Training Accuracy: {train_acc*100:.2f}%")

# Confusion Matrix
print("\n Confusion Matrix:")
print(confusion_matrix(y_final, train_pred))

# Classification Report
print("\n Classification Report:")
print(classification_report(y_final, train_pred, target_names=['NonToxic', 'Toxic']))


 Training Accuracy: 95.23%

 Confusion Matrix:
[[363  42]
 [  0 476]]

 Classification Report:
              precision    recall  f1-score   support

    NonToxic       1.00      0.90      0.95       405
       Toxic       0.92      1.00      0.96       476

    accuracy                           0.95       881
   macro avg       0.96      0.95      0.95       881
weighted avg       0.96      0.95      0.95       881



In [68]:
import joblib

joblib.dump(rf, "RandomForest_NanoToxicity.pkl")
print("Model saved as RandomForest_NanoToxicity.pkl")


Model saved as RandomForest_NanoToxicity.pkl


# Testing Dataset

In [69]:
import pandas as pd

test_df = pd.read_csv("test_data.csv")
test_df.head()


Unnamed: 0,Core Size,Hyrdo Size (nm),?-potential in H2O,Electronegativity,Concentration,Expotime,Viability,Toxicity
0,14.7,429.5,22.95,1.61,0.0,24.0,100.0,NonToxic
1,14.7,429.5,22.95,1.61,3.1,24.0,104.08,NonToxic
2,14.7,429.5,22.95,1.61,6.2,24.0,100.03,NonToxic
3,14.7,429.5,22.95,1.61,12.5,24.0,99.96,NonToxic
4,14.7,429.5,22.95,1.61,25.0,24.0,102.78,NonToxic


In [70]:
print(" Missing values in each column:\n", test_df.isnull().sum())

 Missing values in each column:
 Core Size             3
Hyrdo Size (nm)       3
?-potential in H2O    3
Electronegativity     3
Concentration         3
Expotime              3
Viability             3
Toxicity              3
dtype: int64


In [71]:
test_df = test_df.dropna()

In [72]:
print("\n After cleaning, missing values:\n", test_df.isnull().sum())


 After cleaning, missing values:
 Core Size             0
Hyrdo Size (nm)       0
?-potential in H2O    0
Electronegativity     0
Concentration         0
Expotime              0
Viability             0
Toxicity              0
dtype: int64


In [73]:
test_df.columns = test_df.columns.str.strip()

In [74]:
test_df['Toxicity_Label_Num'] = test_df['Toxicity'].map({'NonToxic': 0, 'Toxic': 1})

# Fill NaN values (e.g., with -1 or a value outside your label range) before converting to int
test_df['Toxicity_Label_Num'] = test_df['Toxicity_Label_Num'].fillna(-1).astype(int)

In [75]:
test_df.head()

Unnamed: 0,Core Size,Hyrdo Size (nm),?-potential in H2O,Electronegativity,Concentration,Expotime,Viability,Toxicity,Toxicity_Label_Num
0,14.7,429.5,22.95,1.61,0.0,24.0,100.0,NonToxic,0
1,14.7,429.5,22.95,1.61,3.1,24.0,104.08,NonToxic,0
2,14.7,429.5,22.95,1.61,6.2,24.0,100.03,NonToxic,0
3,14.7,429.5,22.95,1.61,12.5,24.0,99.96,NonToxic,0
4,14.7,429.5,22.95,1.61,25.0,24.0,102.78,NonToxic,0


In [76]:
# Print column names and their index position for clarity
for i, col in enumerate(test_df.columns):
    print(f"{i}: {col}")


0: Core Size
1: Hyrdo Size (nm)
2: ?-potential in H2O
3: Electronegativity
4: Concentration
5: Expotime
6: Viability
7: Toxicity
8: Toxicity_Label_Num


In [77]:
# Rename columns to match training features
test_df = test_df.rename(columns={
    'Core Size': 'coresize',
    'Hyrdo Size (nm)': 'hydrosize',
    '?-potential in H2O': 'surfcharge',
    'Electronegativity': 'e',
    'Concentration': 'dosage'
})

# Verify the new column names
print(" Updated column names:")
print(test_df.columns.tolist())

 Updated column names:
['coresize', 'hydrosize', 'surfcharge', 'e', 'dosage', 'Expotime', 'Viability', 'Toxicity', 'Toxicity_Label_Num']


In [78]:
test_df.head()

Unnamed: 0,coresize,hydrosize,surfcharge,e,dosage,Expotime,Viability,Toxicity,Toxicity_Label_Num
0,14.7,429.5,22.95,1.61,0.0,24.0,100.0,NonToxic,0
1,14.7,429.5,22.95,1.61,3.1,24.0,104.08,NonToxic,0
2,14.7,429.5,22.95,1.61,6.2,24.0,100.03,NonToxic,0
3,14.7,429.5,22.95,1.61,12.5,24.0,99.96,NonToxic,0
4,14.7,429.5,22.95,1.61,25.0,24.0,102.78,NonToxic,0


In [79]:
# Select the same 5 features used during training
X_test = test_df[['coresize',
                  'hydrosize',
                  'surfcharge',
                  'e',
                  'dosage']]
X_test.head()


Unnamed: 0,coresize,hydrosize,surfcharge,e,dosage
0,14.7,429.5,22.95,1.61,0.0
1,14.7,429.5,22.95,1.61,3.1
2,14.7,429.5,22.95,1.61,6.2
3,14.7,429.5,22.95,1.61,12.5
4,14.7,429.5,22.95,1.61,25.0


In [80]:
import joblib
rf = joblib.load("RandomForest_NanoToxicity.pkl")
print(" Random Forest model loaded!")


 Random Forest model loaded!


In [81]:
y_pred = rf.predict(X_test)
print(" Sample Predictions:", y_pred[:10])


 Sample Predictions: [0 0 0 0 0 0 0 0 0 0]




In [82]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
y_test = test_df['Toxicity_Label_Num']

acc = accuracy_score(y_test, y_pred)
print(f" Test Accuracy: {acc*100:.2f}%")

print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\n Classification Report:")
print(classification_report(y_test, y_pred, target_names=['NonToxic', 'Toxic']))


 Test Accuracy: 75.00%

 Confusion Matrix:
[[169  53]
 [  7  11]]

 Classification Report:
              precision    recall  f1-score   support

    NonToxic       0.96      0.76      0.85       222
       Toxic       0.17      0.61      0.27        18

    accuracy                           0.75       240
   macro avg       0.57      0.69      0.56       240
weighted avg       0.90      0.75      0.81       240

