## Building a Nuclei Classifier Model from Extracted Feature Data

In this comprehensive tutorial, we will guide you through the process of constructing a nuclei classifier using the powerful *Random Forest* algorithm. Our approach involves extracting features from image data using DSA and assigning classes to the data instances using K-Means method. A foundational grasp of Python and familiarity with utilizing iPython notebooks are prerequisites for comprehending the content presented in this tutorial.

**Included Resources**
- [Input Image](https://data.kitware.com/api/v1/file/hashsum/sha512/1ff135eb0ff8864a876a19ae3dec579f27f1718726a68643f6a40a244fdfa08e81f63f1413c198b38384cb34e8705bc60a6c69ef2b706cb0419f6ec091b2b621/download)
- [Extracted Features file (Optional)](https://data.kitware.com/api/v1/file/hashsum/sha512/e8c829b60d316ff84d2ffafb5accd605eb8dcd02dec709105ec9127aa2d7969e2feca74f66394b26f0e90375cd0d1cda3d1831023449f66cf50a637906444578/download)

*This tutorial is created by Subin Erattakulangara (Kitware)*

### Step 1 (Extract nuclei features)

Open the Nuclei Feature extraction panel in DSA and upload the image data into *Input Image* area shown below. Provide the location for both feature file and annotation files to be saved. Then press submit to start the process.The cli will generate the feature file requried for the classifier. Annotation file is not required for creating the classfier.

![DSA panel.png](https://data.kitware.com/api/v1/file/hashsum/sha512/10f88a5400e7fa46605e3f75530ae8703a429fbbf1185444a14fa40beec251434d19760de90bdaae25b5ece3557b502b59e40fab377b3df5978088b14c3a14e2/download)

### Step 2 (Generate training labels)

Once the CLI generates the feature file, download it. Create a new folder and put the downloaded feature file in there. Then, run the provided Python code within the same folder. This code enhances the feature file with classes. This simple flow ensures you manage, organize, and improve your feature file effortlessly.<br><br>
You can also use the `.csv` file provided above to create the classifier.

##### Read the csv file

In [9]:
# Ensure you import all the necessary libraries.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
import pandas as pd
import pickle

In [13]:
# Read CSV file
# This link provides access to a sample CSV file for download. Feel free to replace it with your own customized CSV file.
url = "https://data.kitware.com/api/v1/file/hashsum/sha512/e8c829b60d316ff84d2ffafb5accd605eb8dcd02dec709105ec9127aa2d7969e2feca74f66394b26f0e90375cd0d1cda3d1831023449f66cf50a637906444578/download"
df = pd.read_csv(url).fillna(0)
df.head()

Unnamed: 0,Feature.Label,Feature.Identifier.Xmin,Feature.Identifier.Ymin,Feature.Identifier.Xmax,Feature.Identifier.Ymax,Feature.Identifier.CentroidX,Feature.Identifier.CentroidY,Feature.Identifier.WeightedCentroidX,Feature.Identifier.WeightedCentroidY,Feature.Orientation.Orientation,...,Feature.Cytoplasm.Haralick.Entropy.Mean,Feature.Cytoplasm.Haralick.Entropy.Range,Feature.Cytoplasm.Haralick.DifferenceVariance.Mean,Feature.Cytoplasm.Haralick.DifferenceVariance.Range,Feature.Cytoplasm.Haralick.DifferenceEntropy.Mean,Feature.Cytoplasm.Haralick.DifferenceEntropy.Range,Feature.Cytoplasm.Haralick.IMC1.Mean,Feature.Cytoplasm.Haralick.IMC1.Range,Feature.Cytoplasm.Haralick.IMC2.Mean,Feature.Cytoplasm.Haralick.IMC2.Range
0,1.0,522.0,0.0,543.0,7.0,532.181818,2.171717,533.656523,2.031401,1.537561,...,4.70649,0.194888,0.008351,0.002288,2.040481,0.322045,-0.057505,0.073408,0.47629,0.252285
1,2.0,907.0,5.0,917.0,22.0,910.981651,13.155963,910.756021,13.072805,0.042388,...,4.97538,0.256907,0.008047,0.001967,2.07238,0.319199,-0.11136,0.086315,0.658314,0.175118
2,3.0,621.0,16.0,631.0,28.0,625.518072,21.421687,626.368708,20.876592,-0.603565,...,4.855255,0.247903,0.007793,0.00202,2.085567,0.344207,-0.067079,0.091177,0.521425,0.290303
3,4.0,651.0,31.0,661.0,50.0,655.672414,40.137931,655.497753,40.189438,-0.144686,...,5.415242,0.192235,0.006401,0.001623,2.382495,0.307914,-0.059416,0.060096,0.520708,0.198088
4,5.0,923.0,42.0,937.0,60.0,929.307692,50.487179,928.986837,50.98015,-0.620023,...,4.628241,0.286323,0.008531,0.00239,2.002746,0.406743,-0.09769,0.105828,0.59982,0.219931


##### Standardize the data and perform K-means clustering

In [14]:
# Standardize the data
X = df.values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform KMeans clustering
num_clusters = 5  # Number of clusters you want to create
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

  super()._check_params_vs_input(X, default_n_init=10)


##### Add generated cluster labels to feature file

In [15]:
# Add cluster labels to the original data
df['Cluster'] = cluster_labels

# Print the count of data points in each cluster
print(df['Cluster'].value_counts())

3    833
1    737
4    607
2    505
0     40
Name: Cluster, dtype: int64


In this process we have modified the feature file so that the class labels are also added into it. These labels are required to train the Random forest classifier.

### Step 3 (Train random forest classifier)

##### Convert dataframe to target and labels

In [16]:
# Assuming the last column contains the target labels
X = df.iloc[:, :-1]  # Features
y = df.iloc[:, -1]   # Target labels
print(X.shape, y.shape)

# Convert categorical labels to numerical using LabelEncoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

(2722, 134) (2722,)


##### Split the data into training and testing sets

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

##### Train a random forest classifier

In [18]:
# Create a RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = classifier.predict(X_test)

# Convert numerical predictions back to categorical labels
y_pred_labels = label_encoder.inverse_transform(y_pred)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.91


#### Save the trained model into a pickle file

In [19]:
model_filename = 'breast_cancer_classification_model.pkl'
with open(model_filename, 'wb') as model_file:
    pickle.dump(classifier, model_file)

print(f"Model saved as {model_filename}")

Model saved as breast_cancer_classification_model.pkl


This trained model can be used for nuclei classification. The model file should be uploaded to Girder and selected for the "Input Model File" in the Classify Nuclei task.