#### **1. Install and Import Required Libraries**

This section ensures all necessary Python libraries are available and imported. It includes tools for data manipulation, machine learning, feature selection, and evaluation. Installing and importing these packages is essential for building and running the classification pipeline.

In [2]:
# Install any missing libraries
!pip install scikit-learn pandas numpy
# Original Work of Adityabaan Tripathy (20251694)
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, classification_report



#### **2. Load Data**

Here, we load the provided CSV files containing the training, test, and label data. Proper data loading is the foundation for any machine learning workflow, allowing us to access and manipulate the gene expression profiles and associated labels.

In [3]:
# Load the datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
labels = pd.read_csv('train_labels.csv')

#### **3. Prepare Labeled and Unlabeled Data**

This step separates the labeled and unlabeled samples from the training data. By distinguishing between these sets, we can apply supervised learning to labeled data and leverage semi-supervised techniques to utilize the unlabeled samples for improved model performance.

In [4]:
# Merge labels with train to get labeled samples
labeled = train.merge(labels, on='Id', suffixes=('', '_true'))
X_labeled = labeled.filter(like='gene_').values
y_labeled = labeled['Class_true'].values

# Unlabeled samples (Class is NaN)
unlabeled = train[train['Class'].isna()]
X_unlabeled = unlabeled.filter(like='gene_').values

# Prepare test set
X_test = test.filter(like='gene_').values
test_ids = test['Id'].values

#### **4. Data Normalization**

Gene expression features are standardized to have zero mean and unit variance. Normalization is crucial in high-dimensional data to ensure all features contribute equally and to improve the convergence and stability of machine learning algorithms.

In [5]:
# Standardize features (fit only on training data)
scaler = StandardScaler()
X_labeled = scaler.fit_transform(X_labeled)
X_unlabeled = scaler.transform(X_unlabeled)
X_test = scaler.transform(X_test)

#### **5. Feature Selection**

We select the most informative genes using mutual information, reducing the number of features from thousands to a manageable subset. Feature selection helps prevent overfitting, speeds up training, and enhances model interpretability, especially with limited labeled data.

In [7]:
# Select top 500 informative genes using mutual information
from sklearn.impute import SimpleImputer

# Impute missing values with the mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_labeled_imputed = imputer.fit_transform(X_labeled)
X_unlabeled_imputed = imputer.transform(X_unlabeled)
X_test_imputed = imputer.transform(X_test)

selector = SelectKBest(mutual_info_classif, k=500)
X_labeled_fs = selector.fit_transform(X_labeled_imputed, y_labeled)
X_unlabeled_fs = selector.transform(X_unlabeled_imputed)
X_test_fs = selector.transform(X_test_imputed)

#### **6. Semi-Supervised Learning with Self-Training (Pseudo-Labeling)**

A base classifier is trained on labeled data and then iteratively assigns labels to unlabeled samples with high confidence. This self-training approach allows the model to benefit from additional data, improving generalization in scenarios with scarce labeled examples.

In [8]:
# Use a robust base classifier
base_clf = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42)

# Wrap with SelfTrainingClassifier for semi-supervised learning
self_training_clf = SelfTrainingClassifier(base_clf, threshold=0.95, max_iter=10, verbose=True)

# Prepare combined data: labeled + unlabeled (unlabeled as -1)
X_combined = np.vstack([X_labeled_fs, X_unlabeled_fs])
y_combined = np.concatenate([y_labeled, [-1]*X_unlabeled_fs.shape[0]])

# Fit the model
self_training_clf.fit(X_combined, y_combined)

End of iteration 1, added 1 new labels.


#### **7. Model Evaluation**

The model's predictions on labeled data are evaluated using classification metrics such as Macro F1-Score. This step provides insight into model performance, highlights strengths and weaknesses, and guides further improvements.

In [9]:
# If you want to evaluate on labeled data
y_pred_labeled = self_training_clf.predict(X_labeled_fs)
print(classification_report(y_labeled, y_pred_labeled, digits=4))
print('Macro F1-Score:', f1_score(y_labeled, y_pred_labeled, average='macro'))

              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000        26
           1     1.0000    1.0000    1.0000        26
           2     1.0000    1.0000    1.0000        56
           3     1.0000    1.0000    1.0000        15
           4     1.0000    1.0000    1.0000        27

    accuracy                         1.0000       150
   macro avg     1.0000    1.0000    1.0000       150
weighted avg     1.0000    1.0000    1.0000       150

Macro F1-Score: 1.0


#### **8. Predict on Test Set and Prepare Submission**

The trained model predicts cancer types for the test set, and results are formatted according to the required submission structure. This final step generates the output file needed for leaderboard evaluation and competition submission.

In [10]:
# Predict test set classes
test_preds = self_training_clf.predict(X_test_fs).astype(int)

# Prepare submission file
submission = pd.DataFrame({'Id': test_ids, 'Class': test_preds})
submission.to_csv('submission.csv', index=False)
print('Submission file created: submission.csv')

Submission file created: submission.csv
