# Multilabel Classification with TF-IDF and Naive Bayes

This notebook demonstrates a multilabel text classification task using TF-IDF for feature extraction and Naive Bayes for classification. The dataset used is a CSV file with descriptions and tags.


## Libraries and Dependencies

In [314]:
import pandas as pd
import torch
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_recall_curve, roc_curve, roc_auc_score

## Loading the Data
Read the CSV file containing the data.

In [348]:
df=pd.read_csv('codeforce_processed_cleaned_data_.csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,time_limit,memory_limit,input_file,output_file,description,tags,language
0,0,1846/F,1 second,256 megabytes,standard,standard,interactive task rudolph a scientist study ali...,"['constructive algorithms', 'implementation', ...",en
1,1,1847/D,2 seconds,256 megabytes,standard,standard,josuke tire peaceful life morioh follow nephew...,"['data structures', 'dsu', 'greedy', 'implemen...",en
2,2,1846/E2,2 seconds,256 megabytes,standard,standard,hard version problem difference version $$$ n ...,"['binary search', 'brute force', 'data structu...",en
3,3,1846/E1,2 seconds,256 megabytes,standard,standard,a simple version problem difference version $$...,"['brute force', 'implementation', 'math']",en
4,4,1846/C,1 second,256 megabytes,standard,standard,rudolf register a program competition follow r...,"['constructive algorithms', 'greedy', 'impleme...",en


## TF-IDF Vectorization
Use `TfidfVectorizer` to convert the text descriptions into numerical features.

In [349]:
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(df['description'])
X.shape

(7950, 1000)

## Preprocessing the Tags
Convert the `tags` column from strings to lists.

In [350]:
import ast

tags = df['tags'].tolist()
for i in range(len(tags)):
    tags[i] = str(tags[i])

df['tags'] = [ast.literal_eval(tag) for tag in tags]

## Binarizing the Tags
Use `MultiLabelBinarizer` to convert the tags into a binary format.

In [351]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['tags'])

## Train-Test Split
Split the data into training and testing sets.

In [352]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training
Train a `MultinomialNB` classifier using `OneVsRestClassifier`.

In [353]:
nb= MultinomialNB()

classifier = OneVsRestClassifier(nb)
classifier.fit(X_train, y_train)

## Predictions
Make predictions on the test set.

In [354]:
y_pred = classifier.predict(X_test)
y_pred_prob = classifier.predict_proba(X_test)

Ensure that each text has at least one tag predicted.

In [355]:
for i in range(y_pred.shape[0]):
    if not y_pred[i].any():
        # Assign the most probable tag
        y_pred[i, np.argmax(classifier.predict_proba(X_test[i]))] = 1

## Evaluation
Evaluate the model using various metrics.

In [356]:
print(classification_report(y_test, y_pred, target_names=mlb.classes_))

                           precision    recall  f1-score   support

                 *special       0.00      0.00      0.00        47
                    2-sat       0.00      0.00      0.00         3
            binary search       1.00      0.01      0.01       175
                 bitmasks       1.00      0.02      0.04        91
              brute force       0.00      0.00      0.00       242
chinese remainder theorem       0.00      0.00      0.00         3
            combinatorics       0.67      0.02      0.04       106
  constructive algorithms       0.77      0.08      0.15       289
          data structures       0.67      0.13      0.22       286
          dfs and similar       0.54      0.36      0.43       146
       divide and conquer       0.00      0.00      0.00        33
                       dp       0.38      0.05      0.09       320
                      dsu       0.00      0.00      0.00        68
       expression parsing       0.00      0.00      0.00     

  _warn_prf(average, modifier, msg_start, len(result))


In [357]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy * 100:.2f}%")

class_accuracies = {}
for i, class_name in enumerate(mlb.classes_):
    class_accuracy = accuracy_score(y_test[:, i], y_pred[:, i])
    class_accuracies[class_name] = class_accuracy

# Print accuracy for each class
print("Accuracy for each class:")
for class_name, class_accuracy in class_accuracies.items():
    print(f"{class_name}: {class_accuracy * 100:.2f}%")

Overall Accuracy: 7.61%
Accuracy for each class:
*special: 97.04%
2-sat: 99.81%
binary search: 89.06%
bitmasks: 94.40%
brute force: 84.72%
chinese remainder theorem: 99.81%
combinatorics: 93.40%
constructive algorithms: 82.89%
data structures: 83.21%
dfs and similar: 91.32%
divide and conquer: 97.92%
dp: 79.25%
dsu: 95.72%
expression parsing: 99.50%
fft: 99.18%
flows: 98.81%
games: 98.18%
geometry: 96.67%
graph matchings: 98.87%
graphs: 89.94%
greedy: 72.64%
hashing: 97.74%
implementation: 68.62%
interactive: 98.36%
math: 72.45%
matrices: 98.81%
meet-in-the-middle: 99.62%
number theory: 92.01%
probabilities: 97.67%
schedules: 99.81%
shortest paths: 96.79%
sortings: 89.50%
string suffix structures: 99.25%
strings: 94.21%
ternary search: 99.37%
trees: 93.90%
two pointers: 94.40%
