<a href="https://colab.research.google.com/drive/1a8sYlJ2SO7MIdelmYEqcXgXF5PtA1F6p#scrollTo=Nyv_XJIYmb10" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AkshatSurolia**

We utilized a specialized model from Hugging Face's extensive model hub, specifically the 'AkshatSurolia/ICD-10-Code-Prediction' pre-trained model. This model is based on the BERT architecture, which has revolutionized the field of natural language processing due to its deep understanding of context and nuance in language. Originating from a repository known for its wide array of state-of-the-art machine learning models, this particular model was initially trained to predict ICD-10 medical codes, showcasing its ability to handle complex, specialized language tasks.

Adapting this model to our specific need, which was classifying texts into different difficulty levels, we harnessed its advanced capabilities in processing and understanding language. The model's pre-trained foundation provided a robust starting point, allowing us to fine-tune it on our dataset for accurate difficulty classification. This approach exemplifies the power of using advanced pre-trained models to efficiently tackle specialized tasks like text classification, demonstrating how these models can be repurposed beyond their initial training objectives to suit a wide range of applications.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import pandas as pd
df_train = pd.read_csv("https://raw.githubusercontent.com/Oglo/Project-DSML/main/Data/training_data.csv").dropna()
df_test = pd.read_csv("https://raw.githubusercontent.com/Oglo/Project-DSML/main/Data/unlabelled_test_data.csv").dropna()
df_final = pd.read_csv("https://raw.githubusercontent.com/Oglo/Project-DSML/main/Data/sample_submission.csv").dropna()

In [None]:
import torch
from transformers import CamembertTokenizer, CamembertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from torch.utils.data import Dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import AutoTokenizer, BertForSequenceClassification, AutoModelForSequenceClassification

label_encoder = LabelEncoder()
df_train['encoded_labels'] = label_encoder.fit_transform(df_train['difficulty'])

tokenizer = AutoTokenizer.from_pretrained("AkshatSurolia/ICD-10-Code-Prediction")

train_encodings = tokenizer(df_train['sentence'].tolist(), truncation=True, padding=True, max_length=64)
test_encodings = tokenizer(df_test['sentence'].tolist(), truncation=True, padding=True, max_length=64)

class DifficultyDataset(Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

train_dataset = DifficultyDataset(train_encodings, df_train['encoded_labels'].tolist())

model = BertForSequenceClassification.from_pretrained("AkshatSurolia/ICD-10-Code-Prediction")

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

test_dataset = DifficultyDataset(test_encodings)

predictions = trainer.predict(test_dataset).predictions.argmax(-1)
predicted_labels = label_encoder.inverse_transform(predictions)

df_final['difficulty'] = predicted_labels
