In [4]:
!pip install sentencepiece -q
!pip install transformers datasets pandas scikit-learn-q
!pip install accelerate -U -q
!pip install sacremoses -q
!pip install transformers -q

[31mERROR: Could not find a version that satisfies the requirement scikit-learn-q (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for scikit-learn-q[0m[31m
[0m

**These lines import necessary libraries including Pandas for data manipulation, PyTorch for deep learning, components from the Transformers library for using FlauBERT, and scikit-learn tools for data preprocessing and dataset splitting.**

In [None]:
import pandas as pd
import torch
from transformers import FlaubertTokenizer, FlaubertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
# Load the training data
training_data_path = '/kaggle/input/aaaaaaaaa/augmented_training_data2(2).csv'
training_data = pd.read_csv(training_data_path)

In this section of the code, we are performing several key preprocessing steps to prepare the dataset for training with the FlauBERT model:

Tokenizer Initialization:

* We initialize the FlauBERT tokenizer, specifically the 'flaubert/flaubert_large_cased' variant. This tokenizer is crucial for processing French textual data. It converts raw text sentences into a machine-readable format, known as tokens. These tokens are numerical representations of text segments that the FlauBERT model can understand and analyze.

Encoding Difficulty Levels:

* we use using scikit-learn's LabelEncoder to transform the difficulty labels in the dataset from textual to numerical form. Since machine learning models inherently work with numbers, encoding categorical labels into numbers is an essential step. This process assigns a unique integer to each level of difficulty.

Tokenizing Sentences and Encoding Labels:

* The sentences from the dataset are tokenized using the initialized FlauBERT tokenizer. This step involves breaking down each sentence into tokens and ensuring that they are of uniform length, achieved by truncating longer sentences and padding shorter ones. The maximum length is set to 512 tokens, aligning with the model's input size requirements.
Simultaneously, we convert the encoded difficulty labels into a list format, aligning them with the tokenized sentences. This alignment is crucial for supervised learning, where each input (tokenized sentence) is associated with a corresponding output label (difficulty level).

Through these steps, we are ensuring that the data is in the correct format and ready for training with the FlauBERT model, setting the stage for effective machine learning on language data.

In [None]:
# Initialize the FlauBERT tokenizer
tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_large_cased')

# Encode the difficulty levels
label_encoder = LabelEncoder()
training_data['encoded_labels'] = label_encoder.fit_transform(training_data['difficulty'])

# Tokenize the sentences and encode the labels
train_encodings = tokenizer(training_data['sentence'].tolist(), truncation=True, padding=True, max_length=512)
train_labels = training_data['encoded_labels'].tolist()

In this portion of the code, we are defining a custom dataset class named `FrenchDifficultyDataset`, which is tailored for use with PyTorch, particularly for handling the data we've prepared for our FlauBERT model:

1. **Defining the Dataset Class:**
   - We create a class `FrenchDifficultyDataset` that inherits from `torch.utils.data.Dataset`. This class is specifically designed to handle our tokenized sentences and their corresponding difficulty labels.

2. **Initialization Method (`__init__`):**
   - In the initializer (`__init__`), we take two arguments: `encodings` and `labels`. The `encodings` are the tokenized representations of our sentences, and `labels` are the corresponding difficulty levels that we have encoded earlier.
   - We assign these `encodings` and `labels` to instance variables within the class so that they can be accessed by other methods in the class.

3. **Get Item Method (`__getitem__`):**
   - The `__getitem__` method is defined to facilitate the retrieval of data samples by index. For a given index `idx`, this method returns a dictionary where each key-value pair corresponds to input features and their values, along with the associated label for that data point.
   - We convert each item in the `encodings` and the `labels` to PyTorch tensors. This conversion is essential because PyTorch models expect data in the form of tensors.

4. **Length Method (`__len__`):**
   - The `__len__` method returns the total number of samples in the dataset. This is simply the length of the `labels` list, as each label corresponds to one encoded sentence.

By defining this `FrenchDifficultyDataset` class, we are effectively packaging our preprocessed data (both the input encodings and the output labels) into a format that is compatible with the PyTorch framework, particularly for use in training and evaluation loops. This class will enable us to seamlessly integrate our dataset with PyTorch's data loading and batching utilities.

In [None]:
# Prepare the dataset
class FrenchDifficultyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


In this part of the code, we are splitting our prepared dataset into training and validation sets, an essential step in the machine learning workflow:

1. **Creating the Dataset Instance:**
   - First, we instantiate our custom `FrenchDifficultyDataset` class using `train_encodings` and `train_labels`. This dataset includes the tokenized sentences and their corresponding difficulty levels. The `FrenchDifficultyDataset` object encapsulates our data in a format that's compatible with PyTorch, facilitating easier data handling during model training.

2. **Splitting the Dataset:**
   - We then use the `train_test_split` function from scikit-learn to divide our dataset into training and validation sets. This function is a standard utility in machine learning used to evaluate the performance of a model on unseen data.
   - By setting `test_size=0.1`, we allocate 10% of our data for validation and the remaining 90% for training. The validation set is crucial for tuning the model and checking for issues like overfitting, where the model performs well on the training data but poorly on new, unseen data.

By completing this step, we ensure that we have a well-defined training set to fit our model and a separate validation set to evaluate its performance. This division is critical for developing robust machine learning models that generalize well to new data.

In [None]:
# Split the training data
train_dataset, val_dataset = train_test_split(FrenchDifficultyDataset(train_encodings, train_labels), test_size=0.1)

In this line of code, we're setting up the FlauBERT model for our specific sequence classification task:

1. **Loading the FlauBERT Model:**
   - We use the `FlaubertForSequenceClassification.from_pretrained` method to load a pre-trained FlauBERT model. This method is ideal for loading models that have been pre-trained on a large corpus of data and are well-suited for fine-tuning on specific tasks like ours.
   - The model variant we choose is `'flaubert/flaubert_large_cased'`, which is a large-sized FlauBERT model that respects the case (uppercase/lowercase) of the input text. This variant is particularly effective for understanding the nuances in language.

2. **Configuring for Sequence Classification:**
   - We specify `num_labels=len(label_encoder.classes_)` in the model's configuration. This tells the model the number of distinct labels (or classes) it needs to predict. The `num_labels` should match the number of different difficulty levels in our dataset, which we determine by the length of `label_encoder.classes_`.
   - By setting the `num_labels`, we are effectively tailoring the FlauBERT model for our classification task, ensuring it outputs predictions corresponding to our encoded difficulty levels.

Through this process, we configure the FlauBERT model to understand the specific requirements of our task, making it ready for subsequent training with our dataset. This step is crucial in adapting powerful, pre-trained models to specialized tasks with relatively less effort and time compared to training a model from scratch.

In [None]:
# Load the FlauBERT model
model = FlaubertForSequenceClassification.from_pretrained('flaubert/flaubert_large_cased', num_labels=len(label_encoder.classes_))

In this section of the code, we're defining the settings and hyperparameters for training our FlauBERT model:

1. **Setting Up Training Arguments:**
   - We use the `TrainingArguments` class from the Hugging Face Transformers library to configure various aspects of the training process. This configuration will be passed to the Trainer object later on.

2. **Configuration Details:**
   - `output_dir='./results'`: We specify a directory where the training outputs (like model checkpoints) will be saved. This is useful for keeping track of training results and for potential model recovery in case of interruptions.
   - `num_train_epochs=3`: This sets the number of training epochs. An epoch is one complete pass through the entire training dataset. We choose to train for three epochs.
   - `per_device_train_batch_size=8`: This determines the batch size for training on each device (like a GPU or CPU). A smaller batch size can help reduce memory usage but might affect training speed and convergence.
   - `warmup_steps=500`: Warmup steps are used to gradually ramp up the learning rate at the beginning of training. This can help in stabilizing the training process and is often beneficial for fine-tuning.
   - `weight_decay=0.01`: This is a regularization parameter that helps prevent the model from overfitting to the training data. It adds a penalty for larger weights in the model.
   - `logging_dir='./logs'`: Specifies where to save logs generated during training. This is helpful for monitoring the training process and debugging.
   - `logging_steps=10`: Determines how often to log training information. In this case, we log every 10 steps.
   - `save_strategy="no"`: We're disabling saving model checkpoints at the end of each epoch, which can be useful for saving disk space and speeding up training.
   - `save_steps=1e9`: Sets a very high number of steps for saving the model, effectively disabling periodic checkpoint saves due to the high threshold.

By configuring these training arguments, we tailor the training process to our specific needs and computational constraints. These settings are crucial for efficient and effective training of the model on our dataset.

In [None]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
        save_strategy="no",  # Disable saving at the end of each epoch
    save_steps=1e9
)

In this part of the code, we are initializing and conducting the training process for our FlauBERT model using the Hugging Face `Trainer` API:

1. **Initializing the Trainer:**
   - We create an instance of the `Trainer` class, which is a part of the Hugging Face Transformers library. This class simplifies the training process by abstracting many of the complex steps involved in training deep learning models.
   - We pass the previously configured FlauBERT model to the `Trainer` as the `model` parameter.
   - The `training_args` we defined earlier are also passed to the `Trainer`. These arguments provide the Trainer with our specific training configurations like the number of epochs, batch size, and logging details.
   - `train_dataset` is provided as the dataset for training. This is the dataset that the model will learn from.
   - `eval_dataset` is specified for evaluation. The model's performance will be periodically assessed on this dataset to understand how well it is learning and generalizing.

2. **Training the Model:**
   - We call the `train()` method on the Trainer instance. This method starts the training process of our model on the specified training dataset.
   - During training, the model learns to classify the difficulty level of French sentences by adjusting its internal parameters based on the error between its predictions and the actual labels.
   - The training process will automatically evaluate the model on the validation dataset and log the training progress as per the settings defined in `training_args`.

By using the `Trainer` API, we streamline the training process, making it more manageable and less error-prone. The API handles many underlying details like batch processing, gradient calculations, and model evaluations, allowing us to focus more on fine-tuning the training configuration and interpreting the results.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

In this final part of the code, we are focused on saving the trained model and loading the unlabelled test data:

1. **Saving the Trained Model:**
   - After training the FlauBERT model, we use the `save_pretrained` method to save it. This method ensures that all the model's parameters and configurations are stored correctly.
   - We specify the path `/kaggle/working/flaubertlevrai-finetuned` as the location to save the model. This path is particularly relevant in the Kaggle environment, where `/kaggle/working` is a common directory used for output files. The model can be accessed from this directory for future use, like making predictions or further fine-tuning.
   - The saved model includes the fine-tuned weights that have been adjusted to our specific task of predicting the difficulty level of French sentences.

2. **Loading Unlabelled Test Data:**
   - We load an unlabelled test dataset from the specified CSV file located at `/kaggle/input/trainin/unlabelled_test_data.csv`. This dataset is expected to contain French sentences for which we want to predict the difficulty levels.
   - The data is loaded into a Pandas DataFrame `unlabelled_test_data` using the `read_csv` method. This DataFrame will be used to prepare the data for making predictions with our trained model.

In summary, these steps are crucial for finalizing the machine learning workflow. Saving the trained model allows us to reuse it without the need to retrain, and loading the unlabelled test data prepares us for the next step, which is typically to make predictions and evaluate the model's performance on real-world data.

In [None]:
# Save the model
model.save_pretrained('/kaggle/working/flaubertlevrai-finetuned')

# Load the unlabelled test data
unlabelled_test_data_path = '/kaggle/input/trainin/unlabelled_test_data.csv'
unlabelled_test_data = pd.read_csv(unlabelled_test_data_path)

In this segment of the code, we're processing the unlabelled test data to make it compatible with our trained FlauBERT model, and we're preparing a PyTorch dataset for the test data:

1. **Preprocessing the Test Data:**
   - We use the previously initialized FlauBERT tokenizer to process the sentences in the unlabelled test dataset. This involves tokenizing the sentences, similar to how we processed our training data.
   - The `tokenizer` function is called with `truncation=True` and `padding=True` to ensure that all tokenized outputs have the same length, specified by `max_length=512`. This uniformity is essential for the model to process the data correctly.

2. **Creating the Test Dataset Class:**
   - We define a custom class `FrenchTestDataset` that inherits from `torch.utils.data.Dataset`. This class is tailored to handle the tokenized test data for PyTorch.
   - The `__init__` method takes the tokenized data (`encodings`) as input and stores it in an instance variable.
   - The `__getitem__` method allows us to retrieve a single tokenized instance from the dataset by index. This method will be used by PyTorch to iterate over the dataset during the prediction phase.
   - The `__len__` method returns the total number of samples in the dataset, which is determined by the length of the `input_ids` in the encodings.

3. **Instantiating the Test Dataset:**
   - We create an instance of the `FrenchTestDataset` class using the `test_encodings`. This instance, `test_dataset`, is a PyTorch-compatible dataset containing our preprocessed test sentences.

By preparing the test dataset in this manner, we ensure that our unlabelled data is in the correct format for making predictions with the trained FlauBERT model. This step is vital for evaluating the model's performance on new, unseen data.

In [None]:
# Preprocess the test data
test_encodings = tokenizer(unlabelled_test_data['sentence'].tolist(), truncation=True, padding=True, max_length=512)

# Prepare the test dataset
class FrenchTestDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

test_dataset = FrenchTestDataset(test_encodings)

In this final step of our machine learning workflow, we use our trained FlauBERT model to make predictions on the unlabelled test dataset and then decode these predictions back into human-readable labels:

1. **Running Predictions:**
   - We utilize the `predict` method of our `Trainer` object to perform predictions on the `test_dataset`. This method efficiently processes the dataset through our trained FlauBERT model to generate predictions.
   - Our `test_dataset` is comprised of tokenized sentences from the unlabelled test data, formatted specifically to be compatible with the FlauBERT model.

2. **Decoding the Predictions:**
   - The model’s predictions are initially logits for each class (difficulty level). We use the `argmax` function to select the most probable class for each sentence.
   - We then apply the `inverse_transform` method of the `LabelEncoder` to translate these numerical predictions back into their original categorical labels (like "A1", "B2", etc.). This step is crucial as it converts the model's output into an interpretable form.

By doing this, we are able to not only predict the difficulty levels of new sentences but also interpret these predictions in a meaningful way. This is an essential part of deploying a machine learning model, where we transform its numerical outputs into actionable insights.

In [None]:
# Run prediction
predictions = trainer.predict(test_dataset)
predicted_labels = label_encoder.inverse_transform(predictions.predictions.argmax(-1))

In this final stage of our process, we're focusing on saving the predictions made by our model back to the unlabelled test dataset and then exporting this enriched dataset to a CSV file:

1. **Appending Predictions to Test Data:**
   - We add a new column, `'difficulty'`, to our `unlabelled_test_data` DataFrame. This column is filled with the `predicted_labels` we obtained from our model.
   - By doing this, each sentence in the test dataset is now associated with a predicted difficulty level, effectively combining our original unlabelled data with the insights gained from the model.

2. **Preparing the Data for Export:**
   - We create a new DataFrame `una` by dropping the `'sentence'` column from `unlabelled_test_data`. The reason might be to focus on the predictions alone, or to conform to data privacy requirements by not exporting raw text.
   - Dropping the sentence column helps in cases where we only need the model's output for further analysis or reporting.

3. **Exporting to CSV:**
   - We use the `to_csv` method to save the DataFrame `una` to a CSV file. This file is named `'letraduit.csv'` and is saved to the `/kaggle/working` directory, which is a standard directory for output files on Kaggle.
   - The `index=False` parameter is used to prevent pandas from writing row indices into the CSV file, ensuring that the file contains only the data columns.

Through these steps, we're not only able to generate and understand the model's predictions but also export these results in a structured and accessible format. This is crucial for sharing our findings, conducting further analysis, or integrating them into larger systems or reports.

In [3]:
# Save the predictions to the unlabelled test data
unlabelled_test_data['difficulty'] = predicted_labels
una = unlabelled_test_data.drop('sentence', axis=1)
una.to_csv('/kaggle/working/letraduit.csv', index=False)



Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/896k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Downloading pytorch_model.bin:   0%|          | 0.00/1.49G [00:00<?, ?B/s]

Some weights of FlaubertForSequenceClassification were not initialized from the model checkpoint at flaubert/flaubert_large_cased and are newly initialized: ['sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss
10,2.5645
20,2.4087
30,2.1824
40,2.1809
50,1.949
60,2.1569
70,1.9757
80,1.9634
90,1.8688
100,1.8445
