## Authentification Google Colab
Cette section sert à authentifier l'utilisateur dans Google Colab, permettant ainsi l'accès aux ressources de Google Cloud. C'est une étape nécessaire pour interagir avec Google Cloud Storage et d'autres services Google Cloud.

In [1]:
# Authentification avec Google Colab pour accéder aux ressources Google Cloud.
from google.colab import auth
auth.authenticate_user()


## Installation de Google Cloud Storage
Ici, nous installons la bibliothèque Google Cloud Storage, qui est utilisée pour accéder et manipuler les données stockées dans Google Cloud Storage. Cela nous permet de télécharger le fichier CSV nécessaire pour l'analyse de sentiment.

In [2]:
# Installation de la bibliothèque Google Cloud Storage pour interagir avec les données stockées sur GCP.
!pip install google-cloud-storage




## Chargement et Préparation des Données
Dans cette partie, nous chargeons les données depuis Google Cloud Storage dans un DataFrame Pandas. Nous renommons également les colonnes pour qu'elles soient appropriées pour l'analyse de sentiment. Ces données seront utilisées pour entraîner et évaluer le modèle BERT.

In [3]:
# Chargement des données depuis Google Cloud Storage dans un DataFrame Pandas et renommage des colonnes.
from google.cloud import storage
import pandas as pd

from io import StringIO
# Création d'un client pour interagir avec le Google Cloud Storage
client = storage.Client()

# Définissez le nom de votre bucket et le chemin du fichier CSV
bucket_name = 'bucket-tweetssentimentsanalyses'
source_blob_name = 'final_dataset2_80k.csv'

# Accès au bucket
bucket = client.get_bucket(bucket_name)

# Accès au fichier
blob = bucket.blob(source_blob_name)

# Télécharger le contenu dans un objet pandas DataFrame
# Cela suppose que votre fichier CSV est correctement formaté pour pandas
content = blob.download_as_text()
df = pd.read_csv(StringIO(content))
# Renommer les colonnes
# Supposons que votre DataFrame 'df' ait deux colonnes sans nom
# Vous pouvez attribuer les noms 'text' et 'categorie' aux deux colonnes comme suit
df.columns = ['text', 'categorie']

# Affichage des premières lignes du DataFrame
print(df.head())


                                               text  categorie
0                   yeah shopping haha shopping xxx          0
1               ikea somewhat disappointment wanted          0
2                    found grandpa put hospice care          0
3  working three weekends row four six weekends fml          0
4     sad say desperate times never know people may          0


## Installation des Dépendances pour l'Analyse de Sentiment avec BERT
Cette section couvre l'installation des bibliothèques nécessaires pour l'analyse de sentiment avec BERT, y compris `transformers`, `torch` et `sklearn`. Ces bibliothèques fournissent les outils nécessaires pour l'entraînement et l'évaluation de modèles de machine learning.

In [4]:
# Installation des bibliothèques nécessaires pour l'analyse de sentiment avec BERT.
!pip install transformers
!pip install torch
!pip install sklearn


Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


## Préparation des Données pour BERT
Ce segment du notebook implique la préparation des données textuelles pour l'entraînement avec BERT. Il inclut le processus de tokenisation et la configuration des jeux de données pour l'entraînement et la validation.

In [5]:
# Préparation et traitement des données textuelles pour l'entraînement avec BERT.
import pandas as pd
from google.cloud import storage
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset, random_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support




# Select text and category columns
texts = df['text'].values
categories = df['categorie'].values  # Replace 'category' with the name of your category column
# Drop NaN values and ensure all entries are of string type


# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def encode(docs):
    # Tokenize all of the sentences and map the tokens to their word IDs.
    input_ids = []
    attention_masks = []

    for doc in docs:
        encoded_dict = tokenizer.encode_plus(
            doc,
            add_special_tokens = True,
            max_length = 64,  # Set a max length to truncate/pad
            pad_to_max_length = True,
            return_attention_mask = True,
            return_tensors = 'pt',  # PyTorch tensors
        )

        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(categories)

    return input_ids, attention_masks, labels

df['text'] = df['text'].fillna("")

# Proceed with your encoding
texts = df['text'].values
input_ids, attention_masks, labels = encode(texts)
# Split data into train and validation sets
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels, random_state=2018, test_size=0.1)

# Create the DataLoader for our training set
train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=32)

# Create the DataLoader for our validation set
validation_dataset = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_dataset)
validation_dataloader = DataLoader(validation_dataset, sampler=validation_sampler, batch_size=32)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


## Entraînement du Modèle BERT
Les cellules suivantes sont dédiées à la configuration, à l'entraînement et à l'évaluation du modèle BERT. Cela implique la définition des paramètres du modèle, l'exécution de l'entraînement par lots, et l'évaluation de la performance du modèle sur les données de test.

In [6]:
# Configuration et entraînement du modèle BERT pour l'analyse de sentiment.
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels = 2)  # Binary classification
model.cuda()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Évaluation et Rapport de Classification
Dans cette dernière partie, nous utilisons le modèle BERT entraîné pour faire des prédictions sur l'ensemble de validation et générer un rapport de classification. Ce rapport nous aide à évaluer les performances du modèle en termes de précision, rappel, et score F1.

In [7]:
# Évaluation du modèle BERT sur les données de test.
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)
epochs = 6
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)




In [8]:
# Génération du rapport de classification pour évaluer les performances du modèle BERT.
import numpy as np
from tqdm.notebook import tqdm

# Function for calculating the accuracy of predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# For each epoch...
for epoch in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, epochs))
    print('Training...')

    total_loss = 0

    # Set our model to training mode
    model.train()

    # For each batch of training data...
    for step, batch in tqdm(enumerate(train_dataloader)):

        # Unpack this training batch from our dataloader.
        b_input_ids = batch[0].to('cuda')
        b_input_mask = batch[1].to('cuda')
        b_labels = batch[2].to('cuda')

        # Clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()

        # Perform a forward pass (evaluate the model on this training batch).
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

        loss = outputs.loss
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Update parameters and take a step using the computed gradient.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)

    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print("")
    print("Average training loss: {0:.2f}".format(avg_train_loss))

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on our validation set.

    print("")
    print("Running Validation...")

    # Put the model in evaluation mode
    model.eval()

    # Tracking variables
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:

        # Add batch to GPU
        batch = tuple(t.to('cuda') for t in batch)

        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch

        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():

            # Forward pass, calculate logit predictions.
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        logits = outputs.logits

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)

        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))



Training...


0it [00:00, ?it/s]


Average training loss: 0.49

Running Validation...
  Accuracy: 0.78

Training...


0it [00:00, ?it/s]


Average training loss: 0.40

Running Validation...
  Accuracy: 0.78

Training...


0it [00:00, ?it/s]


Average training loss: 0.29

Running Validation...
  Accuracy: 0.78

Training...


0it [00:00, ?it/s]


Average training loss: 0.20

Running Validation...
  Accuracy: 0.78

Training...


0it [00:00, ?it/s]


Average training loss: 0.13

Running Validation...
  Accuracy: 0.77

Training...


0it [00:00, ?it/s]


Average training loss: 0.10

Running Validation...
  Accuracy: 0.77


In [11]:
from sklearn.metrics import classification_report

# Put model in evaluation mode to evaluate loss on the validation set
model.eval()

# Tracking variables
predictions , true_labels = [], []

# Predict
for batch in validation_dataloader:
  # Add batch to GPU
  batch = tuple(t.to('cuda') for t in batch)

  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch

  # Telling the model not to compute or store gradients, saving memory and speeding up validation
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

  logits = outputs.logits

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()

  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

# Flatten the predictions and true values for aggregate classification report
flat_predictions = np.concatenate(predictions, axis=0)
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = np.concatenate(true_labels, axis=0)

print(classification_report(flat_true_labels, flat_predictions))


              precision    recall  f1-score   support

           0       0.78      0.77      0.78      4067
           1       0.76      0.78      0.77      3933

    accuracy                           0.77      8000
   macro avg       0.77      0.77      0.77      8000
weighted avg       0.77      0.77      0.77      8000



Le modèle semble avoir des performances équilibrées entre les deux classes, avec une précision, un rappel et un F1-score similaires. L'accuracy de 77% indique également une performance raisonnable.