<a href="https://colab.research.google.com/github/Gilbert-B/Natural-Language-Processing-Sentiment-Analysis-/blob/main/Sentiment_Analysis_BERT_Based_MODEL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

In [1]:
!pip install huggingface_hub transformers datasets gradio pipreqs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m72.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gradio
  Downloading gradio-3.28.3-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m89.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pipreqs
  Downloading pipreqs-0.4.13-py2.py3-none-any.whl (33

In [3]:
# Import libraries
import os
import uuid
import pandas as pd
import numpy as np
from scipy.special import softmax
import gradio as gr

from google.colab import drive
from datasets import load_dataset
from huggingface_hub import notebook_login
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    TFAutoModelForSequenceClassification,
    IntervalStrategy,
    TrainingArguments,
    EarlyStoppingCallback,
    pipeline,
    Trainer
) 


In [4]:
#login to HF hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
drive.mount('/content/drive')

Mounted at /content/drive


## Application of Hugging Face Text classification model Fune-tuning

This code sets the environment variable "WANDB_DISABLED" to "true", which disables the use of the Weights and Biases (W&B) tool. W&B is a third-party tool that can be used to track and visualize the training progress of machine learning models. By setting this environment variable, you are telling your code to not use this tool.

In [6]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [7]:
# Load the dataset and display some values

# Load the CSV file into a DataFrame

url = "https://github.com/Azubi-Africa/Career_Accelerator_P5-NLP/raw/master/zindi_challenge/data/Train.csv"

df = pd.read_csv(url)


Data Quality checks 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   10001 non-null  object 
 1   safe_text  10001 non-null  object 
 2   label      10000 non-null  float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.7+ KB


In [9]:
# Select rows with missing values
df.isnull().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [10]:
# Select rows with missing values
df[df.isnull().any(axis=1)]

Unnamed: 0,tweet_id,safe_text,label,agreement
4798,RQMQ0L2A,#lawandorderSVU,,
4799,I cannot believe in this day and age some pare...,1,0.666667,


In [11]:
# Extract complete text from 'safe_text' column
complete_text = df.iloc[4798]['safe_text']
complete_text

'#lawandorderSVU '

In [12]:
# Select row by index and assign values to columns
df.loc[4798, 'label'] = 0
df.loc[4798, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[4798, df.columns.get_loc('safe_text')] = complete_text


In [13]:
# Generate random UUID string for tweet_id
'''UUIDs are often used in software applications for various purposes such as generating unique IDs for entities, 
tracking unique user sessions, or creating unique file names'''
rand_tweet_id = str(uuid.uuid4())

# Select row by index and assign values to columns
row_index = 4799
df.loc[row_index, 'tweet_id'] = rand_tweet_id
df.loc[row_index, 'label'] = 1
df.loc[row_index, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[row_index, df.columns.get_loc('safe_text')] = df.iloc[row_index, 1]


In [14]:
df[df.duplicated()].sum()

tweet_id     0.0
safe_text    0.0
label        0.0
agreement    0.0
dtype: float64

In [15]:
#distribution of sentiments 
df["label"].value_counts()

 0.0    4909
 1.0    4054
-1.0    1038
Name: label, dtype: int64

# Finetuning the BERT model

In [19]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [17]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
1641,CQDD6QLM,"New <user> ""Hey Love"" #MMR #ManyMenRecords #Yo...",0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
336,I4D043ST,<user> esp when mercury free vaccines are avai...,1.0,0.666667
6861,CKX52Y8G,"My Life, Your Entertainment #YOTC #MMR @ Exoti...",0.0,1.0
720,07S3NL2T,Baby Luna is sore from her vaccines :( #poorpuppy,0.0,0.666667


In [18]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
5818,Y8PQ0BT7,So nervous... The baby's getting vaccines... (...,1.0,0.666667
7842,C9Z6JBSS,AIDS N : A malaria vaccine in children with HI...,0.0,0.666667
880,0VE4NWWQ,Measles Outbreak Hits Texas Church That Preach...,1.0,0.666667
9072,RHQRUF14,Thank you <user> for mtg with your staff. We l...,1.0,1.0
288,ZWEP2IL4,Health district offers no-cost immunizations f...,1.0,0.666667


In [20]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (8000, 4), eval is (2001, 4)


By saving the subsets as CSV files, you can easily load them into your machine learning framework of choice (e.g., PyTorch, TensorFlow) and preprocess the data as needed for your specific task. Additionally, saving the subsets as separate files allows you to easily swap in new training or evaluation data as needed during the development process.

In [22]:
directory = r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data'

In [23]:
# Save splitted subsets
train.to_csv(r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\train_subset.csv', index=False)
eval.to_csv(r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\eval_subset.csv', index=False)

In [25]:
# Load the CSV files into a dataset

dataset = load_dataset('csv',
                        data_files={'train': r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\train_subset.csv',
                        'eval': r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-75423146d73afc37/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-75423146d73afc37/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Transformers is a Python library for natural language processing (NLP) developed by Hugging Face. It provides an easy-to-use interface for building and training state-of-the-art deep learning models for a variety of NLP tasks, such as text classification, named entity recognition, question answering, and more.

The transformer architecture is a type of neural network that is particularly well-suited for processing sequential data, such as natural language text. It replaces the recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that were previously used for NLP tasks, and has achieved state-of-the-art performance on a wide range of benchmarks.

The Transformers library provides pre-trained transformer models that can be fine-tuned on a specific NLP task with only a small amount of task-specific data. This allows developers to easily leverage the power of transformer models for their own NLP tasks, even if they do not have access to large amounts of training data or high-performance computing resources.

A tokenizer is a component in natural language processing (NLP) that breaks down text into individual tokens, which are usually words or subwords. Tokenization is an important preprocessing step in many NLP tasks, because it converts raw text data into a format that can be easily processed by machine learning models.

There are different types of tokenizers that can be used, depending on the specific requirements of the task. Some common types include:

Word tokenizers: These tokenize text into individual words based on whitespace or punctuation.

Subword tokenizers: These tokenize text into subwords, which can be useful for handling out-of-vocabulary words or words that are rare in the training data.

Character tokenizers: These tokenize text into individual characters, which can be useful for languages that have complex orthographies or for handling misspellings.

AutoTokenizer is used to instantiate a tokenizer. AutoTokenizer is a class in the Transformers library that provides a convenient way to automatically select the appropriate tokenizer for a given pre-trained model. The AutoTokenizer class uses heuristics to determine the type of tokenizer that should be used based on the architecture and configuration of the pre-trained model. This can be useful when working with a variety of pre-trained models, because it allows you to use the appropriate tokenizer without having to manually select one for each model.

In [26]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

'''
This code instantiates a tokenizer for the BERT (Bidirectional Encoder Representations from Transformers) 
pre-trained model with the bert-base-cased configuration.

'''


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

'\nThis code instantiates a tokenizer for the BERT (Bidirectional Encoder Representations from Transformers) \npre-trained model with the bert-base-cased configuration.\n\n'

Specifically, AutoTokenizer.from_pretrained() is a method in the Transformers library that allows you to load a pre-trained tokenizer for a specific model architecture and configuration. In this case, the from_pretrained() method is called with the argument 'bert-base-cased', which is the name of a pre-trained BERT model that has been trained on a large corpus of English text.

The bert-base-cased configuration refers to a version of the BERT model that has a cased vocabulary, meaning that it distinguishes between uppercase and lowercase letters. This can be useful in tasks where the case of words is important, such as named entity recognition or sentiment analysis.

By instantiating a tokenizer for the bert-base-cased model using AutoTokenizer.from_pretrained(), you can tokenize text according to the same scheme used during pre-training of the BERT model. This can be useful when fine-tuning the pre-trained model on a specific task, because it ensures that the input data is pre-processed in the same way as the data used to train the original model.

In [27]:
# Define a function to transform the label values
def transform_labels(label):
    # Extract the label value
    label = label['label']
    # Map the label value to an integer value
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2
    # Return a dictionary with a single key-value pair
    return {'labels': num}

# Define a function to tokenize the text data
def tokenize_data(example):
    # Extract the 'safe_text' value from the input example and tokenize it
    return tokenizer(example['safe_text'], padding='max_length')

# Apply the transformation functions to the dataset using the 'map' method
# This transforms the label values and tokenizes the text data
dataset_out = dataset.map(transform_labels)

dataset_base = dataset_out.map(tokenize_data, batched=True)

# Define a list of column names to remove from the dataset
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']

# Apply the 'transform_labels' function to the dataset to transform the label values
# Also remove the columns specified in 'remove_columns'

dataset_base = dataset_base.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

The columns specified in remove_columns are removed from the dataset because they are not needed for the subsequent analysis or model training.

tweet_id: This column contains unique identifiers for each tweet, which are not relevant for the analysis or modeling.

label: This column contains the original label values, which have already been transformed into numerical values using the transform_labels function.

safe_text: This column contains the preprocessed text data that has already been tokenized and encoded, so it is not needed for subsequent analysis or modeling.

agreement: This column indicates the level of agreement among the annotators for each tweet. While this information might be useful for some analyses, it is not necessary for the sentiment analysis task at hand.

By removing these columns, the resulting dataset is more compact and easier to work with, while retaining all the relevant information for the sentiment analysis task.

In [28]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 8000
    })
    eval: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 2001
    })
})

In [29]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',                          # Directory where the model checkpoints and evaluation results will be stored
    evaluation_strategy=IntervalStrategy.STEPS,      # Interval for evaluating the model during training (every specified number of steps)
    save_strategy=IntervalStrategy.STEPS,            # Interval for saving the model during training (every specified number of steps)
    save_steps=500,                                  # Number of steps between two saves
    load_best_model_at_end=True,                     # Whether to load the best model at the end of training
    num_train_epochs=10,                              # Number of training epochs
    per_device_train_batch_size=2,                   # Batch size per GPU for training
    per_device_eval_batch_size=2,                    # Batch size per GPU for evaluation
    learning_rate=3e-5,                              # Learning rate
    weight_decay=0.01,                               # Weight decay
    warmup_steps=500,                                # Number of warmup steps
    logging_steps=500,                               # Number of steps between two logs
    fp16=True,                                       # Whether to use 16-bit precision
    gradient_accumulation_steps=16,                  # Number of steps to accumulate gradients before performing an optimizer step
    dataloader_num_workers=2,                        # Number of workers to use for loading data
    push_to_hub=True,                                # Whether to push the model checkpoints to the Hugging Face hub
    hub_model_id="GhylB/Sentiment_Analysis_BERT_Based_MODEL",  # Model ID to use when pushing the model to the Hugging Face hub 
)

#use hub_model_id="finetuned-Sentiment-classfication-ROBERTA-model
#use hub_model_id="finetuned-Sentiment-classfication-BERT-model
#use hub_model_id="finetuned-Sentiment-classfication-DISTILBERT-model

# Define the early stopping callback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,                       # Number of epochs with no improvement before stopping training
    early_stopping_threshold=0.01,                   # Minimum improvement in the metric for considering an improvement
)

# Combine the training arguments and the early stopping callback
training_args.callbacks = [early_stopping]


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Explanation:

from transformers import IntervalStrategy, TrainingArguments: Importing the IntervalStrategy and TrainingArguments classes from the transformers library.

training_args = TrainingArguments(: Creating a TrainingArguments object and assigning it to the variable training_args.

output_dir='./results': Specifies the directory where the training results will be saved.

evaluation_strategy=IntervalStrategy.STEPS: Specifies how often the model will be evaluated during training. In this case, the model will be evaluated at specific intervals.

save_strategy=IntervalStrategy.STEPS: Specifies how often the model will be saved during training. In this case, the model will be saved at specific intervals.

save_steps=500: Specifies how often the model will be saved during training, in terms of the number of steps taken. In this case, the model will be saved every 500 steps.

load_best_model_at_end=True: Specifies whether to load the best model at the end of training. If set to True, the best model will be loaded; if set to False, the last model will be loaded.

num_train_epochs=3: Specifies the number of epochs for training the model. In this case, the model will be trained for 3 epochs.

per_device_train_batch_size=2: Specifies the batch size for training. In this case, each training batch will contain 2 examples.

per_device_eval_batch_size=2: Specifies the batch size for evaluation. In this case, each evaluation batch will contain 2 examples.

In [30]:

'''
AutoModelForSequenceClassification is a class in the Transformers library that is used for sequence classification tasks, 
where the input is a sequence of text and the output is a label or category assigned to that sequence.

The benefit of using AutoModelForSequenceClassification is that it automatically selects the 
appropriate pre-trained model architecture based on the specified configuration and dataset. 
This makes it easy to fine-tune pre-trained models for various sequence classification tasks without having 
to manually select the appropriate model architecture.
'''

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

'''
Sentiment analysis is a common use case for sequence classification, 
where the goal is to classify text into categories such as positive, negative, or neutral sentiment. 
Therefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.
'''


Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

'\nSentiment analysis is a common use case for sequence classification, \nwhere the goal is to classify text into categories such as positive, negative, or neutral sentiment. \nTherefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.\n'

In [31]:
train_dataset_base = dataset_base['train'].shuffle(seed=10) #.select(range(40000)) # to select a part

'''
train_dataset is created by selecting the 'train' subset of the original dataset and 
shuffling it randomly using the shuffle() function with a specified seed value of 10. 
This ensures that the data samples are presented to the model in a randomized order during training.

'''

eval_dataset_base = dataset_base['eval'].shuffle(seed=10)


In [32]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse = np.sqrt(np.mean((predictions - labels)**2))
    return {"rmse": rmse}


In [33]:
trainer_base = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset_base, 
    eval_dataset=eval_dataset_base,
    compute_metrics=compute_metrics    # Add this line to define the compute_metrics function
)


Cloning https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL into local empty directory.


In [34]:
# Launch the learning process: training 

'''
trainer.train() launches the training process on the specified train_dataset.
'''

trainer_base.train()

'''

During training, the model's parameters will be updated to minimize the loss between the predicted outputs and the actual outputs. The process consists of forward and backward passes through the neural network, followed by parameter updates using an optimization algorithm (in this case, AdamW).

The trainer object will keep track of the training progress, 
including the current epoch, the number of steps completed, 
the average training loss, and the average evaluation loss 
(if an evaluation dataset is provided). 
The training will continue for the specified number of epochs (num_train_epochs in training_args) 
or until the stopping criterion is met (e.g., early stopping based on the evaluation loss).

'''



Step,Training Loss,Validation Loss,Rmse
500,0.7508,0.595455,0.669534
1000,0.3953,0.748545,0.660516
1500,0.1399,1.056143,0.67028
2000,0.0585,1.309443,0.652524
2500,0.0298,1.438129,0.667291


"\n\nDuring training, the model's parameters will be updated to minimize the loss between the predicted outputs and the actual outputs. The process consists of forward and backward passes through the neural network, followed by parameter updates using an optimization algorithm (in this case, AdamW).\n\nThe trainer object will keep track of the training progress, \nincluding the current epoch, the number of steps completed, \nthe average training loss, and the average evaluation loss \n(if an evaluation dataset is provided). \nThe training will continue for the specified number of epochs (num_train_epochs in training_args) \nor until the stopping criterion is met (e.g., early stopping based on the evaluation loss).\n\n"

In [35]:
# Evaluate the model
eval_results = trainer_base.evaluate()

# Create a dictionary of the evaluation results
results_dict = {
    "Model": "Bert_base",
    "Loss": eval_results["eval_loss"],
    "RMSE": eval_results["eval_rmse"],
    "Runtime": eval_results["eval_runtime"],
    "Samples Per Second": eval_results["eval_samples_per_second"],
    "Steps Per Second": eval_results["eval_steps_per_second"],
    "Epoch": eval_results["epoch"]
}

# Create a pandas DataFrame from the dictionary
results_df = pd.DataFrame([results_dict])

# Print the results
print(results_df)


       Model      Loss      RMSE  Runtime  Samples Per Second  \
0  Bert_base  0.595455  0.669534  37.6575              53.137   

   Steps Per Second  Epoch  
0            26.582   10.0  


In [36]:

 # Push the final fine-tuned model to the Hugging Face model hub

trainer_base.push_to_hub("GhylB/Sentiment_Analysis_BERT_Based_MODEL")

Upload file runs/May07_20-10-46_d2d2afe4bb13/events.out.tfevents.1683490463.d2d2afe4bb13.1596.0:   0%|        …

Upload file runs/May07_20-10-46_d2d2afe4bb13/events.out.tfevents.1683494331.d2d2afe4bb13.1596.2:   0%|        …

To https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL
   68dafb6..f1cbe2b  main -> main

   68dafb6..f1cbe2b  main -> main

To https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL
   f1cbe2b..3a5b309  main -> main

   f1cbe2b..3a5b309  main -> main



'https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL/commit/f1cbe2bb6ee5406b66d36437328aeb96379f87ce'

In [37]:
tokenizer.push_to_hub("GhylB/Sentiment_Analysis_BERT_Based_MODEL")

CommitInfo(commit_url='https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL/commit/f40bde8c14f0e2ba1c70ac01fafe1cdcd31043ae', commit_message='Upload tokenizer', commit_description='', oid='f40bde8c14f0e2ba1c70ac01fafe1cdcd31043ae', pr_url=None, pr_revision=None, pr_num=None)

In [38]:
model.push_to_hub("GhylB/Sentiment_Analysis_BERT_Based_MODEL")

CommitInfo(commit_url='https://huggingface.co/GhylB/Sentiment_Analysis_BERT_Based_MODEL/commit/a227ec7c300259361df54e16e71737bee1b6f1cf', commit_message='Upload BertForSequenceClassification', commit_description='', oid='a227ec7c300259361df54e16e71737bee1b6f1cf', pr_url=None, pr_revision=None, pr_num=None)

### You can load your model from anywhere using from_pretrained!

In [40]:
# Load the tokenizer
tokenizer = tokenizer.from_pretrained("GhylB/Sentiment_Analysis_BERT_Based_MODEL")

# Load the fine-tuned model
model = pipeline("text-classification", model="GhylB/Sentiment_Analysis_BERT_Based_MODEL", tokenizer=tokenizer)



Downloading (…)lve/main/config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [41]:
label_map = {0: "negative", 1: "neutral", 2: "positive"}

# Make predictions on some example text
result = model("I love these covid vaccines.")

# Map the numerical label to the corresponding class name
result[0]["label"] = label_map[int(result[0]["label"].split("_")[1])]

# Print the predicted label and score
print(result)

[{'label': 'positive', 'score': 0.879347562789917}]


In [42]:
!pip freeze > r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data'

In [44]:
!ls {r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data'}


ls: cannot access 'C:UsersGilBOneDriveDocumentsGit': No such file or directory
ls: cannot access 'RepoNLPNatural-Language-Processing-Project-Sentiment-AnalysisData': No such file or directory
