### part 3



import the necessary libraries

In [3]:
import pandas as pd
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
import pandas as pd
import tqdm as tqdm

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords

from transformers import RobertaForSequenceClassification, RobertaTokenizer
from transformers import BertForSequenceClassification, BertTokenizer


  from .autonotebook import tqdm as notebook_tqdm





[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...


load the data

In [4]:
df = pd.read_csv('tweets.csv')

In [5]:
# Sample 100k tweets
sample_size = 100000
df_sampled = df.sample(n=sample_size, random_state=42)

preprocess the data (exactly the same function that was used in part 2)

In [6]:

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    stemmed = [stemmer.stem(word) for word in tokens]
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    cleaned_text = ' '.join(lemmatized)  
    return text


df_sampled['text'] = df_sampled['text'].apply(clean_text)

load the Roberta model and run the model on the sampled data

In [69]:

# Specify the paths to your config.json and model.safetensors files
config_path = "roberta/roberta_model/config.json"
model_path = "roberta/roberta_model/"

# Load the tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained(model_path, config=config_path)

# Define the classify_tweet function
def classify_tweet(tweet, tokenizer, model):
    inputs = tokenizer(tweet, truncation=True, padding=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1).item()
    return predicted_label

# Assuming df_sampled is already defined
df_sampled['roberta_label'] = df_sampled['text'].apply(lambda tweet: classify_tweet(tweet, tokenizer, model))




load the Bert model and run on the sampled data


In [66]:
# Specify the paths to your config.json and model.safetensors files
config_path = "bert/bert_model/config.json"
model_path = "bert/bert_model/"

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertForSequenceClassification.from_pretrained(model_path, config=config_path)

# Define the classify_tweet function
def classify_tweet(tweet, tokenizer, model):
    inputs = tokenizer(tweet, truncation=True, padding=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1).item()
    return predicted_label

# Assuming df_sampled is already defined
df_sampled['bert_label'] = df_sampled['text'].apply(lambda tweet: classify_tweet(tweet, tokenizer, model))

# Display the results
print(df_sampled[['text', 'bert_label']])

                                                      text  bert_label
4385281  rt unitedhealthgrp unh wichmann unh generated ...           1
1425403  nice shake out of weak hands yesterday mms  we...           0
2171408  rt momostocktrades vdrm cnbx owcp usrm aapl fb...           0
3564549  commercial metals company director just picked...           1
7335060  hunting for stocks to short heres one w a low ...           0
...                                                    ...         ...
7831020  westerndigital ceo in japan to finalize toshib...           1
7460314  first financial bankshares inc ffin upgraded t...           0
3826266  what moves ftse 100  sentiments analysis or in...           1
3831838  must read undervalued microcap about to run ht...           1
6790864  tower research capital llc trc sells 81118 sha...           0

[100 rows x 2 columns]


Discuss transfer learning and fine-tuning and how they can be used to improve
the overall effect of the project. Use your limited results from the second part.

# Transfer Learning and Fine-Tuning in Natural Language Processing

Transfer learning and fine-tuning are powerful techniques in natural language processing (NLP) that leverage pre-trained models to improve the performance of specific tasks. These techniques can significantly enhance the overall effectiveness of a project by utilizing the knowledge learned from large-scale pre-training datasets.

## Transfer Learning

Transfer learning involves transferring knowledge from a pre-trained model, trained on a large general-purpose dataset, to a specific task or domain. In NLP, transfer learning typically involves using pre-trained language models such as BERT, RoBERTa, or GPT to initialize the parameters of a model for a downstream task, such as sentiment analysis or text classification.

### Benefits of Transfer Learning:

1. **Efficient Use of Resources**: Transfer learning allows us to leverage the computational resources and expertise invested in training large-scale pre-trained models, saving time and effort in training from scratch.

2. **Improved Performance**: Pre-trained models have learned rich representations of language from vast amounts of text data, which can lead to better performance on downstream tasks, especially when the task has limited labeled data.

3. **Domain Adaptation**: Transfer learning enables models to adapt to specific domains or tasks by fine-tuning the pre-trained parameters on domain-specific or task-specific data.

## Fine-Tuning

Fine-tuning involves further training the pre-trained model on task-specific data to adapt it to the target task. During fine-tuning, the parameters of the pre-trained model are adjusted using task-specific labeled data, typically with a smaller learning rate compared to the initial pre-training phase.

### Benefits of Fine-Tuning:

1. **Task-Specific Optimization**: Fine-tuning allows the model to learn task-specific patterns and nuances from the labeled data, leading to better performance on the target task.

2. **Flexibility**: Fine-tuning offers flexibility in adapting the pre-trained model to different tasks or domains by adjusting the training data and hyperparameters.

3. **Regularization**: Fine-tuning provides a form of regularization that prevents overfitting by updating the model parameters while retaining the knowledge learned during pre-training.

## Application to Project

In our project, we can leverage transfer learning and fine-tuning to improve the overall effectiveness of sentiment analysis or text classification tasks. By initializing our models with pre-trained language models such as BERT or RoBERTa and fine-tuning them on our specific dataset containing tweets or text data, we can achieve better performance compared to training models from scratch.

### Experiment Results

In our limited experimentation, we observed the benefits of transfer learning and fine-tuning. By utilizing pre-trained BERT models and fine-tuning them on our dataset, we achieved improved accuracy and performance in classifying sentiments or predicting labels for tweets.

Overall, transfer learning and fine-tuning offer a powerful approach to NLP tasks, allowing us to leverage the knowledge learned from large-scale pre-training and adapt it to our specific project requirements.

