## 2. Criação de uma baseline


### Vader Sentiment


In [None]:
pip install nltk



Import the necessary libraries and download the VADER lexicon.

In [14]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon (if you haven't already)
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Create a SentimentIntensityAnalyzer object and use it to analyze text sentiment.

In [15]:
# Create a sentiment analyzer
sia = SentimentIntensityAnalyzer()

In [16]:
import pandas as pd

In [17]:
df = pd.read_csv('/content/amazon_reviews_train.csv')
def analyze_sentiment(text):
    sentiment_scores = sia.polarity_scores(text)
    return sentiment_scores

#df['sentiment'] = df['review'].apply(analyze_sentiment)


In [18]:
doc=df.review.to_list()
tags=df.sentiment.to_list()
seldocs=doc[1:10]
seltags=tags[1:10]


In [19]:
for i in range(len(seldocs)):
  s=sia.polarity_scores(seldocs[i])['compound']
  print(s,  seltags[i], seldocs[i])

0.8265 positive This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
0.0 negative If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.
0.9468 positive Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.
0.9346 positive This saltwater taffy had great flavors and was very soft and chewy. 

#### Measure the accuracy

In [20]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd

# Assuming you have loaded your data into the 'df' DataFrame

# Initialize the SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Extract the text documents and true sentiment labels
doc = df.review.to_list()
tags = df.sentiment.to_list()
seldocs = doc[1:10]
seltags = tags[1:10]

# Initialize variables to track correct predictions
correct_predictions = 0
total_predictions = len(seldocs)

# Classify sentiments and measure accuracy
for i in range(total_predictions):
    compound_score = sia.polarity_scores(seldocs[i])['compound']
    predicted_sentiment = 'positive' if compound_score >= 0 else 'negative'

    # Check if the predicted sentiment matches the true sentiment label
    if predicted_sentiment == seltags[i]:
        correct_predictions += 1

# Calculate accuracy
accuracy = correct_predictions / total_predictions

print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 77.78%


# **3**. Preparação de dados e aplicação de um léxico de sentimentos

The nltk.download('all') downloads all the datasets and packages available in the nltk library.


*   This is necessary because some of the functions in the nltk library require specific datasets to be downloaded in order to work properly.
*   By downloading all the datasets, the user can access all the functions in the library without having to worry about missing datasets.

## Apply Sentiment Lexicon

### nrc_lexicon

In [None]:
#data = pd.read_csv("../data/en/NCR-lexicon.csv", encoding="utf-8")
data = pd.read_csv("https://raw.githubusercontent.com/fmmb/Text-Mining/main/data/NRC-lexicon.csv", encoding="utf-8")
data.sample(5)

In [None]:
data.set_index("English", inplace=True)
lex1 = data["Positive"] - data["Negative"]
lex1.sample(5)

In [None]:
lex2 = lex1.to_dict()

In [None]:
def sentimento(texto):
    soma = 0
    for w in texto.split():
        soma = soma + lex2.get(w, 0)
    if soma >= 0:
        return "positive"
    else:
        return "negative"

In [None]:
# Function to calculate the prediction for each row
def calculate_prediction(row):
    return sentimento(row['review'])

In [None]:
# Apply the function to each row to get the 'Prediction' column
from sklearn.metrics import classification_report

print(classification_report(df['sentiment'], df['Prediction']))

#### Preprocess the text

In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer


import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Assuming 'df' is your DataFrame containing 'Sentiment' and 'Review' columns

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def normalize_text(text):
    tokens = word_tokenize(text)  # Tokenize the text
    tokens_lowered = [word.lower() for word in tokens]  # Lowercase each token
    tokens_no_HTML_tag = [word for word in tokens_lowered if word not in ['br', '<br>', '<br />', '<br/>']]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_filtered_no_quotes]  # Lemmatize the tokens

    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

# Apply text normalization to the 'Review' column

df['Normalized_Review'] = df['review'].apply(normalize_text)

# Display the updated DataFrame
print(df)

In [None]:
# Function to calculate the prediction for each row
def calculate_prediction(row):
    return sentimento(row['Normalized_Review'])

In [None]:
# Apply the function to each row to get the 'Prediction' column
from sklearn.metrics import classification_report

print(classification_report(df['sentiment'], df['Prediction']))

#### Negation Handling

In [None]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/fmmb/Text-Mining/main/data/NRC-lexicon.csv", encoding="utf-8")
data.set_index("English", inplace=True)
lex1 = data["Positive"] - data["Negative"]
lex2 = lex1.to_dict()

negation_words = ["not", "no", "never"]  # Add more negation words as needed

def sentimento(texto):
    words = texto.split()
    soma = 0
    negation_multiplier = 1  # to handle negation

    for i, w in enumerate(words):
        if w in negation_words:
            negation_multiplier = -1  # invert the sentiment
        else:
            soma = soma + (lex2.get(w, 0) * negation_multiplier)
            negation_multiplier = 1  # reset the multiplier after handling the word

    if soma >= 0:
        return "positive"
    else:
        return "negative"

# Function to calculate the prediction for each row
def calculate_prediction(row):
    return sentimento(row['review'])

# Apply the function to each row to get the 'Prediction' column
df['prediction_neg_handled'] = df.apply(calculate_prediction, axis=1)

In [None]:
df

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df['sentiment'], df['prediction_neg_handled']))

## 4. Treino de um modelo (aprendizagem automática)

In [8]:
from textblob import TextBlob
import pandas as pd
import nltk
import csv
from nltk import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from textblob import TextBlob


In [None]:
nltk.download('punkt')

In [None]:
df = pd.read_csv(r'/content/amazon_reviews_train.csv')


In [None]:
df

In [None]:
# Group by 'Sentiment' and count
sentiment_counts = df.groupby('sentiment').size()

# Print the counts
print(sentiment_counts)

In [None]:
!jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000

## **Training a model** (automatic learning)

NLTK (Natural Language Toolkit) and Scikit-Learn are two powerful libraries in Python that are extensively used for text classification tasks. They offer a wide range of tools and algorithms to preprocess text data, extract features, and build machine learning models. Here's an overview of how they are utilized in text classification.
We will test both and adjust the model to our data.
Finnaly we will test our test set and measure the accuracy.

### Trying with NLTK classify

**Preprocessing**

1.   lowercasing, removing punctuation, stopword removal, and potentially stemming or lemmatization.



In [None]:
import pandas as pd
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')

df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

The train_test_split function is a fundamental tool in machine learning. It serves a critical purpose by dividing a dataset into two distinct subsets: the training set and the testing set. The training set is used to train the model, allowing it to learn patterns and relationships within the data. Once trained, the model's performance is evaluated on the testing set, which contains data it has never seen before. This process mimics real-world scenarios where the model encounters new, unseen data. By assessing the model's performance on this independent set, we gain confidence in its ability to generalize well and make accurate predictions on future, unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# splitting the data into training and testing sets, with 70% of the data used for testing and 30% used for training ( just to be more fast than all set)
# the random state is set to 42 for reproducibility

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.7, random_state=42)


**Vectorize the Text**


1.   Convert the text data into a numerical format that can be used by machine learning algorithms. Common techniques include TF-IDF vectorization or word embeddings like Word2Vec or GloVe.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


**Train the Binary Classifiers**

In [None]:
from sklearn.linear_model import LogisticRegression

#  binary_classifier is now a trained binary classifier capable of predicting the sentiment (positive or negative) of new, unseen text data

binary_classifier = LogisticRegression()
binary_classifier.fit(X_train_tfidf, y_train)



**Evaluate the Binary Classifier**

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = binary_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')


Apply the model you have built to the test set, evaluate the results obtained and compare them with the results obtained in the previous task.

**Test**

In [None]:
df_test = pd.read_csv(r'/content/amazon_reviews_test.csv')

In [None]:
# Preprocess the test set
df_test['review'] = df_test['review'].str.lower()
df_test['review'] = df_test['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df_test['review'] = df_test['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# The split data is not applicable

# Vectorize the test set
X_test = df_test['review']
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Apply the binary classifier on the test set
y_pred = binary_classifier.predict(X_test_tfidf)

# Evaluate the performance
accuracy = accuracy_score(df_test['sentiment'], y_pred)
report = classification_report(df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')



## Trying with scikit-learn classify

1.   With Logistic Regression
2.   With Naive Bayes



**Preprocessing the data and Split the data**

In [None]:
df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.3, random_state=42)


Vectorize the text data with TF-IDF

In [None]:
# Logistic Regression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Train with Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

binary_classifier = LogisticRegression()
binary_classifier.fit(X_train_tfidf, y_train)

In [None]:
y_pred = binary_classifier.predict(X_test_tfidf)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

With Naive Bayes

In [None]:
# Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

**Preprocess and Split the data (the same before)**

In [None]:
df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.3, random_state=42)


Vectorize the data

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

**Train the Classifier with Multinomial Naive Bayes**

In [None]:
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

#Predict sentiments on the test set
y_pred = naive_bayes_classifier.predict(X_test_tfidf)

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

**Classify text and measure accuracy**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Individual Predictions - Just to understand one by one how is classify

In [None]:
for idx, (review, true_label, pred_label) in enumerate(zip(df['review'][y_test.index], y_test, y_pred)):
    print(f"Text {idx+1}:")
    print(f"Review: {review}")
    print(f"True Label: {true_label}")
    print(f"Predicted Label: {pred_label}")
    print(f"Accuracy: {'Correct' if true_label == pred_label else 'Incorrect'}\n")

**Therefore, for the test data, the logistic regression fits better, resulting in better accuracy**

So, let's test with test set

In [None]:
# Step 1: Preprocess the data
df_test['review'] = df_test['review'].str.lower()
df_test['review'] = df_test['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df_test['review'] = df_test['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Step 2: Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(df['review'])
X_test_tfidf = tfidf_vectorizer.transform(df_test['review'])

# Step 3: Train the binary classifier (Logistic Regression)
binary_classifier = LogisticRegression()
binary_classifier.fit(X_train_tfidf, df['sentiment'])

# Step 4: Apply the trained binary classifier on the test set
y_pred = binary_classifier.predict(X_test_tfidf)

# Step 5: Evaluate the performance
accuracy = accuracy_score(df_test['sentiment'], y_pred)
report = classification_report(df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

# 5. Utilização de transformadores

1.  as a first step, you can carry out simple experiments, using already defined pipelines, applying one or more existing models;
2. as a second step, use your data to finetune the pre-trained model and thus achieve even better results.


In [None]:
pip install transformers

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("text-classification")

In [None]:
# This command takes a lot of time run

# Apply sentiment analysis using the pipeline
#results = sentiment_pipeline(df['review'].tolist())

# Print out the sentiment predictions
#for idx, result in enumerate(results):
#    print(f"Text {idx+1}: {result['label']} (confidence: {result['score']:.4f})")


In [None]:
# Preprocessing
df_test['review'] = df_test['review'].str.lower()
df_test['review'] = df_test['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df_test['review'] = df_test['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Sample a subset of the test data
subset_df_test = df_test.sample(n=100, random_state=42)

# Apply sentiment analysis using the pipeline on the subset
results = sentiment_pipeline(subset_df_test['review'].tolist())

# Extract predicted sentiments and convert to lowercase
predicted_labels = [result['label'].lower() for result in results]

# Map sentiment labels to binary labels
binary_labels = {'positive': 1, 'negative': 0}
y_pred_binary = [binary_labels[label] for label in predicted_labels]

# Vectorize with TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(df['review'])
X_test_tfidf = tfidf_vectorizer.transform(subset_df_test['review'])

# Train the binary classifier (Logistic Regression)
binary_classifier = LogisticRegression()
binary_classifier.fit(X_train_tfidf, df['sentiment'])

# Apply the trained binary classifier on the test set
y_pred = binary_classifier.predict(X_test_tfidf)

# Measure the accuracy
accuracy = accuracy_score(subset_df_test['sentiment'], y_pred)
report = classification_report(subset_df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Now trying with other pipeline

In [None]:
sentiment_pipeline_distilbert = pipeline("sentiment-analysis", model="distilbert-base-uncased")

In [None]:
# Preprocessing
df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Sample a subset of the test data
subset_df_test = df_test.sample(n=100, random_state=42)

# Apply sentiment analysis using the pipeline
results_distilbert = sentiment_pipeline_distilbert(subset_df_test['review'].tolist())

# Extract predicted sentiments
predicted_labels_distilbert = [result['label'] for result in results_distilbert]

# Map sentiment labels to binary labels for DistilBERT predictions
binary_labels_distilbert = {'LABEL_1': 1, 'LABEL_0': 0}
y_pred_binary_distilbert = [binary_labels_distilbert[label] for label in predicted_labels_distilbert]

# Convert sentiment labels to binary labels for logistic regression
binary_labels_logistic = {'positive': 1, 'negative': 0}
y_pred_binary_logistic = [binary_labels_logistic[label] for label in y_pred]


# Evaluate the performance
accuracy = accuracy_score(y_pred_binary_distilbert, y_pred_binary_logistic)
report = classification_report(y_pred_binary_distilbert, y_pred_binary_logistic)


print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

In [None]:
https://www.kaggle.com/code/pritishmishra/text-classification-with-distilbert-92-accuracy

https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

In [None]:
import pandas as pd
from transformers import pipeline, DistilBertTokenizer, DistilBertForSequenceClassification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Assuming you have already loaded your df_test DataFrame

# Step 1: Preprocess the data
df_test['review'] = df_test['review'].str.lower()
df_test['review'] = df_test['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df_test['review'] = df_test['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Step 2: Sample a smaller subset of the data to speed up processing
subset_df_test = df_test.sample(n=1000, random_state=42)

# Step 3: Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(df['review'])
X_test_tfidf = tfidf_vectorizer.transform(subset_df_test['review'])

# Step 4: Define a pipeline for fill-mask task
fill_mask_pipeline = pipeline('fill-mask', model='distilbert-base-uncased')

# Step 5: Apply the pipeline to generate masked sentences
masked_sentences = []
for text in subset_df_test['review']:
    mask_position = len(text.split()) // 2
    masked_text = ' '.join([f'[MASK]' if i == mask_position else word for i, word in enumerate(text.split())])
    masked_sentences.append(masked_text)

# Step 6: Use the pipeline to predict the missing word's sentiment
predicted_sentiments = []
for masked_sentence in masked_sentences:
    results = fill_mask_pipeline(masked_sentence)
    predicted_word = results[0]['token_str']
    predicted_sentiments.append(predicted_word)

# Step 7: Apply logistic regression to classify the predicted sentiments
binary_classifier = LogisticRegression()
binary_classifier.fit(X_train_tfidf, df['sentiment'])

# Step 8: Vectorize the predicted sentiments using TF-IDF
X_predicted_tfidf = tfidf_vectorizer.transform(predicted_sentiments)

# Step 9: Classify the predicted sentiments
y_pred = binary_classifier.predict(X_predicted_tfidf)

# Step 10: Measure accuracy
accuracy = accuracy_score(subset_df_test['sentiment'], y_pred)
report = classification_report(subset_df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')


In [None]:
# Step 1: Preprocess the data (assuming df_test is your test DataFrame)
df_test['review'] = df_test['review'].str.lower()
df_test['review'] = df_test['review'].apply(lambda x: ''.join([char for char in x if char not in string.punctuation]))
stop_words = set(stopwords.words('english'))
df_test['review'] = df_test['review'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in stop_words]))

# Step 2: Use a Transformer pipeline for sentiment analysis
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased")

# Step 3: Apply sentiment analysis using the pipeline on a subset
results = sentiment_pipeline(df_test['review'].tolist())

# Extract predicted sentiments and convert to lowercase
predicted_labels = [result['label'].lower() for result in results]

# Map sentiment labels to binary labels
# Assuming the labels from the pipeline are in the format 'LABEL_X'
binary_labels = {'LABEL_1': 1, 'LABEL_0': 0}
predicted_labels = [label.upper() for label in predicted_labels]
y_pred_binary = [binary_labels[label] for label in predicted_labels]


# Step 4: Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=50000)
X_train_tfidf = tfidf_vectorizer.fit_transform(df['review'])
X_test_tfidf = tfidf_vectorizer.transform(df_test['review'])

# Step 5: Train the binary classifier (Logistic Regression)
binary_classifier = LogisticRegression(max_iter=10000)
binary_classifier.fit(X_train_tfidf, df['sentiment'])

# Step 6: Apply the trained binary classifier on the test set
y_pred = binary_classifier.predict(X_test_tfidf)

# Step 7: Measure the accuracy
accuracy = accuracy_score(df_test['sentiment'], y_pred)
report = classification_report(df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

In [None]:

# Map sentiment labels to binary labels
# Assuming the labels from the pipeline are in the format 'LABEL_X'
binary_labels = {'LABEL_1': 1, 'LABEL_0': 0}
predicted_labels = [label.upper() for label in predicted_labels]
y_pred_binary = [binary_labels[label] for label in predicted_labels]


# Step 4: Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=50000)
X_train_tfidf = tfidf_vectorizer.fit_transform(df['review'])
X_test_tfidf = tfidf_vectorizer.transform(df_test['review'])

# Step 5: Train the binary classifier (Logistic Regression)
binary_classifier = LogisticRegression(max_iter=10000)
binary_classifier.fit(X_train_tfidf, df['sentiment'])

# Step 6: Apply the trained binary classifier on the test set
y_pred = binary_classifier.predict(X_test_tfidf)

# Step 7: Measure the accuracy
accuracy = accuracy_score(df_test['sentiment'], y_pred)
report = classification_report(df_test['sentiment'], y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Fine-tuning involves training the pre-trained model on your specific task or dataset. You can use a sentiment analysis dataset and fine-tune a transformer-based model (like BERT or DistilBERT) on it.
This will allow the model to adapt to the specific nuances and characteristics of your sentiment analysis task.

Experiment with Different Models:

Try using different pre-trained models and architectures. For example, you can experiment with BERT, RoBERTa, XLNet, etc., and see which one performs better for your specific task.

Data Augmentation:

You can generate additional training data by applying techniques like paraphrasing, back-translation, or using synonyms. This can help in exposing the model to a wider range of sentence structures and sentiments.

Ensemble Methods:

Combine predictions from multiple models to improve accuracy. You can use techniques like averaging, stacking, or even using a voting classifier.

Balancing the Dataset:

If your dataset is imbalanced (i.e., it has significantly more samples of one class than the other), consider techniques like oversampling, undersampling, or using synthetic data generation methods.


## Fine tunning

In [None]:
df_train = pd.read_csv(r'/content/amazon_reviews_train.csv')
df_test = pd.read_csv(r'/content/amazon_reviews_test.csv')

In [None]:
!pip install datasets

In [None]:
from datasets import Dataset, DatasetDict
import pandas as pd

# Convert the pandas DataFrames to datasets
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

# Create a DatasetDict containing train and test splits
dataset_dict = DatasetDict({'train': train_dataset, 'test': test_dataset})

print(dataset_dict)


In [None]:
small_train_dataset = dataset_dict["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = dataset_dict["test"].shuffle(seed=42).select([i for i in list(range(300))])

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
def preprocess_function(examples):
   return tokenizer(examples["review"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)


In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


In [None]:
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}


In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
pip install torch==2.0.0
!pip install accelerate>=0.20.1

In [None]:
import transformers
from transformers import TrainingArguments, Trainer

repo_name = "dilanveracruz/textMining"

training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
   prediction_loss_only=False
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
!pip install transformers
!pip install torch
!pip install scikit-learn



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your data (assuming you have a DataFrame 'df' with 'review' and 'sentiment' columns)
# df = pd.read_csv('your_data.csv')

# Preprocess data (if necessary)
df['review'] = df['review'].str.lower()
df['review'] = df['review'].str.replace('[^\w\s]', '')  # Remove punctuation
df['review'] = df['review'].str.replace('\d+', '')  # Remove numbers
df['review'] = df['review'].str.strip()  # Remove leading/trailing spaces

# Split data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Save train and test data to separate CSV files
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)


## Segmenting sentences