<a href="https://colab.research.google.com/github/CanardSteak/applied/blob/main/NLP_KNN_BERT_EMBEEDINGS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Embeeding Natural Language Processing </a>

## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative

In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).


1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Text Processing: Stop words removal and stemming</a>
4. <a href="#4">Train - Validation Split</a>
5. <a href="#5">Data processing with Pipeline</a>
6. <a href="#6">Train the classifier</a>
7. <a href="#7">Test the classifier</a> Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
8. <a href="#8">Ideas for improvement</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*
* __isPositive:__ Whether the review is positive or negative (1 or 0)


## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

We will use the __pandas__ library to read our dataset.

In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (70000, 6)


In [3]:
# IMDB Dataset
#df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)

#train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
#train_df.head()

#test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
#test_df.head()


Let's look at the first 10 rows of the dataset.

In [4]:
df.head(10)

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0
5,I purchased the home and business because I wa...,Quicken home and business not for amatures,True,1335312000,0.0,0.0
6,The download doesn't take long at all. And it'...,Great!,True,1377993600,0.0,1.0
7,This program is positively wonderful for word ...,Terrific for practice.,False,1158364800,2.397895,1.0
8,Fantastic protection!! Great customer support!!,Five Stars,True,1478476800,0.0,1.0
9,Obviously Win 7 now the last great operating s...,Five Stars,True,1471478400,0.0,1.0


## 2. <a name="2">Exploratory data analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the distribution of __isPositive__ field.

In [5]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

We can check the number of missing values for each columm below.

In [6]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Text Processing: Stop words removal and stemming</a>
(<a href="#0">Go to top</a>)

In [7]:
df=df.dropna()
print(df.isna().sum())

reviewText    0
summary       0
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


In [8]:
!pip install transformers torch scikit-learn


Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Col

## 4. <a name="4">Train - Validation Split</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (90%) and validation (10%).

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(df["reviewText"], df['isPositive'], test_size=0.1, random_state=42)

## Use BERT for text embeddings:
(<a href="#0">Go to top</a>)

You can use the Hugging Face Transformers library to load a pre-trained BERT model and tokenize your text dat

In [10]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [11]:
from transformers import BertTokenizer, BertModel
from tqdm import tqdm
from transformers import RobertaTokenizer, RobertaModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import XLNetTokenizer, XLNetModel

#tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
#bert_model = XLNetModel.from_pretrained('xlnet-base-cased').to(device)


#tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
#bert_model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base").to(device)


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

#tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
#bert_model = RobertaModel.from_pretrained('roberta-base').to(device)

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model.to(device)
# BERT-based Classifier
# Load BERT tokenizer and model, move to GPU

# Tokenize and encode the text data, move to GPU
max_length = 128
X_train_tokens = tokenizer(list(train_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)
X_test_tokens = tokenizer(list(test_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)

# Calculate BERT embeddings for the text data
def get_bert_embeddings(tokens):
    embeddings = []
    for i in tqdm(range(len(tokens['input_ids']))):
        with torch.no_grad():
            output = bert_model(input_ids=tokens['input_ids'][i].unsqueeze(0), attention_mask=tokens['attention_mask'][i].unsqueeze(0))
        embeddings.append(output[0].squeeze().mean(dim=0).cpu().numpy())
    return embeddings

X_train_bert_embeddings = get_bert_embeddings(X_train_tokens)
X_test_bert_embeddings = get_bert_embeddings(X_test_tokens)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

100%|██████████| 62979/62979 [13:12<00:00, 79.48it/s]
100%|██████████| 6998/6998 [01:26<00:00, 80.51it/s]


## 6. <a name="6">Train the classifier</a>
(<a href="#0">Go to top</a>)

We train our classifier with __.fit()__ on our training dataset.
Train a KNN model:
Now that you have BERT embeddings for your text data, you can train a KNN model using scikit-learn.

This code demonstrates how to use BERT for text embeddings and then train a KNN model for sentiment analysis of AWS product reviews. Make sure to replace 'your_dataset.csv' with the actual path to your dataset file and adjust other parameters as needed. You may also fine-tune the model and preprocessing steps to improve performance.

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_bert_embeddings, train_labels)

NB = GaussianNB()
NB.fit(X_train_bert_embeddings, train_labels)

# Train a classifier on BERT embeddings (you can use any classifier of your choice)
# Here, we'll use Logistic Regression as an example
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_bert_embeddings, train_labels)

#rf=RandomForestClassifier()
#rf.fit(train_embeddings, train_labels)
#xgb=GradientBoostingClassifier()
#xgb.fit(train_embeddings, train_labels)



##Test the classifier
(Go to top)

To evaluate the KNN model's performance on sentiment classification, you can generate a classification report and a confusion matrix. Here's how you can do it using scikit-learn:
|--|--|--|

In [13]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f'KNN Accuracy: {accuracy * 100:.2f}%')
#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Naive Bais Accuracy: {accuracy * 100:.2f}%')

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Logistic Regression Accuracy: {accuracy * 100:.2f}%')

KNN Accuracy: 84.41%
Naive Bais Accuracy: 66.36%
Logistic Regression Accuracy: 88.03%


In [14]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)




Classification Report:
              precision    recall  f1-score   support

    negative       0.78      0.81      0.79      2583
    positive       0.88      0.87      0.88      4415

    accuracy                           0.84      6998
   macro avg       0.83      0.84      0.83      6998
weighted avg       0.85      0.84      0.84      6998


Confusion Matrix:
[[2082  501]
 [ 590 3825]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.53      0.74      0.62      2583
    positive       0.80      0.62      0.70      4415

    accuracy                           0.66      6998
   macro avg       0.67      0.68      0.66      6998
weighted avg       0.70      0.66      0.67      6998


Confusion Matrix:
[[1918  665]
 [1689 2726]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.84      0.84      0.84      2583
    positive       0.91      0.90      0.90      4415

    accuracy         

With bert model, and KNN = to 10 we obtain  this results :        


||Accuracy tab|
|------|-----|
|KNN   |  0.84    |
|Naive Bais Accuracy   | 0.66    |
| Logistical regression | 0.88   |

As we can see, the Logistical regression obtains the highest accuracy with 88%


With Roberta model, and KNN = to 10 we obtain  this results :        


||Accuracy tab|
|------|-----|
|KNN   |  0.82    |
|Naive Bais Accuracy   | 0.59    |
| Logistical regression | 0.89  |

As we can see, the Logistical regression obtains the highest accuracy with 89%


Roberta AMAZON-REVIEW-DATA-CLASSIFICATION.csv

Testing :

    Use with KNN = 10
    LogisticRegression(max_iter=1000)


Result Table

||Accuracy tab|
|-----|--------|
|KNN | 0.82 |
|Naive Bais Accuracy| 0.59 |
|Logistic Regression Accuracy| 0.89 |

For the roberta we need to change the model :

    from transformers import RobertaTokenizer, RobertaModel


    tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
    bert_model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base").to(device)

For the xlnet model :   


    from transformers import XLNetTokenizer, XLNetModel

    tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    model = XLNetModel.from_pretrained('xlnet-base-cased')

    #tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    #bert_model = XLNetModel.from_pretrained('xlnet-base-cased').to(device)


    #tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
    #bert_model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base").to(device)


    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

Here in the code, to det up all training model. We just have to comment and uncomment and let one applied and train the data set with the chosen one.
