<a href="https://colab.research.google.com/github/Adese-hub/LLM/blob/main/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Choice**

This project is to predict the rating of airline reviews based on the text content of the reviews. This task is important as it enables airlines to gauge customer satisfaction levels more effectively by automatically analyzing and categorizing large volumes of customer feedback. By accurately predicting the rating of each review, airlines can identify areas for improvement, address customer concerns, and enhance overall customer experience. This can ultimately lead to increased customer retention, positive brand reputation, and improved business performance.


In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from langdetect import detect  # For language detection

# Downloading NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Loading the CSV data into a DataFrame
data = pd.read_csv('singapore_airlines_reviews.csv')  # Replace 'your_data.csv' with the path to your CSV file

# Function to detect and filter out non-English text
def filter_non_english(text):
    try:
        if detect(text) == 'en':
            return True
        else:
            return False
    except:
        return False

# Filtering non-English text
data = data[data['text'].apply(filter_non_english)]

# Renaming the columns to English
data = data.rename(columns={'type': 'type', 'rating': 'rating', 'text': 'text'})

# Tokenization (to exclude non-English words)
data['tokens'] = data['text'].apply(lambda x: [word for word in word_tokenize(x.lower()) if word.isalpha()])

# Removing stopwords
stop_words = set(stopwords.words('english'))
data['tokens'] = data['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# Lemmatization process
lemmatizer = WordNetLemmatizer()
data['tokens'] = data['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Text representation using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
text_representation = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Selecting only the desired columns from the original DataFrame
data_selected_columns = data[['type', 'rating', 'text']]

data_with_representation = pd.concat([data_selected_columns, text_representation], axis=1)

# Displaying the DataFrame with tokenization and text representation
data_with_representation.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,type,rating,text,00,000,000ft,001,0010,0011,0025,...,천하의,첫째,태초에,하나님의,하나님이,하늘이라,하시고,하시니,혼돈하고,흑암이
0,review,3.0,We used this airline to go from Singapore to L...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,review,5.0,The service on Singapore Airlines Suites Class...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,review,1.0,"Booked, paid and received email confirmation f...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,review,5.0,"Best airline in the world, seats, food, servic...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,review,2.0,Premium Economy Seating on Singapore Airlines ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/981.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m952.3/981.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=9fb465b6aba8fa2f8992cc21515ad00ff4069deea1dc40d28a263db3ecd145b7
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b

**Pre-trained Model**

The choice of BERT (Bidirectional Encoder Representations from Transformers) as the pre-trained model for this task is well-founded. BERT has demonstrated state-of-the-art performance in various natural language processing (NLP) tasks, including text classification, due to its ability to capture contextual information and semantic relationships within text data. Given the complexity and nuances of natural language, BERT's contextual understanding makes it particularly suitable for analyzing and classifying airline reviews effectively. Additionally, BERT's pre-trained weights allow for transfer learning, enabling fine-tuning on domain-specific datasets such as airline reviews. This contextual appropriateness of BERT makes it a compelling choice for this classification task, offering the potential for high-performance models capable of accurately predicting review ratings.


In [None]:
from transformers import BertTokenizer

# Instantiating the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def preprocess_text(text, max_length):
    tokenized = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
    return tokenized['input_ids'], tokenized['attention_mask']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split

# maximum sequence length for BERT
MAX_LENGTH = 128

# number of labels
NUM_LABELS = 1  # Assuming binary classification, adjust accordingly for your task

# Spliting data into train and test sets
train_df, test_df = train_test_split(data_with_representation, test_size=0.2, random_state=42)

# Preprocess train and test data
train_inputs = [preprocess_text(text, MAX_LENGTH) for text in train_df['text']]
test_inputs = [preprocess_text(text, MAX_LENGTH) for text in test_df['text']]

# Converting labels to tensors
train_labels = torch.tensor(train_df['rating'].values)
test_labels = torch.tensor(test_df['rating'].values)

batch_size = 32

# Unzipping train_inputs
train_input_ids, train_attention_masks = zip(*train_inputs)

# Converting to tensors
train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)

# Checking mismatch in the lengths
min_length = min(len(train_input_ids), len(train_attention_masks), len(train_labels))
train_labels = train_labels[:min_length]
train_input_ids = train_input_ids[:min_length]
train_attention_masks = train_attention_masks[:min_length]

# Creating TensorDataset for train data
train_data = TensorDataset(train_input_ids, train_attention_masks, train_labels.unsqueeze(1))  # Add unsqueeze to make labels compatible with input shape

test_input_ids, test_attention_masks = zip(*test_inputs)
test_input_ids = torch.cat(test_input_ids, dim=0)
test_attention_masks = torch.cat(test_attention_masks, dim=0)
test_labels = test_labels[:len(test_input_ids)]  # Ensure labels match input length

# Creating TensorDataset for test data
test_data = TensorDataset(test_input_ids, test_attention_masks, test_labels.unsqueeze(1))  # Add unsqueeze to make labels compatible with input shape

# Creating DataLoaders
train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

# Loading pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=NUM_LABELS)

# Defining optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
num_epochs = 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()

for epoch in range(num_epochs):
    for batch in train_loader:
        pre_input_ids, pre_attention_mask, pre_labels = batch
        print(pre_input_ids.dim())
        input_ids = pre_input_ids
        attention_mask = pre_attention_mask
        labels = pre_labels.float()

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluation
model.eval()
test_loss = 0.0
correct_predictions = 0
total_predictions = 0

with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        test_loss += loss.item()
        _, predicted_labels = torch.max(logits, 1)
        correct_predictions += (predicted_labels == labels.squeeze(1)).sum().item()  # Squeeze to match label format
        total_predictions += labels.size(0)

# Calculating evaluation metrics
accuracy = correct_predictions / total_predictions
average_loss = test_loss / len(test_loader)

print(f"Accuracy: {accuracy}")
print(f"Average Loss: {average_loss}")


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2


  return F.mse_loss(input, target, reduction=self.reduction)


2
2
Accuracy: 1.0
Average Loss: 7.220044045847224


In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split

# Hyperparameters
learning_rates = [1e-5, 2e-5, 3e-5]
batch_sizes = [16, 32, 64]
num_epochs = 3

# best hyperparameters and their performance
best_accuracy = 0
best_hyperparameters = {}
best_model_state = None

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for lr in learning_rates:
    for batch_size in batch_sizes:
        # Creating DataLoaders with current batch_size
        train_loader = DataLoader(train_data, batch_size=batch_size)
        test_loader = DataLoader(test_data, batch_size=batch_size)

        # Loading pre-trained BERT model for sequence classification
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=NUM_LABELS)
        model.to(device)

        # Defining optimizer with current learning rate
        optimizer = AdamW(model.parameters(), lr=lr)

        # Training loop
        for epoch in range(num_epochs):
            model.train()
            for batch in train_loader:
                input_ids, attention_mask, labels = batch
                input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device).float()

                optimizer.zero_grad()
                outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                loss.backward()
                optimizer.step()

        # Evaluation
        model.eval()
        test_loss = 0.0
        correct_predictions = 0
        total_predictions = 0

        with torch.no_grad():
            for batch in test_loader:
                input_ids, attention_mask, labels = batch
                input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device).float()

                outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                logits = outputs.logits

                test_loss += loss.item()
                _, predicted_labels = torch.max(logits, 1)
                correct_predictions += (predicted_labels == labels.squeeze(1)).sum().item()
                total_predictions += labels.size(0)

        accuracy = correct_predictions / total_predictions
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_hyperparameters = {'learning_rate': lr, 'batch_size': batch_size}
            best_model_state = model.state_dict()  # Save

# Print best hyperparameters
print(f"Best Hyperparameters: {best_hyperparameters}")
print(f"Best Accuracy: {best_accuracy}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are

Best Hyperparameters: {'learning_rate': 1e-05, 'batch_size': 16}
Best Accuracy: 1.0
