# LinkedIn App Review Sentiment Analysis: Large-Scale Fine-Tuning of DistilBERT

In this project, we developed a robust sentiment analysis model for LinkedIn app reviews using a large dataset of 322,642 user reviews. We employed DistilBERT, a lightweight and efficient transformer model, and fine-tuned it on our specific dataset. The process involved preprocessing the data, converting ratings into sentiment categories (Negative, Neutral, Positive), and training the model using PyTorch and the Hugging Face Transformers library.
We implemented a custom dataset class to handle the large volume of text data efficiently and used data loaders to manage batching. The model was trained with a linear learning rate schedule and evaluated using standard classification metrics. The resulting fine-tuned model can accurately predict the sentiment of new LinkedIn app reviews, providing valuable insights into user experiences and satisfaction levels.
This approach demonstrates the application of state-of-the-art NLP techniques to a real-world business problem, offering a scalable solution for analyzing user feedback in the context of professional networking applications.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import Dataset, DataLoader
import torch
from sklearn.metrics import accuracy_score, classification_report

## Data Loading and Preprocessing

In [None]:
# Load the data
df = pd.read_csv('/content/drive/MyDrive/datasets/LINKEDIN_REVIEWS.csv')

# Convert ratings to sentiment
def rating_to_sentiment(rating):
    if rating <= 2:
        return 0  # Negative
    elif rating == 3:
        return 1  # Neutral
    else:
        return 2  # Positive

df['sentiment'] = df['review_rating'].apply(rating_to_sentiment)


In [None]:

# Split the data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['review_text'].tolist(), df['sentiment'].tolist(), test_size=0.1, random_state=42
)

print(f"Training samples: {len(train_texts)}")
print(f"Validation samples: {len(val_texts)}")

Training samples: 290376
Validation samples: 32265


## Tokenization and Dataset Creation

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

class ReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:

# Create datasets
train_dataset = ReviewDataset(train_texts, train_labels, tokenizer)
val_dataset = ReviewDataset(val_texts, val_labels, tokenizer)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

## Fine-tune the model

In [None]:
# Model Setup and Training

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 3
total_steps = len(train_loader) * num_epochs

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average training loss: {avg_train_loss:.4f}")

    # Validation
    model.eval()
    val_loss = 0
    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()

            preds = torch.argmax(outputs.logits, dim=1)
            predictions.extend(preds.cpu().tolist())
            true_labels.extend(labels.cpu().tolist())

    avg_val_loss = val_loss / len(val_loader)
    print(f"Validation loss: {avg_val_loss:.4f}")

    from sklearn.metrics import classification_report
    print(classification_report(true_labels, predictions))

Epoch 1/3, Average training loss: 0.3838
Validation loss: 0.3777
              precision    recall  f1-score   support

           0       0.77      0.83      0.80      6305
           1       0.53      0.04      0.07      2336
           2       0.90      0.97      0.94     23624

    accuracy                           0.87     32265
   macro avg       0.74      0.61      0.60     32265
weighted avg       0.85      0.87      0.85     32265

Epoch 2/3, Average training loss: 0.3415
Validation loss: 0.3750
              precision    recall  f1-score   support

           0       0.79      0.82      0.80      6305
           1       0.49      0.12      0.20      2336
           2       0.91      0.97      0.94     23624

    accuracy                           0.88     32265
   macro avg       0.73      0.64      0.65     32265
weighted avg       0.86      0.88      0.86     32265

Epoch 3/3, Average training loss: 0.3079
Validation loss: 0.3893
              precision    recall  f1-score

## Evaluate the model

In [None]:
model.eval()
predictions = []
true_labels = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=1)
        predictions.extend(preds.cpu().tolist())
        true_labels.extend(labels.cpu().tolist())

accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy}")
print(classification_report(true_labels, predictions))



Accuracy: 0.8754687742135441
              precision    recall  f1-score   support

           0       0.79      0.80      0.80      6305
           1       0.44      0.17      0.25      2336
           2       0.91      0.96      0.94     23624

    accuracy                           0.88     32265
   macro avg       0.71      0.65      0.66     32265
weighted avg       0.86      0.88      0.86     32265



## Save the fine-tuned model

In [None]:
model.save_pretrained("/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model")
tokenizer.save_pretrained("/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model")

('/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model/tokenizer_config.json',
 '/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model/special_tokens_map.json',
 '/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model/vocab.txt',
 '/content/drive/MyDrive/saved_models/fine_tuned_linkedin_sentiment_model/added_tokens.json')

## Use the model for inference

In [None]:
def predict_sentiment(text):
    encoding = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True)
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        prediction = torch.argmax(outputs.logits, dim=1)

    sentiment_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
    return sentiment_map[prediction.item()]

# Example usage
new_review = "The LinkedIn app has greatly improved my professional networking experience."
sentiment = predict_sentiment(new_review)
print(f"Predicted sentiment: {sentiment}")

Predicted sentiment: Positive


In [None]:
# Example usage with a negative review
negative_review = "The LinkedIn app is frustrating to use. It constantly crashes, and the user interface is confusing. Job search functionality is unreliable, and I often miss important notifications. The app feels outdated compared to other professional networking platforms. I'm considering deleting it."

sentiment = predict_sentiment(negative_review)
print(f"Review: {negative_review}")
print(f"Predicted sentiment: {sentiment}")

Review: The LinkedIn app is frustrating to use. It constantly crashes, and the user interface is confusing. Job search functionality is unreliable, and I often miss important notifications. The app feels outdated compared to other professional networking platforms. I'm considering deleting it.
Predicted sentiment: Negative
