<a href="https://colab.research.google.com/github/Mehak-Kamran/jupiterNotebook/blob/main/Assignment3a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
from nltk.corpus import brown
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Download Brown Corpus
nltk.download('brown')

# Load the Brown Corpus categories and documents
categories = brown.categories()
Xdoc = []
Y = []

for category in categories:
    for fileid in brown.fileids(category):
        Xdoc.append(" ".join(brown.words(fileid)))  # Join words into a single string (document)
        Y.append(category)  # Append the genre

# For quick testing, limit the dataset size
X_train_small, X_test_small, y_train_small, y_test_small = train_test_split(Xdoc[:200], Y[:200], test_size=0.2, random_state=42)

# Initialize DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', clean_up_tokenization_spaces=False)

# Tokenize the data
train_encodings = tokenizer(X_train_small, truncation=True, padding=True, return_tensors="pt")
test_encodings = tokenizer(X_test_small, truncation=True, padding=True, return_tensors="pt")

print("Tokenization complete")

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
train_encodings = {key: val.to(device) for key, val in train_encodings.items()}
test_encodings = {key: val.to(device) for key, val in test_encodings.items()}

# Load pre-trained DistilBERT model to obtain embeddings
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased')
distilbert_model.to(device)

print("DistilBERT model loaded")

# Get embeddings (use the [CLS] token, first token of each sequence)
with torch.no_grad():
    train_embeddings = distilbert_model(**train_encodings).last_hidden_state[:, 0, :].cpu().numpy()
    test_embeddings = distilbert_model(**test_encodings).last_hidden_state[:, 0, :].cpu().numpy()

print("Embeddings obtained")

# Train a Logistic Regression classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(train_embeddings, y_train_small)

# Make predictions on the test data
y_pred = classifier.predict(test_embeddings)

# Evaluate the model
accuracy = accuracy_score(y_test_small, y_pred)
precision = precision_score(y_test_small, y_pred, average='weighted')
recall = recall_score(y_test_small, y_pred, average='weighted')
f1 = f1_score(y_test_small, y_pred, average='weighted')

print("DistilBERT + Logistic Regression Results:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Tokenization complete
Using device: cpu


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBERT model loaded
Embeddings obtained
DistilBERT + Logistic Regression Results:
Accuracy: 0.675
Precision: 0.7495833333333334
Recall: 0.675
F1-Score: 0.6994791666666667


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.675

This means the model correctly classified about 67.5% of the test samples.
Precision: 0.750

Precision measures the proportion of positive identifications that were actually correct. A precision of 0.750 means that 75% of the instances classified as a certain genre were indeed of that genre.
Recall: 0.675

Recall measures the proportion of actual positives that were correctly identified. A recall of 0.675 means that the model correctly identified 67.5% of all actual instances of each genre.
F1-Score: 0.699

The F1-Score is the harmonic mean of precision and recall, providing a single score to evaluate the balance between precision and recall. A score of 0.699 indicates a balanced performance between precision and recall.