<a href="https://colab.research.google.com/github/Apoak/Deep-Learning-Projects/blob/main/Bag_of_Words_Model_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lab 7.1: Bag of Words Model

In this lab you will use the bag of words model to learn author attribution with a [dataset of texts from Victorian authors](https://github.com/agungor2/Authorship_Attribution?tab=readme-ov-file).

In [None]:
import numpy as np
import sklearn
import pandas as pd

Here we download the CSV file containing the text snippets and author IDs.

In [None]:
!wget --no-clobber -O Gungor_2018_VictorianAuthorAttribution_data-train.csv -q https://www.dropbox.com/scl/fi/emk9db05t9u8yzgrjje7t/Gungor_2018_VictorianAuthorAttribution_data-train.csv?rlkey=kzvbl0mbpnrpjr4c3q18le6w2&dl=1

In [None]:
df = pd.read_csv('Gungor_2018_VictorianAuthorAttribution_data-train.csv', encoding = "ISO-8859-1")
df.head()

In [None]:
text = list(df['text'])
labels = df['author'].values

In [None]:
text[0]

### Exercises

1. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to produce a term frequency vector for each text.  Set `max_features=1000` to only use the top 1000 terms.

Prepare a 90/10 train-test split `random_state=42`.

Train the default `MLPCLassifier` from `sklearn.neural_network` on the data and report the train and test accuracy.  You can use the argument `verbose=True` to `MLPClassifier` to monitor training.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [None]:
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(text)
# print(vectorizer.get_feature_names_out())
# print(X.toarray())



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.1, random_state=42)

In [None]:
mlp = MLPClassifier(verbose=True)
mlp.fit(X_train, y_train)

In [None]:
# Predict on test set
y_pred = mlp.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")



2. Repeat the steps but using `TfidfVectorizer` to produce term frequency - inverse document frequency vectors.

Does the IDF weighting improve the results?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfid = TfidfVectorizer(max_features=1000)
X2 = vectorizer.fit_transform(text)

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, labels, test_size=0.1, random_state=42)

In [None]:
mlp2 = MLPClassifier(verbose=True)
mlp2.fit(X_train2, y_train2)

In [None]:
# Predict on test set
y_pred2 = mlp2.predict(X_test2)

# Calculate accuracy
accuracy2 = accuracy_score(y_test2, y_pred2)
print(f"Test Accuracy: {accuracy2:.4f}")

**Comparison:**
The accuracy of the two methods of Vectorization is equal, the most notable difference is the TfidfVectorizer took more epochs to minimize the loss.