<a href="https://colab.research.google.com/github/ShoSato-047/DSCI330_Lab-3.1---Text-Classification-with-NLP/blob/main/DSCI330_lab3_1_text_classification_w_NLP_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = red>**BoW vs. TF-IDF:**
- Use BoW when you need a simple word frequency representation.
- Use TF-IDF when you want to weigh words based on importance across documents.

In [30]:
%pip install datasets



In [31]:
%pip install composable



In [32]:
import os
import composable_records as rec
import composable_tuples as tup

from composable import pipeable
from composable.strict import map, filter
from composable_utility import apply, identity, get
from composable_object import obj

from composable_glob import glob
from composable_utility import get, with_open, identity, apply

# You may need to install utility.py to use "with_open"

In [33]:
# this is the root directory
!pwd

/content


## Loading the emotions dataset

The following data set include a large number of sentences combined with a classification of the emotion.

In [34]:
from datasets import load_dataset

emotions = load_dataset("emotion")

In [35]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [36]:
(train_ds := emotions >> get('train')) #getting training set

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [37]:
train_ds.column_names

['text', 'label']

In [38]:
train_ds.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [39]:
train_ds[:5]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': [0, 0, 3, 2, 3]}

### Preparing the dataset

In [40]:
(documents :=
 train_ds
 >> get('text')
) >> tup.head(5)

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'im grabbing a minute to post i feel greedy wrong',
 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
 'i am feeling grouchy']

In [101]:
(labels :=
 train_ds
 >> get('label')
) >> tup.head(5)

[0, 0, 3, 2, 3]

### Bag of Words

## **Bag of Words:**
In Bag of Words (BoW), the higher the value, the more frequently a word appears in a document.

The score is not limited to a range of 0 to 1. The values represent raw word counts, meaning they can be any non-negative integer starting from 0 and increasing based on how often a word appears in a document.

- Minimum: 0 (if the word is absent in a document)
- Maximum: No fixed upper limit (depends on how frequently a word appears)

In [103]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create the sparse feature set
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

# Create a training and test (validation) set
X_train, X_test, y_train, y_test = train_test_split(X, labels,
                                                    test_size = 0.2,
                                                    random_state=42)

# Train the model
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = classifier.predict(X_test)

(accuracy := accuracy_score(y_test, y_pred))

0.7375

### TF-IDF

## **TF-IDF Formula:**

**TF-IDF = TF × IDF**

Where:

- Term Frequency (TF): Measures how often a term appears in a document.
- Inverse Document Frequency (IDF): Measures how rare a term is across documents.

Note:
- Minimum Score: 0 - If a word does not appear in a document, its score is 0.
- If *TfidfVectorizer* uses **L2 normalization** (default in scikit-learn), TF-IDF values are scaled to range between 0 and 1 for each document.

In [102]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create the sparse feature set
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(documents)

# Create a training and test (validation) set
X_train, X_test, y_train, y_test = train_test_split(X, labels,
                                                    test_size = 0.2,
                                                    random_state=42)

# Train the model
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = classifier.predict(X_test)

(accuracy := accuracy_score(y_test, y_pred))

0.6165625

## <font color="red"> Exercise 3.6.1 </font>

The file `train.csv` contains bits of text from three classic authors in the horror genre ([source](https://www.kaggle.com/c/spooky-author-identification/)).

**Preprocessing Tasks.**
1. Read the raw lines in using `with_open`.
2. Split the data into columns and extract the text into one list and the labels into other.
3. Clean up the text by making it lower case and removing any punctuation.
4. Map the text labels to numbers.

**ML tasks.** Test the performance of the naive Bayes classifer on both the Bag of Words and TF-IDF features.

In [44]:
# Question:

# Your objective is to accurately identify the author of the sentences in the test set?

# What do you mean by labels? Is this a list of specific words for each author?
# We had a list of label ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'] in emotions dataset.
# Where can I find a list of labels?

# Am I supposed to split the text for each author?

In [45]:
# Your code here

### **Step 1: Read the raw lines in using with_open**

In [46]:
# This is my current location (root)
!pwd

/content


In [47]:
(paths :=
 "/content/train.csv"
 >> glob()
)

['/content/train.csv']

In [48]:
with open(paths[0], encoding ="utf-8") as f:
    lines = f.readlines()
lines[:3]

['"id","text","author"\n',
 '"id26305","This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","EAP"\n',
 '"id17569","It never once occurred to me that the fumbling might be a mere mistake.","HPL"\n']

### **Step 2: Split the data into columns and extract the text into one list and the labels into other**

In [49]:
# Hint: Replace ',' with '-' in each line
lines = [line.replace(', ', '- ') for line in lines]
lines[:3]

['"id","text","author"\n',
 '"id26305","This process- however- afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit- and return to the point whence I set out- without being aware of the fact; so perfectly uniform seemed the wall.","EAP"\n',
 '"id17569","It never once occurred to me that the fumbling might be a mere mistake.","HPL"\n']

In [50]:
split_lines = [line.split(',') for line in lines]
split_lines[:3]

[['"id"', '"text"', '"author"\n'],
 ['"id26305"',
  '"This process- however- afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit- and return to the point whence I set out- without being aware of the fact; so perfectly uniform seemed the wall."',
  '"EAP"\n'],
 ['"id17569"',
  '"It never once occurred to me that the fumbling might be a mere mistake."',
  '"HPL"\n']]

In [69]:
# Create a record
records = []
for row in split_lines[1:]:  # Skip header
    record = {
        "id": row[0].strip('"'),
        "text": row[1].strip('"'),
        "author": row[2].strip('"').strip('\n')
    }
    records.append(record)

records[:3]

[{'id': 'id26305',
  'text': 'This process- however- afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit- and return to the point whence I set out- without being aware of the fact; so perfectly uniform seemed the wall.',
  'author': 'EAP"'},
 {'id': 'id17569',
  'text': 'It never once occurred to me that the fumbling might be a mere mistake.',
  'author': 'HPL"'},
 {'id': 'id11008',
  'text': 'In his left hand was a gold snuff box- from which- as he capered down the hill- cutting all manner of fantastic steps- he took snuff incessantly with an air of the greatest possible self satisfaction.',
  'author': 'EAP"'}]

In [65]:
# double check if my records work correctly
records[0]['id']

'id26305'

### **Step 3: Clean up the text by making it lower case and removing any punctuation.**

#### **Comprehensions Solution:**

In [76]:
# Making text lowercase
text_lower = [record['text'].lower() for record in records]
text_lower[:3]

['this process- however- afforded me no means of ascertaining the dimensions of my dungeon; as i might make its circuit- and return to the point whence i set out- without being aware of the fact; so perfectly uniform seemed the wall.',
 'it never once occurred to me that the fumbling might be a mere mistake.',
 'in his left hand was a gold snuff box- from which- as he capered down the hill- cutting all manner of fantastic steps- he took snuff incessantly with an air of the greatest possible self satisfaction.']

In [77]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [78]:
# Tokenizing a list of sentence into words
from nltk.tokenize import word_tokenize

text_words = [word_tokenize(record['text']) for record in records]
text_words[:1]

[['This',
  'process-',
  'however-',
  'afforded',
  'me',
  'no',
  'means',
  'of',
  'ascertaining',
  'the',
  'dimensions',
  'of',
  'my',
  'dungeon',
  ';',
  'as',
  'I',
  'might',
  'make',
  'its',
  'circuit-',
  'and',
  'return',
  'to',
  'the',
  'point',
  'whence',
  'I',
  'set',
  'out-',
  'without',
  'being',
  'aware',
  'of',
  'the',
  'fact',
  ';',
  'so',
  'perfectly',
  'uniform',
  'seemed',
  'the',
  'wall',
  '.']]

In [56]:
# Removing any punctuation
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

from string import punctuation

(punc_map := str.maketrans('', '', punctuation))

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

In [80]:
# Removing punctuation from each word in the tokenized lists
text_no_punc = [[word.translate(punc_map) for word in words] for words in text_words]
text_no_punc[:1]

[['This',
  'process',
  'however',
  'afforded',
  'me',
  'no',
  'means',
  'of',
  'ascertaining',
  'the',
  'dimensions',
  'of',
  'my',
  'dungeon',
  '',
  'as',
  'I',
  'might',
  'make',
  'its',
  'circuit',
  'and',
  'return',
  'to',
  'the',
  'point',
  'whence',
  'I',
  'set',
  'out',
  'without',
  'being',
  'aware',
  'of',
  'the',
  'fact',
  '',
  'so',
  'perfectly',
  'uniform',
  'seemed',
  'the',
  'wall',
  '']]

#### **Pipeable Solution:**

In [127]:
# pipeable function for readability
(clean_words :=
 [record['text'] for record in records]
 >> pipeable(lambda texts: [s.lower() for s in texts])
 >> pipeable(lambda texts: [s.translate(punc_map) for s in texts])
 >> pipeable(lambda texts: [word_tokenize(s) for s in texts])
)[:2]

[['this',
  'process',
  'however',
  'afforded',
  'me',
  'no',
  'means',
  'of',
  'ascertaining',
  'the',
  'dimensions',
  'of',
  'my',
  'dungeon',
  'as',
  'i',
  'might',
  'make',
  'its',
  'circuit',
  'and',
  'return',
  'to',
  'the',
  'point',
  'whence',
  'i',
  'set',
  'out',
  'without',
  'being',
  'aware',
  'of',
  'the',
  'fact',
  'so',
  'perfectly',
  'uniform',
  'seemed',
  'the',
  'wall'],
 ['it',
  'never',
  'once',
  'occurred',
  'to',
  'me',
  'that',
  'the',
  'fumbling',
  'might',
  'be',
  'a',
  'mere',
  'mistake']]

### **Step 4: Map the text labels to numbers.**

I need to combine clean_words with author in a single record before mapping numbers.

In [128]:
# Lables from kaggle website - only 3 authors
labels = ['EAP', 'HPL', 'MWS']
label_map = {'EAP': 0, 'HPL': 1, 'MWS': 2}

# Map the labels to numbers
numeric_labels = [label_map[label] for label in labels]
numeric_labels

[0, 1, 2]

In [125]:
cleaned_data = [
    {'text': text, 'label': label}
    for text, label in zip(clean_words, numeric_labels)
]

cleaned_data

[{'text': ['this',
   'process',
   'however',
   'afforded',
   'me',
   'no',
   'means',
   'of',
   'ascertaining',
   'the',
   'dimensions',
   'of',
   'my',
   'dungeon',
   'as',
   'i',
   'might',
   'make',
   'its',
   'circuit',
   'and',
   'return',
   'to',
   'the',
   'point',
   'whence',
   'i',
   'set',
   'out',
   'without',
   'being',
   'aware',
   'of',
   'the',
   'fact',
   'so',
   'perfectly',
   'uniform',
   'seemed',
   'the',
   'wall'],
  'label': 0},
 {'text': ['it',
   'never',
   'once',
   'occurred',
   'to',
   'me',
   'that',
   'the',
   'fumbling',
   'might',
   'be',
   'a',
   'mere',
   'mistake'],
  'label': 1},
 {'text': ['in',
   'his',
   'left',
   'hand',
   'was',
   'a',
   'gold',
   'snuff',
   'box',
   'from',
   'which',
   'as',
   'he',
   'capered',
   'down',
   'the',
   'hill',
   'cutting',
   'all',
   'manner',
   'of',
   'fantastic',
   'steps',
   'he',
   'took',
   'snuff',
   'incessantly',
   'with',
   '

In [134]:
# Define the label map for authors (mapping authors to numeric labels)
author_map = {'EAP': 0, 'HPL': 1, 'MWS': 2}  # Extend the map as needed

# Combine clean_words and author info into a record
cleaned_data = [
    {'text': text, 'author': record['author'], 'label': author_map.get(record['author'], -1)}
    for text, record in zip(clean_words, records)
]

# Print the cleaned data with text, author, and numeric labels
cleaned_data[:3]

[{'text': ['this',
   'process',
   'however',
   'afforded',
   'me',
   'no',
   'means',
   'of',
   'ascertaining',
   'the',
   'dimensions',
   'of',
   'my',
   'dungeon',
   'as',
   'i',
   'might',
   'make',
   'its',
   'circuit',
   'and',
   'return',
   'to',
   'the',
   'point',
   'whence',
   'i',
   'set',
   'out',
   'without',
   'being',
   'aware',
   'of',
   'the',
   'fact',
   'so',
   'perfectly',
   'uniform',
   'seemed',
   'the',
   'wall'],
  'author': 'EAP"',
  'label': -1},
 {'text': ['it',
   'never',
   'once',
   'occurred',
   'to',
   'me',
   'that',
   'the',
   'fumbling',
   'might',
   'be',
   'a',
   'mere',
   'mistake'],
  'author': 'HPL"',
  'label': -1},
 {'text': ['in',
   'his',
   'left',
   'hand',
   'was',
   'a',
   'gold',
   'snuff',
   'box',
   'from',
   'which',
   'as',
   'he',
   'capered',
   'down',
   'the',
   'hill',
   'cutting',
   'all',
   'manner',
   'of',
   'fantastic',
   'steps',
   'he',
   'took',
   