## Natural Language Processing: Word Representations

**Goal of NLP**
---

In NLP, the main goal is to enable computers to understand and generate human language. For this to happen, text data must be translated into a form that computers can process numerically.

This involves converting words into vectors — numerical representations that capture **syntactic and semantic** meaning.


---
# 1. Bag of Words (BoW)
---

BoW is one of the most basic and simplest approaches to represent text.

- We create a **vocabulary** of all unique words in the dataset.
- Each document (or sentence) is represented as a **vector** and it has a length equal to the vocabulary size.
- Each position in the vector corresponds to a word, and the value is filled with counts that represents how often it appears in the text.


### Example Corpus:
1. "The cat sat"
2. "The dog barked"
3. "The cat meowed"

Vocabulary: \[ \text{"the"}, \text{"cat"}, \text{"sat"}, \text{"dog"}, \text{"barked"}, \text{"meowed"} \]

### BoW Vector Representation:

| Sentence             | Vector                    |
|----------------------|---------------------------|
| The cat sat          | [1, 1, 1, 0, 0, 0]        |
| The dog barked       | [1, 0, 0, 1, 1, 0]        |
| The cat meowed       | [1, 1, 0, 0, 0, 1]        |

> Pros:
- Simple to implement
  
> Cons:
- Ignores word order, syntactic relationship
- Vectors are sparse and high-dimensional
- Cannot capture meaning or similarity between words, semantics and context

---
BOW can also have bigrams, trigrams or n-grams as tokens. Original BOW is a unigram based tokenization.

**Unigrams:**

These are individual words. For example, in the sentence "The cat sat", the unigrams are:

"The", "cat", "sat"

**Bigrams:**

These are consecutive pairs of words. In the sentence "The cat sat", the bigrams are:

("The", "cat"), ("cat", "sat")

**N-Grams:**

These are generalized as groups of N consecutive words. For example, in the sentence "The cat sat on the mat", the trigrams (3-grams) are:

("The", "cat", "sat"), ("cat", "sat", "on"), ("sat", "on", "the"), ("on", "the", "mat")

Note: Bag-of-Words (BoW) treats each word (or n-gram) as a feature and counts how often each occurs in a document.


## 2. TF-IDF: Term Frequency – Inverse Document Frequency

---

**TF-IDF** improves BoW by not just counting word frequency, but also considering how **important** a word is across documents.

### Formula:

Let:
- tf(t, d): Term frequency = how often term \( t \) appears in document \( d \)
- idf(t): Inverse Document Frequency = how rare the term is across documents

**TF-IDF**:
- $\text{TF(term frequency)} = \frac{\text{no. of times w appears in a doc}}{\text{no. of times w appears in all the docs}}$
- $\text{IDF(Inverse doc freq)(t)}= log(\frac{\text{N=totaldocs}}{\text{no. of docs have w in it}}) = \log \left( \frac{N}{df(t)} \right)$

It’s important to understand that IDF in TF-IDF reduces the weight of common words that frequently appear in most documents within the corpus, as these words contribute little to document classification. Conversely, it highlights less common words, making them more important for distinguishing the documents in which they appear.

\begin{equation}
\text{TF-IDF}(t, d) = tf(t, d) \times idf(t)
\end{equation}

Where:
- \( N \): Total number of documents
- \( df(t) \): document frequency of the term i.e Number of documents containing the term \( t \)

### Intuition:
- Common words like "the", "is", "in" will have **low IDF**, reducing their weight.
- Rare but meaningful words like "quantum", "neural" will have **high IDF**, boosting their score.

> Pros:
- Adds weighting to reflect importance

> Cons:
- Still loses semantic meaning, context, and word order

---



Lets use TF_IDF representations to perform topic modeling using cluster based model like K-Means

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [21]:
!cp /content/drive/MyDrive/Colab\ Notebooks/NLP/data/trc_dn.json .
!cp /content/drive/MyDrive/Colab\ Notebooks/NLP/data/months.json .

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("stopwords")

import string
import json
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
def load_json(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data

def write_json(file_path, data):
    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

In [7]:
input_file_path = "./trc_dn.json"
data = load_json(input_file_path)
data.keys()  # json file

dict_keys(['names', 'descriptions'])

In [9]:
type(data["names"]), type(data["descriptions"])

(list, list)

In [13]:
data["names"][:10]

['AARON, Thabo Simon',
 'ABBOTT, Montaigne',
 'ABDUL WAHAB, Zakier',
 'ABRAHAM, Nzaliseko Christopher',
 'ABRAHAMS, Achmat Fardiel',
 'ABRAHAMS, Annalene Mildred',
 'ABRAHAMS, Ashraf',
 'ABRAHAMS, Derrek',
 "ABRAHAMS, John (aka 'Gaika')",
 'ABRAHAMS, Moegsien']

In [14]:
data["descriptions"][:10]

["An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.",
 'A member of the SADF who was severely injured in a landmine explosion in Messina, Transvaal, on 5 May 1987.',
 'A member of QIBLA who disappeared in September 1988 after fleeing the country for political reasons. He had been detained several times before he left South Africa.',
 'A COSAS supporter who was kicked and beaten with batons and rifle-butts by members of the Ciskei Police during protests against the Ciskei government at Zwelitsha and Mdantsane, Ciskei, in September 1985.',
 'Was shot and blinded in one eye by members of the SAP in Athlone, Cape Town, in August 1976, after the Soweto uprising.',
 'Was shot and injured by members of the SAP in Robertson, Cape, on 11 

In [12]:
data

{'names': ['AARON, Thabo Simon',
  'ABBOTT, Montaigne',
  'ABDUL WAHAB, Zakier',
  'ABRAHAM, Nzaliseko Christopher',
  'ABRAHAMS, Achmat Fardiel',
  'ABRAHAMS, Annalene Mildred',
  'ABRAHAMS, Ashraf',
  'ABRAHAMS, Derrek',
  "ABRAHAMS, John (aka 'Gaika')",
  'ABRAHAMS, Moegsien',
  'ABRAHAMS, Rashid',
  'ABRAHAMS, Toyer',
  'ACHHURST, EM',
  'ACKERMAN, David Jacobus',
  'ACKERMAN, Marita',
  'ACKERMANN, Gerhardus Oliver',
  'ADAM, Alfred Mawonga',
  'ADAM, Alfred Mawonga',
  'ADAMS,  Zwelinzima Sidwell',
  'ADAMS, Koos',
  'ADAMS, Magadien',
  'ADAMS, Noel',
  'ADAMS, Sandra Joyce',
  'ADONIS,  Jacques Ferdinand',
  'ADONIS, Motlalepule Sunnyboy Slovo',
  'ADONIS, Sandra Noreen',
  'ADOONS, Phineus Zenzile',
  'ADRIAANSE, Noel John',
  'AFRICANDER, Sipho Victor',
  'AFRIKA, Anna',
  'AFRIKA, Dick',
  'AFRIKA, Jan',
  'AFRIKA, Pieter',
  'AGGETT, Joyce',
  'AGGETT, Neil Hudson',
  'ALA, Wezeka Getrude',
  'ALA, Wezeka Getrude',
  'ALBERT, Nombuyiselo Francis',
  'ALEXANDER, Tammas Edwar

In [29]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
months = load_json("./months.json")
stop_words + months

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [39]:
from typing import List

def load_json(file_path: str):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data

def write_json(file_path: str, data):
    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

def clean_data(stop_words: List, text: str) -> str:
    clean_words = []
    # Lower case
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Remove patterns like (AC\2001\023) using re
    text = re.sub(pattern=r"^AC/\d{4}/\d{3}", repl="", string=text)
    # Split
    words = text.split()
    # Remove Stop words and digits
    for word in words:
        if word not in stop_words and not word.isdigit():
            clean_words.append(word)
    # Changes the list back to string
    clean_data = " ".join(clean_words)
    # Final check to remove double spaces
    while "  " in clean_data:
        clean_data = clean_data.replace("  ", " ")
    return clean_data


def preprocess_data(docs: List[str]) -> List[str]:
    stop_words = stopwords.words("english")
    months = load_json("./months.json")  # add months to the stop words - we are removing months from the data
    stop_words = stop_words + months
    final_data = []
    for text in docs:
        cleaned_data = clean_data(stop_words, text)
        final_data.append(cleaned_data)
    return final_data

In [47]:
# Load and Preprocess Data
input_file_path = "./trc_dn.json"
data = load_json(input_file_path)  # json
descriptions = data["descriptions"]
names = data["names"]

cleaned_docs = preprocess_data(descriptions)
cleaned_names = preprocess_data(names)

In [42]:
descriptions[0]

"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."

In [43]:
cleaned_docs[0]

'ancyl member shot severely injured sap members lephoi bethulie orange free state ofs april police opened fire gathering anc supporters house following dispute two neighbours one linked anc sap councillor'

In [44]:
names[0]

'AARON, Thabo Simon'

In [45]:
cleaned_names[0]

'aaron thabo simon'

In [48]:
# tf-idf vectorizer to represent the docs with salient features
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words="english",  # we do it again just to make sure we didnt miss any stop words
                             ngram_range=(1, 3),    # creates unigrams, bigrams, trigrams
                             min_df=5,
                             max_df=10,
                             max_features=100)

vectors = vectorizer.fit_transform(cleaned_docs)

In [50]:
feature_names = vectorizer.get_feature_names_out()
feature_names.shape, feature_names

((100,),
 array(['aberdeen', 'aberdeen cape', 'ac19980026', 'ac2000250',
        'african security force', 'amaqabane', 'azasm', 'backed',
        'bafokeng', 'basotho', 'bethal', 'carnarvon', 'charged arson',
        'charged convicted sentenced', 'charged sabotage',
        'charges withdrawn', 'chief mangope', 'club mufc', 'coerced ifp',
        'coerced ifp membership', 'colonel', 'committee led',
        'committee led mr', 'congress kyc khutsong', 'cosatu official',
        'dam', 'distributing', 'dladla', 'door', 'eagles', 'eagles club',
        'elizabeth security', 'employer', 'flat', 'gangraped',
        'guard shot', 'guguletu seven', 'ivory', 'ivory park',
        'journalist', 'june detained emergency', 'june operatives sadf',
        'june severely', 'kabokweni', 'kabokweni kangwane', 'kakamas',
        'kakamas cape', 'kutloanong', 'kwelerha', 'led mr',
        'local crossroads leader', 'mcilongo', 'middelburg police',
        'mk soldiers', 'motherwell', 'mr ndlovu', '

In [53]:
type(vectors), vectors.shape

(scipy.sparse._csr.csr_matrix, (21747, 100))

### fit_transform() and transform()

**fit_transform()** is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance.

Now, we want scaling to be applied to our test data too and at the same time do not want to be biased with our model. We want our test data to be a completely new and a surprise set for our model. The transform method helps us in this case. Using the transform method we can use the same mean and variance as it is calculated from our training data to transform our test data. Thus, the parameters learned by our model using the training data will help us to transform our test data.

If we will use the fit method on our test data too, we will compute a new mean and variance that is a new scale for each feature and will let our model learn about our test data too. Thus, what we want to keep as a surprise is no longer unknown to our model and we will not get a good estimate of how our model is performing on the test (unseen) data which is the ultimate goal of building a model using machine learning algorithm.