### Importing Libraries

In [None]:
import os
import sys
import numpy as np
import pandas as pd

### Adding `utils` directory to `PYTHONPATH`

In [None]:
sys.path.append(os.path.abspath("../utils"))

### Reading Cleaned Data

In [None]:
# Importing load_csv function from read_data module
from read_data import load_csv

In [None]:
# Loading cleaned data
cleaned_df = load_csv('clean_data', 'cleaned_data.csv')
cleaned_df.head()

### Text Vectorization

##### Vectorization
- It is the process of converting text into numbers, specifically into vectors so that a computer can understand and work with them.
- Computers don't understand text directly, they understand numbers.
- So, before we feed text to a machine learning model, we need to turn our words into numerical form.
- Imagine trying to train a model to detect whether a message is positive or negative.
- If you just give it the sentence :
> "I love this movie!"
- The model doesn't know what "love" or "movie" means.
- You need to translate those words into numbers.

<hr>

Let's suppose we have three sentences which we are going to convert into numbers.
- `I love NLP`
- `I love deep learning`
- `NLP is fun`

##### Step 1 : Build a Vocabulary (Unique Words)
- From all 3 sentences, list every unique word :
| Word         |
|--------------|
| I            |
| love         |
| NLP          |
| deep         |
| learning     |
| is           |
| fun          |

- So the vocabulary size is 7.

##### Step 2 : Bag of Words (BoW)
Now we represent each sentence as a vector of word counts using the vocabulary above.
| Sentence                 | I | love | NLP | deep | learning | is | fun |
|--------------------------|---|------|-----|------|----------|----|-----|
| I love NLP | 1 | 1 | 1   | 0    | 0        | 0  | 0   |
| I love deep learning | 1 | 1    | 0   | 1    | 1        | 0  | 0   |
| NLP is fun | 0 | 0    | 1   | 0    | 0        | 1  | 1   |

- Each row = one sentence.
- Each column = one word from the vocabulary.
- Each cell = how many times that word appeared in that sentence.

<hr>

##### Stop Words
- Stop words are the common, frequently occurring words in a language that don't carry much meaning on their own.
- **Example** : `the`, `is`, `in`, `on`, `and`, `a`, `an`, `to`, `of`, `with`, `for`, `from`, `that`, `this`, `it`, etc.
- They are often removed from text because:
    - They don't help distinguish between documents/sentences.
    - They add noise and increase vector size in models like Bag of Words (BoW).

<hr>

##### What is Stemming?
- Stemming is the process of reducing a word to its base or root form (called the "stem").
| Word | Stem |
|:--:|:---:|
| running | run |
| runs | run |
| runner | runner |
| studied | studi |
| studies |	studi |

##### Problem with BoW without Stemming
BoW treats every unique word as different, even if they are grammatically related. Like :
1. `"He is running in the park"`
2. `"She runs every day"`
- Without stemming, `running` ≠ `runs`.
- So BoW thinks these are completely different words, even though they refer to the same root concept: "`run`".

##### BoW Matrix (Without Stemming)
Vocabulary: `["he", "is", "running", "in", "the", "park", "she", "runs", "every", "day"]`
| Sentence          | he | is | running | in | the | park | she | runs | every | day |
|-------------------|----|----|---------|----|-----|------|-----|------|-------|-----|
| Sentence 1        | 1  | 1  | 1       | 1  | 1   | 1    | 0   | 0    | 0     | 0   |
| Sentence 2        | 0  | 0  | 0       | 0  | 0   | 0    | 1   | 1    | 1     | 1   |

These vectors are very different even though the meaning is similar.

##### Benefit of Stemming
Stemming reduces related words to their common root form, like :
1. `running`, `runs`, `ran` → `run`
2. `studies`, `studying`, `studied` → `studi`
- Now BoW will group them together!

##### BoW Matrix (With Stemming)
New Vocabulary: `["he", "is", "run", "in", "the", "park", "she", "every", "day"]`
| Sentence          | he | is | run | in | the | park | she | every | day |
|-------------------|----|----|-----|----|-----|------|-----|-------|-----|
| Sentence 1        | 1  | 1  | 1   | 1  | 1   | 1    | 0   | 0     | 0   |
| Sentence 2        | 0  | 0  | 1   | 0  | 0   | 0    | 1   | 1     | 1   |

Now "`running`" and "`runs`" are treated the same as "`run`".

In [None]:
# Importing stem function from text_stemming module
from text_stemming import stem

In [None]:
# Applying stemming on tags column
cleaned_df['tags'] = cleaned_df['tags'].apply(stem)
cleaned_df.head()