# Preprocessing for **String** data

01. BoW / Binary BoW
02. n-Grams
03. TF-IDF
04. Word2Vec (SOTA) / Avg. Word2Vec / Weighted Word2Vec

# **01. Bag of Words Representation**

   > Separate words become different features ($feat_k$)
   > 
   > $x_i^{feat_k} = \text{count of k-th word in string }x_i \,\,\,\, (where\, 0 < k < size_{vocabluary})$

For Binary BoW,
   > $x_i^{feat_k} \begin{cases}
      0, & \text{if count of k-th word in string 𝑥𝑖 }=0 \\
      1, & \text{if count of k-th word in string 𝑥𝑖 }>0
    \end{cases}$ 

***As we are simply counting number of $k^{th}$ word*** in BoW representation, it can be further **IMPROVED** by following techniques --
   - **Stopword removal:** Remove useless frequently occuring words (*and, is, or, the* etc.) 
       > Semantic information is lost
   - **Stemming:** Change similar words to their base words ("do" / "done" / "doing" $\rightarrow$ "do" (base word))
       > *snowball* stemmer powerful compared to *porter* stemmer (old technique)
   - **Lemmatisation:** Same as stemming but,
       > base word of stem might not be an actual word whereas, lemma is an actual language word.
   
### **Note**

- $size_{vocabluary} \propto sparsity_{x_i}$ (As vocabulary increases, sparsity if $x_i$ increases)
- To know how similar / dissimilar two vectors are, take [Norm](https://www.kaggle.com/l0new0lf/l1-l2-ln-norm) of their **difference** *(Norm gives size of a vector)*
- [L1 Norm](https://www.kaggle.com/l0new0lf/l1-l2-ln-norm) of Binary BoW is `= number of different words` 

### **Disadvantages**

- **Compute:** In case of Logistic regression, `n_dims = n_vocab` $\Rightarrow$ Large training and inference time
- **Space:** Sparse vector
- **Inherent:**
    - Simple counts 
        - *Doesn't incorporate semantics*
        - *Information is lost*
        - *Sequence info ignored*
    - Problem w/ *heteronyms* (`Lead` in pencil is not same as `Lead` in leader)
    - Problem w/ synonyms (same meaning words w/ difft spelling will be made into different features instead of making same)

# **02. n-Grams**

> Reatains sequence information. *( n = number of neighbors)*

**BoW** where separate **consecutive-groups-of-words** become different features ($feat_k$). <br>
Size of group is given by `n` in n-Grams

### **Note (Main Disadvantage)**

- $size_{vocabluary} \propto n \propto sparsity_{x_i}\,\,\,$ (As `n` increases, sparsity and vocabulary increases)
- Extremely sparse vector

### **Advantage over simple BoW**

- Partial sequence info is retained
    > eg. "do not" -> ["do", "not"] is not informative but bigram,  "do not" -> ["do not"] is!

# **03. TF-IDF**

> Based on two key ideas - 
> - Normalize BoW counts **in datasample** [by sum](https://www.kaggle.com/l0new0lf/02-08-normalisation-vs-standardisation-vs-probs) i.e to probabilities **(Property of datasample)**
> - Penalize more frequent and reward less frequently words **(Property of dataset)**



**TERM FREQUENCY (TF):** Converts simple counts to probabilities by dividing w/ sum of all words' counts **in datasample** <br>
**INVERSE DOC. FREQ. (IDF):** Penalize more frequent and reward less frequently words **in dataset**

<br><br>
**$$\text{TF-IDF(}x_i^{feat_k}) = \text{TF} * \text{IDF}$$**
<br><br>

$\text{TF(}x_i^{feat_k}) = \frac{\text{count of k-th word in }x_i}{\text{Sum of counts of all words in x_i} }$ where $x_i$ is a datasample (string or document)
<br>
<br>
$\text{IDF(}x_i^{feat_k}) = \log{\bigg( \frac{n}{n_{\text{with k-th word}}} \bigg)}$ where **$n$** is total number of samples in dataset and **$n_{\text{with k-th word}}$** is number of samples w/ **atleast** one k-th word.

> Can even use method similar to [laplace smoothing]() in IDF equation 

### **NOTE**

- **TF** Normalizes [by sum](https://www.kaggle.com/l0new0lf/02-08-normalisation-vs-standardisation-vs-probs) by **ROW**
- As IDF is inverse, `IDF >>> TF`. Not good as IDF can nullify impact of TF in overall value. Hence, **use log(IDF)** to monotonically decrease it's value.
- IDF is **small** `~ log(1)` for frequent words and **large** `~ log(num_of_samples)` for rare words
    
### **DISADVANTAGE**

- Doesn't take semantic meaning into account
- Sparse vector

### **Ponder**

What if **IDF** Inverse-Normalizes original counts [by sum](https://www.kaggle.com/l0new0lf/02-08-normalisation-vs-standardisation-vs-probs) by **COLUMN** i.e $\log{ \bigg( \frac{N_X^{feat_k}}{N_{x_i}^{feat_k}} \bigg)}$ Where, $N_X^{feat_k}$ is count of k-th word in whole dataset  $X$ and $N_{x_i}^{feat_k}$ is count of k-th word in sample $x_i$


# **04. WORD2VEC**

> [2013 Paper](https://arxiv.org/pdf/1301.3781.pdf) *Takes semantic meaning into consideration unlike any methods above.*
>
> - Converts a given word (text) to any d-sized dense vector.
> $d \propto \text{information retained}$
> - Takes neighbors of the word into consideration (while training)

Can be understood w/ help of *Matrix Factorisation / Deep Learning*

## **Advantages**

- Dense vector! (unlike any methods above)
- Incorporate semantic meaning (unlike any methods above)
    - vector distance preserves similarity of different words
    - vector direction preserves temporal information (tenses) and relationships b/w words
    - if $\vec{v_1}$ is related to $\vec{v_2}$ and $\vec{v_3}$ is related to $\vec{v_4}$, $\,\,\, (\vec{v_1} - \vec{v2} ) \,\, // \,\, (\vec{v_3} - \vec{v4} )$ Thus, relationship is maintained! 
    

## **Disadvantages**

- Black box
- Need extremely large training set

## **Sentence to Vector**

Either of three methods are applicable

- Use complex SOTA techniques like *Sentence2Vec* (Need retraining and large corpus of data)
- Take avg. of element word vectors 
$$\vec{\text{sentence vector}} = \frac{1}{N_{\text{num of words}}} \sum_{i=0}^{N_{\text{num of words}}}{\text{w2v}(word_i)}$$

- Take TF-IDF weighted avg. element of word vectors

$$\vec{\text{sentence vector}} = \frac{1}{\sum_{i=0}^{N_{\text{num of words}}}{\text{TF-IDF}_i}} \sum_{i=0}^{N_{\text{num of words}}}{\text{TF-IDF}_i * \text{w2v}(word_i)}$$

> In both above cases, resulant sentence vector will be of **same dims** as of individual word vec

# DATA

In [None]:
import numpy as np
import pandas as pd
import re

df = pd.read_csv("../input/twitter-sentiment-analysis-hatred-speech/train.csv")

# helper function to remove twitter handles
def remove_pattern(input_text, pattern):
    r = re.findall(pattern, input_text) # retuns a list with substrings with 'pattern'
    for i in r:
        input_text = re.sub(i, "", input_text) # remove pattern 
    return input_text

# to lower
df['to_lower'] = df['tweet'].apply(lambda x: x.lower())
# find pattern for twitter handles using regex
# @user
df['handle_removed'] = np.vectorize(remove_pattern)(df['to_lower'], "@[\w]*")
# 1. convert pandas series to string
# 2. call replace method on string
# 3. Use regex to replace everything except [a-z] and [A-Z] with space (" ")
# 4. use "[^a-zA-Z#]" to retain hash-symbol (not doing it here)
df['puncs_removed'] = df['handle_removed'].str.replace("[^a-zA-Z]", " ")

df.head(3)

In [None]:
# to numpy array
df['puncs_removed'].values[:3]

In [None]:
len(df['puncs_removed'])

> **Num of samples = n = 31962**

# CODE

# **01. BoW / Binary Bow / n-Grams**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
bow = CountVectorizer(
            stop_words     = 'english',
            binary         = False, # True -> Binary BoW,
            ngram_range    = (1,1), # (1, 1) -> only unigrams, (1, 2) -> unigrams and bigrams, and (2, 2) -> only bigrams
            #vocabulary    = Mapping / iterable (custom vocabulary)
        )

In [None]:
# use `fit_transform` with training data (seeing first time => generates vocabulary)
# use `transform` with test data (not seeing first time => needs vocabulary - already genearated by `fit_transform`)
features = bow.fit_transform(df['puncs_removed'].values)

print(features.shape)
print(type(features))

> - For `n = 31962` samples, `37255` dimension vector
> - Vocabulary size is `37255`
> - returns sparse matrix (saves memory)

Use `features.todense()` or `features.toarray()` to convert to dense `np.ndarray`

In [None]:
print(dir(bow))

In [None]:
print("vocabulary size is: ", len(bow.vocabulary_))

# method 1:
# out of 37255 
print("method 1:")
for vocab_word, index in bow.vocabulary_.items():
    if index == 0   : print("feature 0 repesents word: ", vocab_word)
    if index == 100 : print("feature 100 repesents word: ", vocab_word)
        
# method 2: 
print("method 2:")
print(f"feature 0 repesents word: {bow.get_feature_names()[0]}")
print(f"feature 100 repesents word: {bow.get_feature_names()[100]}")

In [None]:
# `transform` instead of `fit_transform` for test-data
bow.transform(['hi']) # (1x37255)

# **02. BoW Improvement**

- Tokenize (space separated list)
- Stopword removal
- Stemming / Lemmatisation

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.corpus import stopwords

''' 
# Lemmatisation:
# base word of stem might not be an actual word whereas, lemma is an actual language word

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf

USAGE: instead of stemming pd.series below, use
`.apply(lambda x: [wnl.lemmatize(i) for i in x])`
'''


stopwords = set( stopwords.words('english') )

# SnowballStemmer stemmer better
#stemmer = PorterStemmer('english')
stemmer = SnowballStemmer('english')

# Tokenize before stemming. 
# Tokenize: Split into particular words i.e into list
tokenized_tweet = df['puncs_removed'].apply(lambda x: x.split())

# Stopword removal (in-place)
tokenized_tweet = tokenized_tweet.apply(lambda tokens: [i if i not in stopwords else '' for i in tokens])

# Stemming (can be replaced w/ Lemmatisation)
# Iterate over every word in each list 
# So that `having` and `have` both can be converted into `have`
stemmed_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x])

# convert list of words into a line
for i in range(len(stemmed_tweet)):
    stemmed_tweet[i] = ' '.join(stemmed_tweet[i])
df["processed"] = stemmed_tweet

# display
df[['puncs_removed', 'processed']].head(3)

# **03. TF-IDF**

> *Same usage as BoW above*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# use exactly same as BoW
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
tfidf = TfidfVectorizer(
            stop_words   = 'english',
            ngram_range  = (1,2) # uni as well as bi
        )

In [None]:
# use `fit_transform` with training data (seeing first time => generates vocabulary)
# use `transform` with test data (not seeing first time => needs vocabulary - already genearated by `fit_transform`)
tfidf_features = tfidf.fit_transform(df['processed'].values)

print(tfidf_features.shape)
print(type(tfidf_features))

In [None]:
def get_topn_tfidfs_of_a_sample(tfidf_features, sample_idx, n=25):
    """
    tfidf_features  : np.ndarray of dims (num_samples, num_tfidf_feats)
    sample_idx      : row_idx
    """
    tfidfs_of_a_row = tfidf_features[sample_idx].toarray().flatten() # (1x31194) -> (31194,)
    desc_idxs = np.argsort(tfidfs_of_a_row)[::-1][:n]
    
    top_n_tfidfs = tfidfs_of_a_row[desc_idxs]
    top_n_tfidfs_featwords = np.array(tfidf.get_feature_names())[desc_idxs]
    
    return top_n_tfidfs_featwords, top_n_tfidfs

In [None]:
# Analyze top-n TF-IDFs of a >>data-sample 0 and 1<<
# inspiration: http://buhrmann.github.io/tfidf-analysis
bar_xs_0, bar_ys_0 = get_topn_tfidfs_of_a_sample(tfidf_features, 0, n=25)
bar_xs_1, bar_ys_1 = get_topn_tfidfs_of_a_sample(tfidf_features, 1, n=25)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axarr = plt.subplots(1, 2)
fig.set_size_inches(12,5)

# sample at idx 0
sns.barplot(bar_ys_0, bar_xs_0, ax=axarr[0])
axarr[0].set_title('For sample at idx 0')
axarr[0].grid()

# sample at idx 1
sns.barplot(bar_ys_1, bar_xs_1, ax=axarr[1])
axarr[1].set_title('For sample at idx 1')
axarr[1].grid()

fig.tight_layout(pad=3.0)
plt.show()

# **04. WORD2VEC**

With genism,

- Can train custom model (w/ custom data)
- Can use pretrained model

In [None]:
! curl -O "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

In [None]:
from gensim.models import Word2Vec, KeyedVectors
pretrained_model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

Make sure for pretrained model, input word **exists in google-news vocabulary**

> Better used lemmatisation(base word exists in vocab) instead of stemming

In [None]:
# word -> 300 dim vec (pretrained)
pretrained_model['test'].shape

In [None]:
# similarity using norm of distance vector 
# output is normalized (between 0,1)
pretrained_model.similarity('king', 'queen')

In [None]:
pretrained_model.most_similar('king')

To accept words that **do not exist** in google-news vocabulary, train custom model

Input is list of data samples (which is again list of individual words)
```
[
    ['w1', 'w2', ..., 'wm'],
    ['w1', 'w2', ..., 'wm'],
    ['w1', 'w2', ..., 'wm'],
    .
    .
    .
]
```

Word2Vec parameters (custom model)
- list of sentences (sentence is list of words) as shown above
- **min_count:** min number of occurances of word required to create a unique vector for it
- **size:** dim of output vector (larger the better)
- **workers:*** cpu cores to use

> Note: 
> - vocabulary size i.e data size must be large (to compensate for almost any random valid input at test time)
> - Here, used **stemmed** text. **Better use raw punc_removed text**

In [None]:
sentences = df['processed'].values # ndarray of stemmed

list_of_list_of_list_of_words = []
for sentence in sentences:
    list_of_list_of_list_of_words.append(
        sentence.split()
    )
    
list_of_list_of_list_of_words[:2]

In [None]:
# train custom model
custom_model = Word2Vec(
                list_of_list_of_list_of_words,
                min_count     = 1,
                size          = 300,
                workers       = 4
                )

In [None]:
custom_model.wv['lyft'].shape # custom vocabulary

In [None]:
custom_model.wv.most_similar('lyft')

# **04. Sentence to vectors**

**A. Using Avg Word2vec** (see formula above)

In [None]:
list_of_sentences = df['processed'].values # ndarray of stemmed sentences
list_of_sentences[0]

In [None]:
from tqdm import tqdm

avg_w2v_sentences = []
for sentence in tqdm(list_of_sentences):
    w2vs = []
    for word in sentence.split():
        w2vs.append(custom_model.wv[word])
        
    w2vs = np.array(w2vs)
    avg_w2v_sentence_vec = np.sum(w2vs, axis=0) / len(sentence)
    avg_w2v_sentences.append(avg_w2v_sentence_vec)

In [None]:
avg_w2v_sentences = avg_w2v_sentences

print(avg_w2v_sentences[0].shape)
print(len(avg_w2v_sentences))

**B. Using TF-IDF weighted w2v** (see formula above)

> Just like Avg Word2vec above, use TF-IDF data and genism w2v to compute the vector representing sentence sentence