# Text Mining

In this project we would explore the methods of preprocessing text which includes:
- BOW (Bag-of-Words)
- TF-IDF
- Word2Vec
- GloVe
- FastText
- OneHotEncoding

## Import Libraries

In [14]:
# Common Python Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
from copy import deepcopy

# Text preprocessing/cleaning Libraries
import nltk
import re
import string
import contractions
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer # or LancasterStemmer, RegexpStemmer, SnowballStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

# Bag of Words Libraries
from sklearn.feature_extraction.text  import CountVectorizer

# TF-IDF libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# Word2Vec Libraries
import gensim
from gensim.models import Word2Vec

# Text Representation
from sklearn.feature_extraction.text import CountVectorizer

# Data Preprocessing Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Model Evaluation
from sklearn.metrics import f1_score, classification_report, confusion_matrix

## Import the data

In [15]:
text_data_path = "./train_data.csv"
text_data = pd.read_csv(text_data_path, sep = ",")

In [16]:
# View the data
text_data.head()

Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0


In [17]:
# text_data["label"].value_counts()

## Text Cleaning

In [31]:
def clean_text(text: str, language: str, tokenize: bool = False, remove_stop_words: bool = False, stem_words: bool = False):
    """
    This function is to clean the text from stopwords, punctuation and return a clean text for further analysis

    Args:
        text (str):
            The dataframe containing the text data
        
        language (str):
            This are the available languages:
            - "catalan": "ca"
            - "czech": "cs"
            - "german": "de"
            - "greek": "el"mlaskjdlj
            - "english": "en"
            - "spanish": "es"
            - "finnish": "fi"
            - "french": "fr"
            - "hungarian": "hu"
            - "icelandic": "is"
            - "italian": "it"
            - "latvian": "lv"
            - "dutch": "nl"
            - "polish": "pl"
            - "portuguese": "pt"
            - "romanian": "ro"
            - "russian": "ru"
            - "slovak": "sk"
            - "slovenian": "sl"
            - "swedish": "sv"
            - "tamil": "ta"
        
        tokenize (bool):
            True = return tokenized data
            False = return untokenized data
        
        remove_stop_words (bool):
            True = remove stop words
            False = do not remove stop words

        stem_words (bool):
            True = get the base words (i.e. spraying -> spray)
            False = leave the words as is
    """

    stemmer = PorterStemmer()
    stop_words = set(stopwords.words(language))

    def tokenize_text(text):
        return [w for s in sent_tokenize(text) for w in word_tokenize(s)]
    
    def remove_special_characters(text, characters=string.punctuation.replace('-', '')):
        pattern = re.compile(f"[{re.escape(characters)}]")
        return pattern.sub("", text)

    def stem_text(tokens):
        return [stemmer.stem(t) for t in tokens]

    def remove_stopwords_func(tokens):
        return [w for w in tokens if w not in stop_words]

    # Clean process
    text = contractions.fix(text)                        # fixing contraction
    text = text.strip().lower()                          # lowercase + trim
    text = remove_special_characters(text)               # remove punctuation
    tokens = tokenize_text(text)                         # tokenize words

    if remove_stop_words:
        tokens = remove_stopwords_func(tokens)           # remove stopwords
        
    if stem_words:
        tokens = stem_text(tokens)                       # stemming

    if tokenize:
        return tokens                                    # return as tokens
    else:
        return " ".join(tokens)                          # return as string

In [19]:
# Test
sample = "I love the smell of freshly brewed coffee in the morning!"
cleaned = clean_text(sample, language="english", remove_stop_words=True, tokenize=False)
print(cleaned)

love smell freshly brewed coffee morning


## Train Test Split

In [20]:
test_size = 0.20
val_size = 0.10

# Splitting the data into train and temp (which will be further split into validation and test)
train_df, test_df = train_test_split(text_data, test_size=test_size, stratify=text_data['label'], random_state=42) #stratify is used to ensure that the same proportion of each class is present in both the training and test sets

# Splitting train into validation and test sets
train_df, val_df = train_test_split(train_df, test_size=val_size, stratify=train_df['label'], random_state=42)

In [21]:
train_df.columns

Index(['text', 'label'], dtype='object')

In [22]:

# Spit the data to x and y values
x_train, y_train = train_df["text"], train_df["label"]
x_test, y_test = test_df["text"], test_df["label"]
x_val, y_val = val_df["text"], val_df["label"]

## Text Preprocessing Methods

### Bag-of-Words

Briefly, the bag-of-words preprocessing method only counts the occurence of each word and does not care about the ordering of the words.

**Example**

Both of these sentences become almost the same for BoW:

- “Coffee is life”
- “Life is coffee”

BoW doesn’t care that the word order is swapped — it just notes that both have “coffee”, “is”, and “life”.

In [23]:
bow_vectorizer = CountVectorizer(max_features=10000)  # you can change this limit
X_train_bow = bow_vectorizer.fit_transform(x_train)
X_test_bow = bow_vectorizer.transform(x_test)
X_val_bow = bow_vectorizer.transform(x_val)

In [24]:
print("Vocabulary size:", len(bow_vectorizer.get_feature_names_out()))
print("BoW shape (train):", X_train_bow.shape)
print("BoW shape (test):", X_test_bow.shape)
print("BoW shape (val):", X_val_bow.shape)

Vocabulary size: 10000
BoW shape (train): (12232, 10000)
BoW shape (test): (3398, 10000)
BoW shape (val): (1360, 10000)


### TF-IDF

TF-IDF or Term Frequency – Inverse Document Frequency.

It’s a way to represent how **important** a word is within a document, compared to the whole dataset (all documents).

1. Term Frequency (TF)
Formula:
$$
TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

<br>

2. Inverse Document Frequency
$$
IDF(t) = \log{\frac{N}{n_t}}
$$
- $N$ = total number of documents
- $n_t$ = number of documents containing the term t

3. Combine TF-IDF
$$
\text{TF-IDF} = TF(t, d) × IDF(t)
$$

$Example:$

Let’s say you have 3 documents:
1. "coffee coffee bean taste nice"
2. "i love coffee"
3. "tea taste nice"

The word coffee appears often in doc1 and doc2, but not in doc3 → quite important!

The word taste appears in doc1 and doc3 → less special.

The word nice appears in doc1 and doc3 → also common.

In [25]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000)  # you can change this limit
X_train_tfidf = tfidf_vectorizer.fit_transform(x_train)
X_test_tfidf = tfidf_vectorizer.transform(x_test)
X_val_tfidf = tfidf_vectorizer.transform(x_val)

In [26]:
print("Vocabulary size:", len(tfidf_vectorizer.get_feature_names_out()))
print("tfidf shape (train):", X_train_tfidf.shape)
print("tfidf shape (test):", X_test_tfidf.shape)
print("tfidf shape (val):", X_val_tfidf.shape)

Vocabulary size: 10000
tfidf shape (train): (12232, 10000)
tfidf shape (test): (3398, 10000)
tfidf shape (val): (1360, 10000)


### Word2Vec

In [29]:
w2v_x_train = deepcopy(pd.DataFrame(x_train))
w2v_x_test = deepcopy(pd.DataFrame(x_test))
w2v_x_val = deepcopy(pd.DataFrame(x_val))
w2v_x_train

Unnamed: 0,text
10369,Carmakers registered the fewest new vehicles i...
5022,Credit Spread Improvement And A Quick Update O...
1542,MOGU Files Annual Report on Form 20-F for Fisc...
6774,$GMAB: Genmab announces net sales of DARZALEX ...
11076,Swiss Producer &amp; Import Prices (M/M) Jun: ...
...,...
4964,$FOMC - Fomo reports Q1 results https://t.co/...
15379,$ULH - Universal Logistics Holdings: Recent St...
5977,China’s banking regulator has asked lenders to...
6323,Quarles Says Fed Should Have Hiked Rates Befor...


In [33]:
w2v_x_train["tokenized"] = w2v_x_train["text"].apply(
    lambda x: clean_text(
        text=x, 
        language="english",
        tokenize=True,
        remove_stop_words=True
    )
)
w2v_x_test["tokenized"] = w2v_x_test["text"].apply(
    lambda x: clean_text(
        text=x, 
        language="english",
        tokenize=True,
        remove_stop_words=True
    )
)
w2v_x_val["tokenized"] = w2v_x_val["text"].apply(
    lambda x: clean_text(
        text=x, 
        language="english",
        tokenize=True,
        remove_stop_words=True
    )
)



In [35]:
w2v_x_train

Unnamed: 0,text,tokenized
10369,Carmakers registered the fewest new vehicles i...,"[carmakers, registered, fewest, new, vehicles,..."
5022,Credit Spread Improvement And A Quick Update O...,"[credit, spread, improvement, quick, update, s..."
1542,MOGU Files Annual Report on Form 20-F for Fisc...,"[mogu, files, annual, report, form, 20-f, fisc..."
6774,$GMAB: Genmab announces net sales of DARZALEX ...,"[gmab, genmab, announces, net, sales, darzalex..."
11076,Swiss Producer &amp; Import Prices (M/M) Jun: ...,"[swiss, producer, amp, import, prices, mm, jun..."
...,...,...
4964,$FOMC - Fomo reports Q1 results https://t.co/...,"[fomc, -, fomo, reports, q1, results, httpstco..."
15379,$ULH - Universal Logistics Holdings: Recent St...,"[ulh, -, universal, logistics, holdings, recen..."
5977,China’s banking regulator has asked lenders to...,"[china, ’, banking, regulator, asked, lenders,..."
6323,Quarles Says Fed Should Have Hiked Rates Befor...,"[quarles, says, fed, hiked, rates, taper, fini..."


In [34]:
# make the tokenized text into one list of tokens for word2vec to learn
train_tokenized_word_list = w2v_x_train["tokenized"].to_list()
test_tokenized_word_list = w2v_x_test["tokenized"].to_list()
val_tokenized_word_list = w2v_x_val["tokenized"].to_list()

In [36]:
# Train the word2vec odel
w2v_model = Word2Vec(
    sentences=train_tokenized_word_list,
    vector_size=100,
    window=5, # Max distance from target word
    min_count=2,
    sg=1, # use skip gram (0 for CBOW)
    workers=-1 #all cores to train
)

In [37]:
# Vectorize the text data after tokenizing the text
def sentence_vector(tokens, model):
    valid_words = [w for w in tokens if w in model.wv]
    if not valid_words:
        return np.zeros(model.vector_size)
    return np.mean(model.wv[valid_words], axis=0)

In [38]:
w2v_x_train["vectorized"] = w2v_x_train["tokenized"].apply(
    lambda x: sentence_vector(x, w2v_model)
)

w2v_x_test["vectorized"] = w2v_x_test["tokenized"].apply(
    lambda x: sentence_vector(x, w2v_model)
)

w2v_x_val["vectorized"] = w2v_x_val["tokenized"].apply(
    lambda x: sentence_vector(x, w2v_model)
)

In [39]:
w2v_x_train

Unnamed: 0,text,tokenized,vectorized
10369,Carmakers registered the fewest new vehicles i...,"[carmakers, registered, fewest, new, vehicles,...","[-0.0037967074, 0.0020488119, 0.0005077503, 0...."
5022,Credit Spread Improvement And A Quick Update O...,"[credit, spread, improvement, quick, update, s...","[0.0019219258, 0.0017877818, 0.0005140198, 0.0..."
1542,MOGU Files Annual Report on Form 20-F for Fisc...,"[mogu, files, annual, report, form, 20-f, fisc...","[-5.4733526e-05, -9.8319724e-05, -0.0039652335..."
6774,$GMAB: Genmab announces net sales of DARZALEX ...,"[gmab, genmab, announces, net, sales, darzalex...","[-0.0016815005, 0.00077633763, 0.0009391241, -..."
11076,Swiss Producer &amp; Import Prices (M/M) Jun: ...,"[swiss, producer, amp, import, prices, mm, jun...","[-0.0027268182, 0.0018887662, -0.0022556628, 0..."
...,...,...,...
4964,$FOMC - Fomo reports Q1 results https://t.co/...,"[fomc, -, fomo, reports, q1, results, httpstco...","[0.0005459329, 0.003093831, 0.00016384262, 0.0..."
15379,$ULH - Universal Logistics Holdings: Recent St...,"[ulh, -, universal, logistics, holdings, recen...","[0.0026075242, -0.001720027, 0.0039549335, 0.0..."
5977,China’s banking regulator has asked lenders to...,"[china, ’, banking, regulator, asked, lenders,...","[-0.00028046552, -0.00025583737, 0.0007133113,..."
6323,Quarles Says Fed Should Have Hiked Rates Befor...,"[quarles, says, fed, hiked, rates, taper, fini...","[-0.0011883086, 0.001673927, 0.002254193, -0.0..."


In [40]:
# Get the vectors and make it into a 2D array
# [[a, b, c], # label 1
#  [d, e, f], # label 2
#  [g, h, i], # label 3
#  ...      ,
#  [., ., .]] # label n

X_train_vec = np.vstack(w2v_x_train["vectorized"].values)
X_test_vec  = np.vstack(w2v_x_test["vectorized"].values)
X_val_vec   = np.vstack(w2v_x_val["vectorized"].values)

In [41]:
X_train_vec

array([[-3.79670737e-03,  2.04881188e-03,  5.07750316e-04, ...,
        -2.52123550e-03, -2.58323317e-03,  6.75106945e-04],
       [ 1.92192581e-03,  1.78778183e-03,  5.14019805e-04, ...,
         2.38531432e-03, -7.66813697e-04,  1.48879190e-04],
       [-5.47335258e-05, -9.83197242e-05, -3.96523345e-03, ...,
         2.04864264e-04,  3.85582773e-03, -4.04579332e-03],
       ...,
       [-2.80465523e-04, -2.55837367e-04,  7.13311310e-04, ...,
         6.00262021e-04, -2.73436162e-04,  1.23293675e-03],
       [-1.18830858e-03,  1.67392695e-03,  2.25419295e-03, ...,
         3.07700131e-04, -4.26686514e-04,  2.30513909e-03],
       [ 1.84311590e-03, -1.16678595e-03,  4.26563993e-03, ...,
         3.60934786e-03, -1.26184896e-03,  1.23401149e-03]])