## Pipeline Preprocessing

This notebook intended to show the effects of the preprocessing pipeline. 
Each example will show the result of applying different preprocessing functions, such as tokenization, stopword removal, etc.

In [9]:
import warnings
import sys
import os


warnings.filterwarnings('ignore')
current_dir = %pwd

parent_dir = os.path.abspath(os.path.join(current_dir, '../..'))
sys.path.append(parent_dir)

os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [10]:
from src.main.pipeline.pipeline import Pipeline
from src.main.pipeline.functions import stop_words_removal, clean_text, remove_contractions, unify_numbers, tfidf_vectorizer
from src.main.utilities.utils import get_dataset
import numpy as np

<h4><b>Sample text</b></h4>

Let's extract from the datset one sentence for each existing label. They will be taken as a reference example for each pipeline function applied.

In [11]:
inputs, targets = get_dataset()
inputs = inputs[:200].reshape(-1, 1)
targets = targets[:200]

def print_results(inputs, targets):
    unique_classes = np.unique(targets)
    for class_name in unique_classes:
        class_index = np.where(targets == class_name)[0][0]
        print(f"Class: {class_name}")
        print(inputs[class_index][0])
        print()

print_results(inputs, targets)

Class: Entertainment
23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) "Until you have a dog you don't understand what could be eaten."

Class: Life
6 Signs You’re Grinding Your Teeth At Night (And What To Do About It) Beyond toothaches, there are other common red flags that you're dealing with nighttime teeth grinding.

Class: Politics
Biden Says U.S. Forces Would Defend Taiwan If China Invaded President issues vow as tensions with China rise.

Class: Sports
Maury Wills, Base-Stealing Shortstop For Dodgers, Dies At 89 Maury Wills, who helped the Los Angeles Dodgers win three World Series titles with his base-stealing prowess, has died.

Class: Voices
Spirituality Has A New Face — And It’s Queer As Hell Meet three spiritual leaders working hard for queer people to have a safe space in the religious community.



<h4><b>Removing Contractions</b></h4>

As a first step, we remove the contractions, expanding shortened words back into their full forms.

For example, "don't" becomes "do not," "haven't" becomes "have not," and so on.

This type of preprocessing creates a more uniform data set for the model to process and also increases efficiency because the model has to learn and handle fewer unique words.

In [12]:
pipeline = Pipeline([remove_contractions])
results = pipeline.execute(inputs)
print_results(results.reshape(-1, 1), targets)

Pipeline started
Pipeline execution time: 0:00:00.072688
Class: Entertainment
23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) "Until you have a dog you do not understand what could be eaten."

Class: Life
6 Signs You are Grinding Your Teeth At Night (And What To Do About It) Beyond toothaches, there are other common red flags that you are dealing with nighttime teeth grinding.

Class: Politics
Biden Says YOU.S. Forces Would Defend Taiwan If China Invaded President issues vow as tensions with China rise.

Class: Sports
Maury Wills, Base-Stealing Shortstop For Dodgers, Dies At 89 Maury Wills, who helped the Los Angeles Dodgers win three World Series titles with his base-stealing prowess, has died.

Class: Voices
Spirituality Has A New Face — And It is Queer As Hell Meet three spiritual leaders working hard for queer people to have a safe space in the religious community.



<h4><b>Text Cleaning</b></h4>

The text cleaning operation consists of applying a series of regular expression necessary to remove noise, standardize the text, and extract meaningful features.

This step of the pipeline enhances the accuracy and effectiveness of the model.

In [13]:
# Cleaning the text
pipeline = Pipeline([remove_contractions, clean_text])
results = pipeline.execute(inputs)
print_results(results.reshape(-1, 1), targets)

Pipeline started
Pipeline execution time: 0:00:00.024352
Class: Entertainment
23 of the funniest tweets about cats and dogs this week (sept  17 23)  until you have a dog you do not understand what could be eaten  

Class: Life
6 signs you are grinding your teeth at night (and what to do about it) beyond toothaches  there are other common red flags that you are dealing with nighttime teeth grinding 

Class: Politics
biden says you s  forces would defend taiwan if china invaded president issues vow as tensions with china rise 

Class: Sports
maury wills  base stealing shortstop for dodgers  dies at 89 maury wills  who helped the los angeles dodgers win three world series titles with his base stealing prowess  has died 

Class: Voices
spirituality has a new face  and it is queer as hell meet three spiritual leaders working hard for queer people to have a safe space in the religious community 



<h4><b>Removing stop-words</b></h4>

Remove the stop words from the text using the english stop words list from nltk. This is a common and effective text cleaning technique that helps to focus on the core meaning of the text and improves efficiency.

In [14]:
pipeline = Pipeline([remove_contractions, clean_text, stop_words_removal])
results = pipeline.execute(inputs)
print_results(results.reshape(-1, 1), targets)

Pipeline started
Pipeline execution time: 0:00:00.046176
Class: Entertainment
23 funniest tweets cats dogs week (sept  17 23)  dog understand could eaten  

Class: Life
6 signs grinding teeth night (and it) beyond toothaches  common red flags dealing nighttime teeth grinding 

Class: Politics
biden says  forces would defend taiwan china invaded president issues vow tensions china rise 

Class: Sports
maury wills  base stealing shortstop dodgers  dies 89 maury wills  helped los angeles dodgers win three world series titles base stealing prowess  died 

Class: Voices
spirituality new face  queer hell meet three spiritual leaders working hard queer people safe space religious community 



<h4><b>TF-IDF Vectorization</b></h4>

Convert text data into a suitable numerical representations, considering the frequency of a word within a document and its importance across the entire document collection.

It helps identify key words, compare documents, and perform various NLP tasks effectively.

In [15]:
pipeline = Pipeline([remove_contractions, clean_text, stop_words_removal, tfidf_vectorizer])
results = pipeline.execute(inputs)
print(results)


Pipeline started
Pipeline execution time: 0:00:00.027754
  (0, 681)	0.281651915451354
  (0, 494)	0.2008518216117173
  (0, 2229)	0.281651915451354
  (0, 642)	0.26129597718823094
  (0, 8)	0.281651915451354
  (0, 1890)	0.26129597718823094
  (0, 2316)	0.26129597718823094
  (0, 643)	0.26129597718823094
  (0, 363)	0.26129597718823094
  (0, 2215)	0.26129597718823094
  (0, 847)	0.26129597718823094
  (0, 33)	0.49370641828898415
  (1, 610)	0.1576247334306647
  (1, 1646)	0.17984497040615585
  (1, 2162)	0.17984497040615585
  (1, 1892)	0.13162872265506462
  (1, 2365)	0.1446267280428647
  (1, 2142)	0.15047141487319912
  (1, 2335)	0.15047141487319912
  (1, 141)	0.17984497040615585
  (1, 1234)	0.17984497040615585
  (1, 972)	0.17984497040615585
  (1, 55)	0.17984497040615585
  (1, 611)	0.1354044964551735
  (1, 641)	0.3596899408123117
  :	:
  (198, 577)	0.17094297234955833
  (198, 1089)	0.40836585883621657
  (198, 2088)	0.15695041981492436
  (198, 2312)	0.17094297234955833
  (198, 894)	0.1532113494914801