### Chapter 5 Code Demo for NLP Data Pre-processing with Python

#### YouTube Comments Spam Detection 

We solved the same problem in the Chapter 1 code demo. This time we will attempt solve it here again with some data-pre-processing steps and see if we get any improvements in results.

In [1]:
# Import the required libraries.
import pandas as pd
import numpy as np
# The below code is for working with machine learning model.
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
# Ignore warnings.
import warnings 
warnings.filterwarnings('ignore')

In [3]:
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

In [178]:
# The below code is for working with machine learning model.
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

#### A quick look at our data

In [196]:
# Read the data files [1] available in the same folder as this code.
Youtube01_psy = pd.read_csv('Youtube01-Psy.csv')
Youtube02_katyperry = pd.read_csv('Youtube02-KatyPerry.csv')
Youtube03_lmfao = pd.read_csv('Youtube03-LMFAO.csv')
Youtube04_eminem = pd.read_csv('Youtube04-Eminem.csv')
Youtube05_shakira = pd.read_csv('Youtube05-Shakira.csv')

In [73]:
# ACombine all five datasets.
combined_df = pd.concat([Youtube01_psy, Youtube02_katyperry, Youtube03_lmfao, Youtube04_eminem, Youtube05_shakira])

# Reset the index
combined_df.reset_index(drop=True, inplace=True)

In [6]:
combined_df.head(3)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1


In [74]:
# Keep only the useful "CONTENT" and "CLASS" columns.
combined_df = combined_df[["CONTENT", "CLASS"]]

In [57]:
# Randomly select 5 rows
random_sample = combined_df.sample(n=5)
print(random_sample)

                                                CONTENT  CLASS
1127  The best Song i saw ‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏èüòçüòçüòçüòçüòçüòçüòçüòòüòòüòò...      0
599   Hey Katycats! We are releasing a movie at midn...      1
574   want to win borderlands the pre-sequel? check ...      1
533   Awesome video this is one of my favorite  song...      0
747             Love this video and the song of courseÔªø      0


- Note the special characters and misalignment doe to spaces in CLASS

In [75]:
# Randomly select 5 rows
random_sample = combined_df.sample(n=5)
print(random_sample)

                                                CONTENT  CLASS
532   http://www.googleadservices.com/pagead/aclk?sa...      1
1447            I love this song sooooooooooooooo muchÔªø      0
1901  Hey youtubers... I really appreciate all of yo...      1
268   https://www.facebook.com/pages/Mathster-WP/149...      1
1304  sorry but eminmem is a worthless wife beating ...      0


#### Convert the text to lowercase.

In [76]:
# Convert all comments to string type for further processing.
combined_df["CONTENT"] = combined_df["CONTENT"].astype(str)

In [77]:
# Convert everything in to lower case.
combined_df["text_lower_case"] = combined_df["CONTENT"].str.lower()
# Randomly select 3 rows.
random_sample_lower = combined_df.sample(n=3)
print(random_sample_lower)

                                               CONTENT  CLASS  \
404  YAY IM THE 11TH COMMENTER!!!!!                ...      1   
907               Check out this playlist on YouTube:Ôªø      1   
942  View 851.247.920<br /><br />¬†Best youtube Vide...      1   

                                       text_lower_case  
404  yay im the 11th commenter!!!!!                ...  
907               check out this playlist on youtube:Ôªø  
942  view 851.247.920<br /><br />¬†best youtube vide...  


In [78]:
# Drop ombined_df["CONTENT"] as we will work only with combined_df["text_lower_case"].
# The punctuations present are - !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`
# combined_df = combined_df.drop(columns=["CONTENT"])

In [79]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case'], dtype='object')

#### Remove all unwanted punctuations

In [80]:
# Punctuation to remove
punctuation_to_remove = "!\"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`"

# Remove the above punctuations from the # Remove the above punctuations from the "text_lower_case" column.
combined_df["text_lower_case"] = combined_df["text_lower_case"].str.translate(str.maketrans('', '', punctuation_to_remove))

In [81]:
# Randomly select 10 rows.
random_sample_lower = combined_df.sample(n=10)
print(random_sample_lower)

                                                CONTENT  CLASS  \
173                     http://www.gofundme.com/gvr7xgÔªø      1   
277   Hey, join me on ts≈´, a publishing platform whe...      1   
1828                          Shakira is very beautiful      0   
25    marketglory . com/strategygame/andrijamatf ear...      1   
580   Thank you KatyPerryVevo for your instagram lik...      1   
1240  all u should go check out j rants vi about eminem      1   
1788              Please visit this Website: oldchat.tk      1   
962    <br />Please help me get 100 subscribers by t...      1   
1857                                            Love it      0   
301   http://hackfbaccountlive.com/?ref=4436607  psy...      1   

                                        text_lower_case  
173                           httpwwwgofundmecomgvr7xgÔªø  
277   hey join me on ts≈´ a publishing platform where...  
1828                          shakira is very beautiful  
25    marketglory  comstrategygamea

#### Remove stopwords

In [82]:
import nltk
from nltk.corpus import stopwords

# Download the stopwords if you haven't already
nltk.download('stopwords')

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to C:\Users\Shailendra
[nltk_data]     Kadre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [83]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case'], dtype='object')

In [84]:
# Remove stopwords from the "text_lower_case" column
combined_df["text_lower_case"] = combined_df["text_lower_case"].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [85]:
# Randomly select 5 rows.
random_sample_lower = combined_df.sample(n=5)
print(random_sample_lower)

                                                CONTENT  CLASS  \
1025  <a href="https://m.freemyapps.com/share/url/10...      1   
181                         Please check out my vidiosÔªø      1   
724   This awesome song needed 4 years to reach to 8...      0   
1296  5 years and i still dont get the music video h...      0   
878                                                omgÔªø      0   

                                        text_lower_case  
1025  hrefhttpsmfreemyappscomshareurl10b35481httpsmf...  
181                                please check vidiosÔªø  
724   awesome song needed 4 years reach 800 mil view...  
1296   5 years still dont get music video help someoneÔªø  
878                                                omgÔªø  


The frequent or rare words removal is based on the specific goals of your NLP task and the nature of dataset in hand. Frequent words, removed because of their minimal informational value. Removing frequent and rare words helps to reduce noise in the data and allows focus on more meaningful words. It improves the performance of several NLP algorithms as they can now focus on content-rich words that contributes to the overall analysis. Rare words are particularly removed as they have limited relevance. 

- For this problem, we will not remove the frequent and rare words.
- Below we will just share the code on how to remove them, if your project needs it.

#### Remove frequent words

In [86]:
from collections import Counter

# Combine all text into a single string and split into individual words.
all_words = ' '.join(combined_df["text_lower_case"]).split()

# Count the frequency of every word.
word_counts = Counter(all_words)

# Determine the threshold for frequent words (top 10 most common words).
most_common_words = word_counts.most_common(10)

# Print the frequent words with their frequencies.
print("Frequent words with their frequencies:")
for word, count in most_common_words:
    print(f"{word}: {count}")

# Set of the frequent words.
frequent_words = {word for word, count in most_common_words}

# Remove frequent words from the combined_df["text_lower_case"].
combined_df["frequent_removed"] = combined_df["text_lower_case"].apply(
    lambda x: ' '.join([word for word in x.split() if word not in frequent_words])
)

Frequent words with their frequencies:
check: 559
video: 294
Ôªø: 267
like: 235
please: 231
song: 231
subscribe: 209
love: 189
channel: 173
music: 144


In [87]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed'], dtype='object')

In [46]:
# Randomly select 5 rows.
random_sample_lower = combined_df[['CLASS', 'frequent_removed']].sample(n=5)
print(random_sample_lower)

      CLASS                                   frequent_removed
1939      1  peoples earth seen perform every form evil lei...
459       0  comment randomly get lots likes replies reason...
896       0                       almost 1 billion views niceÔªø
1799      0                                      she39s pretty
1426      0                charlieee dddd saw lost understandÔªø


#### Remove rare words. 

In [88]:
from collections import Counter

# Combine all text into a single string and split into individual words.
all_words = ' '.join(combined_df["text_lower_case"]).split()

# Count the frequency of every word.
word_counts = Counter(all_words)

# Define the threshold for rare words.
threshold = 5
rare_words = {word: count for word, count in word_counts.items() if count < threshold}

# Sort the rare words by frequency in ascending order and keep the top 5.
sorted_rare_words = sorted(rare_words.items(), key=lambda item: item[1])
top_5_rare_words = sorted_rare_words[:5]

# Print the top 5 rare words with their frequencies.
print("Top 5 rare words with their frequencies:")
for word, count in top_5_rare_words:
    print(f"{word}: {count}")

# Remove rare words from combined_df["text_lower_case"].
combined_df["rare_removed"] = combined_df["text_lower_case"].apply(
    lambda x: ' '.join([word for word in x.split() if word not in rare_words])
)

Top 5 rare words with their frequencies:
anyway: 1
kobyoshi02: 1
monkeys: 1
shirtplease: 1
test: 1


- All the removed rare words are not displayed here as the list is very long.

In [48]:
combined_df.columns

Index(['CLASS', 'text_lower_case', 'frequent_removed', 'rare_removed'], dtype='object')

In [20]:
# Randomly select 5 rows.
random_sample_lower = combined_df[['CLASS', 'lower_removed']].sample(n=5)
print(random_sample_lower)

      CLASS                                      lower_removed
1455      0                                                 so
1787      1                          please visit this website
35        0  why is a korean song so big in the does that m...
1622      1                   check out this video on youtubeÔªø
1386      0                                                  Ôªø


 - Below we will apply Stemming and Lemmatisation directly to combined_df["text_lower_case"] as they are necessary steps.

#### Stemming.

In [89]:
import nltk
from nltk.stem import PorterStemmer
from collections import Counter

# Ensure that the necessary NLTK resources are available
nltk.download('punkt')

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Function to stem words in a text and return the stemmed version
def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

[nltk_data] Downloading package punkt to C:\Users\Shailendra
[nltk_data]     Kadre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [90]:
# Apply stemming to tcombined_df["text_lower_case"].
combined_df["text_lower_case"] = combined_df["text_lower_case"].apply(stem_text)

In [91]:
# Combine all stemmed text into a single string and split into individual words
all_stemmed_words = ' '.join(combined_df["text_lower_case"]).split()

# Count the frequency of every stemmed word
word_counts = Counter(all_stemmed_words)

# Get the top 5 stemmed words and their frequencies
top_5_stemmed_words = word_counts.most_common(5)

In [92]:
# Create a mapping from stemmed words to their original base forms
base_form_mapping = {}
for text in combined_df["text_lower_case"]:
    words = text.split()
    for word in words:
        stemmed_word = stemmer.stem(word)
        if stemmed_word not in base_form_mapping:
            base_form_mapping[stemmed_word] = set()
        base_form_mapping[stemmed_word].add(word)

In [93]:
# Display the top 5 stemmed words with their original base forms
print("Top 5 stemmed words with their original base forms:")
for stemmed_word, count in top_5_stemmed_words:
    base_forms = base_form_mapping.get(stemmed_word, [])
    base_form_display = ', '.join(base_forms)  # Display unique base forms
    print(f"Stemmed Word: {stemmed_word}, Count: {count}, Original Words: {base_form_display}")

Top 5 stemmed words with their original base forms:
Stemmed Word: check, Count: 568, Original Words: check
Stemmed Word: video, Count: 361, Original Words: video
Stemmed Word: song, Count: 274, Original Words: song
Stemmed Word: Ôªø, Count: 267, Original Words: Ôªø
Stemmed Word: like, Count: 256, Original Words: like


In [94]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed',
       'rare_removed'],
      dtype='object')

In [95]:
combined_df[['CLASS', 'text_lower_case']].head(10)

Unnamed: 0,CLASS,text_lower_case
0,1,huh anyway check youtub channel kobyoshi02
1,1,hey guy check new channel first vid us monkey ...
2,1,test say murdevcom
3,1,shake sexi ass channel enjoy Ôªø
4,1,watchvvtarggvgtwq check Ôªø
5,1,hey check new websit site kid stuff kidsmediau...
6,1,subscrib channel Ôªø
7,0,turn mute soon came want check viewsÔªø
8,1,check channel funni videosÔªø
9,1,u shouldd check channel tell nextÔªø


In [96]:
# Randomly select 5 stemmed rows.
random_sample = combined_df[['CLASS', 'text_lower_case']].sample(n=5)
print(random_sample)

      CLASS                                    text_lower_case
1749      1            brazil pleas subscrib channel love allÔªø
602       0                           song never get old lt3 Ôªø
1329      1  guy check extraordinari websit call zonepacom ...
1791      1  hello guysi found way make money onlin get pai...
1479      0                                  anybodi els 2015Ôªø


#### Lemmatization

We can perform lemmatization in two ways. 

1. Without POS Tagging: It is less accurate as this way, the lemmatizer often assumes that words are nouns which can lead to potential errors.
2. With POS Tagging: It is more accurate as it takes the word's role in the sentence into account while performing lemmatizion. Taking POS tags in to account reduces the likelihood of errors. 

In the following code, we will take up the second process.

Note on the approach of coding: POS tags from NLTK‚Äôs are more detailed compared to the broader ones from WordNet. NLTK‚Äôs POS tagger provides the necessary contextual information. Converting NLTK‚Äôs detailed tags to WordNet's simpler POS tags ensures that the lemmatizer has the necessary context. 

In [97]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

# Convert NLTK POS tags to WordNet POS tags as explained above.
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Assume nounun if no match is found.

In [98]:
# Test on sample text.
text = """Note on the approach of coding: POS tags from NLTK‚Äôs are more detailed 
          compared to the broader ones from WordNet. NLTK‚Äôs POS tagger provides the 
          necessary contextual information. Converting NLTK‚Äôs detailed tags to WordNet's 
          simpler POS tags ensures that the lemmatizer has the necessary context.
"""

# Tokenize and find the POS tags.
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Initialize the lemmatizer.
lemmatizer = WordNetLemmatizer()

# Lemmatize with POS tags
lemmatized_words = [lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tags]

print(lemmatized_words)

['Note', 'on', 'the', 'approach', 'of', 'coding', ':', 'POS', 'tag', 'from', 'NLTK', '‚Äô', 's', 'be', 'more', 'detailed', 'compare', 'to', 'the', 'broad', 'one', 'from', 'WordNet', '.', 'NLTK', '‚Äô', 's', 'POS', 'tagger', 'provide', 'the', 'necessary', 'contextual', 'information', '.', 'Converting', 'NLTK', '‚Äô', 's', 'detail', 'tag', 'to', 'WordNet', "'s", 'simpler', 'POS', 'tag', 'ensure', 'that', 'the', 'lemmatizer', 'have', 'the', 'necessary', 'context', '.']


In [99]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed',
       'rare_removed'],
      dtype='object')

In [100]:
# We will aplly the lemmatize diectly on combined_df['text_lower_case'] as its a necessary step in our case.
# First write a function to lemmatize text.
def lemmatize_text(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tags]
    return ' '.join(lemmatized_tokens)

In [101]:
# Apply the lemmatization function directly to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(lemmatize_text)

# Display the DataFrame with the new lemmatized column
print(combined_df[['CLASS','text_lower_case']])

      CLASS                                    text_lower_case
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


We are half way through yet. We are still left with the following text pre-processing processes that we will directly aply to our main text column, combined_df['text_lower_case']. 
- Removal of emojis
- Removal of emoticons
- Conversion of emoticons to words
- Conversion of emojis to words
- Removal of URLs
- Removal of HTML tags
- Chat words conversion
- Spelling correction

Below we will take them up one-by-one.

#### Removal of emojis

In text pre-processing, we need to remove emojis to simplify text data. Removing emojis lets models focus on the most relevant content. Removing emojis helps us to standardize the input data and makes it more uniform and easier to process. 

In [102]:
# We will use regular expressions to remove emojis directly from combined_df['text_lower_case'].

import pandas as pd
import re

# Function to remove emojis
def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # Emoticons
        u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
        u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
        u"\U0001F1E0-\U0001F1FF"  # Flags (iOS)
        u"\U00002702-\U000027B0"  # Miscellaneous Symbols
        u"\U000024C2-\U0001F251"  # Enclosed Characters
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [103]:
# Apply the function diectly to combined_df['text_lower_case'].
combined_df['emojis_removed'] = combined_df['text_lower_case'].apply(remove_emojis)

# Pring the output.
print(combined_df[['CLASS','emojis_removed']])

      CLASS                                     emojis_removed
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                       shake sexi as channel enjoy 
4         1                           watchvvtarggvgtwq check 
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


In [104]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed',
       'rare_removed', 'emojis_removed'],
      dtype='object')

#### Removal of emoticons

You might be wondering the difference between emojis emoticons. Emojis are colourful, digital icons like üòä (smiling face) or üöÄ (rocket) . They stand for objects or emotions. While emoticons, are text-based symbols like :-) (smiley face) or <3 (heart).  Emoticons are created with keyboard characters to express feelings. 

In [116]:
import re

# Apply RegEx to remove emoticons. Write a function first.
def remove_emoticons(text):
    # RegEx pattern to match commonly used emoticons.
    emoticon_pattern = re.compile(
        r'[:;=8][\-o\*]?[)\]\(\[dDpP\||/\\\^]',
        re.UNICODE)
    return emoticon_pattern.sub(r'', text)

In [117]:
# Sample sentence with emoticons.
sample_sentence = "Hello John! :) How are you? :P I hope you're good today. :D"

# Remove emoticons from the sample sentence
emoticons_removed = remove_emoticons(sample_sentence)

# Print the result
print("Original Sentence:", sample_sentence)
print("Cleaned Sentence:", emoticons_removed)

Original Sentence: Hello John! :) How are you? :P I hope you're good today. :D
Cleaned Sentence: Hello John!  How are you?  I hope you're good today. 


In [118]:
# Apply the function to the combined_df['text_lower_case']. We will store it in another column for now.
combined_df['emoticons_removed'] = combined_df['text_lower_case'].apply(remove_emoticons)

# Print the output.
print(combined_df[['CLASS','emoticons_removed']])

      CLASS                                  emoticons_removed
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


In [119]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed',
       'rare_removed', 'emojis_removed', 'emoticons_removed'],
      dtype='object')

#### Conversion of emoticons to words

We can map emoticons to their corresponding descriptions like "smiley face" or "grinning face." This conversion helps to preserve the sentiments conveyed by emoticons in the form of simple text words, which are simpler to analyse as compared to plain emoticons. It‚Äôs a sure information gain and it helps to improve the accuracy and effectiveness of the NLP task in hand. 

In [121]:
# Map the common emoticons to words.
emoticon_to_word = {
    ':)': 'smiley face',
    ':D': 'grinning face',
    ':P': 'playful face',
    ':-)': 'smiley face',
    ':-D': 'grinning face',
    ':-P': 'playful face'
}

# Function to replace emoticons with words.
def emoticons_to_word_converter(text):
    for emoticon, word in emoticon_to_word.items():
        text = text.replace(emoticon, word)
    return text

In [122]:
# Sample sentence.
sample_sentence = "Hello there! :) How are you? :P I hope you're doing well. :D"

# Convert emoticons to words.
converted_sentence = emoticons_to_word_converter(sample_sentence)

# Print the output.
print("Original Sentence:", sample_sentence)
print("Converted Sentence:", converted_sentence)

Original Sentence: Hello there! :) How are you? :P I hope you're doing well. :D
Converted Sentence: Hello there! smiley face How are you? playful face I hope you're doing well. grinning face


In [124]:
# Apply the function directly on combined_df['text_lower_case'] 
# Its likely to be useful in increasing the accuracy of our analysis.

# Apply the function to the 'text_lower_case' column
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(convert_emoticons_to_words)

# Print the output.
print(combined_df[['CLASS', 'text_lower_case']])

      CLASS                                    text_lower_case
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


#### Conversion of emojis to words

In [128]:
# Install emoji if you have not done it earlier. 
!pip install emoji # it is the necessarylibrary. 

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
                                              0.0/431.4 kB ? eta -:--:--
     -------                                 81.9/431.4 kB 2.3 MB/s eta 0:00:01
     ----------------------                 256.0/431.4 kB 3.2 MB/s eta 0:00:01
     -------------------------------------- 431.4/431.4 kB 3.4 MB/s eta 0:00:00
Installing collected packages: emoji
Successfully installed emoji-2.12.1


In [131]:
import emoji 

# Map of emojis to words.
def emojis_to_word_converer(text):
    # Convert emojis to their corresponding descriptions.
    return emoji.demojize(text)

In [132]:
# Example usage
sample_text = "Hello John! üòä How are you? üöÄ I hope you're good today. üéâ"
converted_text = emojis_to_word_converer(sample_text)

print("Original Text:", sample_text)
print("Converted Text:", converted_text)

Original Text: Hello John! üòä How are you? üöÄ I hope you're good today. üéâ
Converted Text: Hello John! :smiling_face_with_smiling_eyes: How are you? :rocket: I hope you're good today. :party_popper:


In [133]:
# Apply the function to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(emojis_to_word_converer)

# Print the output.
print(combined_df[['CLASS', 'text_lower_case']])

      CLASS                                    text_lower_case
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


#### Removal of URLs

URLs are noise and irrelevant information for any text analysis. Removing URLs standardizes and cleans the input data. 

In [134]:
import re

# Let's write a function to remove URLs from the input text.
def url_remover(text):
    url_pattern = re.compile(r'http[s]?://\S+|www\.\S+')
    return url_pattern.sub('', text)

In [136]:
# Sample usage below.
sample_text = "Check out this link: https://www.example.com and also visit http://example.org."
cleaned_text = url_remover(sample_text)

print("Original Text:", sample_text)
print("Cleaned Text:", cleaned_text)

Original Text: Check out this link: https://www.example.com and also visit http://example.org.
Cleaned Text: Check out this link:  and also visit 


In [137]:
# Apply the function to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(url_remover)

# Print the output.
print(combined_df[['CLASS','text_lower_case']])

      CLASS                                    text_lower_case
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


#### Removal of HTML tags

HTML tags may pose as noise and irrelavent data in most text analysis. Removing them can improve the accuracy of our text analysis.

In [141]:
import re

# A simple function to remove HTML tags from the input text.
def html_tag_remover(text):
    html_tag_pattern = re.compile(r'<[^>]+>')
    return html_tag_pattern.sub('', text)

In [145]:
# Sample usage.
sample_text = "<p>Hi John! <a href='https://example_url.com'>Click here</a> to visit.</p>"
cleaned_text = html_tag_remover(sample_text)

print("Original Text:", sample_text)
print("Cleaned Text:", cleaned_text)

Original Text: <p>Hi John! <a href='https://example_url.com'>Click here</a> to visit.</p>
Cleaned Text: Hi John! Click here to visit.


In [146]:
# Apply the function to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(html_tag_remover)

# Print the output.
print(combined_df[['CLASS','text_lower_case']])

      CLASS                                    text_lower_case
0         1         huh anyway check youtub channel kobyoshi02
1         1  hey guy check new channel first vid u monkey i...
2         1                                 test say murdevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                          watchvvtarggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two reason 1it africa 2i born beauti...
1953      0                                                wow
1954      0                                   shakira u wiredo
1955      0                                shakira best dancer

[1956 rows x 2 columns]


#### Chat words conversion

Chat words are informal, abbreviated, or slang terms that are popularly used in online messaging, especially by the younger generation. The examples include "u" for "you" or "lol" for "laughing out loud." Converting chat words in to more formal language words can help in accurately analysing them by text processing models. 

In [148]:
# Map common chat words to more formal words. You can add on to this list.
chat_to_formal = {
    'u': 'you',
    'r': 'are',
    'lol': 'laughing out loud',
    'brb': 'be right back',
    'ttyl': 'talk to you later',
    'thx': 'thanks',
    'gtg': 'got to go',
    'b4': 'before'
}

# Let's now write a function to replace chat words with formal words.
def chat_word_converter(text):
    for chat_word, formal_word in chat_to_formal.items():
        text = text.replace(chat_word, formal_word)
    return text

In [150]:
# Sample usage.
sample_text = "Hey John! r u coming to the function this evening? lol, thx for your invite!"
converted_text = chat_word_converter(sample_text)

print("Original Text:", sample_text)
print("Converted Text:", converted_text)

Original Text: Hey John! r u coming to the function this evening? lol, thx for your invite!
Converted Text: Hey John! are you coming to the fyounction this evening? laughing out loud, thanks foare yoyouare invite!


In [151]:
# Apply the function to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(chat_word_converter)

# Print the output.
print(combined_df[['CLASS','text_lower_case']])

      CLASS                                    text_lower_case
0         1   hyouh anyway check yoyoutyoub channel kobyoshi02
1         1  hey gyouy check new channel fiarest vid you mo...
2         1                             test say myouaredevcom
3         1                      shake sexi as channel enjoy Ôªø
4         1                        watchvvtaareggvgtwq check Ôªø
...     ...                                                ...
1951      0                           love song sing camp time
1952      0  love song two areeason 1it afareica 2i boaren ...
1953      0                                                wow
1954      0                             shakiarea you wiareedo
1955      0                            shakiarea best danceare

[1956 rows x 2 columns]


#### Spelling correction

Spelling correction will increase the information gain and help in accurate analysis of the input text. 

In [None]:
!pip install SpellChecker # Necesary library.

In [None]:
!pip install indexer # Another necesary library.

In [None]:
!pip install pyspellchecker # Another necesary library.

In [175]:
from spellchecker import SpellChecker
import pandas as pd

# Initialize the spell checker
spell = SpellChecker()

# Function to correct spelling in text
def spelling_corrector(text):
    words = text.split()  # Split the text into words
    corrected_words = [spell.candidates(word).pop() if spell.candidates(word) else word for word in words]
    return ' '.join(corrected_words)

In [176]:
# Sample usage.
sample_text = """We havv few speling mistkes in this short paragraph.', 
                            'Anothr exampl with spellig erors.', 
                            'No spellig errors here!
               """
converted_text = spelling_corrector(sample_text)

print("Original Text:", sample_text)
print("Converted Text:", converted_text)

Original Text: I havv a speling mistke in this sentnce.', 
                            'Anothr exampl with spellig erors.', 
                            'No spellig errors here!
               
Converted Text: I have a spieling mistake in this sentnce.', another example with spelling erors.', no spelling errors heres


In [None]:
# Apply the function to combined_df['text_lower_case'].
combined_df['text_lower_case'] = combined_df['text_lower_case'].apply(spelling_corrector)

# Print the output. 
print(combined_df[['CLASS','text_lower_case']])

Note: Applying the spelling_corrector function to the combined_df['text_lower_case'] was taking a long time. So, I aborted the kernel. You can try it on a better machine or cloud. I tried TextBlob as well, but it was also taking a lot of time.

In [177]:
combined_df.columns

Index(['CONTENT', 'CLASS', 'text_lower_case', 'frequent_removed',
       'rare_removed', 'emojis_removed', 'emoticons_removed'],
      dtype='object')

- The useful columns for further processing are 'CLASS' and 'text_lower_case'. All others are for demo. Use can use their code in your projects dpending upon what type of text analysis you are trying to do.

#### Apply Machine Learning Model

- Now all the below steps are similar to what we have seen in Chapter 1.

In [183]:
# Seperate features and the target.
X = np.array(combined_df['text_lower_case'])
y = np.array(combined_df['CLASS'])

In [181]:
count_vectorizer = feature_extraction.text.CountVectorizer() # Instrantiate CountVectorizer() 

In [184]:
X.shape

(1956,)

In [185]:
# Split in to train and test datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [186]:
# We will use count_vectorizer.fit_transform() for X_train and X_test.
X_train = count_vectorizer.fit_transform(X_train)

In [187]:
X_test = count_vectorizer.transform(X_test) # We will do onlt transform() with X_test.

In [190]:
# Initialize the Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [191]:
scores = model_selection.cross_val_score(clf, X_train, y_train, cv=3, scoring="f1")
scores

array([0.87470449, 0.9044289 , 0.90487239])

Note: Surprisingly we got considerably better accuracies in Chapter 1, in which mini,al pre-processing was done. Looks like there is a considerable loss of information in these pre-processing steps. It's an interesting lesson for all of us. 

In [192]:
clf.fit(X_train, y_train) # Fit the model.

In [193]:
# Make predictions with the test data. 
y_pred = clf.predict(X_test)

In [194]:
# Construct a dataframe with columns as y_test and y_pred. 
test_df = pd.DataFrame()
test_df["y"] = y_test
test_df["y_predict"] = y_pred

In [195]:
# Display 10 random rows from test_df
random_sample = test_df.sample(n=10)
print(random_sample)

     y  y_predict
9    1          1
218  1          1
3    1          1
362  1          1
291  1          1
174  0          0
99   1          1
357  0          0
531  0          0
294  1          1


- The ground truth, y_test, till comares good with the predictions. 

#### References (the dataset source)

[1] Alberto,T.C. and Lochter,J.V.. (2017). YouTube Spam Collection. UCI Machine Learning Repository. https://doi.org/10.24432/C58885.

Code Snippet 5.4