### Text Preprocessing
Text preprocessing includes a series of steps used to clean and prepare raw text data for analysis or modeling. It typically includes tasks such as removing noise (like HTML tags, URLs, punctuation, emojis, and emoticons), converting text to lowercase, correcting spelling, replacing abbreviations with full forms, removing stopwords, and normalizing words through stemming or lemmatization. 


In [1]:
# ! python3 -m pip install ipykernel


In [2]:
import pandas as pd

tweet_df = pd.read_csv('tweets.csv')
tweet_df = tweet_df.head(50)

In [3]:
tweet_df.drop(columns=['id', 'isRetweet', 'isDeleted', 'device', 'favorites',
       'retweets', 'date', 'isFlagged'], inplace=True)

#### REPLACE SHORT FORM WORDS WITH FULL FORM

In [4]:
full_form_dict = {
    'KAG2020':'Keep America Great 2020'
}

def correct_short_forms(text):
 
    words = text.split()
    corrected_words = [full_form_dict.get(word, word) for word in words]
    corrected_text = ' '.join(corrected_words)
    
    return corrected_text

tweet_df['text'] = tweet_df['text'].apply(correct_short_forms)
tweet_df

Unnamed: 0,text
0,Republicans and Democrats have both created ou...
1,I was thrilled to be back in the Great city of...
2,RT @CBS_Herridge: READ: Letter to surveillance...
3,The Unsolicited Mail In Ballot Scam is a major...
4,RT @MZHemingway: Very friendly telling of even...
5,RT @WhiteHouse: President @realDonaldTrump ann...
6,Getting a little exercise this morning! https:...
7,https://t.co/4qwCKQOiOw
8,https://t.co/VlEu8yyovv
9,https://t.co/z5CRqHO8vg


#### LOWERCASING

In [5]:
tweet_df['text']=tweet_df['text'].str.lower()
tweet_df.head(5)

Unnamed: 0,text
0,republicans and democrats have both created ou...
1,i was thrilled to be back in the great city of...
2,rt @cbs_herridge: read: letter to surveillance...
3,the unsolicited mail in ballot scam is a major...
4,rt @mzhemingway: very friendly telling of even...


#### REMOVE HTML TAGS

In [6]:
import re

def remove_html_tags(text):
    pattern = re.compile(r'<.*?>') 
    return pattern.sub('', text)


In [7]:
tweet_df['text'] = tweet_df['text'].apply(lambda text: remove_html_tags(text))

In [8]:
tweet_df['text'][1]

'i was thrilled to be back in the great city of charlotte, north carolina with thousands of hardworking american patriots who love our country, cherish our values, respect our laws, and always put america first! thank you for a wonderful evening!! #kag2020 https://t.co/dnjzfrsl9y'

#### REMOVE URL

In [9]:
def remove_url(text):
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return pattern.sub(r'',text)

tweet_df['text'] = tweet_df['text'].apply(remove_url)

In [10]:
tweet_df['text'][7]

''

#### REMOVE PUNCTUATION


In [11]:
import string

def remove_punctuation(text):
    pattern = re.compile(f"[{re.escape(string.punctuation)}]")
    return pattern.sub(r'',text)

tweet_df['text'] = tweet_df['text'].apply(remove_punctuation)
tweet_df.head()


Unnamed: 0,text
0,republicans and democrats have both created ou...
1,i was thrilled to be back in the great city of...
2,rt cbsherridge read letter to surveillance cou...
3,the unsolicited mail in ballot scam is a major...
4,rt mzhemingway very friendly telling of events...


#### SPELLING CORRECTION


In [12]:
from textblob import TextBlob

def correct_spelling(text):
    textBLB = TextBlob(text)
    return textBLB.correct().string

tweet_df['text'] = tweet_df['text'].apply(correct_spelling)


In [13]:
tweet_df['text'][14]

'thank you elise '

In [14]:
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')

nltk.download('stopwords')

from nltk.tokenize import word_tokenize,sent_tokenize



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LEGION\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LEGION\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### REMOVING STOPWORDS

In [15]:
stop_words = set(stopwords.words('english'))

In [16]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\LEGION\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [17]:

def removing_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_sentences = ' '.join(filtered_words)
    
    return filtered_sentences


tweet_df['text'] = tweet_df['text'].apply(removing_stopwords)
tweet_df.head(3)


Unnamed: 0,text
0,republicans democrats created economic problems
1,thrilled back great city charlotte north carol...
2,cbsherridge read letter surveillance court obt...


#### REMOVE EMOJI

In [18]:
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # Emojis
                               u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               u"\U0001F700-\U0001F77F"  # Alchemical Symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U0001FB00-\U0001FBFF"  # Symbols for Legacy Computing
                               u"\U0001F004-\U0001F0CF"  # Miscellaneous Symbols and Arrows
                               u"\U0001F10D-\U0001F10F"  # Enclosed Alphanumeric Supplement
                               u"\U0001F200-\U0001F251"  # Enclosed Ideographic Supplement
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


tweet_df['text'] = tweet_df['text'].apply(remove_emojis)


In [19]:
tweet_df['text'][6]

'getting little exercise morning'

#### REMOVE EMOTICONS

In [20]:
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

  u":‑\)":"Happy face or smiley",
  u":\)":"Happy face or smiley",
  u":-\]":"Happy face or smiley",
  u":\]":"Happy face or smiley",
  u"8-\)":"Happy face smiley",
  u":o\)":"Happy face smiley",
  u":-\}":"Happy face smiley",
  u":\}":"Happy face smiley",
  u":-\)":"Happy face smiley",
  u":c\)":"Happy face smiley",
  u":\^\)":"Happy face smiley",
  u"=\]":"Happy face smiley",
  u"=\)":"Happy face smiley",
  u"B\^D":"Laughing, big grin or laugh with glasses",
  u":-\)\)":"Very happy",
  u":‑\(":"Frown, sad, andry or pouting",
  u":-\(":"Frown, sad, andry or pouting",
  u":\(":"Frown, sad, andry or pouting",
  u":‑\[":"Frown, sad, andry or pouting",
  u":\[":"Frown, sad, andry or pouting",
  u":-\|\|":"Frown, sad, andry or pouting",
  u">:\[":"Frown, sad, andry or pouting",
  u":\{":"Frown, sad, andry or pouting",
  u">:\(":"Frown, sad, andry or pouting",
  u":'‑\(":"Crying",
  u":'\(":"Crying",
  u":'‑\)":"Tears of happiness",
  u":'\)":"Tears of happiness",
  u":-\*":"Kiss",
  u":\*"

In [21]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

tweet_df['text'] = tweet_df['text'].apply(remove_emoticons)


In [None]:

print(tweet_df.iloc[9]['text'])




#### TOKENIZATION - There are many ways to implement tokenization.

In [51]:
normal_text = "republicans democrats created economic problems."
normal_para = "Trump won this debate, handily. Biden wasn’t a force at all. Trump was substantive, on-point, well-tempered."

##### Using the split function 

In [52]:
# work tokenization
tokenize1 = normal_text.split()
tokenize1

['republicans', 'democrats', 'created', 'economic', 'problems.']

In [53]:
# sentence tokenization
tokenize2 = normal_para.split(".")
tokenize2

['Trump won this debate, handily',
 ' Biden wasn’t a force at all',
 ' Trump was substantive, on-point, well-tempered',
 '']

##### Using regular expression

In [54]:
import re
tokenize3 = re.findall("[\w']+",normal_text)
tokenize3

  tokenize3 = re.findall("[\w']+",normal_text)


['republicans', 'democrats', 'created', 'economic', 'problems']

### Using NLTK

In [55]:
from nltk.tokenize import word_tokenize,sent_tokenize


In [56]:
word_tokenize(normal_text)

['republicans', 'democrats', 'created', 'economic', 'problems', '.']

In [57]:
sent_tokenize(normal_para)

['Trump won this debate, handily.',
 'Biden wasn’t a force at all.',
 'Trump was substantive, on-point, well-tempered.']

### USING SPACY

In [58]:
# !python -m spacy download en_core_web_sm

In [59]:
import spacy
nlp = spacy.load('en_core_web_sm')
tokenize4 = nlp(normal_text)
tokenize4

republicans democrats created economic problems.

In [60]:
for token in tokenize4:
    print(token)

republicans
democrats
created
economic
problems
.


In [61]:
def spacy_tokenize(text):
    nlp = spacy.load('en_core_web_sm')
    tokenize_value = nlp(text)
    return tokenize_value

tweet_df['text'] = tweet_df['text'].apply(spacy_tokenize)


In [62]:
tweet_df.head(3)

Unnamed: 0,text
0,"(republicans, democrats, created, economic, pr..."
1,"(thrilled, back, great, city, charlotte, north..."
2,"(cbsherridge, read, letter, surveillance, cour..."


### STEMMING

Stemming is the process of reducint inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

Here we again pass the tokenize value over stemming and then with the list comprehension  created the list of word and join then to show again in the dataframe.

In [63]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
nlp = spacy.load('en_core_web_sm')

def apply_stemming(text):
    
    tokenize_value = nlp(text)
    
    stemmed_words =  [stemmer.stem(token.text) for token in tokenize_value]
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

tweet_df['text'] = tweet_df['text'].apply(apply_stemming)


In [64]:
tweet_df.head(3)

Unnamed: 0,text
0,republican democrat creat econom problem
1,thrill back great citi charlott north carolina...
2,cbsherridg read letter surveil court obtain cu...


### Lemmatization

Lemmatization, unlike Stemming , Reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word in scalled Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citatio nform of s set of words.

So the final outcome after all these text-pre-processing is:

In [65]:
pd.set_option('display.max_colwidth', None)
tweet_df

Unnamed: 0,text
0,republican democrat creat econom problem
1,thrill back great citi charlott north carolina thousand hardwork american patriot love countri cherish valu respect law alway put america first thank wonder even kag2020
2,cbsherridg read letter surveil court obtain cub news question disciplinari action who …
3,unsolicit mail ballot scar major threat democraci amp democrat know almost recent elect use system even though much smaller amp far fewer ballot count end disast larg number miss ballot amp fraud
4,mzhemingway friendli tell event come appar leak complaint media read articl the …
5,whitehous presid realdonaldtrump announc histor step protect constitut right pray public school http …
6,get littl exercis morn
7,
8,
9,


#### This Data seems to be quite GOOD ENOUGH for further tasks like (Indexing, Embedding ).