### Text Preprocessing
Text preprocessing includes a series of steps used to clean and prepare raw text data for analysis or modeling. It typically includes tasks such as removing noise (like HTML tags, URLs, punctuation, emojis, and emoticons), converting text to lowercase, correcting spelling, replacing abbreviations with full forms, removing stopwords, and normalizing words through stemming or lemmatization. 


In [118]:
# ! python3 -m pip install ipykernel


In [119]:
import pandas as pd

data = {
    "Employee Name": ["Sujan","John", "Alice", "Bob", "Yere", "Michael", "Sarah", "David", "Linda", "Emily", "Robert",
                      "Sophia", "Daniel", "Olivia", "James", "Emma", "William", "Ava", "Joseph", "Cia"],
    "Review": [
        "Great company to work for! #TechLife",
        "I found <b> Table Tennis board</b> broken,can you please fix it ASAP! ;_;",
        "Work-life balance is terrible!!! üò°.",
        "Love the new project, it's amazing. #Innovation üí°",
        "<div>The team is very supportive.</div>",
        "The HR dpartmnt is responsive and helpful. ü§ó",
        "The cafeteria food needs improvement. üçîüçü",
        "Meetngs deadlines can be challenging https://www.meetlinglink.com ",
        "The company culture is fantastic! üòÑ",
        "The company's software dvelpmnt process is quite efficient and productive. üíºüñ•Ô∏è",
        "The office environment is clean and organized. üëå",
        "I wish we had more training opportunities :‚Äë\[ .",
        "The management is approachable and open to feedback.",
        "Great benefits and perks for employees! üíºüéâ",
        "The project mnager is very understanding.",
        "I love the diversity in our team.",
        "Our IT infrastructure is top-notch. ",
        "I found a broken link on the company website.<a href='https://example.com'>here</a>",
        "I've seen some improvement in our software quality.",
        "The workload is manageable."
    ]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Employee Name,Review
0,Sujan,Great company to work for! #TechLife
1,John,"I found <b> Table Tennis board</b> broken,can ..."
2,Alice,Work-life balance is terrible!!! üò°.
3,Bob,"Love the new project, it's amazing. #Innovation üí°"
4,Yere,<div>The team is very supportive.</div>
5,Michael,The HR dpartmnt is responsive and helpful. ü§ó
6,Sarah,The cafeteria food needs improvement. üçîüçü
7,David,Meetngs deadlines can be challenging https://w...
8,Linda,The company culture is fantastic! üòÑ
9,Emily,The company's software dvelpmnt process is qui...


In [120]:
review_df = df[['Review']]
review_df

Unnamed: 0,Review
0,Great company to work for! #TechLife
1,"I found <b> Table Tennis board</b> broken,can ..."
2,Work-life balance is terrible!!! üò°.
3,"Love the new project, it's amazing. #Innovation üí°"
4,<div>The team is very supportive.</div>
5,The HR dpartmnt is responsive and helpful. ü§ó
6,The cafeteria food needs improvement. üçîüçü
7,Meetngs deadlines can be challenging https://w...
8,The company culture is fantastic! üòÑ
9,The company's software dvelpmnt process is qui...


#### REPLACE SHORT FORM WORDS WITH FULL FORM

In [121]:
full_form_dict = {
    'HR': 'Human Resource',
    'TT': 'Table Tennis',
    'IT': 'Information Technology',
    'ASAP': 'as soon as possible'
}

def correct_short_forms(text):
 
    words = text.split()
    corrected_words = [full_form_dict.get(word, word) for word in words]
    corrected_text = ' '.join(corrected_words)
    
    return corrected_text


review_df['Review'] = review_df['Review'].apply(correct_short_forms)
review_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(correct_short_forms)


Unnamed: 0,Review
0,Great company to work for! #TechLife
1,"I found <b> Table Tennis board</b> broken,can ..."
2,Work-life balance is terrible!!! üò°.
3,"Love the new project, it's amazing. #Innovation üí°"
4,<div>The team is very supportive.</div>
5,The Human Resource dpartmnt is responsive and ...
6,The cafeteria food needs improvement. üçîüçü
7,Meetngs deadlines can be challenging https://w...
8,The company culture is fantastic! üòÑ
9,The company's software dvelpmnt process is qui...


#### LOWERCASING

In [122]:
review_df['Review']=review_df['Review'].str.lower()
review_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review']=review_df['Review'].str.lower()


Unnamed: 0,Review
0,great company to work for! #techlife
1,"i found <b> table tennis board</b> broken,can ..."
2,work-life balance is terrible!!! üò°.
3,"love the new project, it's amazing. #innovation üí°"
4,<div>the team is very supportive.</div>


#### REMOVE HTML TAGS

In [123]:
import re

def remove_html_tags(text):
    pattern = re.compile(r'<.*?>') 
    return pattern.sub('', text)


In [124]:
review_df['Review'] = review_df['Review'].apply(lambda text: remove_html_tags(text))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(lambda text: remove_html_tags(text))


In [125]:
review_df['Review'][1]

'i found  table tennis board broken,can you please fix it asap! ;_;'

#### REMOVE URL

In [126]:
def remove_url(text):
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return pattern.sub(r'',text)

review_df['Review'] = review_df['Review'].apply(remove_url)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_url)


In [127]:
review_df['Review'][7]

'meetngs deadlines can be challenging '

#### REMOVE PUNCTUATION


In [128]:
import string

def remove_punctuation(text):
    pattern = re.compile(f"[{re.escape(string.punctuation)}]")
    return pattern.sub(r'',text)

review_df['Review'] = review_df['Review'].apply(remove_punctuation)
review_df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_punctuation)


Unnamed: 0,Review
0,great company to work for techlife
1,i found table tennis board brokencan you plea...
2,worklife balance is terrible üò°
3,love the new project its amazing innovation üí°
4,the team is very supportive


#### SPELLING CORRECTION


In [129]:
from textblob import TextBlob

def correct_spelling(text):
    textBLB = TextBlob(text)
    return textBLB.correct().string

review_df['Review'] = review_df['Review'].apply(correct_spelling)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(correct_spelling)


In [130]:
review_df['Review'][14]

'the project manager is very understanding'

In [131]:
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')

nltk.download('stopwords')

from nltk.tokenize import word_tokenize,sent_tokenize



[nltk_data] Downloading package punkt to /home/fm-pc-
[nltk_data]     lt-275/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/fm-pc-
[nltk_data]     lt-275/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### REMOVING STOPWORDS

In [132]:
stop_words = set(stopwords.words('english'))

In [133]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/fm-pc-
[nltk_data]     lt-275/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [134]:

def removing_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_sentences = ' '.join(filtered_words)
    
    return filtered_sentences


review_df['Review'] = review_df['Review'].apply(removing_stopwords)
review_df.head(3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(removing_stopwords)


Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible üò°


#### REMOVE EMOJI

In [135]:
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # Emojis
                               u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               u"\U0001F700-\U0001F77F"  # Alchemical Symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U0001FB00-\U0001FBFF"  # Symbols for Legacy Computing
                               u"\U0001F004-\U0001F0CF"  # Miscellaneous Symbols and Arrows
                               u"\U0001F10D-\U0001F10F"  # Enclosed Alphanumeric Supplement
                               u"\U0001F200-\U0001F251"  # Enclosed Ideographic Supplement
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


review_df['Review'] = review_df['Review'].apply(remove_emojis)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_emojis)


In [136]:
review_df['Review'][6]

'cafeteria food needs improvement '

#### REMOVE EMOTICONS

In [137]:
EMOTICONS = {
    u":‚Äë\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‚ÄëD":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‚ÄëD":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‚ÄëD":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‚Äë\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‚Äëc":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‚Äë<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‚Äë\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‚Äë\(":"Crying",
    u":'\(":"Crying",
    u":'‚Äë\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‚Äë':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‚ÄëO":"Surprise",
    u":O":"Surprise",
    u":‚Äëo":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‚Äë0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‚Äë\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‚Äë\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‚Äë,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‚ÄëP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‚ÄëP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‚Äë√û":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":√û":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‚Äë/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‚Äë\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‚Äëx":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‚Äë#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‚Äë&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‚Äë\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‚Äë3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‚Äë\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‚Äëb":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‚Äë\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‚Äë\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‚Äë\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‚Äë\)":"Cool",
    u"\|‚ÄëO":"Bored",
    u":‚ÄëJ":"Tongue-in-cheek",
    u"#‚Äë\)":"Party all night",
    u"%‚Äë\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‚Äë\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(„Éª\.„Éª;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)Ôºè":"Joyful",
    u"\(\^o\^\)Ôºè":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(‰∏Ä‰∏Ä\)":"Shame",
    u"\(Ôºõ‰∏Ä_‰∏Ä\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\¬∑\^=\)":"cat",
    u"\(=\^\¬∑\¬∑\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\„Éª\„Éª?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\Ôºà\*\^_\^\*Ôºâ" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^‚Äî\^\Ôºâ":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\Ôºà\^‚Äî\^\Ôºâ":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\¬∑\¬∑\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(„Éº„Éº;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\ÔºæÔΩñ\Ôºæ\)":"Happy",
    u"\(\ÔºæÔΩï\Ôºæ\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*Ôø£mÔø£\)":"Dissatisfied",
    u"\(‚ÄòA`\)":"Snubbed or Deflated"
}

In [138]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

review_df['Review'] = review_df['Review'].apply(remove_emoticons)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  review_df['Review'] = review_df['Review'].apply(remove_emoticons)


In [139]:
review_df['Review'][6]

'cafeteria food needs improvement '

#### TOKENIZATION - There are many ways to implement tokenization.

In [140]:
normal_text = "AI is getting smarter and can now make new things like art and stories."
normal_para = "AI is getting smarter and can now make new things like art and stories. This changes how people create and design."

##### Using the split function 

In [141]:
# work tokenization
tokenize1 = normal_text.split()
tokenize1

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories.']

In [142]:
# sentence tokenization
tokenize2 = normal_para.split(".")
tokenize2

['AI is getting smarter and can now make new things like art and stories',
 ' This changes how people create and design',
 '']

##### Using regular expression

In [143]:
import re
tokenize3 = re.findall("[\w']+",normal_text)
tokenize3

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories']

### Using NLTK

In [144]:
from nltk.tokenize import word_tokenize,sent_tokenize


In [145]:
word_tokenize(normal_text)

['AI',
 'is',
 'getting',
 'smarter',
 'and',
 'can',
 'now',
 'make',
 'new',
 'things',
 'like',
 'art',
 'and',
 'stories',
 '.']

In [146]:
sent_tokenize(normal_para)

['AI is getting smarter and can now make new things like art and stories.',
 'This changes how people create and design.']

### USING SPACY

In [165]:
# !python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
tokenize4 = nlp(normal_text)
tokenize4

In [None]:
for token in tokenize4:
    print(token)

In [None]:
def spacy_tokenize(text):
    nlp = spacy.load('en_core_web_sm')
    tokenize_value = nlp(text)
    return tokenize_value

review_df['Review'] = review_df['Review'].apply(spacy_tokenize)


In [161]:
review_df.head(3)

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible


### STEMMING

Stemming is the process of reducint inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

Here we again pass the tokenize value over stemming and then with the list comprehension  created the list of word and join then to show again in the dataframe.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
nlp = spacy.load('en_core_web_sm')

def apply_stemming(text):
    
    tokenize_value = nlp(text)
    
    stemmed_words =  [stemmer.stem(token.text) for token in tokenize_value]
    stemmed_text = ' '.join(stemmed_words)
    return stemmed_text

review_df['Review'] = review_df['Review'].apply(apply_stemming)


In [163]:
review_df.head(3)

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible


### Lemmatization

Lemmatization, unlike Stemming , Reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word in scalled Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citatio nform of s set of words.

So the final outcome after all these text-pre-processing is:

In [164]:
pd.set_option('display.max_colwidth', None)
review_df

Unnamed: 0,Review
0,great company work techlife
1,found table tennis board brokencan please fix sap
2,worklife balance terrible
3,love new project amazing innovation
4,team suppurative
5,human resource department responsive helpful
6,cafeteria food needs improvement
7,meetings deadline challenging
8,company culture fantastic
9,company software dvelpmnt process quite efficient productive Ô∏è


#### This Data seems to be quite GOOD ENOUGH for further tasks like (Indexing, Embedding ).