This Notebook is a practice of Text Preprocessing as referred from [Let us Start with Text Processing](https://www.kaggle.com/code/priyankdl/let-us-start-with-text-processing) by **Priyank Thakkar Sir**.

This is how we can use **string.punctuation** to play with/replace the punctuations in a string(using **str.translate**).....

In [None]:
import string
import pandas as pd

punctuation = string.punctuation

mapping = str.maketrans("?", ",", punctuation.replace("?", ""))   #Three arguments,0 and 1 arguments are strings of equal lenght that
                                                                  #have one to one character mapping in the translation table and the 3
                                                                  #argument specifies the ones to be deleted


print("Punctuations:",punctuation)
print(type(mapping),mapping)
data = {'lowered_text': [
    'Hello, World!',
    'This is an example sentence.',
    'Good morning - have a great day!',
    'Why so serious?'
]}

trdf = pd.DataFrame(data)

print("Before translation:")
print(trdf['lowered_text'].head(10))

trdf['lowered_text'] = trdf["lowered_text"].str.translate(mapping)

print("After translation:")
print(trdf['lowered_text'].head(10))


Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
<class 'dict'> {63: 44, 33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 39: None, 40: None, 41: None, 42: None, 43: None, 44: None, 45: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None}
Before translation:
0                       Hello, World!
1        This is an example sentence.
2    Good morning - have a great day!
3                     Why so serious?
Name: lowered_text, dtype: object
After translation:
0                       Hello World
1       This is an example sentence
2    Good morning  have a great day
3                   Why so serious,
Name: lowered_text, dtype: object


Using of **collections.Counter** to find the frequency of all the words like a map in c++ Also trying to filter the commonly used words as they could be insignificant.Basically Use of a map in string/text processing



In [None]:
from collections import Counter

text = """
Once upon a time, in a land far, far away, there lived a young girl named Ella. Ella was kind and gentle, and everyone in the village loved her. She lived with her stepmother and stepsisters who were not so kind. They treated her poorly and made her do all the chores around the house. Despite this, Ella remained positive and hopeful.

One day, a royal ball was announced, and everyone in the village was excited to attend. Ella's stepmother and stepsisters went to the ball, but they did not allow Ella to go with them. Ella was heartbroken. However, her fairy godmother appeared and magically transformed her rags into a beautiful gown and glass slippers. She told Ella to return before midnight, as the magic would wear off then.

At the ball, everyone was enchanted by Ella's beauty. The prince asked her to dance, and they danced the night away. As the clock struck midnight, Ella ran away, leaving behind one of her glass slippers. The prince searched the kingdom for the owner of the glass slipper, and when he found Ella, they lived happily ever after.
"""

words = text.lower().split()
word_freq = Counter(words)

top_5_words = [word for word, freq in word_freq.most_common()[0:5]]

new_string=""
for word in text.split():
  if word not in top_5_words:
    new_string=new_string+" "+word

print("Top 5 most frequent words:", top_5_words)
print("Filtered text:", new_string)


Top 5 most frequent words: ['the', 'and', 'her', 'ella', 'a']
Filtered text:  Once upon time, in land far, far away, there lived young girl named Ella. Ella was kind gentle, everyone in village loved her. She lived with stepmother stepsisters who were not so kind. They treated poorly made do all chores around house. Despite this, Ella remained positive hopeful. One day, royal ball was announced, everyone in village was excited to attend. Ella's stepmother stepsisters went to ball, but they did not allow Ella to go with them. Ella was heartbroken. However, fairy godmother appeared magically transformed rags into beautiful gown glass slippers. She told Ella to return before midnight, as magic would wear off then. At ball, everyone was enchanted by Ella's beauty. The prince asked to dance, they danced night away. As clock struck midnight, Ella ran away, leaving behind one of glass slippers. The prince searched kingdom for owner of glass slipper, when he found Ella, they lived happily ever

Trying to do **Stemming** using Porter Stemmer.
Stemming is the process of trying to take a word back to its root form so that words like 'run','ran' or 'running' can be considered as one and the same

A **Stemmer** or a **SnowBall Stemmer** can both be used but SnowBall Stemmer is better as it supports many languages

In [None]:
#Use of a Snowball Stemmer
from nltk.stem.snowball import SnowballStemmer
import pandas as pd

stemmer = SnowballStemmer("english")

data = {
    'text_data': [
        'running runners ran quickly',
        'happily happy happier happiest',
        'swimming swims swam swim',
        'beautifully beautiful beauty beautify',
        'jumps jumped jumping jump'
    ]
}

df = pd.DataFrame(data)

def apply_stemming(text):
    return ' '.join(stemmer.stem(word) for word in text.split())

df["Stemmed_Text"] = df["text_data"].apply(apply_stemming)

print("Original Text:")
print(df["text_data"].head(5))

print("Stemmed Text:")
print(df["Stemmed_Text"].head(5))


Original Text:
0              running runners ran quickly
1           happily happy happier happiest
2                 swimming swims swam swim
3    beautifully beautiful beauty beautify
4                jumps jumped jumping jump
Name: text_data, dtype: object
Stemmed Text:
0              run runner ran quick
1    happili happi happier happiest
2               swim swim swam swim
3     beauti beauti beauti beautifi
4               jump jump jump jump
Name: Stemmed_Text, dtype: object


As you can see,the above stemmed data contains words that aren't included in the actual dictionary or the text corpus

So solve this,we would be using Lemmatisation that can check for stemmed words in the text corpus if they actually exist or not

Also we could specify it to reduce to a root word with a specific POS(Part of Speech) in the dictionary.Let us look at an example

In [None]:
import nltk
# nltk.download('wordnet')    #Its important for the POS tags that are mentioned while lemmatisation
# nltk.download('averaged_perceptron_tagger')   #My best guess is that its important for the nltk.pos_tag() function
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import pandas as pd

lemmatizer = WordNetLemmatizer()
wordnet_tags = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}

sample_data = {
    'text_content': [
        'cats are running swiftly',
        'beautifully made beautiful things',
        'swimming is a good exercise',
        'the fox jumped over the fence',
        'she sings very well'
    ]
}

dataframe = pd.DataFrame(sample_data)

def lemmatize_with_pos(text):
    words = text.split()
    result = ''
    for word in words:
        tag = nltk.pos_tag([word])[0][1][0]
        pos_tag = wordnet_tags.get(tag, wordnet.NOUN)
        result += lemmatizer.lemmatize(word, pos_tag) + ' '
    return result.strip()

dataframe["Lemmatized_Text"] = dataframe["text_content"].apply(lemmatize_with_pos)

print("Original Text:")
print(dataframe["text_content"].head(5))

print("Lemmatized Text:")
print(dataframe["Lemmatized_Text"].head(5))

print("Missing Values in Lemmatized Text:")
print(dataframe["Lemmatized_Text"].isnull().sum())


Original Text:
0             cats are running swiftly
1    beautifully made beautiful things
2          swimming is a good exercise
3        the fox jumped over the fence
4                  she sings very well
Name: text_content, dtype: object
Lemmatized Text:
0                  cat be run swiftly
1    beautifully make beautiful thing
2             swim be a good exercise
3       the fox jumped over the fence
4                 she sings very well
Name: Lemmatized_Text, dtype: object
Missing Values in Lemmatized Text:
0


So lemmatisation assures that the reduced word is actually a real root word in the dictionary

Playing with **Emojis**

There are usually dictionaries or libraries that can map emojis to their descriptions or also just simply give you a dictionary.

These mappings can then be used to manipulate emojis in our text data.For example we can replace emojis with their descriptions,or even omit certain other emojis.

The basic way is to use **.sub** function.It can be used to 2 ways:

1)re.sub("emoji_character_encodings_to_be_replaced","string_that_will_replace_emojis","Original_string")  #re here means Regular Expressions,it a basically a library

2)emoticons_to_be_replaced(#Obviously in character encoding).sub("string_that_will_replace_the_emojis",original_string)

In [None]:
import re

# Emoji to description mapping
UNICODE_EMO = {
    "🔥": "FIRE",
    "😊": "SMILING_FACE",
    "😂": "FACE_WITH_TEARS_OF_JOY",
    "❤️": "RED_HEART",
    "🌟": "STAR",
    "💔": "BROKEN_HEART",
    "👍": "THUMBS_UP",
    "👎": "THUMBS_DOWN"
}

def remove_emoji(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    return emoji_pattern.sub(r'\\n', text)

def emojis_to_text(text):
    for emot, desc in UNICODE_EMO.items():
        text = re.sub(
            r'(' + re.escape(emot) + ')',
            desc,
            text
        )
    return text

# Example usage
text_with_emojis = "Game is on 🔥 and the mood is 😊"
cleaned_text = emojis_to_text(text_with_emojis)

print("Original Text:")
print(text_with_emojis)

print("Text with Emojis Replaced:")
print(cleaned_text)

print("Another Function:")
print(remove_emoji(text_with_emojis))

#Basically two different methods of replacing emojis that are slightly different.
# emojis_to_text() can replace individual emojis where the remove_emojis() replaces all and any occurence of emojis that are compiled with a common string


Original Text:
Game is on 🔥 and the mood is 😊
Text with Emojis Replaced:
Game is on FIRE and the mood is SMILING_FACE
Another Function:
Game is on \n and the mood is \n


The same **sub()** can be used for strings other than emojis character encodings.For example **html tags** or even to replace certain cuss words in the online world

In [None]:
import re

def strip_html_tags(content):
    pattern = re.compile('<.*?>')
    return pattern.sub('', content)

html_content = """
<html>
<head><title>Sample Page</title></head>
<body>
<h1>Welcome to My Page</h1>
<p>This is a <strong>sample</strong> paragraph.</p>
<a href="https://example.com">Visit Example</a>
</body>
</html>
"""

clean_text = strip_html_tags(html_content)

print(clean_text)




Sample Page

Welcome to My Page
This is a sample paragraph.
Visit Example





Finally there is also a **spellchecker** library that is used to correct any syntactically wrong words,i.e. map and replace it to the nearest word in terms of syntax and semantics both

In [None]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


**spell.correction()**:To correct the syntax of the word

**spell.unknown(list_of_words)**:To get a list of all the mispelled words in the list(Basically the words not available in the dictionary)

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

text_with_typos = "The quik brown fox jumps ovr the lazi dog."

def correct_spelling(text):
    words = text.split()
    corrected_words = []

    for word in words:
        if word in spell.unknown(words):
                corrected_words.append(spell.correction(word))
        else:
            corrected_words.append(word)

    return ' '.join(corrected_words)

corrected_text = correct_spelling(text_with_typos)

print("Original Text:")
print(text_with_typos)

print("Corrected Text:")
print(corrected_text)


Original Text:
The quik brown fox jumps ovr the lazi dog.
Corrected Text:
The quit brown fox jumps or the lazy dog
