**Lower Casing**

In [1]:
import pandas as pd
import numpy as np

In [4]:
dataset = pd.DataFrame([['No problem. Here is a sample paragraph of 150 words with HTML tags and punctuation marks:<p>As technology keeps advancing at a rapid pace, the world is becoming more digitalized. It’s no surprise that the demand for web developers and designers has increased significantly in recent years. HTML remains the backbone of the web, and it is essential to learn it as it forms the foundation of all web development. With basic knowledge of HTML tags such as headings, paragraphs, images, and links, you can design and develop simple web pages. However, as your web development skills advance, you will need to master complex tags like forms and tables to create more interactive and user-friendly websites. The great thing about HTML is that it is a straightforward language, and there are lots of online resources that can help you learn it in no time.</p><p>Moreover, HTML is compatible with other web technologies like CSS and JavaScript, making it possible to style and add interactivity to web pages. CSS is responsible for web page styles like colors, typography, and layout, while JavaScript adds functionality. HTML, CSS, and JavaScript make up the three main web technologies and are critical to web development. Mastering these technologies will allow you to build dynamic, responsive, and modern websites that meet the demands of the digital age. In conclusion, learning HTML is the first step to becoming a web developer, and with practice and dedication, you can achieve great things in your web development journey.</p>',1]],columns=['review','sent'])

In [8]:
dataset['review']=dataset['review'].str.lower()

**Removing HTML Tags**
- using Reglar Expression

In [9]:
import re
def remove_html_tag(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [10]:
dataset['review'].apply(remove_html_tag)

0    no problem. here is a sample paragraph of 150 ...
Name: review, dtype: object

**Remove URLs from dataset**
- Again use Regular Expressions to remove urls

**Remove Punctuations From Dataset**
- using string library

In [11]:
import string

In [13]:
exclude = string.punctuation

In [14]:
def remove_punctuation(text):
    for char in exclude:
        text = text.replace(char, "")
    return text

# this takes a long time for big datasets

In [16]:
def remove_punc1(text):
    return text.translate(str.maketrans('','',exclude))

In [17]:
dataset['review'].apply(remove_punc1)

0    no problem here is a sample paragraph of 150 w...
Name: review, dtype: object

**Chat Word Treatment**
- Some words used like abbrevation 
- ASAP
- GN
- LMAO
- TBH
- IDK
- change them to their full form

In [None]:
# collects all the chat words from git repo sms slang translator 
chat_words = []

**Spelling Correction**
- use texblob library for spelling correction

In [None]:
from textblob import TextBlob

incorrect_txt = 'ceertain huw ar yu'

textBlb = TextBlob(incorrect_txt)

textBlb.correct().string

**Remove Stop Words**
- Many stop words like (i her of the, etc)

In [4]:
from nltk.corpus import stopwords

**Handling Emojis**
- 😊😍😘✅👌♥
- remove or replace

In [None]:
import re
def remove_emojis(txt):
    emoji_pattern = re.compile("["
                                u"\U0001F600-\U0001F64F" #Emotions
                                u"\U0001F300-\U0001F5FF" #Symbols
                                u"\U0001F680-\U0001F6FF" #Transport
                                u"\U0001F1E0-\U0001F1FF" #IOS Flags
                                u"\U00002707-\U000027B0"
                                u"\U000024C2-\U0001F251"
                                "]+",flags=re.UNICODE)
    return emoji_pattern.sub(r'',txt)

# This Code is use to remove all the emojis and special chars 

In [None]:
#Another way to remove emojis is using emoji library
import emoji
print(emoji.demojize('python is 🔥'))

**Tokenization**
- Two ways
1. **Word-level tokenization**: Splitting text into individual words or tokens. Example
- Input: "Hello, how are you?"
- Output: ["Hello", "how", "are", "you"]
2. **Character-level tokenization**: Splitting text into individual characters. Example
- Input: "Hello"
- Output: ["H", "e", "l", "l", "o"]
**Stopwords**
- Common words like "the", "and", "a", etc. that do not carry much
meaning in a sentence.
- Removing stopwords can help reduce dimensionality and improve model performance.
**Stemming and Lemmatization**
- **Stemming**: Reducing words to their base form using simple rules. Example
- Input: "running"
- Output: "run"
- **Lemmatization**: Reducing words to their base form using a dictionary. Example
- Input: "running"
- Output: "run"
**Named Entity Recognition (NER)**
- Identifying named entities in text, such as people, places, organizations, etc.
Example
- Input: "Apple is a technology company."
- Output: ["Apple" (organization)]
**Part-of-Speech (POS) Tagging**
- Identifying the part of speech (noun, verb, adjective, etc.) of each word in
a sentence.
Example
- Input: "The quick brown fox jumps over the lazy dog."
- Output: ["The" (article), "quick" (adjective), "brown" (
    adjective), ...]

In [None]:
#First way is to use split function
text = "hwy we can make a good car. We can give compeletion to bugatti."
text.split() # This do word toknization
text.split('.') # This will do sentence type tokenization
# But this split approach do not handle the special character plus 
# do not handle chat words

#############################Regular Expression#############################
# Second way is to use RE
import re
token = re.findall("[\w']+",text)
# This will handle the special character and chat words
# This can handle dot, ? and many other special chars

##################################NLTK####################################
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize(text)
sent_tokenize(text)


#################################Spacy####################################
import spacy
npl = spacy.load('en_core_web_sm')
result = nlp(text)

**Stemming and Lemmatization**

In [None]:
###############################Stemming######################################
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def stem_words(txt):
  return " ".join([ps.stem(word) for word in txt.split()])

In [None]:
#############################Lemmatizing########################################3
from nltk.stem import WordNetLemmatizer

word_lemmatizer = WordNetLemmatizer()

for word in text:
    print(word_lemmatizer.lemmatize(word))