# remove html tags

In [1]:
sample_text='<p>My favorite color is <del>blue</del> <ins>red</ins>.</p>'

In [2]:
sample_text

'<p>My favorite color is <del>blue</del> <ins>red</ins>.</p>'

**import re**: This line imports the Python re module, which provides support for regular expressions. Regular expressions are patterns that can be used to match and manipulate strings.


**<:** Matches the opening angle bracket of an HTML tag.

**.*?**: Matches any character (.) zero or more times (*) in a non-greedy way (?). This is used to match the content within the HTML tag.


**>**: Matches the closing angle bracket of an HTML tag.

**return p.sub('', data)**: This line uses the sub() method of the regular expression pattern (p) to substitute (replace) matches with an empty string (''). In other words, it removes any text that matches the regular expression pattern from the input data.

In [3]:
import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('',data)

In [4]:
striphtml(sample_text)

'My favorite color is blue red.'

In [5]:
striphtml("<ph2!.>This is the first line.</p><p>This is the second line.</p>")

'This is the first line.This is the second line.'

# Apply on dataset

In [29]:
import pandas as pd
df=pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [30]:
df['review']=df['review'].apply(striphtml)

In [31]:
df['review']  # in review column have this type of syntax but after compile wire regular expression re remove <br /><br /> like this pattern

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [32]:
df.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive


# Removing link from text

In [28]:
textt1='check out my notebook https://github.com/HarshalJain2'
textt2='check out my notebook https://www.kaggle.com/harshalmjain'
textt3='check out my notebook http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins'
textt4='check out my notebook http://www.udemy.com/'
textt5='check out my notebook www.kaggle.com/datasets/salader/dogs-vs-cats'

In [36]:
def remove_url(textt):
    pattern=re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', textt)

In [43]:
print(remove_url(textt1))
print(remove_url(textt2))
print(remove_url(textt3))
print(remove_url(textt4))
print(remove_url(textt5))

check out my notebook 
check out my notebook 
check out my notebook 
check out my notebook 
check out my notebook 


# Removing Punctutation Marks

In [62]:
import string , time  # to get all puntuation 
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [63]:
exclude=string.punctuation # all puntuation store in one variable

In [64]:
def remove_punct(text):
    for char in exclude:
        text=text.replace(char,'')
    return text    

In [65]:
text= 'hi! i am irry. how can i help you ?, and ask about anything''/. when & hen^ what @ same' 

In [66]:
start= time.time()
print(remove_punct(text))
time1=time.time()-start
print(time1)

hi i am irry how can i help you  and ask about anything when  hen what  same
0.0009992122650146484


In [67]:
def remove_punct_best_way(text):
    return text.translate(str.maketrans('','',exclude))

In [68]:
start= time.time()
print(remove_punct_best_way(text))
time2=time.time()-start
print(time2)

hi i am irry how can i help you  and ask about anything when  hen what  same
0.0


In [70]:
# second function required less time than the first function

# Emoji Removing from Text

In [6]:
# pip install emoji

Note: you may need to restart the kernel to use updated packages.


In [7]:
text_with_emojis = "Hello 😃, how are you? 🌟"

**emoji_pattern** = **re.compile("[...]")**: This line defines a regular expression pattern stored in the emoji_pattern variable. This pattern is used to match a wide range of emojis. The pattern is constructed using Unicode code points (e.g., \U0001F600), which represent specific emoji characters.

**return emoji_pattern.sub(r'', text)**: This line uses the **sub()** method of the emoji_pattern regular expression to replace all matches of emojis in the input text with an empty string ''. In other words, it removes all emojis from the input text.

In [8]:
import re

def remove_emojis(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F700-\U0001F77F"  # alchemical symbols
        u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        u"\U0001FA00-\U0001FA6F"  # Chess Symbols
        u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        u"\U0001FB00-\U0001FBFF"  # Symbols for Legacy Computing
        u"\U0001FC00-\U0001FCFF"  # Symbols for Legacy Computing
        u"\U0001FD00-\U0001FDFF"  # Symbols for Legacy Computing
        u"\U0001FE00-\U0001FEFF"  # Symbols for Legacy Computing
        u"\U0001FF00-\U0001FFFF"  # Symbols for Legacy Computing
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


In [9]:
cleaned_text = remove_emojis(text_with_emojis)
print(cleaned_text)


Hello , how are you? 


In [27]:
# same for dataset using apply function

# Converting emoji into machine language code called unicode_normalization 

In [10]:
text_with_emojis

'Hello 😃, how are you? 🌟'

python have one function .encode('utf-8) this will convert this emoji into machine understandable language called as unicode normalization

In [11]:
text_with_emojis.encode('utf-8')

b'Hello \xf0\x9f\x98\x83, how are you? \xf0\x9f\x8c\x9f'

# Spelling check

In [12]:
incorrect_text='''I am havng truble speeling corectly. I offten make mistaks wen I type. Somtimes, I dont pay atention to 
my grammar either. Its dificult to rite perfictly all the time. The english langauge can be challanging, 
with its homophones and confusng rules. But I'm commited to improving, and I knw that with practice, I can get beter.'''

1. using textblob import Textblob liblary 
2. in this one function called **.correct()** and  corrected the spellings

In [13]:
# pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [14]:
from textblob import TextBlob
correct_text=TextBlob(incorrect_text)
correct_text.correct()

TextBlob("I am having trouble spelling correctly. I often make mistake wen I type. Sometimes, I dont pay attention to 
my grammar either. Its difficult to rite perfectly all the time. The english language can be challenging, 
with its homophones and confusing rules. But I'm committed to improving, and I know that with practice, I can get better.")

In [16]:
# not all spelling cone correctly still here.