In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv(r'D:\#DATA Science\NLP\NLP\IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


### Handling Emojis in Text:

When working with text data that includes emojis, you have a few options for handling them:

1. **Removing Emojis**: You can remove emojis from the text to simplify the data and focus on the textual content itself.

2. **Replacing Emojis**: Alternatively, you can replace emojis with specific text, which can help maintain the context and meaning of the original emoji.

#### 1. Removing Emojis:

To remove emojis, you can use regular expressions or other text processing methods to identify and remove the Unicode characters representing emojis.




In [6]:
import re
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               "\U0001F600-\U0001F64F"  # Emoticons
                               "\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               "\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               "\U0001F700-\U0001F77F"  # Alchemical Symbols
                               "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               "\U0001FA00-\U0001FA6F"  # Chess Symbols
                               "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A"
                               "\U00002702-\U000027B0"  # Dingbats
                               "\U000024C2-\U0001F251" 
                               "]+", flags=re.UNICODE)
    clean_text = emoji_pattern.sub(r'', text)
    return clean_text

text = "Hello!😊🚀🌸 How are you? 😊🚀🌸"
cleaned_text = remove_emojis(text)
print(cleaned_text)


Hello! How are you? 


#### 2. Replacing Emojis:

You can replace emojis with text representations that capture the emotion or meaning of the emoji. This can help retain the sentiment or context of the original emoji.

In [11]:
import emoji
text2 = '😊'
no_emoji = emoji.demojize(text2)
print("The emoji is saying: ",no_emoji)

The emoji is saying:  :smiling_face_with_smiling_eyes:
Babar Azam is on  🔥  :smiling_face_with_smiling_eyes:


In [12]:
print(emoji.demojize('Babar Azam is on 🔥'))

Babar Azam is on :fire:.


### Tokenization in NLP:

1. **Using `split()` Function**: This is a basic method where text is split based on spaces or other specified delimiters.

2. **Using Regular Expression**: Regular expressions can be used to define more complex tokenization patterns, allowing you to capture words and other patterns.

3. **Using NLTK Tokenizers**: NLTK (Natural Language Toolkit) provides powerful tokenization tools, such as `word_tokenize()` for word-level tokenization and `sent_tokenize()` for sentence-level tokenization.

4. **Using spaCy**: spaCy is a popular NLP library that offers advanced tokenization capabilities along with other linguistic features.


In [18]:
my_intro = "Bismillah! My name is Hamza Ali, I am from Pakistan. I am a Student at U.E.T Peshawar and Alhamdulillah!"
my_intro.split()        # just using split for tokenization

['Bismillah!',
 'My',
 'name',
 'is',
 'Hamza',
 'Ali,',
 'I',
 'am',
 'from',
 'Pakistan.',
 'I',
 'am',
 'a',
 'Student',
 'at',
 'U.E.T',
 'Peshawar',
 'and',
 'Alhamdulillah!']

In [32]:
import re

my_info = "I am a Pakistan!, I love my country.!"
tokens = re.findall("[\w']+", my_info)              ### regular expression for tokenization
tokens


['I', 'am', 'a', 'Pakistan', 'I', 'love', 'my', 'country']

In [41]:
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize(my_info)                  ### tokenization using nltk

['I', 'am', 'a', 'Pakistan', '!', ',', 'I', 'love', 'my', 'country', '.', '!']

In [40]:
sent_tokenize(my_intro)

['Bismillah!',
 'My name is Hamza Ali, I am from Pakistan.',
 'I am a Student at U.E.T Peshawar and Alhamdulillah!']

In [43]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(my_intro)
for tokens in doc:
    print(tokens)

Bismillah
!
My
name
is
Hamza
Ali
,
I
am
from
Pakistan
.
I
am
a
Student
at
U.E.T
Peshawar
and
Alhamdulillah
!


### Stemming using PorterStemmer in NLTK:

Stemming is a text normalization technique that reduces words to their base or root form. The Porter stemming algorithm is available in NLTK and is a widely used method for this purpose.

To use the PorterStemmer in NLTK:

1. **Import NLTK and PorterStemmer:**

   Before using the PorterStemmer, import the NLTK library and the PorterStemmer class. You'll also need to download necessary data using `nltk.download('punkt')`.

   ```python
   import nltk
   from nltk.stem import PorterStemmer
   nltk.download('punkt')


In [45]:
from nltk.stem.porter import PorterStemmer
PS = PorterStemmer()
def stemmed_words(text):
    text = text.split()
    return " ".join([PS.stem(word) for word in text])


In [51]:
word_for_stemming = "I am moving over the road and singing a song while chilling over the late night"
stemmed_words(word_for_stemming)

'i am move over the road and sing a song while chill over the late night'

### Lemmatization in Natural Language Processing:

Lemmatization is a text normalization technique that reduces words to their base or dictionary form, known as the lemma. Unlike stemming, which simply removes prefixes and suffixes to obtain a root form, lemmatization considers the context and part-of-speech (POS) of words to produce valid words.

In lemmatization, words are transformed based on their part-of-speech tag, preserving grammatical accuracy.

To perform lemmatization using NLTK:

1. **Import NLTK and WordNetLemmatizer:**

   Import the NLTK library and the WordNetLemmatizer class. You'll need to download necessary data using `nltk.download('punkt')`.

   ```python
   import nltk
   from nltk.stem import WordNetLemmatizer
   nltk.download('punkt')


In [52]:
from nltk.stem import WordNetLemmatizer
Lemmatizer = WordNetLemmatizer()

In [56]:
word = 'walking'
Lemmatizer.lemmatize(word, pos='v')

'walk'

In [57]:
sentence = "My name is Hamza, I am living with my family, I am Studing IT and Engineering and eating pizza."
len(sentence)

95

In [64]:
sentence = "My name is Hamza, I am living with my family in Mansehra, we are living happily and eating dinner!"
punctuation = "./!,?"
sentence_words = word_tokenize(sentence)

for word in sentence_words:
    if word in punctuation:
        sentence_words.remove(word)

sentence_words
print("{0:19}{1:19}".format("word", "lemma"))
for word in sentence_words:
    print("{0:19}{1:19}".format(word, Lemmatizer.lemmatize(word, pos='v')))



word               lemma              
My                 My                 
name               name               
is                 be                 
Hamza              Hamza              
I                  I                  
am                 be                 
living             live               
with               with               
my                 my                 
family             family             
in                 in                 
Mansehra           Mansehra           
we                 we                 
are                be                 
living             live               
happily            happily            
and                and                
eating             eat                
dinner             dinner             
