#Importing Important Libraries

In [11]:
import re
import nltk
import pandas as pd
import numpy as np

In [31]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
filePath="/content/drive/MyDrive/Research Material/IMDB Dataset.csv"

In [14]:
data=pd.read_csv(filePath)

In [15]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [16]:
first_review=data.review[1]

It seems our data needs preprocessing. The things we are focusing on ▶

✴ **Text Cleaning**

  ```Before diving into sophisticated techniques lets begin with the basics- text cleaning. Basically it involves converting all the words in a text to a similar form and removing unnecessary words and punctuations from text.```


### **Cleaning**

In [17]:
text= first_review[:200]

In [18]:
text

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece'

By using the python built in **lower()** method we can convert text data into lowercase. In the above text it seems there are some html tags. Most of the texts in the NLP projects are scrapped from different websites. Hence we need to remove html tags from our text as those tags do not add any value to our data. We can use **regular expressions** to remove html tags from our data.


In [19]:
text=text.lower()

In [20]:
pattern = re.compile(r'<.*?>')
text=pattern.sub("",text)
text

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece'

In [21]:
text=re.sub(r'[^\w\s]', '', text)

In [22]:
print(text)

a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece


###✴ **Tokenization**

One of the most common techniques for text tokenization is the whitespace tokenization. In this technique we split the sentence based on the space between tokens or words.

In [23]:
words=text.split()
print(words)

['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece']


We can use the **“word_tokenize”** function from ***nltk.tokenize*** to convert our text into tokens.

In [32]:
from nltk.tokenize import word_tokenize

words=word_tokenize(text)
print(words)


['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece']


Like word tokenizer there is a sentence level tokenizer in nltk package which separates sentences from texts.



In [33]:
from nltk.tokenize import sent_tokenize

sentences=sent_tokenize(text)
print(sentences)

['a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece']


###✴ **StopWords**
We can utilize the Natural Language Toolkit (*nltk*) module to remove **stopwords**. The below code portion shows the nltk package’s list of stopwords:

In [45]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
print(stop_words)

{'him', 'theirs', 'only', "won't", 'or', 'out', 'won', 'you', 'did', 'some', 'to', 'is', 'if', 'ma', 'as', 'same', 'with', 'does', 'own', 'then', 'there', 'few', 'at', 'shan', 'them', 'her', 'ain', "didn't", 'were', "she's", 'had', 'no', "shan't", 'needn', 'now', 've', 'of', 'because', 'aren', 'wasn', 'once', 'just', 'any', 'until', 'from', "you'd", "hasn't", 'down', 'himself', 'itself', 'are', 'too', 'a', 'nor', 'was', 'isn', 'your', 'it', 'i', "you're", 'above', 'off', "mustn't", 'under', 'both', 'couldn', "it's", 'before', 'whom', 'the', 'we', 'hers', 'should', 'those', "don't", 'while', 'during', 'ours', 'yours', "you'll", 'up', "couldn't", 'd', 'm', 'she', 'having', 'have', 'that', 'me', 'when', 'such', 'what', 'which', 's', 'yourself', 'didn', 'again', 'themselves', "wouldn't", 'has', "shouldn't", 'these', 'its', 'don', "that'll", 'against', 'through', 'in', 'here', 'shouldn', 'am', 'all', 'for', "hadn't", 'on', 'mightn', 'than', "doesn't", 'not', 'can', 'y', 'further', 'weren', 

In [44]:
text

'a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece'

The **remove_stopword()** function takes text as an input and removes stopwords from the text and returns the clean text.

In [42]:
def remove_stopwords(text):
  stop_words=set(stopwords.words('english'))
  words=word_tokenize(text)
  filtered_words=[word for word in words if word not in stop_words]
  return ' '.join(filtered_words)

In [43]:
processed_text=remove_stopwords(text)
print(processed_text)

wonderful little production filming technique unassuming oldtimebbc fashion gives comforting sometimes discomforting sense realism entire piece


 ### ✴ **POS-Tagging**

 The *nltk* **“pos_tag”** method to get particular parts-of-speech for each word in a sentence.


In [50]:
from nltk import pos_tag
words=word_tokenize("I shot an elephant in my pajamas")
pos_tag(words)

[('I', 'PRP'),
 ('shot', 'VBP'),
 ('an', 'DT'),
 ('elephant', 'NN'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('pajamas', 'NN')]

###✴ **NER**

Named Entity Recognition (NER) is an essential job in Natural Language Processing (NLP) that identifies and classifies entities (such as names of people, organizations, locations, dates, etc.) in text.  Named Entity Recognition can be used with the Natural Language Toolkit (nltk) package:

In [58]:
from nltk.chunk import ne_chunk

def named_entity_recognition(sentence):
  words=word_tokenize(sentence)
  tagged_words=pos_tag(words)
  named_entities=ne_chunk(tagged_words)
  return named_entities

sentence="Rajib just joined Facebook Inc. in San Francisco in 2024"
entities_tree=named_entity_recognition(sentence)
print(entities_tree)

(S
  (GPE Rajib/NNP)
  just/RB
  joined/VBD
  (ORGANIZATION Facebook/NNP Inc./NNP)
  in/IN
  (GPE San/NNP Francisco/NNP)
  in/IN
  2024/CD)


In the above output, named entities are marked with their corresponding entity types, such as 'ORGANIZATION' for companies and 'GPE' for geopolitical entities (locations).


###✴ **Stemming**
In Natural Language Processing (NLP), stemming is a text normalization approach that reduces words to their base or root form, often known as the stem. Words with comparable meanings should be grouped together via stemming, even if their inflected forms differ slightly. We can use nltk PorterStemmer method to implement stemming easily:

In [60]:
from nltk.stem import PorterStemmer

def stemming(sentence):
  words=word_tokenize(sentence)
  stemmed_words=[PorterStemmer().stem(word) for word in words]
  return ' '.join(stemmed_words)

sentence="""Lets reduce some words into their original form
          using stemming"""
print(stemming(sentence))

let reduc some word into their origin form use stem


It's important to note that stemming has its limitations, and sometimes the resulting stems may not be actual words. For example, “reduce” is converted to “reduc” which is not even a word.

###✴ **Lemmatization**
In Natural Language Processing (NLP), lemmatization is a text normalization approach that reduces words to their dictionary-based or basic form, or lemma. Lemmatization generates legitimate words and takes the word's context into account, yet its objective is identical to that of stemming.



In [66]:
from nltk.stem import WordNetLemmatizer

def lemmatization(sentence):
  words=word_tokenize(sentence)
  lemmatized_words=[WordNetLemmatizer().lemmatize(word,pos='v')
                    for word in words]
  return ' '.join(lemmatized_words)

sentence="""Lets reduce some words into their original form
          using lemmatization"""
print(lemmatization(sentence))

Lets reduce some word into their original form use lemmatization


In above there is a parameter passed into the lemmatize method “pos=v” which is used to lemmatize the verbs in the sentence. By default it's set to noun. Lemmatization is commonly employed in tasks such as information retrieval, question answering, and language modeling.
