<a href="https://colab.research.google.com/github/AmlanSamanta/NLPFundamentals/blob/main/Basic_of_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Natural Language Processing

It is the field of Computer Science which focuses on making systems that can analyse and understand human natural languages. As most of the text data are unstructured, first the data needs to be preprocessed so that it can be possible for computers to analyse and then finally needs to be postprocessed to make them human understandable.

## NLTK - Natural Language Toolkit

It is a Python library which provides supports for text preprocessing, analysis and visualizations. Lets install it along with other libraries using pip.

In [1]:
!pip install nltk matplotlib numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Tokenisation

It refers to the splitting of text by words and sentences and is the first step to transform unstructured text into structured data by allowing to work on smaller pieces of text which becomes easier to analyze. Let's see some examples:

In [2]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [3]:
example_text = """
Vivo's latest Vivo T2 series will launch in India today, April 11. 
The new smartphone will be unveiled during a virtual event at 12 PM and fans can watch the live stream for free on Vivo India's official YouTube channel. 
Ahead of the launch, the company has revealed the Vivo T2 series colour and design. 
The lineup includes the Vivo T2 and a toned-down Vivo T2x. 
Both smartphones support 5G.
The Vivo T2's official Flipkart listing reveals that the phone will come with a Full-HD+ AMOLED display with a 360Hz touch sampling rate 
and 1300 nits peak brightness. On the back, there's a 64-megapixel primary camera with OIS and EIS support to offer 
crisp and stable videos and photos. There's also going to be a 2-megapixel bokeh camera, but no ultra-wide camera lens.
"""

In [4]:
sent_tokenize(example_text)

LookupError: ignored

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
sent_tokenize(example_text)

["\nVivo's latest Vivo T2 series will launch in India today, April 11.",
 "The new smartphone will be unveiled during a virtual event at 12 PM and fans can watch the live stream for free on Vivo India's official YouTube channel.",
 'Ahead of the launch, the company has revealed the Vivo T2 series colour and design.',
 'The lineup includes the Vivo T2 and a toned-down Vivo T2x.',
 'Both smartphones support 5G.',
 "The Vivo T2's official Flipkart listing reveals that the phone will come with a Full-HD+ AMOLED display with a 360Hz touch sampling rate \nand 1300 nits peak brightness.",
 "On the back, there's a 64-megapixel primary camera with OIS and EIS support to offer \ncrisp and stable videos and photos.",
 "There's also going to be a 2-megapixel bokeh camera, but no ultra-wide camera lens."]

In [7]:
word_tokenize(example_text)

['Vivo',
 "'s",
 'latest',
 'Vivo',
 'T2',
 'series',
 'will',
 'launch',
 'in',
 'India',
 'today',
 ',',
 'April',
 '11',
 '.',
 'The',
 'new',
 'smartphone',
 'will',
 'be',
 'unveiled',
 'during',
 'a',
 'virtual',
 'event',
 'at',
 '12',
 'PM',
 'and',
 'fans',
 'can',
 'watch',
 'the',
 'live',
 'stream',
 'for',
 'free',
 'on',
 'Vivo',
 'India',
 "'s",
 'official',
 'YouTube',
 'channel',
 '.',
 'Ahead',
 'of',
 'the',
 'launch',
 ',',
 'the',
 'company',
 'has',
 'revealed',
 'the',
 'Vivo',
 'T2',
 'series',
 'colour',
 'and',
 'design',
 '.',
 'The',
 'lineup',
 'includes',
 'the',
 'Vivo',
 'T2',
 'and',
 'a',
 'toned-down',
 'Vivo',
 'T2x',
 '.',
 'Both',
 'smartphones',
 'support',
 '5G',
 '.',
 'The',
 'Vivo',
 'T2',
 "'s",
 'official',
 'Flipkart',
 'listing',
 'reveals',
 'that',
 'the',
 'phone',
 'will',
 'come',
 'with',
 'a',
 'Full-HD+',
 'AMOLED',
 'display',
 'with',
 'a',
 '360Hz',
 'touch',
 'sampling',
 'rate',
 'and',
 '1300',
 'nits',
 'peak',
 'brightness'

## Stop Words

These are those words which don't contribute much meaning for the processing of the actual text data and should be ignored. For example, in, to, from, an etc.

## Filtering Stop Words

It is the next step in the text preprocessing phase where stop words have to be filtered out from the text.

Let's do that in our next cell.

In [8]:
nltk.download('stopwords')
from  nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
# store our word tokens in a variable
word_tok = word_tokenize(example_text)
set_stopwords = set(stopwords.words('english'))
filtered_tok = [tok for tok in word_tok if tok.casefold() not in set_stopwords]
print(filtered_tok)

['Vivo', "'s", 'latest', 'Vivo', 'T2', 'series', 'launch', 'India', 'today', ',', 'April', '11', '.', 'new', 'smartphone', 'unveiled', 'virtual', 'event', '12', 'PM', 'fans', 'watch', 'live', 'stream', 'free', 'Vivo', 'India', "'s", 'official', 'YouTube', 'channel', '.', 'Ahead', 'launch', ',', 'company', 'revealed', 'Vivo', 'T2', 'series', 'colour', 'design', '.', 'lineup', 'includes', 'Vivo', 'T2', 'toned-down', 'Vivo', 'T2x', '.', 'smartphones', 'support', '5G', '.', 'Vivo', 'T2', "'s", 'official', 'Flipkart', 'listing', 'reveals', 'phone', 'come', 'Full-HD+', 'AMOLED', 'display', '360Hz', 'touch', 'sampling', 'rate', '1300', 'nits', 'peak', 'brightness', '.', 'back', ',', "'s", '64-megapixel', 'primary', 'camera', 'OIS', 'EIS', 'support', 'offer', 'crisp', 'stable', 'videos', 'photos', '.', "'s", 'also', 'going', '2-megapixel', 'bokeh', 'camera', ',', 'ultra-wide', 'camera', 'lens', '.']


## Stemming

It refers to the process of reducing words to their roos, i.e. the core part of the words to allow to zero in on the basic meaning of the word rather than using all the details of how they are being used in a text. For example, "catch" and "caught" share the root "catch"   


In [13]:
from  nltk.stem import PorterStemmer
stemmer_obj = PorterStemmer()
to_be_stemmed = """
On April 12, Prime Minister Narendra Modi will launch the first Vande Bharat Express train in Rajasthan and the first semi-high-speed passenger train in the world on high-rise overhead electric (OHE) land. The first train would run between the railway stations in Jaipur and Delhi Cantt. 
"""
word_toks = word_tokenize(to_be_stemmed)
print([stemmer_obj.stem(tok) for tok in word_toks])

['on', 'april', '12', ',', 'prime', 'minist', 'narendra', 'modi', 'will', 'launch', 'the', 'first', 'vand', 'bharat', 'express', 'train', 'in', 'rajasthan', 'and', 'the', 'first', 'semi-high-spe', 'passeng', 'train', 'in', 'the', 'world', 'on', 'high-ris', 'overhead', 'electr', '(', 'ohe', ')', 'land', '.', 'the', 'first', 'train', 'would', 'run', 'between', 'the', 'railway', 'station', 'in', 'jaipur', 'and', 'delhi', 'cantt', '.']
