<div style="position: relative; width: 100%; height: 300px; display: flex; justify-content: center; align-items: center;">
    <img src="https://miro.medium.com/v2/resize:fit:2000/1*iy12bH-FiUNOy9-0bULgSg.png" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; z-index: 0; opacity: 0.8; border-radius: 37px" >
    <div style="font-size: 28px; border-radius: 10px; position: relative; z-index: 1; text-align: center; background-color: rgba(50, 50, 50, 0); color: rgb(129, 21, 28); display: flex; flex-direction: column; align-items: center; text-align: center; justify-content: center; width: 100%; margin: 10%; padding: 5px ">
        <h1 style="text-align: center; width: 100%" ><b>NLP Preprocessing with NLTK </b></h1>
    </div>
</div>

# <h1 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgb(165, 188, 230); background-color: rgba(130, 21, 128, 0.5); padding: 30px; border-style: solid; border-radius: 10px;"> Imports & Load Data </h1>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
import pprint

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [10]:
sns.set()
sns.set_palette('BuPu')
SNS_CMAP = 'BuPu'

colors = sns.palettes.color_palette(SNS_CMAP)

In [17]:
255*np.array(colors[5]), 280*np.array(colors[2])

(array([129.90588235,  21.47058824, 128.27058824]),
 array([165.19677047, 188.9230296 , 230.15763168]))

In [8]:
path = '/kaggle/input/aassignment-7-txt/test.txt'
with open(path, 'r') as f:
    text = f.read()

In [9]:
text

'Millions of people in India took part in an annual tree planting drive Sunday. More than 250 million saplings were planted in a single day across the country\'s most-populous state.\nThe campaign was led by Uttar Pradesh state government officials, lawmakers, and activists, in a bid to reduce carbon emissions and combat climate change.\nWhere were the trees planted?\nThe saplings were planted by volunteers in forests, farms, schools, and along riverbanks and highways.\n"We are committed to increasing the forest cover of Uttar Pradesh to over 15% of the total land area in the next five years,\'\' said state forest official Manoj Singh.\nAccording to another government official, the forest cover of the state has increased over the last few years.\n"There has been an increase of 127 sqare kilometers [79 sqare miles] in the forest cover in Uttar Pradesh as compared to 2017," a state government spokesperson was quoted as saying in The Indian Express newspaper.\n"There has also been an incr

# <h1 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgb(165, 188, 230); background-color: rgba(130, 21, 128, 0.5); padding: 30px; border-style: solid; border-radius: 10px;"> Tokenisation </h1>

In [19]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

tokens = word_tokenize(text)

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [25]:
print(tokens[:10])

['Millions', 'of', 'people', 'in', 'India', 'took', 'part', 'in', 'an', 'annual']


In [31]:
print(f"Number of Intervals: {len(text.split(' '))}")
print(f'Number of Tokens: {len(tokens)}')

Number of Intervals: 355
Number of Tokens: 434


### <h3 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgba(150, 174, 209, 0.8); background-color: rgba(230, 131, 131, 0.3); padding: 10px; border-style: solid; border-radius: 10px;"> Sentence Tokenisation </h3>

In [35]:
sentences = nltk.tokenize.sent_tokenize(text)
print(f'Number of Sentences: {len(sentences)}')
print(f"Number of Periods: {len(text.split(' '))}")

Number of Sentences: 21
Number of Periods: 355


In [36]:
print(sentences[:5])

['Millions of people in India took part in an annual tree planting drive Sunday.', "More than 250 million saplings were planted in a single day across the country's most-populous state.", 'The campaign was led by Uttar Pradesh state government officials, lawmakers, and activists, in a bid to reduce carbon emissions and combat climate change.', 'Where were the trees planted?', 'The saplings were planted by volunteers in forests, farms, schools, and along riverbanks and highways.']


# <h1 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgb(165, 188, 230); background-color: rgba(130, 21, 128, 0.5); padding: 30px; border-style: solid; border-radius: 10px;"> Stop Words Detection </h1>

In [37]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [39]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

In [42]:
print(f'Number of Tokens: {len(tokens)}')
print(f'Number of Filtered Tokens: {len(filtered_tokens)}')
print(filtered_tokens[:5])

Number of Tokens: 434
Number of Filtered Tokens: 282
['Millions', 'people', 'India', 'took', 'part']


# <h1 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgb(165, 188, 230); background-color: rgba(130, 21, 128, 0.5); padding: 30px; border-style: solid; border-radius: 10px;"> Stemming / Lemmatization </h1>

### <h3 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgba(150, 174, 209, 0.8); background-color: rgba(230, 131, 131, 0.3); padding: 10px; border-style: solid; border-radius: 10px;"> Stemming </h3>

In [43]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]

In [44]:
print(tokens[:20])
print(stemmed_tokens[:20])

['Millions', 'of', 'people', 'in', 'India', 'took', 'part', 'in', 'an', 'annual', 'tree', 'planting', 'drive', 'Sunday', '.', 'More', 'than', '250', 'million', 'saplings']
['million', 'of', 'peopl', 'in', 'india', 'took', 'part', 'in', 'an', 'annual', 'tree', 'plant', 'drive', 'sunday', '.', 'more', 'than', '250', 'million', 'sapl']


### <h3 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgba(150, 174, 209, 0.8); background-color: rgba(230, 131, 131, 0.3); padding: 10px; border-style: solid; border-radius: 10px;"> Lemmatization </h3>

In [55]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [57]:
lemmatized_text = [token.lemma_ for token in nlp(" ".join(tokens))]

In [58]:
lemmatized_text

['million', 'of', 'people', 'in', 'India', 'take', 'part', 'in', 'an', 'annual', 'tree', 'planting', 'drive', 'Sunday', '.', 'More', 'than', '250', 'million', 'sapling', 'be', 'plant', 'in', 'a', 'single', 'day', 'across', 'the', 'country', "'s", 'most', '-', 'populous', 'state', '.', 'the', 'campaign', 'be', 'lead', 'by', 'Uttar', 'Pradesh', 'state', 'government', 'official', ',', 'lawmaker', ',', 'and', 'activist', ',', 'in', 'a', 'bid', 'to', 'reduce', 'carbon', 'emission', 'and', 'combat', 'climate', 'change', '.', 'where', 'be', 'the', 'tree', 'plant', '?', 'the', 'sapling', 'be', 'plant', 'by', 'volunteer', 'in', 'forest', ',', 'farm', ',', 'school', ',', 'and', 'along', 'riverbank', 'and', 'highway', '.', '`', '`', 'we', 'be', 'committed', 'to', 'increase', 'the', 'forest', 'cover', 'of', 'Uttar', 'Pradesh', 'to', 'over', '15', '%', 'of', 'the', 'total', 'land', 'area', 'in', 'the', 'next', 'five', 'year', ',', "''", 'say', 'state', 'forest', 'official', 'Manoj', 'Singh', '.', '

In [59]:
tokens

['Millions', 'of', 'people', 'in', 'India', 'took', 'part', 'in', 'an', 'annual', 'tree', 'planting', 'drive', 'Sunday', '.', 'More', 'than', '250', 'million', 'saplings', 'were', 'planted', 'in', 'a', 'single', 'day', 'across', 'the', 'country', "'s", 'most-populous', 'state', '.', 'The', 'campaign', 'was', 'led', 'by', 'Uttar', 'Pradesh', 'state', 'government', 'officials', ',', 'lawmakers', ',', 'and', 'activists', ',', 'in', 'a', 'bid', 'to', 'reduce', 'carbon', 'emissions', 'and', 'combat', 'climate', 'change', '.', 'Where', 'were', 'the', 'trees', 'planted', '?', 'The', 'saplings', 'were', 'planted', 'by', 'volunteers', 'in', 'forests', ',', 'farms', ',', 'schools', ',', 'and', 'along', 'riverbanks', 'and', 'highways', '.', '``', 'We', 'are', 'committed', 'to', 'increasing', 'the', 'forest', 'cover', 'of', 'Uttar', 'Pradesh', 'to', 'over', '15', '%', 'of', 'the', 'total', 'land', 'area', 'in', 'the', 'next', 'five', 'years', ',', "''", 'said', 'state', 'forest', 'official', 'Mano

# <h1 style="text-align: center; font-family: 'Roboto', sans-serif; color: rgb(165, 188, 230); background-color: rgba(130, 21, 128, 0.5); padding: 30px; border-style: solid; border-radius: 10px;"> TF-IDFVectorizer </h1>

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [62]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit([text])

In [66]:
tfidf_matrix = tfidf_vectorizer.transform([text])

In [67]:
tfidf_matrix

<1x189 sparse matrix of type '<class 'numpy.float64'>'
	with 189 stored elements in Compressed Sparse Row format>