<a href="https://www.kaggle.com/code/iqmansingh/getting-started-with-nlp?scriptVersionId=135564568" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<img src="https://cdn.discordapp.com/attachments/1111599839663370271/1124998618529681428/NLP-Banner.jpg">

# **Getting Started with NLP**

---

### This notebook contains various types of Text Preprocessing Techniques used for NLP like
 1. Tokenization
 2. Stemming
 3. Lemmatization
 4. Vectorization
  - sklearn.feature_extraction.text.CountVectorizer
  - sklearn.feature_extraction.text.TfidfVectorizer
  - Gensim.Word2Vec
  - tf.keras.layers.Embedding

In [68]:
import numpy as np
import pandas as pd 
import tensorflow as tf
import datetime
import warnings
import nltk
import random
import re
import sklearn
import zipfile
import gensim
import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt',download_dir="/kaggle/working/")
nltk.download('wordnet',download_dir="/kaggle/working/")
nltk.download('stopwords',download_dir="/kaggle/working/")
nltk.data.path.append('/kaggle/working/')

with zipfile.ZipFile("/kaggle/working/corpora/wordnet.zip", 'r') as zip_f:
    zip_f.extractall("/kaggle/working/corpora/")
    
warnings.filterwarnings("ignore")
pd.plotting.register_matplotlib_converters()
%matplotlib inline
plt.style.use('dark_background')

[nltk_data] Downloading package punkt to /kaggle/working/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /kaggle/working/...
[nltk_data]   Package stopwords is already up-to-date!


In [69]:
df = pd.read_csv("/kaggle/input/ted-ultimate-dataset/2020-05-01/ted_talks_en.csv")
df.sort_values(by="views",ascending=False,inplace=True)

In [70]:
# Tim Urban: "Inside the mind of a master procrastinator" Speech                                            
speech = df.iloc[6].transcript
speech

'So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know — (Laughter) you get started maybe a little slowly, but you get enough done in the first week that, with some heavier days later on, everything gets done, things stay civil. (Laughter) And I would want to do that like that. That would be the plan. I would have it all ready to go, but then, actually, the paper would come along, and then I would kind of do this. (Laughter) And that would happen every single paper. But then came my 90-page senior thesis, a paper you\'re supposed to spend a year on. And I knew for a paper like that, my normal work flow was not an option. It was way too big a project. So I planned things out, and I decided I kind of had to go something like this. This is how the year would go. So I\'d start off light, and I\'d bump it up in the middle months, and then at the end, I wo

---

# 1. Tokenization
### 1.1 Sentence Tokenization 

In [71]:
sentences = nltk.sent_tokenize(speech)
sentences[:5]

['So in college, I was a government major, which means I had to write a lot of papers.',
 'Now, when a normal student writes a paper, they might spread the work out a little like this.',
 'So, you know — (Laughter) you get started maybe a little slowly, but you get enough done in the first week that, with some heavier days later on, everything gets done, things stay civil.',
 '(Laughter) And I would want to do that like that.',
 'That would be the plan.']

### 1.2 Word Tokenization 

In [72]:
words = nltk.word_tokenize(speech)
len(words)

2769

In [73]:
for i in range(random.randint(1,50),random.randint(100,200)):
    print(words[i],end=" ")

they might spread the work out a little like this . So , you know — ( Laughter ) you get started maybe a little slowly , but you get enough done in the first week that , with some heavier days later on , everything gets done , things stay civil . ( Laughter ) And I would want to do that like that . That would be the plan . I would have it all ready to go , but then , actually , the paper would come along , and then I would kind of do this . ( Laughter ) And that would happen every single paper . But then came my 90-page senior thesis , a paper you 're supposed to spend a year on . And I knew for a paper like that , my normal work flow was not an option . It was way too big a project . So I planned things out , and I decided I kind 

---

# 2. Stemming vs Lemmatization

### 2.1 Stemming

In [74]:
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [75]:
stemmer = nltk.PorterStemmer()
stemmedSentences = []

for j in range(len(sentences)):
    words = nltk.word_tokenize(sentences[j])
#     words = [re.sub('[!,*)@#%(&$_?.^]',"",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower().lstrip() for i in words]
    words = [stemmer.stem(i) for i in words if i not in stopwords]
    stemmedSentences.append(" ".join(words))
stemmedSentences[:5]

['colleg  govern major  mean write lot paper ',
 ' normal student write paper  might spread work littl like ',
 ' know   laughter  get start mayb littl slowli  get enough done first week  heavier day later  everyth get done  thing stay civil ',
 ' laughter  would want like ',
 'would plan ']

### 2.2 Lemmatization 

In [76]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizedSentences = []

for j in range(len(sentences)):
    words = nltk.word_tokenize(sentences[j])
#     words = [re.sub("[!,*)@#%(&$_?.:’'^]","",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower().strip() for i in words]
    words = [lemmatizer.lemmatize(i) for i in words if i not in stopwords]
    lemmatizedSentences.append(" ".join(words))
lemmatizedSentences[:5]

['college  government major  mean write lot paper ',
 ' normal student writes paper  might spread work little like ',
 ' know   laughter  get started maybe little slowly  get enough done first week  heavier day later  everything get done  thing stay civil ',
 ' laughter  would want like ',
 'would plan ']

### 2.3 Comparing Stemming vs Lemmatization

In [77]:
print(stemmedSentences[6])
print(lemmatizedSentences[6])

 laughter  would happen everi singl paper 
 laughter  would happen every single paper 


---

# 3. Vectorization
### 3.1 Bag of Words (CountVectorizer)
- sklearn.feature_extraction.text.CountVectorizer

In [78]:
# Frequncy BoW
countVectorizer = sklearn.feature_extraction.text.CountVectorizer(max_features=2000)
X = countVectorizer.fit_transform(lemmatizedSentences).toarray() 
X.shape
# 20 - no of sentences
# 78 - no of features

(142, 482)

In [79]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### 3.2 TF-IDF (Term Frequency * Inverse Document Frequency)
 - sklearn.feature_extraction.text.TfidfVectorizer

In [80]:
tfidfVecorier = sklearn.feature_extraction.text.TfidfVectorizer(max_features=2000)
X = tfidfVecorier.fit_transform(lemmatizedSentences).toarray() 
X.shape

(142, 482)

In [81]:
print(X)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### 3.3 Word2Vec

In [82]:
# speech1 = re.sub("[^a-zA-Z0-9\s]","",speech).lower()
wordVecSentences = []

for j in range(len(sentences)):
    words = nltk.word_tokenize(re.sub("[^a-zA-Z0-9\s]","",speech))
    words = [i for i in words if i not in stopwords]
    wordVecSentences.append(words)
print(wordVecSentences[0][50:60])

['plan', 'I', 'would', 'ready', 'go', 'actually', 'paper', 'would', 'come', 'along']


In [83]:
word2Vec = gensim.models.Word2Vec(wordVecSentences,min_count=1)
print("Length of Word2Vec Vocab:",len(word2Vec.wv),"\n")
word2Vec.wv.most_similar(positive=["actually"])[:5]

Length of Word2Vec Vocab: 588 



[('plan', 0.7365055084228516),
 ('lab', 0.734563410282135),
 ('let', 0.7147486209869385),
 ('scan', 0.7056334614753723),
 ('takes', 0.6901529431343079)]

### 3.4 Word Embedding
- Keras OneHot removes Special Chars and Lowers the Text 

In [98]:
#One Hot Representation
VOCAB_SIZE = 10000

oneHot = [tf.keras.preprocessing.text.one_hot(i,VOCAB_SIZE) for i in sentences]
print(oneHot[:2])

[[630, 1988, 8608, 1478, 6902, 2303, 5394, 1619, 1853, 7177, 1478, 6432, 9512, 1035, 2303, 5146, 6685, 1582], [3799, 8962, 2303, 6123, 8366, 1713, 2303, 1434, 4228, 7053, 8429, 6763, 3452, 1493, 2303, 218, 8593, 1898]]


In [109]:
#Padding the Vectors
MAXLEN = 50

paddedVecs = tf.keras.utils.pad_sequences(oneHot,padding="pre",maxlen=MAXLEN)
print(paddedVecs[:2])

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0  630 1988 8608 1478 6902 2303 5394 1619 1853 7177
  1478 6432 9512 1035 2303 5146 6685 1582]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0 3799 8962 2303 6123 8366 1713 2303 1434 4228 7053
  8429 6763 3452 1493 2303  218 8593 1898]]


In [114]:
#Embedding Matrix
DIMENSION = 100

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(VOCAB_SIZE,DIMENSION,input_length=MAXLEN))
model.compile(optimizer="adam",loss="mse")
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 50, 100)           1000000   
                                                                 
Total params: 1,000,000
Trainable params: 1,000,000
Non-trainable params: 0
_________________________________________________________________


In [118]:
embeddedVecs = model.predict(paddedVecs)
print(embeddedVecs[0].shape)
# 50 = no of input words
# 100 = no of features in Embedding Matrix

(50, 100)
