<a href="https://www.kaggle.com/code/iqmansingh/getting-started-with-nlp?scriptVersionId=135478047" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Getting Started with NLP**

---

In [176]:
import numpy as np
import pandas as pd 
import tensorflow as tf
import datetime
import warnings
import nltk
import random
import re
import sklearn
nltk.download('punkt',download_dir="/kaggle/working/")
nltk.download('wordnet',download_dir="/kaggle/working/")
nltk.download('stopwords',download_dir="/kaggle/working/")
nltk.data.path.append('/kaggle/working/') 
import zipfile
with zipfile.ZipFile("/kaggle/working/corpora/wordnet.zip", 'r') as zip_f:
    zip_f.extractall("/kaggle/working/corpora/")
warnings.filterwarnings("ignore")
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('dark_background')
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /kaggle/working/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /kaggle/working/...
[nltk_data]   Package stopwords is already up-to-date!


In [177]:
df = pd.read_csv("/kaggle/input/ted-ultimate-dataset/2020-05-01/ted_talks_en.csv")
df.sort_values(by="views",ascending=False,inplace=True)

In [178]:
# Tim Urban: "Inside the mind of a master procrastinator" Speech                                            
speech1 = df.iloc[6].transcript

In [179]:
# Stephen Hawking: “Questioning the Universe* Speech
speech2 = """There is nothing bigger or older than the universe. 
The questions I would like to talk about are: one, where did we come from?
How did the universe come into being? Are we alone in the universe?
Is there alien life out there? What is the future of the human race?
Up until the 1920s, everyone thought the universe was essentially static and unchanging in time.
Then it was discovered that the universe was expanding. Distant galaxies were moving away from us.
This meant they must have been closer together in the past. If we extrapolate back, 
we find we must have all been on top of each other about 15 billion years ago. 
This was the Big Bang, the beginning of the universe. But was there anything before the Big Bang?
If not, what created the universe? Why did the universe emerge from the Big Bang the way it did?
We used to think that the theory of the universe could be divided into two parts. 
First, there were the laws like Maxwell’s equations and general relativity that determined the
evolution of the universe, given its state over all of the space at one time. And second, 
there was no question of the initial state of the universe. We have made good progress on the
first part, and now have the knowledge of the laws of evolution in all but the most extreme conditions.
But until recently, we have had little idea about the initial conditions for the universe."""

---

# 1. Tokenization
### 1.1 Sentence Tokenization 

In [193]:
sentences = nltk.sent_tokenize(speech2)
sentences[:5]

['There is nothing bigger or older than the universe.',
 'The questions I would like to talk about are: one, where did we come from?',
 'How did the universe come into being?',
 'Are we alone in the universe?',
 'Is there alien life out there?']

In [194]:
sentences = nltk.sent_tokenize(speech1)
sentences[:5]

['So in college, I was a government major, which means I had to write a lot of papers.',
 'Now, when a normal student writes a paper, they might spread the work out a little like this.',
 'So, you know — (Laughter) you get started maybe a little slowly, but you get enough done in the first week that, with some heavier days later on, everything gets done, things stay civil.',
 '(Laughter) And I would want to do that like that.',
 'That would be the plan.']

### 1.2 Word Tokenization 

In [199]:
words = nltk.word_tokenize(speech1)
len(words)

2769

In [200]:
for i in range(random.randint(1,50),random.randint(100,200)):
    print(words[i],end=" ")

a lot of papers . Now , when a normal student writes a paper , they might spread the work out a little like this . So , you know — ( Laughter ) you get started maybe a little slowly , but you get enough done in the first week that , with some heavier days later on , everything gets done , things stay civil . ( Laughter ) And I would want to do that like that . That would be the plan . I would have it all ready to go , but then , actually , the paper would come along , and then I would kind of do this . ( Laughter ) And that would happen every single paper . But then came my 90-page senior thesis , a paper you 're supposed to spend a year on . And I knew for 

---

# 2. Stemming vs Lemmatization

### 2.1 Stemming

In [184]:
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [201]:
stemmer = nltk.PorterStemmer()
stemmedSentences = []

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
#     words = [re.sub('[!,*)@#%(&$_?.^]',"",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower().lstrip() for i in words]
    words = [stemmer.stem(i) for i in words if i not in stopwords]
    stemmedSentences.append(" ".join(words))
stemmedSentences[:5]

['colleg  govern major  mean write lot paper ',
 ' normal student write paper  might spread work littl like ',
 ' know   laughter  get start mayb littl slowli  get enough done first week  heavier day later  everyth get done  thing stay civil ',
 ' laughter  would want like ',
 'would plan ']

### 2.2 Lemmatization 

In [202]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizedSentences = []

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
#     words = [re.sub("[!,*)@#%(&$_?.:’'^]","",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower().strip() for i in words]
    words = [lemmatizer.lemmatize(i) for i in words if i not in stopwords]
    lemmatizedSentences.append(" ".join(words))
lemmatizedSentences[:5]

['college  government major  mean write lot paper ',
 ' normal student writes paper  might spread work little like ',
 ' know   laughter  get started maybe little slowly  get enough done first week  heavier day later  everything get done  thing stay civil ',
 ' laughter  would want like ',
 'would plan ']

### 2.3 Comparing Stemming vs Lemmatization

In [203]:
print(stemmedSentences[6])
print(lemmatizedSentences[6])

 laughter  would happen everi singl paper 
 laughter  would happen every single paper 


---

# 3. Vectorization
### 3.1 Bag of Words (CountVectorizer)
- sklearn.feature_extraction.text.CountVectorizer

In [204]:
# Frequncy BoW
countVectorizer = sklearn.feature_extraction.text.CountVectorizer(max_features=2000)
X = countVectorizer.fit_transform(lemmatizedSentences).toarray() 
X.shape
# 20 - no of sentences
# 78 - no of features

(142, 482)

In [205]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### 3.2 TF-IDF (Term Frequency * Inverse Document Frequency)
 - sklearn.feature_extraction.text.TfidfVectorizer

In [206]:
tfidfVecorier = sklearn.feature_extraction.text.TfidfVectorizer(max_features=2000)
X = tfidfVecorier.fit_transform(lemmatizedSentences).toarray() 
X.shape

(142, 482)

In [207]:
print(X)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
