# **Getting Started with NLP**

---

In [1]:
import numpy as np
import pandas as pd 
import tensorflow as tf
import datetime
import warnings
import nltk
import random
import re
import sklearn
nltk.download('punkt',download_dir="/kaggle/working/")
nltk.download('wordnet',download_dir="/kaggle/working/")
nltk.download('stopwords',download_dir="/kaggle/working/")
nltk.data.path.append('/kaggle/working/') 
import zipfile
with zipfile.ZipFile("/kaggle/working/corpora/wordnet.zip", 'r') as zip_f:
    zip_f.extractall("/kaggle/working/corpora/")
warnings.filterwarnings("ignore")
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('dark_background')
from sklearn.model_selection import train_test_split

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


[nltk_data] Downloading package punkt to /kaggle/working/...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data] Downloading package stopwords to /kaggle/working/...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Stephen Hawking: “Questioning the Universe* Speech/Corpus
speech = """There is nothing bigger or older than the universe. 
The questions I would like to talk about are: one, where did we come from?
How did the universe come into being? Are we alone in the universe?
Is there alien life out there? What is the future of the human race?
Up until the 1920s, everyone thought the universe was essentially static and unchanging in time.
Then it was discovered that the universe was expanding. Distant galaxies were moving away from us.
This meant they must have been closer together in the past. If we extrapolate back, 
we find we must have all been on top of each other about 15 billion years ago. 
This was the Big Bang, the beginning of the universe. But was there anything before the Big Bang?
If not, what created the universe? Why did the universe emerge from the Big Bang the way it did?
We used to think that the theory of the universe could be divided into two parts. 
First, there were the laws like Maxwell’s equations and general relativity that determined the
evolution of the universe, given its state over all of the space at one time. And second, 
there was no question of the initial state of the universe. We have made good progress on the
first part, and now have the knowledge of the laws of evolution in all but the most extreme conditions.
But until recently, we have had little idea about the initial conditions for the universe."""

---

# 1. Tokenization
### 1.1 Sentence Tokenization 

In [3]:
sentences = nltk.sent_tokenize(speech)
sentences

['There is nothing bigger or older than the universe.',
 'The questions I would like to talk about are: one, where did we come from?',
 'How did the universe come into being?',
 'Are we alone in the universe?',
 'Is there alien life out there?',
 'What is the future of the human race?',
 'Up until the 1920s, everyone thought the universe was essentially static and unchanging in time.',
 'Then it was discovered that the universe was expanding.',
 'Distant galaxies were moving away from us.',
 'This meant they must have been closer together in the past.',
 'If we extrapolate back, \nwe find we must have all been on top of each other about 15 billion years ago.',
 'This was the Big Bang, the beginning of the universe.',
 'But was there anything before the Big Bang?',
 'If not, what created the universe?',
 'Why did the universe emerge from the Big Bang the way it did?',
 'We used to think that the theory of the universe could be divided into two parts.',
 'First, there were the laws like 

### 1.2 Word Tokenization 

In [4]:
words = nltk.word_tokenize(speech)
len(words)

283

In [5]:
for i in range(random.randint(1,50),random.randint(50,100)):
    print(words[i],end=" / ")

? / How / did / the / universe / come / into / being / ? / Are / we / alone / in / the / universe / ? / Is / there / alien / life / out / there / ? / What / is / the / future / of / the / human / race / ? / Up / until / the / 1920s / , / everyone / thought / the / universe / was / essentially / static / and / unchanging / in / time / . / Then / it / was / discovered / that / the / universe / 

---

# 2. Stemming vs Lemmatization

### 2.1 Stemming

In [6]:
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
stemmer = nltk.PorterStemmer()
stemmedSentences = []

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
#     words = [re.sub('[!,*)@#%(&$_?.^]',"",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower() for i in words]
    words = [stemmer.stem(i) for i in words if i not in stopwords]
    stemmedSentences.append(" ".join(words))
    print(stemmedSentences[i])

noth bigger older univers 
question would like talk  one  come 
univers come 
alon univers 
alien life 
futur human race 
1920  everyon thought univers essenti static unchang time 
discov univers expand 
distant galaxi move away us 
meant must closer togeth past 
extrapol back  find must top 15 billion year ago 
big bang  begin univers 
anyth big bang 
 creat univers 
univers emerg big bang way 
use think theori univers could divid two part 
first  law like maxwel  equat gener rel determin evolut univers  given state space one time 
second  question initi state univers 
made good progress first part  knowledg law evolut extrem condit 
recent  littl idea initi condit univers 


### 2.2 Lemmatization 

In [8]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizedSentences = []

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
#     words = [re.sub("[!,*)@#%(&$_?.:’'^]","",i).lower() for i in words]
    words = [re.sub("[^a-zA-Z0-9]","",i).lower() for i in words]
    words = [lemmatizer.lemmatize(i) for i in words if i not in stopwords]
    lemmatizedSentences.append(" ".join(words))
    print(lemmatizedSentences[i])

nothing bigger older universe 
question would like talk  one  come 
universe come 
alone universe 
alien life 
future human race 
1920s  everyone thought universe essentially static unchanging time 
discovered universe expanding 
distant galaxy moving away u 
meant must closer together past 
extrapolate back  find must top 15 billion year ago 
big bang  beginning universe 
anything big bang 
 created universe 
universe emerge big bang way 
used think theory universe could divided two part 
first  law like maxwell  equation general relativity determined evolution universe  given state space one time 
second  question initial state universe 
made good progress first part  knowledge law evolution extreme condition 
recently  little idea initial condition universe 


### 2.3 Comparing Stemming vs Lemmatization

In [9]:
print(stemmedSentences[6])
print(lemmatizedSentences[6])

1920  everyon thought univers essenti static unchang time 
1920s  everyone thought universe essentially static unchanging time 


---

# 3. Vectorization
### 3.1 Bag of Words (CountVectorizer)
- sklearn.feature_extraction.text.CountVectorizer

In [10]:
# Frequncy BoW
countVectorizer = sklearn.feature_extraction.text.CountVectorizer(max_features=2000)
X = countVectorizer.fit_transform(lemmatizedSentences).toarray() 
X.shape
# 20 - no of sentences
# 78 - no of features

(20, 78)

In [11]:
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### 3.2 TF-IDF (Term Frequency * Inverse Document Frequency)
 - sklearn.feature_extraction.text.TfidfVectorizer

In [12]:
tfidfVecorier = sklearn.feature_extraction.text.TfidfVectorizer(max_features=2000)
X = tfidfVecorier.fit_transform(lemmatizedSentences).toarray() 
X.shape

(20, 78)

In [13]:
print(X)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.44321297 0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
