#### Document Term Matrix
In this lab we will learn about all steps in the construction of term-document matrix. These steps are
* Tokenize text
* Remove punctutions
* Remove stopwords
* Stem the words to root form
* Synonymy
* POS tagging

For this we build a sample dcument on Indian citizenship.

In [None]:
import nltk

In [None]:
# sententence tokenization. Breaks text into sentences.

from nltk.tokenize import sent_tokenize

text="""India is my country! All indians are my brothers and (sisters); I love my country and am proud to be an Indian. 
Our nationality is Indian national. India has rich culture: Tiger is the National animal and peacock is national bird. 
Our nation is great. India@cross-roads / India is socialist nation. We nationalized 14 banks + railways. Indian society 
is very sociable. Every indian should discharge social tasks & duties. Nationalism is a halmark of india culture. India is culturally diverse. 
We are part of United Nations. Should we nationalize all natural assests? You should do your duty to your country."""

tokenized_sent=sent_tokenize(text)
print(tokenized_sent)
print(len(tokenized_sent))

In [None]:
# Word tokenization. Break text into tokens (i.e. words are terms)

from nltk.tokenize import word_tokenize

tokenized_text=word_tokenize(text)
print(tokenized_text)
print(len(tokenized_text))

In [None]:
# obtaining frequency distributions using FreqDist() function

from nltk.probability import FreqDist
fdist = FreqDist(tokenized_text)
print(fdist)

In [None]:
# What are 5 most commonly used words?

fdist.most_common(5)

In [None]:
# Frequency Distribution Plot

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 8]

fdist.plot(30,cumulative=False)

In [None]:
# List stop words

from nltk.corpus import stopwords

stop_words=set(stopwords.words("english"))
print(stop_words)

In [None]:
#removing stopwords

filtered_text=[]
for w in tokenized_text:
    if w.lower() not in stop_words:
        filtered_text.append(w)

print("Tokenized text:",tokenized_text)
print('\n')
print("Filterd text:",filtered_text)

In [None]:
# Removing punctuations

puncts = ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '_', '+', '=', 
          '{', '}', '[', ']', '|', ':', ';', '"', "'", '<', '>', '?', '/', '\\', ',', '.']

nopunct_text=[]
for w in filtered_text:
    if w not in puncts:
        nopunct_text.append(w)

print("Tokenized text:",tokenized_text)
print('\n')
print("No Punctuation text:",nopunct_text)


In [None]:
#Lexicon Normalization
#performing stemming and Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

In [None]:
# Stemming

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

stemmed_text=[]
for w in nopunct_text:
    stemmed_text.append(ps.stem(w))

print("No punctuation Text: ", nopunct_text)
print('\n')
print("Stemmed Text: ", stemmed_text)

In [None]:
# POS tagging
from nltk.tokenize import sent_tokenize
sent="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
print(sent)

In [None]:
# sentence tokenization

tokens=nltk.word_tokenize(sent)
print(tokens)

In [None]:
nltk.pos_tag(tokens)

In [None]:
# Synonyms

from nltk.corpus import wordnet

In [None]:
syn = wordnet.synsets("program")
print(syn[0].name())
print(syn[0].lemmas()[0].name())
print(syn[0].definition())
print(syn[0].examples())

In [None]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print('\n')
print(set(antonyms))

##################### End of Lab ######################