### Tokenisation
##### Given a character sequence and a defined document unit, tokenization is the task of slicing it up into pieces, called tokens ,  at the same time throwing away certain characters, such as punctuation. 

In [1]:
import nltk # Natural Language ToolKit

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
# A general paragraph on Inflation
para="""Inflation is a quantitative measure of the rate at which the average price level of a basket of selected goods and services in an economy increases over some period of time. 
It is the rise in the general level of prices where a unit of currency effectively buys less than it did in prior periods. 
Often expressed as a percentage, inflation thus indicates a decrease in the purchasing power of a nation’s currency.
As prices rise, a single unit of currency loses value as it buys fewer goods and services. 
This loss of purchasing power impacts the general cost of living for the common public which ultimately leads to a deceleration in economic growth. 
The consensus view among economists is that sustained inflation occurs when a nation's money supply growth outpaces economic growth.
To combat this, a country's appropriate monetary authority, like the central bank, then takes the necessary measures to keep inflation within permissible limits and keep the economy running smoothly.
Inflation is measured in a variety of ways depending upon the types of goods and services considered and is the opposite of deflation which indicates a general decline occurring in prices for goods and services when the inflation rate falls below 0%. 
"""

In [4]:
sentences = nltk.sent_tokenize(para) #gives a list of sentences in the paragraph
print(sentences)

['Inflation is a quantitative measure of the rate at which the average price level of a basket of selected goods and services in an economy increases over some period of time.', 'It is the rise in the general level of prices where a unit of currency effectively buys less than it did in prior periods.', 'Often expressed as a percentage, inflation thus indicates a decrease in the purchasing power of a nation’s currency.', 'As prices rise, a single unit of currency loses value as it buys fewer goods and services.', 'This loss of purchasing power impacts the general cost of living for the common public which ultimately leads to a deceleration in economic growth.', "The consensus view among economists is that sustained inflation occurs when a nation's money supply growth outpaces economic growth.", "To combat this, a country's appropriate monetary authority, like the central bank, then takes the necessary measures to keep inflation within permissible limits and keep the economy running smoo

In [5]:
words = nltk.word_tokenize(para) #gives a list of all the words used in the paragraph.
print(words)

['Inflation', 'is', 'a', 'quantitative', 'measure', 'of', 'the', 'rate', 'at', 'which', 'the', 'average', 'price', 'level', 'of', 'a', 'basket', 'of', 'selected', 'goods', 'and', 'services', 'in', 'an', 'economy', 'increases', 'over', 'some', 'period', 'of', 'time', '.', 'It', 'is', 'the', 'rise', 'in', 'the', 'general', 'level', 'of', 'prices', 'where', 'a', 'unit', 'of', 'currency', 'effectively', 'buys', 'less', 'than', 'it', 'did', 'in', 'prior', 'periods', '.', 'Often', 'expressed', 'as', 'a', 'percentage', ',', 'inflation', 'thus', 'indicates', 'a', 'decrease', 'in', 'the', 'purchasing', 'power', 'of', 'a', 'nation', '’', 's', 'currency', '.', 'As', 'prices', 'rise', ',', 'a', 'single', 'unit', 'of', 'currency', 'loses', 'value', 'as', 'it', 'buys', 'fewer', 'goods', 'and', 'services', '.', 'This', 'loss', 'of', 'purchasing', 'power', 'impacts', 'the', 'general', 'cost', 'of', 'living', 'for', 'the', 'common', 'public', 'which', 'ultimately', 'leads', 'to', 'a', 'deceleration', '

In [6]:
len(words)

222

### Stemming
###### Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
###### The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter's algorithm (Porter, 1980). Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix. 
![image.png](attachment:image.png)
###### Many of the later rules use a concept of the measure of a word, which loosely checks the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of a rule as a suffix rather than as part of the stem of a word. For example, the rule:

#### (m>1)    EMENT    
###### would map replacement to replac, but not cement to c. The official site for the Porter Stemmer is:
###### http://www.tartarus.org/~martin/PorterStemmer/

In [7]:
from nltk.stem import PorterStemmer 

In [8]:
from nltk.corpus import stopwords 
#stopwords are those which help in completing the sentences, but do not give meaning to the sentences as such
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
stemmer=PorterStemmer() #returns the stem of a word 

In [10]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    sentences[i]=''.join(words)
    

In [11]:
sentences #we get all the sentences with it's words in their stemmed form

['inflatquantitmeasurrateaveragpricelevelbasketselectgoodserviceconomiincreasperiodtime.',
 'Itrisegenerlevelpriceunitcurrenceffectbuylesspriorperiod.',
 'oftenexpresspercentag,inflatthuindicdecreaspurchaspowernation’currenc.',
 'Aspricerise,singlunitcurrenclosevalubuyfewergoodservic.',
 'thilosspurchaspowerimpactgenercostlivecommonpublicultimleaddecelereconomgrowth.',
 "theconsensuviewamongeconomistsustaininflatoccurnation'smoneysuppligrowthoutpaceconomgrowth.",
 "Tocombat,countri'sapproprimonetariauthor,likecentralbank,takenecessarimeasurkeepinflatwithinpermisslimitkeepeconomirunsmoothli.",
 'inflatmeasurvarietiwaydependupontypegoodservicconsidoppositdeflatindicgenerdeclinoccurpricegoodservicinflatratefall0%.']

### Lemmatization
###### Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [12]:
from nltk import WordNetLemmatizer

In [13]:
lemmatizer=WordNetLemmatizer()

In [14]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words("english"))]
    sentences[i]=''.join(words)

In [15]:
sentences

['inflatquantitmeasurrateaveragpricelevelbasketselectgoodserviceconomiincreasperiodtime.',
 'Itrisegenerlevelpriceunitcurrenceffectbuylesspriorperiod.',
 'oftenexpresspercentag,inflatthuindicdecreaspurchaspowernation’currenc.',
 'Aspricerise,singlunitcurrenclosevalubuyfewergoodservic.',
 'thilosspurchaspowerimpactgenercostlivecommonpublicultimleaddecelereconomgrowth.',
 "theconsensuviewamongeconomistsustaininflatoccurnation'smoneysuppligrowthoutpaceconomgrowth.",
 "Tocombat,countri'sapproprimonetariauthor,likecentralbank,takenecessarimeasurkeepinflatwithinpermisslimitkeepeconomirunsmoothli.",
 'inflatmeasurvarietiwaydependupontypegoodservicconsidoppositdeflatindicgenerdeclinoccurpricegoodservicinflatratefall0%.']

## Word embedding approaches
#### 1) Bag of Words
#### 2) TF-IDF
#### 3) Word2Vec

### -> Bag of Words

![image.png](attachment:image.png)

In [16]:
import re #importing regular expressions

In [17]:
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(para)

In [18]:
corpus = [] # lemmatising the words and removing the stopwords
count = 0 # will give the total number of words present in the corpus after removing the stop words 
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    count += len(review) 
    review = ' '.join(review)
    corpus.append(review)

In [19]:
corpus

['inflation quantitative measure rate average price level basket selected good service economy increase period time',
 'rise general level price unit currency effectively buy le prior period',
 'often expressed percentage inflation thus indicates decrease purchasing power nation currency',
 'price rise single unit currency loses value buy fewer good service',
 'loss purchasing power impact general cost living common public ultimately lead deceleration economic growth',
 'consensus view among economist sustained inflation occurs nation money supply growth outpaces economic growth',
 'combat country appropriate monetary authority like central bank take necessary measure keep inflation within permissible limit keep economy running smoothly',
 'inflation measured variety way depending upon type good service considered opposite deflation indicates general decline occurring price good service inflation rate fall']

In [20]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus)
print(cv.get_feature_names())

['among', 'appropriate', 'authority', 'average', 'bank', 'basket', 'buy', 'central', 'combat', 'common', 'consensus', 'considered', 'cost', 'country', 'currency', 'deceleration', 'decline', 'decrease', 'deflation', 'depending', 'economic', 'economist', 'economy', 'effectively', 'expressed', 'fall', 'fewer', 'general', 'good', 'growth', 'impact', 'increase', 'indicates', 'inflation', 'keep', 'le', 'lead', 'level', 'like', 'limit', 'living', 'loses', 'loss', 'measure', 'measured', 'monetary', 'money', 'nation', 'necessary', 'occurring', 'occurs', 'often', 'opposite', 'outpaces', 'percentage', 'period', 'permissible', 'power', 'price', 'prior', 'public', 'purchasing', 'quantitative', 'rate', 'rise', 'running', 'selected', 'service', 'single', 'smoothly', 'supply', 'sustained', 'take', 'thus', 'time', 'type', 'ultimately', 'unit', 'upon', 'value', 'variety', 'view', 'way', 'within']


In [21]:
print(count) # Total words in the corpus
print(len(cv.get_feature_names())) #All the frequently used words in the corpus

118
84


In [22]:
X = cv.fit_transform(corpus).toarray() 
# Learn the vocabulary dictionary and return document-term matrix.
# This is equivalent to fit followed by transform, but more efficiently implemented.
print(X)
print(len(X[0])) 
# Each row contains 84 elements marking the presence of the feature_names in the given sentence

[[0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0
  0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0
  0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0
  1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 2 0 0 0 1 0 0
  0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0

### TF-IDF : Term Frequency Inverse Document Frequency
#### TF = Number of occurences of a word in a document / Number of words in the document
#### IDF = log( Number of Documents / Number of documents containing word )
#### TF-IDF = TF * IDF

In [23]:
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(para)

In [24]:
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [25]:
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()

In [26]:
print(X)

[[0.         0.         0.         0.29781205 0.         0.29781205
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.24958974 0.
  0.         0.         0.         0.         0.21537547 0.
  0.         0.29781205 0.         0.16715316 0.         0.
  0.         0.24958974 0.         0.         0.         0.
  0.         0.24958974 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.24958974 0.         0.         0.18883682 0.
  0.         0.         0.29781205 0.24958974 0.         0.
  0.29781205 0.21537547 0.         0.         0.         0.
  0.         0.         0.29781205 0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.29704987 0.         0.         0.         0.         0.
  0.         0.        

### Word2Vec

#### Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.

#### More on it in next notebook