# 1. Text Mining & NLP with NLTK
-----
In today, I am going to mention some NLP techniques These are;

    1.1. Tokenization
    1.2. Stemming
    1.3. Lemmatization
    1.4. Stopwords Removel
    1.5. Bag of Words models
    1.6. Parts of Speech (POS)
    1.7. Named Entity Recognization (NER)
    1.8. Chunking
   

## 1.1. Tokenization
-----------
Tokenization is essentially splitting a phrase, sentence, paragraph or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

In [123]:
AI = "When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth. The goals of artificial intelligence include mimicking human cognitive activity."

In [124]:
type(AI)

str

### Word and Sentence Tokenization

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [126]:
AI_sent_tokenizes = sent_tokenize(AI)

for AI_sent_tokenize in AI_sent_tokenizes:
    print(AI_sent_tokenize)

When most people hear the term artificial intelligence, the first thing they usually think of is robots.
That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth.
But nothing could be further from the truth.
The goals of artificial intelligence include mimicking human cognitive activity.


In [127]:
AI_word_tokenizes = word_tokenize(AI)
print(AI_word_tokenizes)

['When', 'most', 'people', 'hear', 'the', 'term', 'artificial', 'intelligence', ',', 'the', 'first', 'thing', 'they', 'usually', 'think', 'of', 'is', 'robots', '.', 'That', "'s", 'because', 'big-budget', 'films', 'and', 'novels', 'weave', 'stories', 'about', 'human-like', 'machines', 'that', 'wreak', 'havoc', 'on', 'Earth', '.', 'But', 'nothing', 'could', 'be', 'further', 'from', 'the', 'truth', '.', 'The', 'goals', 'of', 'artificial', 'intelligence', 'include', 'mimicking', 'human', 'cognitive', 'activity', '.']


In [128]:
from nltk.probability import FreqDist

In [129]:
fdist = FreqDist()

for AI_word_tokenize in AI_word_tokenizes:
    fdist[AI_word_tokenize.lower()] += 1

fdist

FreqDist({'the': 4, '.': 4, 'artificial': 2, 'intelligence': 2, 'of': 2, 'that': 2, 'when': 1, 'most': 1, 'people': 1, 'hear': 1, ...})

In [130]:
fdist_top5 = fdist.most_common(5)

fdist_top5

[('the', 4), ('.', 4), ('artificial', 2), ('intelligence', 2), ('of', 2)]

### N-Gram models

In [131]:
from nltk.tokenize import word_tokenize

In [132]:
import nltk

In [133]:
string = "When most people hear the term artificial intelligence, the first thing they usually think of is robots."

In [134]:
string_word_tokenize = word_tokenize(string)

In [135]:
string_word_tokenize_bigrams = list(nltk.bigrams(sequence=string_word_tokenize))

In [136]:
string_word_tokenize_bigrams

[('When', 'most'),
 ('most', 'people'),
 ('people', 'hear'),
 ('hear', 'the'),
 ('the', 'term'),
 ('term', 'artificial'),
 ('artificial', 'intelligence'),
 ('intelligence', ','),
 (',', 'the'),
 ('the', 'first'),
 ('first', 'thing'),
 ('thing', 'they'),
 ('they', 'usually'),
 ('usually', 'think'),
 ('think', 'of'),
 ('of', 'is'),
 ('is', 'robots'),
 ('robots', '.')]

In [137]:
string_word_tokenize_trigrams = list(nltk.trigrams(sequence=string_word_tokenize))

In [138]:
string_word_tokenize_trigrams

[('When', 'most', 'people'),
 ('most', 'people', 'hear'),
 ('people', 'hear', 'the'),
 ('hear', 'the', 'term'),
 ('the', 'term', 'artificial'),
 ('term', 'artificial', 'intelligence'),
 ('artificial', 'intelligence', ','),
 ('intelligence', ',', 'the'),
 (',', 'the', 'first'),
 ('the', 'first', 'thing'),
 ('first', 'thing', 'they'),
 ('thing', 'they', 'usually'),
 ('they', 'usually', 'think'),
 ('usually', 'think', 'of'),
 ('think', 'of', 'is'),
 ('of', 'is', 'robots'),
 ('is', 'robots', '.')]

In [139]:
string_word_tokenize_ngrams = list(nltk.ngrams(string_word_tokenize, 5))

In [140]:
string_word_tokenize_ngrams

[('When', 'most', 'people', 'hear', 'the'),
 ('most', 'people', 'hear', 'the', 'term'),
 ('people', 'hear', 'the', 'term', 'artificial'),
 ('hear', 'the', 'term', 'artificial', 'intelligence'),
 ('the', 'term', 'artificial', 'intelligence', ','),
 ('term', 'artificial', 'intelligence', ',', 'the'),
 ('artificial', 'intelligence', ',', 'the', 'first'),
 ('intelligence', ',', 'the', 'first', 'thing'),
 (',', 'the', 'first', 'thing', 'they'),
 ('the', 'first', 'thing', 'they', 'usually'),
 ('first', 'thing', 'they', 'usually', 'think'),
 ('thing', 'they', 'usually', 'think', 'of'),
 ('they', 'usually', 'think', 'of', 'is'),
 ('usually', 'think', 'of', 'is', 'robots'),
 ('think', 'of', 'is', 'robots', '.')]

-----------

## 1.2. Stemming
--------------
Stemming is a process of reducing a word to its word setem that affix to suffix and prefix or to do roots of words.

In [145]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
import re

In [142]:
string = "Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity."

In [144]:
compile_ = []

ps = PorterStemmer()

string_ = re.sub("[^a-zA-Z]", " ", str(string))
string_ = string_.lower().split()
string_ = [ps.stem(word=word_) for word_ in string_]
string_ = " ".join(string_)
compile_.append(string_)

print("Before stemming: \n{}".format(string))
print("----------------")
print("After stemming: \n{}".format(compile_))

Before stemming: 
Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity.
----------------
After stemming: 
['artifici intellig is base on the principl that human intellig can be defin in a way that a machin can easili mimic it and execut task from the most simpl to those that are even more complex the goal of artifici intellig includ mimick human cognit activ']


In [146]:
compile_ = []

lts = LancasterStemmer()

string_ = re.sub("[^a-zA-Z]", " ", str(string))
string_ = string_.lower().split()
string_ = [lts.stem(word=word_) for word_ in string_]
string_ = " ".join(string_)
compile_.append(string_)

print("Before stemming: \n{}".format(string))
print("----------------")
print("After stemming: \n{}".format(compile_))

Before stemming: 
Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity.
----------------
After stemming: 
['art intellig is bas on the principl that hum intellig can be defin in a way that a machin can easy mim it and execut task from the most simpl to thos that ar ev mor complex the goal of art intellig includ mimick hum cognit act']


-----------

## 1.3. Lematization
--------------
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In [149]:
from nltk.stem import WordNetLemmatizer
import re

In [150]:
string = "Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity."

In [153]:
compile_ = []

lemma = WordNetLemmatizer()

string_ = re.sub("[^a-zA-Z]", " ", str(string))
string_ = string_.lower().split()
string_ = [lemma.lemmatize(word=word_) for word_ in string_]
string_ = " ".join(string_)
compile_.append(string_)

print("Before lemmatizer: \n{}".format(string))
print("----------------")
print("After lemmatizer: \n{}".format(compile_))

Before lemmatizer: 
Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity.
----------------
After lemmatizer: 
['artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute task from the most simple to those that are even more complex the goal of artificial intelligence include mimicking human cognitive activity']


--------------

## 1.4. Stopwords
----------
Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

In [154]:
from nltk.corpus import stopwords

In [155]:
stopwords_ = stopwords.words("english")

In [158]:
stopwords_[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [159]:
len(stopwords.words("english"))

179

----------------

## 1.5. Bag of Words Model Example

In [162]:
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 

In [160]:
AI = "When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth. The goals of artificial intelligence include mimicking human cognitive activity."

In [161]:
str(AI)

"When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth. The goals of artificial intelligence include mimicking human cognitive activity."

In [163]:
AI_sent_tokenizes = sent_tokenize(AI)

In [164]:
AI_sent_tokenizes

['When most people hear the term artificial intelligence, the first thing they usually think of is robots.',
 "That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth.",
 'But nothing could be further from the truth.',
 'The goals of artificial intelligence include mimicking human cognitive activity.']

In [171]:
compile_ = []

ps = PorterStemmer()

for AI_sent_tokenize in AI_sent_tokenizes:

    string_ = re.sub("[^a-zA-Z]", " ", str(AI_sent_tokenize))
    string_ = string_.lower().split()
    string_ = [ps.stem(word=word_) for word_ in string_ if not word_ is set(stopwords.words("english"))]
    string_ = " ".join(string_)
    compile_.append(string_)

In [172]:
compile_

['when most peopl hear the term artifici intellig the first thing they usual think of is robot',
 'that s becaus big budget film and novel weav stori about human like machin that wreak havoc on earth',
 'but noth could be further from the truth',
 'the goal of artifici intellig includ mimick human cognit activ']

In [165]:
from sklearn.feature_extraction.text import CountVectorizer

In [166]:
vect = CountVectorizer()

In [178]:
vect.fit(compile_)

CountVectorizer()

In [181]:
columns_name = vect.get_feature_names()

In [176]:
results = vect.fit_transform(compile_).toarray()

In [177]:
results

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
        1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 2, 1, 1, 1, 0, 1, 0, 1,
        0],
       [1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0,
        1],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
        0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0]], dtype=int64)

In [182]:
import pandas as pd 

In [183]:
data = pd.DataFrame(data=results, columns=[columns_name])

In [184]:
data

Unnamed: 0,about,activ,and,artifici,be,becaus,big,budget,but,cognit,...,that,the,they,thing,think,truth,usual,weav,when,wreak
0,0,0,0,1,0,0,0,0,0,0,...,0,2,1,1,1,0,1,0,1,0
1,1,0,1,0,0,1,1,1,0,0,...,2,0,0,0,0,0,0,1,0,1
2,0,0,0,0,1,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
3,0,1,0,1,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0


--------------

## 1.6. Parts of Speech (POS)

https://m-clark.github.io/text-analysis-with-R/topic-modeling.html

In [190]:
from nltk.tokenize import word_tokenize

In [191]:
string = "Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks."

In [192]:
string_word_tokenizes = word_tokenize(string)

In [195]:
for string_word_tokenize in string_word_tokenizes:

    print(nltk.pos_tag([string_word_tokenize]))

[('Artificial', 'JJ')]
[('intelligence', 'NN')]
[('is', 'VBZ')]
[('based', 'VBN')]
[('on', 'IN')]
[('the', 'DT')]
[('principle', 'NN')]
[('that', 'IN')]
[('human', 'NN')]
[('intelligence', 'NN')]
[('can', 'MD')]
[('be', 'VB')]
[('defined', 'VBN')]
[('in', 'IN')]
[('a', 'DT')]
[('way', 'NN')]
[('that', 'IN')]
[('a', 'DT')]
[('machine', 'NN')]
[('can', 'MD')]
[('easily', 'RB')]
[('mimic', 'NN')]
[('it', 'PRP')]
[('and', 'CC')]
[('execute', 'NN')]
[('tasks', 'NNS')]
[('.', '.')]


-----------

## 1.7. Named Entity Recognition
----------
Named entity recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.

In [199]:
from nltk import word_tokenize, pos_tag, ne_chunk

In [200]:
sentence = "Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP, before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season, he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"

In [206]:
tokens = word_tokenize(sentence)
print(tokens)

['Born', 'and', 'raised', 'in', 'Madeira', ',', 'Ronaldo', 'began', 'his', 'senior', 'club', 'career', 'playing', 'for', 'Sporting', 'CP', ',', 'before', 'signing', 'with', 'Manchester', 'United', 'in', '2003', ',', 'aged', '18', '.', 'After', 'winning', 'the', 'FA', 'Cup', 'in', 'his', 'first', 'season', ',', 'he', 'helped', 'United', 'win', 'three', 'successive', 'Premier', 'League', 'titles', ',', 'the', 'UEFA', 'Champions', 'League', ',', 'and', 'the', 'FIFA', 'Club', 'World', 'Cup']


In [205]:
pos_tags = pos_tag(tokens)
print(pos_tags)

[('Born', 'NNP'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('Madeira', 'NNP'), (',', ','), ('Ronaldo', 'NNP'), ('began', 'VBD'), ('his', 'PRP$'), ('senior', 'JJ'), ('club', 'NN'), ('career', 'NN'), ('playing', 'NN'), ('for', 'IN'), ('Sporting', 'VBG'), ('CP', 'NNP'), (',', ','), ('before', 'IN'), ('signing', 'VBG'), ('with', 'IN'), ('Manchester', 'NNP'), ('United', 'NNP'), ('in', 'IN'), ('2003', 'CD'), (',', ','), ('aged', 'VBD'), ('18', 'CD'), ('.', '.'), ('After', 'IN'), ('winning', 'VBG'), ('the', 'DT'), ('FA', 'NNP'), ('Cup', 'NNP'), ('in', 'IN'), ('his', 'PRP$'), ('first', 'JJ'), ('season', 'NN'), (',', ','), ('he', 'PRP'), ('helped', 'VBD'), ('United', 'NNP'), ('win', 'VB'), ('three', 'CD'), ('successive', 'JJ'), ('Premier', 'NNP'), ('League', 'NNP'), ('titles', 'NNS'), (',', ','), ('the', 'DT'), ('UEFA', 'NNP'), ('Champions', 'NNP'), ('League', 'NNP'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('FIFA', 'NNP'), ('Club', 'NNP'), ('World', 'NNP'), ('Cup', 'NNP')]


In [208]:
name_entities = ne_chunk(pos_tags)

for name_entity in name_entities:
    print(name_entity)

(GPE Born/NNP)
('and', 'CC')
('raised', 'VBN')
('in', 'IN')
(GPE Madeira/NNP)
(',', ',')
(PERSON Ronaldo/NNP)
('began', 'VBD')
('his', 'PRP$')
('senior', 'JJ')
('club', 'NN')
('career', 'NN')
('playing', 'NN')
('for', 'IN')
('Sporting', 'VBG')
(ORGANIZATION CP/NNP)
(',', ',')
('before', 'IN')
('signing', 'VBG')
('with', 'IN')
(PERSON Manchester/NNP United/NNP)
('in', 'IN')
('2003', 'CD')
(',', ',')
('aged', 'VBD')
('18', 'CD')
('.', '.')
('After', 'IN')
('winning', 'VBG')
('the', 'DT')
('FA', 'NNP')
('Cup', 'NNP')
('in', 'IN')
('his', 'PRP$')
('first', 'JJ')
('season', 'NN')
(',', ',')
('he', 'PRP')
('helped', 'VBD')
(GPE United/NNP)
('win', 'VB')
('three', 'CD')
('successive', 'JJ')
('Premier', 'NNP')
('League', 'NNP')
('titles', 'NNS')
(',', ',')
('the', 'DT')
(ORGANIZATION UEFA/NNP)
('Champions', 'NNP')
('League', 'NNP')
(',', ',')
('and', 'CC')
('the', 'DT')
(ORGANIZATION FIFA/NNP Club/NNP)
('World', 'NNP')
('Cup', 'NNP')


--------------

## 1.8. Chunking
-----
Chunking in NLP is a process to take small pieces of information and group them into large units. The primary use of Chunking is making groups of "noun phrases." It is used to add structure to the sentence by following POS tagging combined with regular expressions.

In [209]:
from nltk import word_tokenize, pos_tag, RegexpParser

In [210]:
sentence = "Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP, before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season, he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"

In [211]:
tokens = word_tokenize(sentence)

In [212]:
tokens_pos_tag = pos_tag(tokens)

In [213]:
chunking = "NP: {<DT>?<JJ>*<NN>}"

In [214]:
cp = RegexpParser(chunking)

In [215]:
result = cp.parse(tokens_pos_tag)

In [218]:
print(result)

(S
  Born/NNP
  and/CC
  raised/VBN
  in/IN
  Madeira/NNP
  ,/,
  Ronaldo/NNP
  began/VBD
  his/PRP$
  (NP senior/JJ club/NN)
  (NP career/NN)
  (NP playing/NN)
  for/IN
  Sporting/VBG
  CP/NNP
  ,/,
  before/IN
  signing/VBG
  with/IN
  Manchester/NNP
  United/NNP
  in/IN
  2003/CD
  ,/,
  aged/VBD
  18/CD
  ./.
  After/IN
  winning/VBG
  the/DT
  FA/NNP
  Cup/NNP
  in/IN
  his/PRP$
  (NP first/JJ season/NN)
  ,/,
  he/PRP
  helped/VBD
  United/NNP
  win/VB
  three/CD
  successive/JJ
  Premier/NNP
  League/NNP
  titles/NNS
  ,/,
  the/DT
  UEFA/NNP
  Champions/NNP
  League/NNP
  ,/,
  and/CC
  the/DT
  FIFA/NNP
  Club/NNP
  World/NNP
  Cup/NNP)
