In [5]:
import numpy as np
import pandas as pd

In [6]:
paragraph = """
Alexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt. By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.[1] He was undefeated in battle and is widely considered to be one of history's greatest and most successful military commanders.[2][3][4]

Until the age of 16, Alexander was tutored by Aristotle. In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destroyed in battle. Alexander then led the League of Corinth, and used his authority to launch the pan-Hellenic project envisaged by his father, assuming leadership over all Greeks in their conquest of Persia.[5][6]

In 334 BC, he invaded the Achaemenid Persian Empire and began a series of campaigns that lasted for 10 years. Following his conquest of Asia Minor, Alexander broke the power of Achaemenid Persia in a series of decisive battles, including those at Issus and Gaugamela; he subsequently overthrew Darius III and conquered the Achaemenid Empire in its entirety.[e] After the fall of Persia, the Macedonian Empire held a vast swath of territory between the Adriatic Sea and the Indus River. Alexander endeavored to reach the "ends of the world and the Great Outer Sea" and invaded India in 326 BC, achieving an important victory over Porus, an ancient Indian king of present-day Punjab, at the Battle of the Hydaspes. Due to the demand of his homesick troops, he eventually turned back at the Beas River and later died in 323 BC in Babylon, the city of Mesopotamia that he had planned to establish as his empire's capital. Alexander's death left unexecuted an additional series of planned military and mercantile campaigns that would have begun with a Greek invasion of Arabia. In the years following his death, a series of civil wars broke out across the Macedonian Empire, eventually leading to its disintegration at the hands of the Diadochi.
"""

In [7]:
paragraph

'\nAlexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt. By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.[1] He was undefeated in battle and is widely considered to be one of history\'s greatest and most successful military commanders.[2][3][4]\n\nUntil the age of 16, Alexander was tutored by Aristotle. In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destroyed in battle. Alexande

In [8]:
import nltk
from nltk.stem import PorterStemmer # Stemming is done using this library
from nltk.corpus import stopwords

## Tokenization --> Converts paragraphs-sentences-words

In [9]:
nltk.download('punkt') # download this package for doing the tokenization.
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
sentences

['\nAlexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.',
 '[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt.',
 'By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.',
 "[1] He was undefeated in battle and is widely considered to be one of history's greatest and most successful military commanders.",
 '[2][3][4]\n\nUntil the age of 16, Alexander was tutored by Aristotle.',
 'In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destro

In [11]:
type(sentences)

list

### Stemming --> Find out the base root word

In [12]:
stemmer = PorterStemmer()

In [13]:
stemmer.stem('going')

'go'

In [14]:
stemmer.stem('facial')

'facial'

In [15]:
stemmer.stem('thinking')

'think'

In [16]:
stemmer.stem('history')

'histori'

### Lemmatizer --> Find out the base word with proper spelling

In [17]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
lemmatizer = WordNetLemmatizer()

In [19]:
lemmatizer.lemmatize('history')

'history'

In [20]:
lemmatizer.lemmatize('drinking')

'drinking'

In [21]:
lemmatizer.lemmatize('goes')

'go'

## Clean Special Characters

In [22]:
len(sentences)

14

In [23]:
import re
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i]) # all the characters will be replaced other than a-zA-z
    review = review.lower()
    corpus.append(review)

In [24]:
corpus

[' alexander iii of macedon  ancient greek              romanized  alexandros        july     bc         june     bc   most commonly known as alexander the great  c  was a king of the ancient greek kingdom of macedon ',
 ' d  he succeeded his father philip ii to the throne in     bc at the age of    and spent most of his ruling years conducting a lengthy military campaign throughout western asia  central asia  parts of south asia  and egypt ',
 'by the age of     he had created one of the largest empires in history  stretching from greece to northwestern india ',
 '    he was undefeated in battle and is widely considered to be one of history s greatest and most successful military commanders ',
 '           until the age of     alexander was tutored by aristotle ',
 'in     bc  shortly after his assumption of kingship over macedon  he campaigned in the balkans and reasserted control over thrace and parts of illyria before marching on the city of thebes  which was subsequently destroyed

### Stemming (This way we can also apply lemmatization)

In [25]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

alexand
iii
macedon
ancient
greek
roman
alexandro
juli
bc
june
bc
commonli
known
alexand
great
c
king
ancient
greek
kingdom
macedon
succeed
father
philip
ii
throne
bc
age
spent
rule
year
conduct
lengthi
militari
campaign
throughout
western
asia
central
asia
part
south
asia
egypt
age
creat
one
largest
empir
histori
stretch
greec
northwestern
india
undef
battl
wide
consid
one
histori
greatest
success
militari
command
age
alexand
tutor
aristotl
bc
shortli
assumpt
kingship
macedon
campaign
balkan
reassert
control
thrace
part
illyria
march
citi
thebe
subsequ
destroy
battl
alexand
led
leagu
corinth
use
author
launch
pan
hellen
project
envisag
father
assum
leadership
greek
conquest
persia
bc
invad
achaemenid
persian
empir
began
seri
campaign
last
year
follow
conquest
asia
minor
alexand
broke
power
achaemenid
persia
seri
decis
battl
includ
issu
gaugamela
subsequ
overthrew
dariu
iii
conquer
achaemenid
empir
entireti
e
fall
persia
macedonian
empir
held
vast
swath
territori
adriat
sea
indu
river


In [28]:
## Lemmatization

for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

alexander
iii
macedon
ancient
greek
romanized
alexandros
july
bc
june
bc
commonly
known
alexander
great
c
king
ancient
greek
kingdom
macedon
succeeded
father
philip
ii
throne
bc
age
spent
ruling
year
conducting
lengthy
military
campaign
throughout
western
asia
central
asia
part
south
asia
egypt
age
created
one
largest
empire
history
stretching
greece
northwestern
india
undefeated
battle
widely
considered
one
history
greatest
successful
military
commander
age
alexander
tutored
aristotle
bc
shortly
assumption
kingship
macedon
campaigned
balkan
reasserted
control
thrace
part
illyria
marching
city
thebe
subsequently
destroyed
battle
alexander
led
league
corinth
used
authority
launch
pan
hellenic
project
envisaged
father
assuming
leadership
greek
conquest
persia
bc
invaded
achaemenid
persian
empire
began
series
campaign
lasted
year
following
conquest
asia
minor
alexander
broke
power
achaemenid
persia
series
decisive
battle
including
issus
gaugamela
subsequently
overthrew
darius
iii
conquere

## Apply Stopwords. Lemmatize

In [29]:
import re
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

## Bag of words

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
cv = CountVectorizer(binary=True) # Binary is optional

In [32]:
X = cv.fit_transform(corpus)

In [33]:
cv.vocabulary_  ## Represents the index of the vectorizer

{'alexander': 6,
 'iii': 72,
 'macedon': 98,
 'ancient': 8,
 'greek': 64,
 'romanized': 123,
 'alexandros': 7,
 'july': 82,
 'bc': 19,
 'june': 83,
 'commonly': 31,
 'known': 87,
 'great': 61,
 'king': 84,
 'kingdom': 85,
 'succeeded': 132,
 'father': 58,
 'philip': 113,
 'ii': 71,
 'throne': 138,
 'age': 5,
 'spent': 129,
 'ruling': 124,
 'year': 153,
 'conducting': 32,
 'lengthy': 97,
 'military': 103,
 'campaign': 24,
 'throughout': 139,
 'western': 149,
 'asia': 11,
 'central': 27,
 'part': 110,
 'south': 128,
 'egypt': 49,
 'created': 38,
 'one': 106,
 'largest': 88,
 'empire': 50,
 'history': 68,
 'stretching': 130,
 'greece': 63,
 'northwestern': 105,
 'india': 76,
 'undefeated': 143,
 'battle': 18,
 'widely': 150,
 'considered': 35,
 'greatest': 62,
 'successful': 133,
 'commander': 30,
 'tutored': 142,
 'aristotle': 10,
 'shortly': 127,
 'assumption': 13,
 'kingship': 86,
 'campaigned': 25,
 'balkan': 17,
 'reasserted': 121,
 'control': 36,
 'thrace': 137,
 'illyria': 73,
 'ma

In [34]:
corpus[0]

'alexander iii macedon ancient greek romanized alexandros july bc june bc commonly known alexander great c king ancient greek kingdom macedon'

In [35]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [36]:
X[0].toarray().shape

(1, 154)

## Power of N-Grams
In order to capture symentic information we use ngrams.
Bi-grams: Apart from only single features, we will be using combination of features.

N-grams(1, 3) means, I will be use unigrams to trigrams.

In [48]:
cv = CountVectorizer(binary=True, ngram_range=(3, 3))

In [49]:
X = cv.fit_transform(corpus)

In [50]:
cv.vocabulary_

{'alexander iii macedon': 14,
 'iii macedon ancient': 100,
 'macedon ancient greek': 124,
 'ancient greek romanized': 19,
 'greek romanized alexandros': 92,
 'romanized alexandros july': 155,
 'alexandros july bc': 17,
 'july bc june': 109,
 'bc june bc': 37,
 'june bc commonly': 110,
 'bc commonly known': 35,
 'commonly known alexander': 52,
 'known alexander great': 114,
 'alexander great king': 13,
 'great king ancient': 85,
 'king ancient greek': 111,
 'ancient greek kingdom': 18,
 'greek kingdom macedon': 91,
 'succeeded father philip': 169,
 'father philip ii': 81,
 'philip ii throne': 144,
 'ii throne bc': 98,
 'throne bc age': 175,
 'bc age spent': 33,
 'age spent ruling': 9,
 'spent ruling year': 165,
 'ruling year conducting': 156,
 'year conducting lengthy': 189,
 'conducting lengthy military': 53,
 'lengthy military campaign': 123,
 'military campaign throughout': 131,
 'campaign throughout western': 45,
 'throughout western asia': 176,
 'western asia central': 185,
 'asia 

In [51]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)