In [1]:
import numpy as np
import pandas as pd

In [2]:
paragraph = """
Alexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt. By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.[1] He was undefeated in battle and is widely considered to be one of history's greatest and most successful military commanders.[2][3][4]

Until the age of 16, Alexander was tutored by Aristotle. In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destroyed in battle. Alexander then led the League of Corinth, and used his authority to launch the pan-Hellenic project envisaged by his father, assuming leadership over all Greeks in their conquest of Persia.[5][6]

In 334 BC, he invaded the Achaemenid Persian Empire and began a series of campaigns that lasted for 10 years. Following his conquest of Asia Minor, Alexander broke the power of Achaemenid Persia in a series of decisive battles, including those at Issus and Gaugamela; he subsequently overthrew Darius III and conquered the Achaemenid Empire in its entirety.[e] After the fall of Persia, the Macedonian Empire held a vast swath of territory between the Adriatic Sea and the Indus River. Alexander endeavored to reach the "ends of the world and the Great Outer Sea" and invaded India in 326 BC, achieving an important victory over Porus, an ancient Indian king of present-day Punjab, at the Battle of the Hydaspes. Due to the demand of his homesick troops, he eventually turned back at the Beas River and later died in 323 BC in Babylon, the city of Mesopotamia that he had planned to establish as his empire's capital. Alexander's death left unexecuted an additional series of planned military and mercantile campaigns that would have begun with a Greek invasion of Arabia. In the years following his death, a series of civil wars broke out across the Macedonian Empire, eventually leading to its disintegration at the hands of the Diadochi.
"""

In [3]:
paragraph

'\nAlexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt. By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.[1] He was undefeated in battle and is widely considered to be one of history\'s greatest and most successful military commanders.[2][3][4]\n\nUntil the age of 16, Alexander was tutored by Aristotle. In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destroyed in battle. Alexande

In [4]:
import nltk
from nltk.stem import PorterStemmer # Stemming is done using this library
from nltk.corpus import stopwords

## Tokenization --> Converts paragraphs-sentences-words

In [5]:
nltk.download('punkt') # download this package for doing the tokenization.
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [6]:
sentences

['\nAlexander III of Macedon (Ancient Greek: Ἀλέξανδρος, romanized: Alexandros; 20/21 July 356 BC – 10/11 June 323 BC), most commonly known as Alexander the Great,[c] was a king of the ancient Greek kingdom of Macedon.',
 '[d] He succeeded his father Philip II to the throne in 336 BC at the age of 20 and spent most of his ruling years conducting a lengthy military campaign throughout Western Asia, Central Asia, parts of South Asia, and Egypt.',
 'By the age of 30, he had created one of the largest empires in history, stretching from Greece to northwestern India.',
 "[1] He was undefeated in battle and is widely considered to be one of history's greatest and most successful military commanders.",
 '[2][3][4]\n\nUntil the age of 16, Alexander was tutored by Aristotle.',
 'In 335 BC, shortly after his assumption of kingship over Macedon, he campaigned in the Balkans and reasserted control over Thrace and parts of Illyria before marching on the city of Thebes, which was subsequently destro

In [7]:
type(sentences)

list

### Stemming --> Find out the base root word

In [8]:
stemmer = PorterStemmer()

### Lemmatizer --> Find out the base word with proper spelling

In [9]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...


True

In [10]:
lemmatizer = WordNetLemmatizer()

In [11]:
len(sentences)

14

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Apply Stopwords. Lemmatize

In [17]:
import re
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

## Bag of words

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

In [52]:
cv = CountVectorizer(binary=True) # Binary is optional

In [53]:
X = cv.fit_transform(corpus)

In [54]:
cv.vocabulary_  ## Represents the index of the vectorizer

{'alexander': 6,
 'iii': 72,
 'macedon': 98,
 'ancient': 8,
 'greek': 64,
 'romanized': 123,
 'alexandros': 7,
 'july': 82,
 'bc': 19,
 'june': 83,
 'commonly': 31,
 'known': 87,
 'great': 61,
 'king': 84,
 'kingdom': 85,
 'succeeded': 132,
 'father': 58,
 'philip': 113,
 'ii': 71,
 'throne': 138,
 'age': 5,
 'spent': 129,
 'ruling': 124,
 'year': 153,
 'conducting': 32,
 'lengthy': 97,
 'military': 103,
 'campaign': 24,
 'throughout': 139,
 'western': 149,
 'asia': 11,
 'central': 27,
 'part': 110,
 'south': 128,
 'egypt': 49,
 'created': 38,
 'one': 106,
 'largest': 88,
 'empire': 50,
 'history': 68,
 'stretching': 130,
 'greece': 63,
 'northwestern': 105,
 'india': 76,
 'undefeated': 143,
 'battle': 18,
 'widely': 150,
 'considered': 35,
 'greatest': 62,
 'successful': 133,
 'commander': 30,
 'tutored': 142,
 'aristotle': 10,
 'shortly': 127,
 'assumption': 13,
 'kingship': 86,
 'campaigned': 25,
 'balkan': 17,
 'reasserted': 121,
 'control': 36,
 'thrace': 137,
 'illyria': 73,
 'ma

In [55]:
corpus[0]

'alexander iii macedon ancient greek romanized alexandros july bc june bc commonly known alexander great c king ancient greek kingdom macedon'

In [56]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [57]:
X[0].toarray().shape

(1, 154)

## TFIDF

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
cv = TfidfVectorizer()

In [21]:
X = cv.fit_transform(corpus)

In [23]:
corpus[0]

'alexander iii macedon ancient greek romanized alexandros july bc june bc commonly known alexander great c king ancient greek kingdom macedon'

In [25]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.25915813, 0.22170106, 0.38377032, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.25915813,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.22170106, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.19188516, 0.        , 0.        , 0.34146089,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.19188516, 0.  

In [28]:
cv = TfidfVectorizer(ngram_range=(3,3))  
X = cv.fit_transform(corpus)

In [29]:
corpus[0]

'alexander iii macedon ancient greek romanized alexandros july bc june bc commonly known alexander great c king ancient greek kingdom macedon'

In [30]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.23570226, 0.23570226,
        0.        , 0.        , 0.23570226, 0.23570226, 0.23570226,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.23570226, 0.        , 0.23570226, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.23570226, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [31]:
cv.vocabulary_

{'alexander iii macedon': 14,
 'iii macedon ancient': 100,
 'macedon ancient greek': 124,
 'ancient greek romanized': 19,
 'greek romanized alexandros': 92,
 'romanized alexandros july': 155,
 'alexandros july bc': 17,
 'july bc june': 109,
 'bc june bc': 37,
 'june bc commonly': 110,
 'bc commonly known': 35,
 'commonly known alexander': 52,
 'known alexander great': 114,
 'alexander great king': 13,
 'great king ancient': 85,
 'king ancient greek': 111,
 'ancient greek kingdom': 18,
 'greek kingdom macedon': 91,
 'succeeded father philip': 169,
 'father philip ii': 81,
 'philip ii throne': 144,
 'ii throne bc': 98,
 'throne bc age': 175,
 'bc age spent': 33,
 'age spent ruling': 9,
 'spent ruling year': 165,
 'ruling year conducting': 156,
 'year conducting lengthy': 189,
 'conducting lengthy military': 53,
 'lengthy military campaign': 123,
 'military campaign throughout': 131,
 'campaign throughout western': 45,
 'throughout western asia': 176,
 'western asia central': 185,
 'asia 

### Define Max_features
-> Top max_features based on the term frequency

In [35]:
cv = TfidfVectorizer(ngram_range=(3,3), max_features=10)  
X = cv.fit_transform(corpus)

In [36]:
X[0].toarray()

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])