In [62]:
paragraph = """The universe is all of space and time and their contents. It comprises all of existence, any fundamental interaction, physical process and physical constant, and therefore all forms of energy and matter, and the structures they form, from sub-atomic particles to entire galactic filaments. Space and time, according to the prevailing cosmological theory of the Big Bang, emerged together 13.787±0.020 billion years ago, and the universe has been expanding ever since. Today the universe has expanded into an age and size that is physically only in parts observable as the observable universe, which is approximately 93 billion light-years in diameter at the present day, while the spatial size, if any, of the entire universe is unknown.

Some of the earliest cosmological models of the universe were developed by ancient Greek and Indian philosophers and were geocentric, placing Earth at the center. Over the centuries, more precise astronomical observations led Nicolaus Copernicus to develop the heliocentric model with the Sun at the center of the Solar System. In developing the law of universal gravitation, Isaac Newton built upon Copernicus's work as well as Johannes Kepler's laws of planetary motion and observations by Tycho Brahe.

Further observational improvements led to the realization that the Sun is one of a few hundred billion stars in the Milky Way, which is one of a few hundred billion galaxies in the observable universe. Many of the stars in a galaxy have planets. At the largest scale, galaxies are distributed uniformly and the same in all directions, meaning that the universe has neither an edge nor a center. At smaller scales, galaxies are distributed in clusters and superclusters which form immense filaments and voids in space, creating a vast foam-like structure. Discoveries in the early 20th century have suggested that the universe had a beginning and has been expanding since then."""

In [63]:
# Importing libraries
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re

In [64]:
# Tokenization
nltk.download('punkt')
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maitr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [65]:
for i in sentences:
    print(i)

The universe is all of space and time and their contents.
It comprises all of existence, any fundamental interaction, physical process and physical constant, and therefore all forms of energy and matter, and the structures they form, from sub-atomic particles to entire galactic filaments.
Space and time, according to the prevailing cosmological theory of the Big Bang, emerged together 13.787±0.020 billion years ago, and the universe has been expanding ever since.
Today the universe has expanded into an age and size that is physically only in parts observable as the observable universe, which is approximately 93 billion light-years in diameter at the present day, while the spatial size, if any, of the entire universe is unknown.
Some of the earliest cosmological models of the universe were developed by ancient Greek and Indian philosophers and were geocentric, placing Earth at the center.
Over the centuries, more precise astronomical observations led Nicolaus Copernicus to develop the h

In [66]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [67]:
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]',' ', sentences[i])
    review = review.lower()
    corpus.append(review)

In [68]:
print(corpus)

['the universe is all of space and time and their contents ', 'it comprises all of existence  any fundamental interaction  physical process and physical constant  and therefore all forms of energy and matter  and the structures they form  from sub atomic particles to entire galactic filaments ', 'space and time  according to the prevailing cosmological theory of the big bang  emerged together              billion years ago  and the universe has been expanding ever since ', 'today the universe has expanded into an age and size that is physically only in parts observable as the observable universe  which is approximately    billion light years in diameter at the present day  while the spatial size  if any  of the entire universe is unknown ', 'some of the earliest cosmological models of the universe were developed by ancient greek and indian philosophers and were geocentric  placing earth at the center ', 'over the centuries  more precise astronomical observations led nicolaus copernicus

In [69]:
# Stemming
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

univers
space
time
content
compris
exist
fundament
interact
physic
process
physic
constant
therefor
form
energi
matter
structur
form
sub
atom
particl
entir
galact
filament
space
time
accord
prevail
cosmolog
theori
big
bang
emerg
togeth
billion
year
ago
univers
expand
ever
sinc
today
univers
expand
age
size
physic
part
observ
observ
univers
approxim
billion
light
year
diamet
present
day
spatial
size
entir
univers
unknown
earliest
cosmolog
model
univers
develop
ancient
greek
indian
philosoph
geocentr
place
earth
center
centuri
precis
astronom
observ
led
nicolau
copernicu
develop
heliocentr
model
sun
center
solar
system
develop
law
univers
gravit
isaac
newton
built
upon
copernicu
work
well
johann
kepler
law
planetari
motion
observ
tycho
brahe
observ
improv
led
realiz
sun
one
hundr
billion
star
milki
way
one
hundr
billion
galaxi
observ
univers
mani
star
galaxi
planet
largest
scale
galaxi
distribut
uniformli
direct
mean
univers
neither
edg
center
smaller
scale
galaxi
distribut
cluster
super

In [70]:
# Lemmatization
nltk.download('wordnet')
for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

universe
space
time
content
comprises
existence
fundamental
interaction
physical
process
physical
constant
therefore
form
energy
matter
structure
form
sub
atomic
particle
entire
galactic
filament
space
time
according
prevailing
cosmological
theory
big
bang
emerged
together
billion
year
ago
universe
expanding
ever
since
today
universe
expanded
age
size
physically
part
observable
observable
universe
approximately
billion
light
year
diameter
present
day
spatial
size
entire
universe
unknown
earliest
cosmological
model
universe
developed
ancient
greek
indian
philosopher
geocentric
placing
earth
center
century
precise
astronomical
observation
led
nicolaus
copernicus
develop
heliocentric
model
sun
center
solar
system
developing
law
universal
gravitation
isaac
newton
built
upon
copernicus
work
well
johannes
kepler
law
planetary
motion
observation
tycho
brahe
observational
improvement
led
realization
sun
one
hundred
billion
star
milky
way
one
hundred
billion
galaxy
observable
universe
many
star

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maitr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [71]:
# Bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# cv = CountVectorizer(binary=True, ngram_range=(n, N)) for binary bag of words and ngrams

In [72]:
X = cv.fit_transform(corpus)

In [73]:
cv.vocabulary_

{'the': 140,
 'universe': 153,
 'is': 77,
 'all': 3,
 'of': 103,
 'space': 128,
 'and': 6,
 'time': 146,
 'their': 141,
 'contents': 28,
 'it': 79,
 'comprises': 26,
 'existence': 48,
 'any': 7,
 'fundamental': 57,
 'interaction': 75,
 'physical': 110,
 'process': 118,
 'constant': 27,
 'therefore': 144,
 'forms': 55,
 'energy': 45,
 'matter': 89,
 'structures': 132,
 'they': 145,
 'form': 54,
 'from': 56,
 'sub': 133,
 'atomic': 13,
 'particles': 107,
 'to': 147,
 'entire': 46,
 'galactic': 59,
 'filaments': 52,
 'according': 0,
 'prevailing': 117,
 'cosmological': 30,
 'theory': 143,
 'big': 17,
 'bang': 14,
 'emerged': 44,
 'together': 149,
 'billion': 18,
 'years': 165,
 'ago': 2,
 'has': 66,
 'been': 15,
 'expanding': 50,
 'ever': 47,
 'since': 123,
 'today': 148,
 'expanded': 49,
 'into': 76,
 'an': 4,
 'age': 1,
 'size': 124,
 'that': 139,
 'physically': 111,
 'only': 105,
 'in': 73,
 'parts': 108,
 'observable': 100,
 'as': 10,
 'which': 161,
 'approximately': 8,
 'light': 86,


In [74]:
print(corpus[0])

the universe is all of space and time and their contents 


In [75]:
print(X[0])

  (0, 140)	1
  (0, 153)	1
  (0, 77)	1
  (0, 3)	1
  (0, 103)	1
  (0, 128)	1
  (0, 6)	2
  (0, 146)	1
  (0, 141)	1
  (0, 28)	1


In [85]:
import numpy as np
corpus = np.array(corpus)
print(corpus)

['the universe is all of space and time and their contents '
 'it comprises all of existence  any fundamental interaction  physical process and physical constant  and therefore all forms of energy and matter  and the structures they form  from sub atomic particles to entire galactic filaments '
 'space and time  according to the prevailing cosmological theory of the big bang  emerged together              billion years ago  and the universe has been expanding ever since '
 'today the universe has expanded into an age and size that is physically only in parts observable as the observable universe  which is approximately    billion light years in diameter at the present day  while the spatial size  if any  of the entire universe is unknown '
 'some of the earliest cosmological models of the universe were developed by ancient greek and indian philosophers and were geocentric  placing earth at the center '
 'over the centuries  more precise astronomical observations led nicolaus copernicus

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), max_features=3)
X = tfidf.fit_transform(corpus)
print(X[0].toarray())

[[0.83536604 0.41768302 0.35735763]]
