Task 5:
1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?
5.	*(not necessary) Find Top 100 most used verbs in sentences with Alice. Get word vectors using a pre-trained word2vec model and visualize them. Compare the words using embeddings. 



In [1]:
import re
import pandas as pd
import numpy as np
from nltk.tokenize import TreebankWordTokenizer

pd.set_option("display.max_colwidth", 100)

In [2]:
#removing stopwords
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords") 
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
#lemmatization
from nltk.stem import  WordNetLemmatizer

nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
def clean_and_tokenize_text(text):
    text = re.sub(r"[^\w\s]", " ", text, re.UNICODE)
    text = re.sub(r"_", " ", text, re.UNICODE)
    text = text.lower()
    tokens = TreebankWordTokenizer().tokenize(text)
    tokens = [token for token in tokens if not token in stop_words]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    text = " ".join(tokens)
    
    return text


Open our text:

In [5]:
with open('Alice.txt', 'r', encoding='utf-8') as f:
    text = f.read()

3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?

In [6]:
#devide our text into chapters for task 3
chapters = text.split('CHAPTER')

In [7]:
print([len(x) for x in chapters])

[970, 30, 27, 39, 43, 35, 24, 25, 36, 33, 31, 30, 29, 11549, 10951, 9259, 13882, 12009, 13842, 12702, 13668, 12629, 11409, 10385, 30604]


We see 12 chapters at the end of our list. Let's remove unnecessary texts 

In [8]:
chapters = [x for x in chapters if len(x) > 2000]

Let's also cut the license agreement from the last shapter

In [9]:
chapters[-1] = chapters[-1].split('THE END')[0]

In [10]:
print([len(x) for x in chapters])

[11549, 10951, 9259, 13882, 12009, 13842, 12702, 13668, 12629, 11409, 10385, 11640]


In [11]:
for num, chapter in enumerate(chapters):
    print(f"Chapter {num+1}:")
    print(chapter[:300])

Chapter 1:
 I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“withou
Chapter 2:
 II.
The Pool of Tears


“Curiouser and curiouser!” cried Alice (she was so much surprised, that
for the moment she quite forgot how to speak good English); “now I’m
opening out like the largest telescope that ever was! Good-bye, feet!”
(for when she looked down at her feet, they seemed to be almost
Chapter 3:
 III.
A Caucus-Race and a Long Tale


They were indeed a queer-looking party that assembled on the bank—the
birds with draggled feathers, the animals with their fur clinging close
to them, and all dripping wet, cross, and uncomfortable.

The first question of course was, how to get dry again: they h
Chapter 4:
 IV.
The Rabbit Sends in a Little Bill


It was the W

In [12]:
#put our data to dataframe
indexes = [f"chapter {i+1}" for i in range(len(chapters))]
data = pd.DataFrame(data=chapters, index=indexes, columns=['orig_text'])
data['processed_text'] = [clean_and_tokenize_text(chapter[4:]) for chapter in chapters]
data

Unnamed: 0,orig_text,processed_text
chapter 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on...,rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister readi...
chapter 2,"II.\nThe Pool of Tears\n\n\n“Curiouser and curiouser!” cried Alice (she was so much surprised, ...",pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english ...
chapter 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey were indeed a queer-looking party that assembled ...,caucus race long tale indeed queer looking party assembled bank bird draggled feather animal fur...
chapter 4,"IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt was the White Rabbit, trotting slowly back again...",rabbit sends little bill white rabbit trotting slowly back looking anxiously went lost something...
chapter 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpillar and Alice looked at each other for some time...,advice caterpillar caterpillar alice looked time silence last caterpillar took hookah mouth addr...
chapter 6,"VI.\nPig and Pepper\n\n\nFor a minute or two she stood looking at the house, and wondering what...",pig pepper minute two stood looking house wondering next suddenly footman livery came running wo...
chapter 7,"VII.\nA Mad Tea-Party\n\n\nThere was a table set out under a tree in front of the house, and th...",mad tea party table set tree front house march hare hatter tea dormouse sitting fast asleep two ...
chapter 8,VIII.\nThe Queen’s Croquet-Ground\n\n\nA large rose-tree stood near the entrance of the garden:...,queen croquet ground large rose tree stood near entrance garden rose growing white three gardene...
chapter 9,"IX.\nThe Mock Turtle’s Story\n\n\n“You can’t think how glad I am to see you again, you dear old...",mock turtle story think glad see dear old thing said duchess tucked arm affectionately alice wal...
chapter 10,"X.\nThe Lobster Quadrille\n\n\nThe Mock Turtle sighed deeply, and drew the back of one flapper ...",lobster quadrille mock turtle sighed deeply drew back one flapper across eye looked alice tried ...


### TF-IDF

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
vectorizer_tfidf = TfidfVectorizer()

In [15]:
vectorizer_tfidf.fit(data.processed_text)

TfidfVectorizer()

In [16]:
X = vectorizer_tfidf.fit_transform(data.processed_text)

In [17]:
def get_top_words(response, feture_names, top_n=10):
    sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
    return feature_names[response.indices[sorted_nzs]]

In [18]:
feature_names = np.array(vectorizer_tfidf.get_feature_names())

In [19]:
#getting top words for each chapter
top_words  = [get_top_words(x, feature_names, 11) for x in X]

In [20]:
#removing of word 'Alice'
for i,string in enumerate(top_words):
    top_words[i] = [x for x in string if x != 'alice'][:11]

In [21]:
data['top_words'] = [', '.join(x) for x in top_words]

In [22]:
data.top_words.to_frame()

Unnamed: 0,top_words
chapter 1,"little, bat, rabbit, door, key, way, eat, think, like, either"
chapter 2,"mouse, pool, little, oh, swam, dear, ll, cat, said, tear"
chapter 3,"said, mouse, dodo, race, lory, dry, thimble, know, bird, dinah"
chapter 4,"bill, little, window, rabbit, puppy, one, fan, bottle, chimney, said"
chapter 5,"caterpillar, said, pigeon, serpent, egg, youth, you, size, father, little"
chapter 6,"said, footman, cat, baby, mad, duchess, pig, wow, like, cook"
chapter 7,"hatter, dormouse, said, march, hare, tea, twinkle, it, time, well"
chapter 8,"queen, said, hedgehog, king, gardener, soldier, cat, five, procession, executioner"
chapter 9,"turtle, said, mock, gryphon, duchess, moral, queen, went, school, it"
chapter 10,"turtle, mock, gryphon, said, lobster, dance, join, beautiful, soup, whiting"


4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?

In [23]:
data_copy = data.copy()

In [24]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
#devision of our text into sentenses
data['orig_sent'] = data.orig_text.apply(lambda x:nltk.tokenize.sent_tokenize(x.replace('\n', ' '))[1:])
data['proc_sent'] = data.orig_sent.apply(lambda x: [clean_and_tokenize_text(item) for item in x])

In [26]:
data.head()

Unnamed: 0,orig_text,processed_text,top_words,orig_sent,proc_sent
chapter 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on...,rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister readi...,"little, bat, rabbit, door, key, way, eat, think, like, either",[Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the ba...,[rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister read...
chapter 2,"II.\nThe Pool of Tears\n\n\n“Curiouser and curiouser!” cried Alice (she was so much surprised, ...",pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english ...,"mouse, pool, little, oh, swam, dear, ll, cat, said, tear","[The Pool of Tears “Curiouser and curiouser!” cried Alice (she was so much surprised, that for...",[pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english...
chapter 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey were indeed a queer-looking party that assembled ...,caucus race long tale indeed queer looking party assembled bank bird draggled feather animal fur...,"said, mouse, dodo, race, lory, dry, thimble, know, bird, dinah",[A Caucus-Race and a Long Tale They were indeed a queer-looking party that assembled on the ba...,[caucus race long tale indeed queer looking party assembled bank bird draggled feather animal fu...
chapter 4,"IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt was the White Rabbit, trotting slowly back again...",rabbit sends little bill white rabbit trotting slowly back looking anxiously went lost something...,"bill, little, window, rabbit, puppy, one, fan, bottle, chimney, said","[The Rabbit Sends in a Little Bill It was the White Rabbit, trotting slowly back again, and lo...",[rabbit sends little bill white rabbit trotting slowly back looking anxiously went lost somethin...
chapter 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpillar and Alice looked at each other for some time...,advice caterpillar caterpillar alice looked time silence last caterpillar took hookah mouth addr...,"caterpillar, said, pigeon, serpent, egg, youth, you, size, father, little","[“Who are _you?_” said the Caterpillar., This was not an encouraging opening for a conversation....","[said caterpillar, encouraging opening conversation, alice replied rather shyly hardly know sir ..."


In [27]:
#deleting all sentences without 'Alice'
data['proc_sent'] = data.proc_sent.apply(lambda x: ' '.join([item for item in x if 'alice' in item]))

In [28]:
data.head()

Unnamed: 0,orig_text,processed_text,top_words,orig_sent,proc_sent
chapter 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on...,rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister readi...,"little, bat, rabbit, door, key, way, eat, think, like, either",[Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the ba...,rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister readi...
chapter 2,"II.\nThe Pool of Tears\n\n\n“Curiouser and curiouser!” cried Alice (she was so much surprised, ...",pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english ...,"mouse, pool, little, oh, swam, dear, ll, cat, said, tear","[The Pool of Tears “Curiouser and curiouser!” cried Alice (she was so much surprised, that for...",pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english ...
chapter 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey were indeed a queer-looking party that assembled ...,caucus race long tale indeed queer looking party assembled bank bird draggled feather animal fur...,"said, mouse, dodo, race, lory, dry, thimble, know, bird, dinah",[A Caucus-Race and a Long Tale They were indeed a queer-looking party that assembled on the ba...,first question course get dry consultation minute seemed quite natural alice find talking famili...
chapter 4,"IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt was the White Rabbit, trotting slowly back again...",rabbit sends little bill white rabbit trotting slowly back looking anxiously went lost something...,"bill, little, window, rabbit, puppy, one, fan, bottle, chimney, said","[The Rabbit Sends in a Little Bill It was the White Rabbit, trotting slowly back again, and lo...",dropped wonder alice guessed moment looking fan pair white kid glove good naturedly began huntin...
chapter 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpillar and Alice looked at each other for some time...,advice caterpillar caterpillar alice looked time silence last caterpillar took hookah mouth addr...,"caterpillar, said, pigeon, serpent, egg, youth, you, size, father, little","[“Who are _you?_” said the Caterpillar., This was not an encouraging opening for a conversation....",alice replied rather shyly hardly know sir present least know got morning think must changed sev...


In [29]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [30]:
#function to vind verbs in nltk.pos_tag
is_verb = lambda pos: pos[:2] == 'VB'

In [31]:
# extracting all verbs from sentenses
data['proc_sent'] = data.proc_sent.apply(lambda x: ' '.join([word for (word, pos) in nltk.pos_tag(x.split()) if is_verb(pos)]))

In [32]:
data.head()

Unnamed: 0,orig_text,processed_text,top_words,orig_sent,proc_sent
chapter 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on...,rabbit hole alice beginning get tired sitting sister bank nothing twice peeped book sister readi...,"little, bat, rabbit, door, key, way, eat, think, like, either",[Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the ba...,beginning get sitting peeped reading considering hot made making worth getting picking ran think...
chapter 2,"II.\nThe Pool of Tears\n\n\n“Curiouser and curiouser!” cried Alice (she was so much surprised, ...",pool tear curiouser curiouser cried alice much surprised moment quite forgot speak good english ...,"mouse, pool, little, oh, swam, dear, ll, cat, said, tear","[The Pool of Tears “Curiouser and curiouser!” cried Alice (she was so much surprised, that for...",cried manage walk want go ashamed said say kept waiting felt help came began started dropped glo...
chapter 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey were indeed a queer-looking party that assembled ...,caucus race long tale indeed queer looking party assembled bank bird draggled feather animal fur...,"said, mouse, dodo, race, lory, dry, thimble, know, bird, dinah",[A Caucus-Race and a Long Tale They were indeed a queer-looking party that assembled on the ba...,get seemed find talking known say know allow knowing refused said kept fixed catch getting conti...
chapter 4,"IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt was the White Rabbit, trotting slowly back again...",rabbit sends little bill white rabbit trotting slowly back looking anxiously went lost something...,"bill, little, window, rabbit, puppy, one, fan, bottle, chimney, said","[The Rabbit Sends in a Little Bill It was the White Rabbit, trotting slowly back again, and lo...",dropped guessed looking glove began hunting seen seemed changed vanished went hunting called fri...
chapter 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpillar and Alice looked at each other for some time...,advice caterpillar caterpillar alice looked time silence last caterpillar took hookah mouth addr...,"caterpillar, said, pigeon, serpent, egg, youth, you, size, father, little","[“Who are _you?_” said the Caterpillar., This was not an encouraging opening for a conversation....",replied know got changed said explain said said put replied begin confusing said found said know...


In [33]:
#finging of top  10 verbs using TF-IDF
X = vectorizer_tfidf.transform(data.proc_sent)
top_verbs  = [get_top_words(x, feature_names, 10) for x in X]
data['top_verbs'] = [', '.join(x) for x in top_verbs]

In [34]:
#results
data[['top_words', 'top_verbs']]

Unnamed: 0,top_words,top_verbs
chapter 1,"little, bat, rabbit, door, key, way, eat, think, like, either","said, going, shutting, say, get, see, went, think, finding, considering"
chapter 2,"mouse, pool, little, oh, swam, dear, ll, cat, said, tear","cried, said, go, went, glove, come, thought, kept, skurried, swimming"
chapter 3,"said, mouse, dodo, race, lory, dry, thimble, know, bird, dinah","said, turning, say, looking, offended, got, kept, known, paused, addressing"
chapter 4,"bill, little, window, rabbit, puppy, one, fan, bottle, chimney, said","made, said, thought, coaxing, began, glove, hunting, came, got, appeared"
chapter 5,"caterpillar, said, pigeon, serpent, egg, youth, you, size, father, little","said, replied, changed, know, tasted, found, looking, think, feel, say"
chapter 6,"said, footman, cat, baby, mad, duchess, pig, wow, like, cook","said, grunted, went, know, grinned, vanished, take, gone, opened, thought"
chapter 7,"hatter, dormouse, said, march, hare, tea, twinkle, it, time, well","said, say, know, replied, took, learning, take, sighed, looking, went"
chapter 8,"queen, said, hedgehog, king, gardener, soldier, cat, five, procession, executioner","said, looked, went, appeared, smiled, lie, tucked, going, bowed, came"
chapter 9,"turtle, said, mock, gryphon, duchess, moral, queen, went, school, it","said, say, ordered, went, replied, love, left, ventured, go, exclaimed"
chapter 10,"turtle, mock, gryphon, said, lobster, dance, join, beautiful, soup, whiting","said, whiting, passed, dancing, checked, follows, replied, thank, began, seen"
