Task 4:
1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?
5.	*(not necessary) Find Top 100 most used verbs in sentences with Alice. Get word vectors using a pre-trained word2vec model and visualize them. Compare the words using embeddings.



In [95]:
import re
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

# Load text

In [96]:
with open("11-0.txt", "r", encoding="utf8") as file:
    text = file.read()

In [97]:
text[:1000]

'The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: Alice’s Adventures in Wonderland\n\nAuthor: Lewis Carroll\n\nRelease Date: January, 1991 [eBook #11]\n[Most recently updated: October 12, 2020]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: Arthur DiBianca and David Widger\n\n*** START OF THE PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***\n\n[Illustration]\n\n\n\n\nAlice’s Adventures in Wonderland\n\nby Lewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3.0\n\nConten

In [98]:
chapters = re.split("CHAPTER", text)
chapters = chapters[13:]
for chapter in chapters:
    print(chapter[:500])

 I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would b
 II.
The Pool of Tears


“Curiouser and curiouser!” cried Alice (she was so much surprised, that
for the moment she quite forgot how to speak good English); “now I’m
opening out like the largest telescope that ever was! Good-bye, feet!”
(for when she looked down at her feet, they seemed to be almost out of
sight, they were getting so far off). “Oh, my poor little feet, I
wonder who will put on your shoes and stockings for you now, dears? I’m
sure _I_ shan’t be able! I shall be a great deal too 

In [99]:
chapters[11] = re.split("THE END", chapters[11])[0]

# Find 10 most important words in each chapter

In [100]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


You should consider upgrading via the 'C:\Users\pogre\Desktop\ML_Technologies_ITMO_2022\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [101]:
nlp = spacy.load("en_core_web_sm")
tokenized_chapters = []
for chapter in chapters:
    doc = nlp(chapter)
    tokens = ' '.join([token.lemma_.lower() for token in doc if (token.is_alpha and not token.is_stop and not token.text == "Alice")])
    tokenized_chapters.append(tokens)

In [102]:
len(tokenized_chapters)

12

In [103]:
for tokenized_chapter in tokenized_chapters:
    print(tokenized_chapter[:20])

rabbit hole begin ti
ii pool tears curiou
iii caucus race long
iv rabbit send littl
advice caterpillar c
vi pig pepper minute
vii mad tea party ta
viii queen croquet g
ix mock turtle story
lobster quadrille mo
xi steal tarts king 
xii evidence cry for


In [104]:
# Compute tf-idf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tokenized_chapters[1:])
X = X.toarray()
X.shape

(11, 1696)

In [105]:
feature_names = np.asarray(vectorizer.get_feature_names_out())
for chapter_id in range(X.shape[0]):
    top_words = np.argsort(X[chapter_id])[::-1][:10]
    print(f"Chapter_id {chapter_id} - {feature_names[top_words]}")

Chapter_id 0 - ['mouse' 'little' 'pool' 'say' 'oh' 'cat' 'dear' 'cry' 'swam' 'think']
Chapter_id 1 - ['say' 'mouse' 'dodo' 'prize' 'race' 'lory' 'know' 'dry' 'thimble' 'dinah']
Chapter_id 2 - ['bill' 'rabbit' 'little' 'window' 'puppy' 'grow' 'say' 'bottle' 'glove'
 'fan']
Chapter_id 3 - ['caterpillar' 'say' 'pigeon' 'serpent' 'egg' 'youth' 'size' 'think'
 'father' 'little']
Chapter_id 4 - ['say' 'cat' 'footman' 'baby' 'mad' 'duchess' 'grin' 'go' 'think' 'sneeze']
Chapter_id 5 - ['hatter' 'dormouse' 'say' 'hare' 'march' 'tea' 'twinkle' 'time' 'know'
 'go']
Chapter_id 6 - ['queen' 'say' 'hedgehog' 'king' 'gardener' 'look' 'go' 'cat' 'soldier'
 'procession']
Chapter_id 7 - ['say' 'turtle' 'mock' 'gryphon' 'duchess' 'moral' 'queen' 'go' 'think'
 'school']
Chapter_id 8 - ['turtle' 'mock' 'gryphon' 'say' 'lobster' 'dance' 'soup' 'beautiful'
 'whiting' 'oop']
Chapter_id 9 - ['king' 'hatter' 'say' 'court' 'dormouse' 'witness' 'jury' 'juror'
 'officer' 'queen']
Chapter_id 10 - ['say' 'king' 'ju

# Find the Top 10 most used verbs in sentences with Alice.

In [106]:
doc = nlp(text)
all_verbs = []
for sent in doc.sents:
    if "alice" in sent.text.lower():
        for token in sent:
            if (token.pos_ == "VERB") and token.is_alpha and (not token.is_stop):
                all_verbs.append(token.lemma_.lower())

counter = Counter(all_verbs)
counter.most_common(10)

[('say', 178),
 ('think', 85),
 ('go', 51),
 ('look', 44),
 ('know', 39),
 ('begin', 37),
 ('come', 29),
 ('get', 26),
 ('feel', 25),
 ('find', 23)]