# Task 2

## Done by Arina Shinkorenok, group j4132c

**Description:**

1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?


### Downloaded the necessary libraries and the Alice in Wonderland file.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from collections import Counter
import spacy
import tqdm
import numpy as np

In [3]:
!python -m spacy download en_core_web_sm

2023-11-08 14:38:09.032912: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-08 14:38:09.032994: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-08 14:38:09.033028: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now l

### Download the text of the book and divide it into chapters

In [4]:
file_path = '/content/drive/MyDrive/ML-ITMO/11-0.txt'
with open(file_path, 'r') as file:
    text = file.read()

In [5]:
chapters = text.split('CHAPTER')[13:]
for chapter in chapters:
    print(chapter[:25])

 I.
Down the Rabbit-Hole

 II.
The Pool of Tears



 III.
A Caucus-Race and a
 IV.
The Rabbit Sends in 
 V.
Advice from a Caterpi
 VI.
Pig and Pepper


For
 VII.
A Mad Tea-Party


T
 VIII.
The Queen’s Croque
 IX.
The Mock Turtle’s St
 X.
The Lobster Quadrille
 XI.
Who Stole the Tarts?
 XII.
Alice’s Evidence





### Preprocessing

In [6]:
# load the English (en) core language model with a small size (web_sm) for English
nlp = spacy.load("en_core_web_sm")
# processed text
new_text = nlp(text.lower())

In [7]:
# go through the first 10 tokens in the text
for token in new_text[:10]:
    print(token.text, token.is_alpha, token.is_stop, token.lemma_, token.pos_)

﻿the False False ﻿the DET
project True False project NOUN
gutenberg True False gutenberg PROPN
ebook True False ebook PROPN
of True True of ADP
alice True False alice PROPN
’s False True ’s PART
adventures True False adventure NOUN
in True True in ADP
wonderland True False wonderland NOUN


In [8]:
# processing the text of chapters and converting them into tokens
nlp = spacy.load("en_core_web_sm")
tokenized_chapters = []
for chapter in tqdm.tqdm(chapters):
    new_text = nlp(chapter)
    tokens = ' '.join([token.lemma_.lower() for token in new_text if (token.is_alpha and not token.is_stop and not token.text == "Alice")])
    tokenized_chapters.append(tokens)

100%|██████████| 12/12 [00:08<00:00,  1.35it/s]


In [9]:
# applying TF-IDF vectorisation to chapter tokens
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tokenized_chapters[0:])
X = X.toarray()
X.shape

(12, 2093)

### Top 10 most important words from each chapter

In [10]:
# get the top 10 important words (according to the TF-IDF metric) for each chapter
feature_names = np.array(vectorizer.get_feature_names_out())
for chapter_idx in range(X.shape[0]):
    top_words = np.argsort(X[chapter_idx])[-10:][::-1]
    print(f"Top 10 words, chapter {chapter_idx + 1}: {feature_names[top_words]}")

Top 10 words, chapter 1: ['fall' 'eat' 'think' 'little' 'bat' 'rabbit' 'door' 'key' 'go' 'way']
Top 10 words, chapter 2: ['mouse' 'pool' 'little' 'say' 'oh' 'swam' 'cat' 'think' 'dear' 'cry']
Top 10 words, chapter 3: ['say' 'mouse' 'dodo' 'race' 'lory' 'prize' 'know' 'dry' 'thimble' 'bird']
Top 10 words, chapter 4: ['bill' 'little' 'window' 'rabbit' 'puppy' 'grow' 'glove' 'fan' 'say'
 'bottle']
Top 10 words, chapter 5: ['caterpillar' 'say' 'serpent' 'pigeon' 'youth' 'egg' 'size' 'think'
 'father' 'little']
Top 10 words, chapter 6: ['say' 'footman' 'cat' 'baby' 'mad' 'duchess' 'grin' 'sneeze' 'wow' 'go']
Top 10 words, chapter 7: ['hatter' 'dormouse' 'say' 'hare' 'march' 'tea' 'twinkle' 'time' 'know'
 'go']
Top 10 words, chapter 8: ['queen' 'say' 'hedgehog' 'king' 'gardener' 'look' 'go' 'rose' 'soldier'
 'cat']
Top 10 words, chapter 9: ['turtle' 'say' 'mock' 'gryphon' 'duchess' 'moral' 'queen' 'go' 'think'
 'school']
Top 10 words, chapter 10: ['turtle' 'mock' 'gryphon' 'say' 'lobster' 'd

### How would you name each chapter according to the identified tokens?

I tried, but my imagination is not very good, don't judge strictly.....

1. the little rabbit has the key to the door
2. cat was swimming in the pool and crying
3. mouse took the prize
4. the little rabbit grew up
5. a caterpillar is a small serpent
6. Wow, it's a mad duchess!
7. hatter and tea
8. queen, king and soldiers
9. turtle morality
10. beautiful lobster dance
11. court of the king and queen
12. Gutenberg's work

### Top 10 most used verbs in sentences with Alice

In [11]:
new_text = nlp(text)

In [12]:
# find the top 10 most common verbs in sentences mentioning "Alice"
verbs = []

for sent in new_text.sents:
    get_verbs = False
    for token in sent:
        if token.text.lower() == "alice":
            get_verbs = True
            break

    if get_verbs:
        for token in sent:
            if (token.pos_ == "VERB") and token.is_alpha and (not token.is_stop):
                verbs.append(token.lemma_.lower())

counter = Counter(verbs)
print(f"Top 10 verbs used with Alice: {counter.most_common(10)}")

Top 10 verbs used with Alice: [('say', 202), ('think', 87), ('go', 56), ('look', 51), ('know', 43), ('begin', 40), ('come', 33), ('get', 27), ('feel', 25), ('find', 23)]


### What does Alice do most often?

Most often Alice says, then thinks

## Conclusion

In this lab work, the text "Alice in Wonderland" was processed, the top 10 words in each chapter and the top 10 verbs that are used in sentences with Alice were found. I used the spaCy library which provides powerful tools for natural language processing and I had a new experience.