# Task 4

1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?
5.	*(not necessary) Find Top 100 most used verbs in sentences with Alice. Get word vectors using a pre-trained word2vec model and visualize them. Compare the words using embeddings. 

In [1]:
import re
import pandas as pd
import requests
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from collections import Counter
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/emidiant/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 1. Download Alice

In [2]:
alice_text = requests.get("http://www.gutenberg.org/files/11/11-0.txt").text
alice_text[:400]

'ï»¿The Project Gutenberg eBook of Aliceâ\x80\x99s Adventures in Wonderland, by Lewis Carroll\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If'

## 2. Data processing

In [3]:
lemmatizer = WordNetLemmatizer()

def data_processing(alice_text, start_book="CHAPTER I.", end_book="THE END", return_split_content=False, drop_point=True):
    # delete header and footer from website
    text = alice_text.split(start_book)[-1].split(end_book)[0]
    text = re.sub("<.*?>", " ", text)
    if not drop_point:
        text = re.sub("_", "", text)
        text = re.sub(r"[^\w\s\.]", "", text)
        text = re.sub(r"[^\a-zA-Z\s\.]", "", text).replace(".", " .")
    else:
        text = re.sub("_", "", text)
        text = re.sub(r"[^\w\s]", "", text)
        text = re.sub(r"[^\a-zA-Z\s]", "", text)
    # tokenization 
    tokens = TreebankWordTokenizer().tokenize(text)
    print(f"Length after tokenizations: {len(tokens)}")
    if return_split_content:
        without_clean = " ".join(tokens)
    # Lower case
    tokens = [token.lower() for token in tokens]
    # delete stop words
    stop_words = stopwords.words("english")
    tokens = [token for token in tokens if token not in stop_words]
    print(f"Length after deleted stop words: {len(tokens)}")
    # lemmatizer
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    if return_split_content:
        return " ".join(tokens), without_clean
    return " ".join(tokens)

In [4]:
contents, real_content = data_processing(alice_text, start_book = "Contents", end_book = "CHAPTER I.\r\n", return_split_content=True)
contents = contents.split("chapter ")[1:]
real_content = re.split(r'.(?=CHAPTER)', real_content)

Length after tokenizations: 70
Length after deleted stop words: 50


In [5]:
alice = data_processing(alice_text)
alice[:400]

Length after tokenizations: 26378
Length after deleted stop words: 12618


'rabbithole alice beginning get tired sitting sister bank nothing twice peeped book sister reading picture conversation use book thought alice without picture conversation considering mind well could hot day made feel sleepy stupid whether pleasure making daisychain would worth trouble getting picking daisy suddenly white rabbit pink eye ran close nothing remarkable alice think much way hear rabbit'

## 3. Finding important words

Division into chapters and removal of chapter titles from the text

In [6]:
chapters = alice.split("chapter ")
chapters = [ch.replace(contents[i], "", 1) for ch, i in zip(chapters, range(len(contents)))]
print(f"Chapters amount: {len(chapters)}")

Chapters amount: 12


TF-IDF

In [7]:
vectorizer = TfidfVectorizer(stop_words=["alice"])
X = vectorizer.fit_transform(chapters)

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
tfidf_df

Unnamed: 0,abide,able,absence,absurd,acceptance,accident,accidentally,account,accounting,accusation,...,youare,youcome,youd,youll,young,youre,youth,youve,zealand,zigzag
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.013432,0.0,0.0,0.033053,0.0
1,0.0,0.030978,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.030978,0.0,0.032048,0.021094,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.026521,0.030881,0.0,0.0,0.0,0.0,0.0,...,0.0,0.030881,0.0,0.0,0.021028,0.025099,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013615,0.0,0.0,0.010696,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.048198,0.016066,0.057529,0.141563,0.014568,0.0,0.023594
5,0.022692,0.0,0.0,0.019488,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015451,0.0,0.046107,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015939,0.0,0.0,...,0.0,0.0,0.0192,0.012637,0.012637,0.007542,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018921,0.0,0.0,...,0.0,0.0,0.022793,0.0,0.0,0.0,0.0,0.013604,0.0,0.0
8,0.0,0.0,0.020672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010693,0.0,0.028152,0.016801,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017741,0.0,...,0.0,0.0,0.009177,0.0,0.0,0.00721,0.0,0.021909,0.0,0.0


Top 10 most important words in each chapter with its real title

In [8]:
for i in range(len(chapters)):
    print(f"{real_content[i]}")
    df_max_tf_idf = tfidf_df.iloc[i].sort_values(ascending=False).reset_index().rename({"index": "word", i: "tf_idf"}, axis=1).set_index("word").head(10)
    print(", ".join(df_max_tf_idf.index.to_list()))
    display(df_max_tf_idf)

CHAPTER I Down the RabbitHole
little, bat, door, key, eat, think, like, way, either, see


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
little,0.172644
bat,0.17032
door,0.153879
key,0.150453
eat,0.142861
think,0.126606
like,0.126606
way,0.126606
either,0.122452
see,0.115096


CHAPTER II The Pool of Tears
mouse, little, pool, im, swam, cat, dear, said, foot, mabel


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
mouse,0.306035
little,0.183377
pool,0.164506
im,0.163655
swam,0.154889
cat,0.153018
dear,0.149787
said,0.129443
foot,0.125889
mabel,0.123911


CHAPTER III A CaucusRace and a Long Tale
mouse, said, dodo, prize, lory, dry, thimble, know, bird, soon


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
mouse,0.400416
said,0.365608
dodo,0.318252
prize,0.185286
lory,0.159126
dry,0.140565
thimble,0.123524
know,0.118285
bird,0.114405
soon,0.095021


CHAPTER IV The Rabbit Sends in a Little Bill
window, little, bill, puppy, rabbit, fan, bottle, glove, said, one


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
window,0.210561
little,0.20163
bill,0.197145
puppy,0.184241
rabbit,0.176991
fan,0.135624
bottle,0.135624
glove,0.135624
said,0.12831
one,0.12831


CHAPTER V Advice from a Caterpillar
caterpillar, said, serpent, pigeon, im, egg, youth, size, father, little


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
caterpillar,0.447479
said,0.427216
serpent,0.283126
pigeon,0.283126
im,0.143822
egg,0.141563
youth,0.141563
size,0.112461
father,0.101313
little,0.090373


CHAPTER VI Pig and Pepper
said, cat, footman, baby, mad, duchess, pig, wow, like, cook


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
said,0.37137
cat,0.336261
footman,0.272298
baby,0.214365
mad,0.189361
duchess,0.164328
pig,0.155902
wow,0.136149
like,0.126424
cook,0.120502


CHAPTER VII A Mad TeaParty
hatter, dormouse, said, march, hare, twinkle, time, tea, draw, know


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
hatter,0.464623
dormouse,0.430343
said,0.381286
march,0.265386
hare,0.265386
twinkle,0.148472
time,0.109862
tea,0.098556
draw,0.095632
know,0.090475


CHAPTER VIII The Queens CroquetGround
queen, said, hedgehog, king, gardener, soldier, cat, five, executioner, procession


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
queen,0.447157
said,0.329889
hedgehog,0.22032
king,0.210033
gardener,0.176256
soldier,0.150429
cat,0.14964
five,0.13245
executioner,0.132192
procession,0.132192


CHAPTER IX The Mock Turtles Story
said, turtle, mock, gryphon, duchess, moral, queen, went, say, day


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
said,0.410295
turtle,0.40774
mock,0.392058
gryphon,0.281522
duchess,0.203166
moral,0.186045
queen,0.163157
went,0.093576
say,0.077743
day,0.072694


CHAPTER X The Lobster Quadrille
turtle, mock, gryphon, said, dance, lobster, beautiful, soup, join, whiting


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
turtle,0.417235
mock,0.376858
gryphon,0.374501
said,0.277999
dance,0.230637
lobster,0.230637
beautiful,0.16151
soup,0.16151
join,0.159672
whiting,0.14193


CHAPTER XI Who Stole the Tarts
king, hatter, said, court, dormouse, witness, queen, juror, officer, breadandbutter


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
king,0.405735
hatter,0.365104
said,0.319204
court,0.295224
dormouse,0.255861
witness,0.229173
queen,0.116281
juror,0.114586
officer,0.114586
breadandbutter,0.098408


CHAPTER XII Alices Evidence
said, king, jury, queen, sister, dream, would, slate, rabbit, fit


Unnamed: 0_level_0,tf_idf
word,Unnamed: 1_level_1
said,0.465603
king,0.392761
jury,0.1989
queen,0.14781
sister,0.13923
dream,0.135098
would,0.118683
slate,0.112582
rabbit,0.108495
fit,0.104872


**Real titles:**

 * CHAPTER I.     Down the Rabbit-Hole
 * CHAPTER II.    The Pool of Tears
 * CHAPTER III.   A Caucus-Race and a Long Tale
 * CHAPTER IV.    The Rabbit Sends in a Little Bill
 * CHAPTER V.     Advice from a Caterpillar
 * CHAPTER VI.    Pig and Pepper
 * CHAPTER VII.   A Mad Tea-Party
 * CHAPTER VIII.  The Queen’s Croquet-Ground
 * CHAPTER IX.    The Mock Turtle’s Story
 * CHAPTER X.     The Lobster Quadrille
 * CHAPTER XI.    Who Stole the Tarts?
 * CHAPTER XII.   Alice’s Evidence



My chapter titles based on TF-IDF:

1. The little bat thinks of a way to eat the key to the door...
2. Сat with small mouse are talking about swimming in the pool
3. Thimble as a prize for the Dodo bird
4. Little rabbit with a bill in a bottle
5. Pigeon and caterpillar discuss the small size of serpent eggs
6. The madness of the footman's cat and the Duchess of the pig
7. The Hatter, the Dormouse and the Hare know all about tea time
8. Procession of five cats, hedgehogs and a king and queen
9. Mock turtle as the moral of the day
10. Mock turtle joins the dance of lobster and gryphon in the soup
11. The king at the hatter's court speaks of the jury's lack of bread and butter
12. The rabbit dreams of a jury where there will be no king and queen

## 4. Find the Top 10 most used verbs

Select data preprocessing with dots to select sentences containing the name Alice

In [9]:
sentences = data_processing(alice_text, drop_point=False)
chapters = sentences.split("chapter ")
chapters = [ch.replace(contents[i], "", 1) for ch, i in zip(chapters, range(len(contents)))]
sentences_clean = "".join(chapters).split(". ")
sentence_alice = [s.strip() for s in sentences_clean if "alice" in s]
print(f"Amount of sentences with word 'Alice': {len(sentence_alice)}")
sentence_alice = " ".join(sentence_alice).split()

Length after tokenizations: 27364
Length after deleted stop words: 13603
Amount of sentences with word 'Alice': 365


Marking words to determine the part of speech

In [10]:
tag_words = nltk.pos_tag(sentence_alice)

Select only verbs and counting top 10

In [11]:
verbs = [lemmatizer.lemmatize(w, 'v') for w, tag in tag_words if tag[0] == "V"]
pd.DataFrame(Counter(verbs).most_common(10)).rename({0: "verb", 1: "counter"}, axis=1).set_index("verb")

Unnamed: 0_level_0,counter
verb,Unnamed: 1_level_1
say,250
go,95
think,60
get,58
look,49
come,45
begin,44
see,36
make,31
know,28


Most often, Alice speaks, then goes somewhere, and only then thinks why she said it and where she went at all.