Task 4:
1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website
http://www.gutenberg.org/files/11/11-0.txt
2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?


In [96]:
import requests
from bs4 import BeautifulSoup

import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import wordnet as wn
from nltk.collocations import BigramCollocationFinder

import re
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [78]:
import codecs

html = codecs.open("/content/drive/MyDrive/MLT LABS/Lab2/11-0.txt", 'r', 'utf-8')
soup = BeautifulSoup(html)
text = soup.get_text()
print(text[0:2000])

The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Alice’s Adventures in Wonderland

Author: Lewis Carroll

Release Date: January, 1991 [eBook #11]
[Most recently updated: October 12, 2020]

Language: English

Character set encoding: UTF-8

Produced by: Arthur DiBianca and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***

[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Content

In [79]:
print(text[-1000:])


Section 5. General Information About Project Gutenberg-tm electronic works

Professor Michael S. Hart was the originator of the Project
Gutenberg-tm concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg-tm eBooks with only a loose network of
volunteer support.

Project Gutenberg-tm eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org

This website includes information about Project Gutenberg-tm,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.




In [80]:
text = text.split("CHAPTER XII.   Alice’s Evidence")[1]
text = text.split("THE END")[0]
chapters = text.split("CHAPTER")[1:]

for i, chapter in enumerate(chapters):
    print(f"Chapter {i + 1}: {chapter[:200]}...\n")

Chapter 1:  I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was read...

Chapter 2:  II.
The Pool of Tears


“Curiouser and curiouser!” cried Alice (she was so much surprised, that
for the moment she quite forgot how to speak good English); “now I’m
opening out like the largest...

Chapter 3:  III.
A Caucus-Race and a Long Tale


They were indeed a queer-looking party that assembled on the bank—the
birds with draggled feathers, the animals with their fur clinging close
to them, and a...

Chapter 4:  IV.
The Rabbit Sends in a Little Bill


It was the White Rabbit, trotting slowly back again, and looking
anxiously about as it went, as if it had lost something; and she heard
it muttering to i...

Chapter 5:  V.
Advice from a Caterpillar


The Caterpillar and Alice looked at each other for some time in
silence: at last the Cat

In [81]:
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append("alice")

In [82]:
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
lemmatizer = WordNetLemmatizer()
clean_chapters=[]
def preprocesse_chapter(chapter):
  tokens = tokenizer.tokenize(chapter)
  words = [word.lower() for word in tokens]
  #words = [lemmatizer.lemmatize(word, pos='n') for word in words]  # singular_form 'n' specifies noun. If apply, strange 'wa' word appears
  words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # infinitive_verbs 'v' specifies verb

  words_filtered = [word for word in words if word not in stop_words]
  clean_chapters.append(' '.join(words_filtered))

In [83]:
for chapter in (chapters):
  preprocesse_chapter(chapter)

In [88]:
vectorizer = TfidfVectorizer(ngram_range=(1, 1))
corpus_tf_idf = vectorizer.fit_transform(clean_chapters)
tfidf = pd.DataFrame(corpus_tf_idf.todense())
tfidf.columns = vectorizer.get_feature_names_out()
tfidf_matrix = tfidf.T
tfidf_matrix.columns = ['Chapter '+ str(i) for i in range(1, 13)]
for i in range(1, 13):
  print(f"Chapter {i}")
  print(tfidf_matrix[f"Chapter {i}"].nlargest(n=10))
  print()

Chapter 1
think     0.204454
eat       0.190808
say       0.172171
go        0.161411
little    0.161411
bat       0.159237
get       0.150650
rabbit    0.143866
key       0.140663
see       0.139889
Name: Chapter 1, dtype: float64

Chapter 2
mouse     0.338822
go        0.219737
say       0.198810
pool      0.182374
little    0.167419
oh        0.158752
cat       0.148432
think     0.146491
swim      0.143233
cry       0.143114
Name: Chapter 2, dtype: float64

Chapter 3
say        0.415799
mouse      0.336597
dodo       0.307650
prize      0.179114
race       0.179114
lory       0.153825
know       0.135135
dry        0.121965
thimble    0.119409
bird       0.110593
Name: Chapter 3, dtype: float64

Chapter 4
bill       0.233369
rabbit     0.202604
little     0.196057
window     0.195839
puppy      0.171359
say        0.161960
grow       0.151564
fan        0.147165
gloves     0.147165
chimney    0.146879
Name: Chapter 4, dtype: float64

Chapter 5
say            0.466604
caterpillar   

New chapters names?
1. Down the Rabbit-Hole - A Little Rabbit's House
2. The Pool of Tears - First troubles and unexpected acquaintance near the water
3. A Caucus-Race and a Long Tale - The Dodo's Prize: A Race to Remember
4. The Rabbit Sends in a Little Bill - Little Gloves by the Chimney
5. Advice from a Caterpillar - Caterpillar's Wisdom
6. Pig and Pepper - Going Mad with the Grinning Cat
7. A Mad Tea-Party - Tea party with dormouse and Hatter
8. The Queen’s Croquet-Ground - Queen's Say
9. The Mock Turtle’s Story - Morality from the turtle and the griffin
10. The Lobster Quadrille - Luxurious banquet
11. Who Stole the Tarts? - Bread court
12. Alice’s Evidence - Jury dreams

In [97]:
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
lemmatizer = WordNetLemmatizer()

tokens = tokenizer.tokenize(text)
words = [word.lower() for word in tokens]
words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # infinitive_verbs 'v' specifies verb

stop_words = nltk.corpus.stopwords.words('english')
words = [word for word in words if word not in stop_words]

In [99]:
#TOP 15 verbs with Alice
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
bigrams = finder.score_ngrams(bigram_measures.pmi)

verbs = [
    (bigram, score)
    for idx, (bigram, score) in enumerate(bigrams)
    if ('alice' in bigram and wn.synsets(bigram[1]) and wn.synsets(bigram[1])[0].pos() == 'v') or
       (bigram[1] == 'alice' and wn.synsets(bigram[0]) and wn.synsets(bigram[0])[0].pos() == 'v')
]

verbs[:15]

[(('alice', 'recognise'), 4.972762267024658),
 (('alice', 'soothe'), 4.972762267024658),
 (('inquire', 'alice'), 4.972762267024658),
 (('exclaim', 'alice'), 3.972762267024658),
 (('vanish', 'alice'), 3.650834172137296),
 (('alice', 'allow'), 3.3877997663035018),
 (('plead', 'alice'), 3.3877997663035018),
 (('alice', 'attend'), 2.650834172137296),
 (('alice', 'remain'), 2.650834172137296),
 (('alice', 'wander'), 2.650834172137296),
 (('alice', 'learn'), 2.3877997663035018),
 (('alice', 'consider'), 1.650834172137296),
 (('believe', 'alice'), 1.650834172137296),
 (('alice', 'appear'), 1.5133306483873614),
 (('alice', 'ask'), 1.5133306483873614)]