## Task 2

1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt

In [108]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [182]:
# import libraries
import pandas as pd
import numpy as np

import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import requests


In [154]:
# load the data
file_path = '/content/drive/MyDrive/Alice.txt'
with open(file_path, 'r') as file:
    text = file.read()

In [155]:
chapter_names = text.split('CHAPTER')[1:13]

In [156]:
text = text.split('CHAPTER')[13:]

2. Perform any necessary preprocessing on the text:
- including converting to lower case
- removing stop words, numbers, non-alphabetic characters
- lemmatization

Use Natural Language for Processing

In [194]:
# import libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem.snowball import SnowballStemmer

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [158]:
# convert to lowercase
text = [text_part.lower() for text_part in text]

In [159]:
# remove non-alphabetic characters and numbers
text = [re.sub(r'[^a-z ]', '', text_part) for text_part in text]

In [160]:
# Tokenize the text
text = [word_tokenize(text_part) for text_part in text]

In [163]:
# remove stop words
stop_words = set(stopwords.words('english'))
text = [[token for token in text_part if token not in stop_words] for text_part in text]

In [168]:
# lemmatization
lemmatizer = WordNetLemmatizer()
text = [[lemmatizer.lemmatize(token) for token in text_part] for text_part in text]

3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice")

- How would you name each chapter according to the identified tokens?

In [170]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

In [179]:
# Find top 10 words for each chapter
top_words_by_chapter = []
for chapter in text:
    # Calculate TF-IDF scores for words in the chapter
    tfidf_matrix = tfidf_vectorizer.fit_transform([token for token in chapter if token != "alice"])
    feature_names = tfidf_vectorizer.get_feature_names_out()
    top_words = [feature_names[i] for i in tfidf_matrix.sum(axis=0).argsort()[0, -10:][::-1]]
    top_words_by_chapter.append(top_words)

In [180]:
for i, top_words in enumerate(top_words_by_chapter):
    chapter_name = ', '.join(top_words[0][0])
    print(f"Chapter {i+1}: {chapter_name}")

Chapter 1: thought, could, door, nothing, think, one, way, see, like, little
Chapter 2: oh, went, cried, im, dear, foot, thing, said, mouse, little
Chapter 3: ill, must, one, thing, long, soon, know, dodo, mouse, said
Chapter 4: voice, quite, bill, thought, heard, get, one, rabbit, said, little
Chapter 5: time, youth, ive, size, serpent, pigeon, im, little, caterpillar, said
Chapter 6: could, went, baby, little, footman, much, duchess, like, cat, said
Chapter 7: one, went, thing, know, time, hare, march, dormouse, hatter, said
Chapter 8: began, went, three, two, see, cat, king, head, queen, said
Chapter 9: moral, say, dont, queen, went, gryphon, duchess, turtle, mock, said
Chapter 10: could, join, wont, lobster, beautiful, would, gryphon, turtle, mock, said
Chapter 11: court, thought, rabbit, witness, queen, one, dormouse, hatter, king, said
Chapter 12: king, copy, term, gutenberg, electronic, foundation, gutenbergtm, said, work, project


Chapter 1 -> Even one little thought could help find the way

Chapter 2 -> Oh, dear little mouse

Chapter 3 -> I must know about one long think

Chapter 4 -> A little rabbit heard one quite voice

Chapter 5 -> I was a little caterpillar in youth

Chapter 6 -> One little baby likes a duchess

Chapter 7 -> One thing went wrong - time

Chapter 8 -> Three Queens and two Kings see cat

Chapter 9 -> Moral says don't mock at Queen

Chapter 10 -> A beautiful lobster joins a turtle

Chapter 11 -> Hatter brought a witness to the king's court

Chapter 12 -> King terms electronic project foundation

4.	Find the Top 10 most used verbs in sentences with Alice.
- What does Alice do most often?

In [206]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [197]:
# load the data
file_path = '/content/drive/MyDrive/Alice.txt'
with open(file_path, 'r') as file:
    text = file.read()

In [198]:
# tokenize sentences
sentences = re.split(r'[.!?]', text)

In [199]:
# convert to lowercase
sentences = [sentence.lower() for sentence in sentences]

In [200]:
# remove non-alphabetic characters and numbers
sentences = [re.sub(r'[^a-z ]', '', sentence) for sentence in sentences]

In [201]:
# Tokenize the text
sentences = [word_tokenize(sentence) for sentence in sentences]

In [202]:
# remove stop words
sentences = [[word for word in sentence if word not in stop_words] for sentence in sentences]

In [203]:
# lemmatization
sentences = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in sentences]

In [209]:
# find verb in sentences with Alica
alice_verbs = []

for sentence in sentences:
    if 'alice' in sentence:
        words = word_tokenize(' '.join(sentence))
        tagged_words = pos_tag(words)
        # Extract verbs
        verbs = [word for word, pos in tagged_words if pos.startswith('VB')]
        alice_verbs.extend(verbs)

# Count verb occurrences
verb_counts = Counter(alice_verbs)

In [223]:
print("Top 10 most used verbs:")

for verb in verb_counts.most_common(10):
    print(f"{verb[0]} = {verb[1]}")



Top 10 most used verbs:
said = 156
thought = 33
went = 23
looked = 18
say = 17
see = 15
got = 15
know = 15
think = 14
began = 14
