<a href="https://colab.research.google.com/github/DeepsMaxi305/Data_Science/blob/main/harry_potter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

You should process some texts using [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/) libraries (ideally both). In particular, you should do the following:
- Load the `harry_potter` book. You can find this text corpus in the datasets folder.
- Segment the text of the book into sentences. How many sentences does this book have?
- Compute the frequency of each token in the book. What are the most frequent tokens?
- Choose a sentence from the book. Analyze this chosen sentence by
    - Calculating all [n-grams](https://en.wikipedia.org/wiki/N-gram).
    - Finding [POS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) of tokens.
    - [Stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) tokens.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [1]:
!pip install textacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textacy
  Downloading textacy-0.13.0-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.7/210.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jellyfish>=0.8.0
  Downloading jellyfish-0.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
Collecting cytoolz>=0.10.1
  Downloading cytoolz-0.12.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyphen>=0.10.0
  Downloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

#Importing Libraries

In [2]:
!pip install textacy
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
import spacy
import textacy
nlp = spacy.load('en_core_web_sm')


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


#Loading Harry Potter Book

In [7]:
f = open("/content/harry_potter.txt")
text = f.read()
print(text[:1000])

CHAPTER ONE THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. 

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, b

#Sentence Segmentation

In [8]:
nltk_sentences = nltk.sent_tokenize(text)
len(nltk_sentences)

6394

In [9]:
nltk_sentences[0]

'CHAPTER ONE THE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.'

In [10]:
doc = nlp(text)
spacy_sentences = list(doc.sents)
len(spacy_sentences)

6186

In [11]:
spacy_sentences[0]

CHAPTER ONE THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

#Word Tokenization

In [16]:
tokens = {}
for s in nltk_sentences:
    sentence_tokens = nltk.tokenize.word_tokenize(s)
    for t in sentence_tokens:
        if t not in tokens:
            tokens[t] = 0
        tokens[t] +=1

frequent_tokens = sorted(tokens, key=tokens.get, reverse =True)[:20]
for t in frequent_tokens:
    print(t,"\t\t\t\t", tokens[t])

, 				 5658
. 				 5119
the 				 3310
'' 				 2441
`` 				 2307
to 				 1845
and 				 1804
a 				 1578
Harry 				 1323
was 				 1253
of 				 1242
he 				 1208
's 				 997
in 				 933
I 				 919
it 				 897
his 				 896
you 				 837
n't 				 826
said 				 793


In [18]:
tokens

{'CHAPTER': 17,
 'ONE': 2,
 'THE': 18,
 'BOY': 1,
 'WHO': 1,
 'LIVED': 1,
 'Mr.': 79,
 'and': 1804,
 'Mrs.': 44,
 'Dursley': 54,
 ',': 5658,
 'of': 1242,
 'number': 15,
 'four': 30,
 'Privet': 16,
 'Drive': 16,
 'were': 330,
 'proud': 7,
 'to': 1845,
 'say': 72,
 'that': 632,
 'they': 506,
 'perfectly': 5,
 'normal': 10,
 'thank': 4,
 'you': 837,
 'very': 161,
 'much': 74,
 '.': 5119,
 'They': 183,
 'the': 3310,
 'last': 82,
 'people': 87,
 "'d": 267,
 'expect': 13,
 'be': 362,
 'involved': 5,
 'in': 933,
 'anything': 70,
 'strange': 21,
 'or': 96,
 'mysterious': 5,
 'because': 84,
 'just': 161,
 'did': 280,
 "n't": 826,
 'hold': 11,
 'with': 403,
 'such': 21,
 'nonsense': 4,
 'was': 1253,
 'director': 2,
 'a': 1578,
 'firm': 2,
 'called': 44,
 'Grunnings': 2,
 'which': 84,
 'made': 66,
 'drills': 6,
 'He': 548,
 'big': 28,
 'beefy': 1,
 'man': 35,
 'hardly': 21,
 'any': 61,
 'neck': 17,
 'although': 7,
 'he': 1208,
 'have': 307,
 'large': 51,
 'mustache': 6,
 'thin': 10,
 'blonde': 2,

In [17]:
nltk.tokenize.word_tokenize(nltk_sentences[0])

['CHAPTER',
 'ONE',
 'THE',
 'BOY',
 'WHO',
 'LIVED',
 'Mr.',
 'and',
 'Mrs.',
 'Dursley',
 ',',
 'of',
 'number',
 'four',
 ',',
 'Privet',
 'Drive',
 ',',
 'were',
 'proud',
 'to',
 'say',
 'that',
 'they',
 'were',
 'perfectly',
 'normal',
 ',',
 'thank',
 'you',
 'very',
 'much',
 '.']

In [21]:
tokens = {}
for t in doc:
  if t.text not in tokens:
    tokens[t.text] = 0
  tokens[t.text] +=1

frequent_tokens = sorted(tokens, key=tokens.get, reverse=True) [:20]
for t in frequent_tokens:
  print(t.replace("\n","<NEWLINE>"),"\t\t\t\t", tokens[t])

, 				 5658
. 				 5125
" 				 4747
the 				 3312
<NEWLINE><NEWLINE> 				 3014
to 				 1851
and 				 1807
a 				 1581
Harry 				 1324
was 				 1253
of 				 1250
he 				 1208
's 				 998
in 				 935
I 				 922
it 				 898
his 				 896
you 				 838
n't 				 821
said 				 793


#N-Gram Computation

In [22]:
nltk_sentence = nltk_sentences[80]
sentence_tokens = nltk.tokenize.word_tokenize(nltk_sentence)
ngrams = list(nltk.ngrams(sentence_tokens , 2 ))
print(nltk_sentence)
ngrams

It was now sitting on his garden wall.


[('It', 'was'),
 ('was', 'now'),
 ('now', 'sitting'),
 ('sitting', 'on'),
 ('on', 'his'),
 ('his', 'garden'),
 ('garden', 'wall'),
 ('wall', '.')]

In [23]:
spacy_sentence=spacy_sentences[100]
sentence_doc = nlp(spacy_sentence.text)
ngrams = list(textacy.extract.basics.ngrams(sentence_doc,2,filter_stops = False))
print(spacy_sentence)
ngrams

And now, over to Jim McGuffin with the weather.


[And now, over to, to Jim, Jim McGuffin, McGuffin with, with the, the weather]

#POS Tagging

In [25]:
print(nltk_sentence)
pos_tags = nltk.pos_tag(sentence_tokens)
for t, tag in pos_tags:
  print(t,"\t\t",tag)


It was now sitting on his garden wall.
It 		 PRP
was 		 VBD
now 		 RB
sitting 		 VBG
on 		 IN
his 		 PRP$
garden 		 NN
wall 		 NN
. 		 .


In [27]:
print(spacy_sentence)
for t in sentence_doc:
  print(t.text,"\t\t",t.pos_)

And now, over to Jim McGuffin with the weather.
And 		 CCONJ
now 		 ADV
, 		 PUNCT
over 		 ADP
to 		 ADP
Jim 		 PROPN
McGuffin 		 PROPN
with 		 ADP
the 		 DET
weather 		 NOUN
. 		 PUNCT


#Stemming

In [29]:
print(nltk_sentence)
porter = nltk.stem.PorterStemmer()
for t in sentence_tokens:
  print(t,"\t\t",porter.stem(t))

It was now sitting on his garden wall.
It 		 it
was 		 wa
now 		 now
sitting 		 sit
on 		 on
his 		 hi
garden 		 garden
wall 		 wall
. 		 .


#Lemmatization

In [31]:
print(nltk_sentence)
lemmatizer = nltk.stem.WordNetLemmatizer()
for t in sentence_tokens:
  print(t,"\t\t", lemmatizer.lemmatize(t))

It was now sitting on his garden wall.
It 		 It
was 		 wa
now 		 now
sitting 		 sitting
on 		 on
his 		 his
garden 		 garden
wall 		 wall
. 		 .


In [33]:
print(spacy_sentence)
for t in sentence_doc:
  print(t.text,"\t\t",t.lemma_)

And now, over to Jim McGuffin with the weather.
And 		 and
now 		 now
, 		 ,
over 		 over
to 		 to
Jim 		 Jim
McGuffin 		 McGuffin
with 		 with
the 		 the
weather 		 weather
. 		 .
