# Our first Notebook!

## 1. Basic Operations

In [9]:
data_dir = 'data/hp/'

### 1.1 Import libraries

In [10]:
import os 

In [11]:
os.path.isdir(data_dir)

True

In [12]:
os.path.isfile(data_dir + 'Book1.txt')

True

In [13]:
from os import path

In [14]:
bk1 = data_dir + 'Book1.txt'

### 1.2 Open files

In [15]:
with open(bk1) as f:
    bk1_text = f.read()

In [16]:
len(bk1_text)

449564

This is the number of characters. No tokenization 

In [17]:
print(bk1_text[:210])

THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much. They were the last people you’d 
expect to be involved in anythi


## 2. Work with text and `nltk`

In [18]:
import nltk

In [19]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### 2.1 Tokenizers

We need some functions that do tokenization. There is a submodule of nltk called *tokenize*. 
https://www.nltk.org/api/nltk.tokenize.html

In [20]:
from nltk.tokenize import word_tokenize

In [26]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/camillacanevese/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [27]:
bk1_toks = word_tokenize(bk1_text)

In [22]:
len(bk1_text)

449564

In [28]:
len(bk1_toks)

101221

Let's inspect our tokens:

In [29]:
bk1_toks[100:120]

['and',
 'blonde',
 'and',
 'had',
 'nearly',
 'twice',
 'the',
 'usual',
 'amount',
 'of',
 'neck',
 ',',
 'which',
 'came',
 'in',
 'very',
 'useful',
 'as',
 'she',
 'spent']

### 2.2 Corpus readers

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.
https://www.nltk.org/api/nltk.corpus.reader.html

In [30]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

*PlaintextCorpusReader*: A reader for plaintext corpora whose documents are divided into categories based on their file identifiers

In [31]:
hp_corpus = PlaintextCorpusReader(data_dir, fileids= r'.*\.txt')

We must specify the *root* (in our case *data_dir*) and a fileID. 

In [32]:
hp_corpus.fileids()

['Book1.txt',
 'Book2.txt',
 'Book3.txt',
 'Book4.txt',
 'Book5.txt',
 'Book6.txt',
 'Book7.txt']

Here, we are just accessing all files of the corpus. 

In [39]:
print(hp_corpus.raw(fileids = 'Book4.txt')[:100])

/ 




THE RIDDLE HOUSE 

The villagers of Little Hangleton still called it “the 
Riddle House,” eve


In this case, the command print the corpus as a *string of characters*.

In [43]:
hp_corpus.sents(fileids = 'Book7.txt')[20]

['Peacocks',
 '...”',
 'Yaxley',
 'thrust',
 'his',
 'wand',
 'back',
 'under',
 'his',
 'cloak',
 'with',
 'a',
 'snort',
 '.']

In this case it returns the list of sentences, that are represented as list of tokens (*list of a list*) of our text.

In [45]:
len(hp_corpus.sents())

84352

In [47]:
len(hp_corpus.sents(fileids = 'Book7.txt'))

15193

In [49]:
hp_corpus.sents(fileids = 'Book1.txt')[30]

['Mr',
 '.',
 'Dursley',
 'gave',
 'himself',
 'a',
 'little',
 'shake',
 'and',
 'put',
 'the',
 'cat',
 'out',
 'of',
 'his',
 'mind',
 '.']

In [50]:
hp_words = hp_corpus.words()

In [52]:
len(hp_words)

1396620

Here, we are asking for a list of token regardless the token division: a list of all the individual units or elements that make up the text, without concern for how those units are divided or segmented. 

In [53]:
hp_words[500:515]

['first',
 'sign',
 'of',
 'something',
 'peculiar',
 '—',
 'a',
 'cat',
 'reading',
 'a',
 'map',
 '.',
 'For',
 'a',
 'second']

### 2.3 Word frequencies

I want to count words, but before that, I want to lowercase everything

In [55]:
'Harry Potter'.lower()

'harry potter'

In [57]:
toks_lower = [tok.lower() for tok in hp_corpus.words()]

In [58]:
from collections import Counter

In [59]:
c = Counter(toks_lower)

In [60]:
c['harry']

18215

In [61]:
c['knew']

1005

In [63]:
len(c)

21089

How to know which are the most frequent types in our corpus?

In [64]:
c.most_common(20)

[(',', 74288),
 ('.', 60720),
 ('the', 51927),
 ('“', 36869),
 ('’', 34270),
 ('and', 27666),
 ('to', 26907),
 ('he', 22223),
 ('of', 21899),
 ('a', 21094),
 ('harry', 18215),
 ('was', 15646),
 ('s', 14845),
 ('you', 14657),
 ('it', 14572),
 ('said', 14491),
 ('his', 14289),
 ('i', 13492),
 ('in', 12686),
 (',”', 11502)]

Here, we are also taking into account *punctuation* and *function words*. 