# NLTK and the Basics - Frequency Distribution

**Take a list of words and see how many times each word appears**

In [1]:
import nltk

In [2]:
# File ids of of project gutenberg that came when we downloaded nltk
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [9]:
alice = nltk.corpus.gutenberg.words("carroll-alice.txt")
alice

['[', 'Alice', "'", 's', 'Adventures', 'in', ...]

**Frequency distribution takes a list of words and counts how many times each word is seen.**

In [4]:
alice_fd = nltk.FreqDist(alice)

In [5]:
alice_fd

FreqDist({',': 1993, "'": 1731, 'the': 1527, 'and': 802, '.': 764, 'to': 725, 'a': 615, 'I': 543, 'it': 527, 'she': 509, ...})

**alice_fd is structured like a dictionary and each key is a word that is mapped to a value. Value represents how many time a word was seen(frequency).**

In [6]:
alice_fd["Rabbit"]
# Rabbit (a key for the dictionary here) was seen 45 times in the book

45

We can find the **15 most common words seen.**

In [7]:
alice_fd.most_common(15)

[(',', 1993),
 ("'", 1731),
 ('the', 1527),
 ('and', 802),
 ('.', 764),
 ('to', 725),
 ('a', 615),
 ('I', 543),
 ('it', 527),
 ('she', 509),
 ('of', 500),
 ('said', 456),
 (",'", 397),
 ('Alice', 396),
 ('in', 357)]

**Except Alice rest of the words are not really descriptive. So looking at the most common words might NOT be the best way to know about the book**

**A word used only once in a collection of text is called a `hapax legomenon`. This will give a list of words that were used only once**

In [8]:
alice_fd.hapaxes()[:15]

['Lewis',
 'Carroll',
 '1865',
 ']',
 'Hole',
 'conversations',
 'daisy',
 'chain',
 'daisies',
 'pink',
 'wondered',
 'actually',
 'TOOK',
 'WATCH',
 'OUT']

These words are descriptive but very uncommon in the book. As most common and uncommon words did NOT give much insight, in future we will learn how to find out **defining words**