# Welcome to Info 202!

MIMS students have Info 202 to thank our recurring dreams of spice racks and closets. The class's textbook *The Discipline of Organizing* is truly the manual of all things resource organizing. But what if we reorganized the grand organizer? What if the interaction this textbook optimized for was not learning about organizing but learning about how much this textbook is about organizing?

If that appeals to you, then come with me on a magical journey through language processing!

*&ast;QUEUE ALADDIN MUSIC&ast;*

## Getting Started

First we need to load up text. Fortunately, I have a PDF of the book and can go to this [totally not sketchy website](pdftotext.com) to convert it into a .txt file (the file is in the repo.) Now let's load her up and check out some of the most common words.

In [1]:
import nltk
from nltk.probability import FreqDist

with open('Discipline_of_Organizing_Informatics_Edition_Fourth_Edition.txt', 'r') as file:
    book_202 = file.read()
word_bag_202 = nltk.word_tokenize(book_202)
dist_202 = FreqDist(word_bag_202)
dist_202.most_common(20)

[('.', 20338),
 (',', 13872),
 ('the', 10921),
 ('of', 8296),
 ('and', 8205),
 ('to', 5708),
 ('a', 5434),
 ('in', 4368),
 (':', 4289),
 (']', 3815),
 ('[', 3815),
 ('is', 3637),
 ('(', 3206),
 (')', 3203),
 ('that', 3192),
 ('or', 3156),
 ('are', 2720),
 ('for', 2428),
 ('The', 2043),
 ('as', 1802)]

Huh, there's a lot of garbage in there. Let's clean that up.

In [7]:
import re

alpha_regex = re.compile('[a-zA-Z]+', re.IGNORECASE)
word_list_202 = [w.lower() for w in word_bag_202 if alpha_regex.match(w)]
dist_202 = FreqDist(word_list_202)
dist_202.most_common(20)

[('the', 12964),
 ('of', 8320),
 ('and', 8232),
 ('a', 5997),
 ('to', 5746),
 ('in', 4939),
 ('is', 3745),
 ('that', 3224),
 ('or', 3167),
 ('for', 2787),
 ('are', 2752),
 ('resources', 1944),
 ('as', 1913),
 ('organizing', 1719),
 ('be', 1710),
 ('it', 1656),
 ('with', 1613),
 ('by', 1603),
 ('can', 1504),
 ('resource', 1382)]

## Calculating Word Frequency Percentages

That's a bit better. But this isn't quite the information we came here for. Yes, Bob Glushko clearly has a fondness for resources but this display doesn't tell us how much he really loves them. Let's see what percent of the words some of his favorite words make up!

In [3]:
# checks words with the prefix
def prefix_frequency(prefix, word_list):
    regex_start = prefix + "+"
    regex = re.compile(regex_start, re.IGNORECASE)
    filtered_text = [w.lower() for w in word_list if regex.match(w)]
    return len(filtered_text)

# calculates the percent of words in a certain list
def prefix_percent(prefix, word_list):
    p_freq = prefix_frequency(prefix, word_list)
    percent = p_freq/len(word_list) * 100
    return percent

def print_202_prefix(prefix):
    percent = prefix_percent(prefix, word_list_202)
    print('\n\"' + prefix + '\" starts ' + '{:.2f}'.format(percent) + '% of the words in Discipline of Organizing.')

print_202_prefix("resource")
print_202_prefix("organiz")
print_202_prefix("informa")
print_202_prefix("interact")


"resource" starts 1.31% of the words in Discipline of Organizing.

"organiz" starts 1.02% of the words in Discipline of Organizing.

"informa" starts 0.57% of the words in Discipline of Organizing.

"interact" starts 0.29% of the words in Discipline of Organizing.


## Comparing Word Frequencies to an Average Text

Hmmm...this is interesting. But not as interesting as it could be. After all, what if these words are just a lot more common than we assume they are? Well, let's compare it to a giant set of words, one that ends up nicely following [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law#cite_note-4): the Brown Corpus!

In [6]:
from nltk.corpus import brown

# the brown corpus is big so this might take a bit
brown_words = [w.lower() for w in brown.words() if alpha_regex.match(w)]

def compare_brown_to_202(prefix):
    percent_brown = prefix_percent(prefix, brown_words)
    percent_202 = prefix_percent(prefix, word_list_202)
    return percent_202 / percent_brown

def print_prefix_comparison(prefix):
    val = compare_brown_to_202(prefix)
    print("\nThe prefix \"" + prefix + "\" appears " + '{:.2f}'.format(val) + " more times in Discipline of Organizing than the Brown Corpus.\n")

print_prefix_comparison("resource")
print_prefix_comparison("organiz")
print_prefix_comparison("informa")
print_prefix_comparison("interact")



The prefix "resource" appears 149.63 more times in Discipline of Organizing than the Brown Corpus.

The prefix "organiz" appears 36.62 more times in Discipline of Organizing than the Brown Corpus.

The prefix "system" appears 12.70 more times in Discipline of Organizing than the Brown Corpus.

The prefix "interact" appears 120.35 more times in Discipline of Organizing than the Brown Corpus.


<img src="dank_gif.gif">