# Overview

We're now switching focus away from the Network Science (for a little bit), beginning to think about _Natural Language Processing_ instead. In other words, today will be all about teaching your computer to "understand" text. This ties in nicely with our work on wikipedia, since wikipedia is a network of connected pieces of text. We've looked at the network so far - now, let's see if we can include the text. Today is about 

* Installing the _natural language toolkit_ (NLTK) package and learning the basics of how it works (Chapter 1)
* Figuring out how to make NLTK to work with other types of text (Chapter 2).

# Installing and the basics

> _Reading_
> The reading for today is Natural Language Processing with Python, first edition (NLPP1e) Chapter 1, Sections 1.1, 1.2, 1.3\. [It's free online](http://www.nltk.org/book_1ed/). 
> 
> * **Important**: Do not use the newest version of this book. Use the first edition. (The newest version is based on on Python 3).
> * **Important**: Seriously, remember that we're using the *first edition*.
> 

> _Exercises_: NLPP1e Chapter 1\.
> 
> * First, install `nltk` if it isn't installed already (there are some tips below that I recommend checking out before doing installing)
> * Second, work through chapter 1. The book is set up as a kind of tutorial with lots of examples for you to work through. I recommend you read the text with an open IPython Notebook and type out the examples that you see. ***It becomes much more fun if you to add a few variations and see what happens***. Some of those examples might very well be due as assignments (see below the install tips), so those ones should definitely be in a `notebook`. 
 

In [2]:
import nltk
nltk.download()
#text1.concordance("monstrous")

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [51]:
from IPython.display import Image, display
from nltk.draw import TreeWidget
from nltk.draw.util import CanvasFrame
import os
import pprint
from __future__ import print_function 
from __future__ import division
ntlk_data=os.getcwd() + "\\nltk"  #data resource
nltk.data.path.append(ntlk_data)
print (ntlk_data)
from nltk.book import *
print (text1.concordance("monstrous"),"\n")
print (text1.similar("monstrous"), "\n")
print (text2.common_contexts(["monstrous", "very"]))

D:\1st Semester Master\Social Graphs and interaction\lesson6\nltk
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
None 

imperial subtly im

### NLTK Install tips 

Check to see if `nltk` is installed on your system by typing `import nltk` in a `notebook`. If it's not already installed, install it as part of _Anaconda_ by typing 

     conda install nltk 

at the command prompt. If you don't have them, you can download the various corpora using a command-line version of the downloader that runs in Python notebooks: In the iPython notebook, run the code 

     import nltk
     nltk.download()

Now you can hit `d` to download, then type "book" to fetch the collection needed today's `nltk` session. Now that everything is up and running, let's get to the actual exercises.

> _Exercises_: NLPP1e Chapter 1 (the stuff that might be due in an upcoming assignment).
> 
> The following exercises from Chapter 1 are what might be due for an assignment later one.
>
> * Try out the `concordance` method, using another text and a word of your own choosing.
> * Also try out the `similar` and `common_context` methods for a few of your own examples.
> * Create your own version of a dispersion plot ("your own version" means another text and different word).
> * Explain in your own words what aspect of language _lexical diversity_ describes. 
> * Create frequency distributions for `text2`, including the cumulative frequency plot for the 75 most common words.
> * What is a bigram? How does it relate to `collocations`. Explain in your own words.
> * Work through ex 2-12 in NLPP's section 1.8\. 
> * Work through exercise 15, 17, 19, 22, 23, 26, 27, 28 in section 1.8\. 

In [65]:
##bigram is a word pair we generate bigram from each two adjacnet words from a text and look for the frequent pairs, aka. collocation
##############exercises
print (12 / (4 + 1))

print (26 ** 100)

print (['Monty', 'Python'] * 20)

from nltk.book import *
print ("text1 length",len(text1),"distince words",len(set(text1)), "\n")

from nltk.corpus import brown
#print ("brown categories",brown.categories())
print ("Lexical diversity humor",len(brown.words(categories="humor"))/len(set(brown.words(categories="humor"))), "Lexical diversity romance",len(brown.words(categories="romance"))/len(set(brown.words(categories="romance"))),"\n")

#Sense and Sensibility
#text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby"])

print (text5.collocations(),'\n')

#####exercise 15
print (sorted([w for w in set(text5) if w.startswith('b')]),"\n")

#####exercise 17 find the slice for the complete sentence that contains this word
print ("index of sunset",text9.index("sunset"))

#for sent in sent_tokenize_list:
#    if "sunset" in set(sent):
#        print ("the sentence is :",sent)
#        break
        
######exercise 19 What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?sorted(set([w.lower() for w in text1]))
# "ex19",sorted(set([w.lower() for w in text1]))) sort both dinstincts words and indeices
# "ex19",sorted([w.lower() for w in set(text1)])) sort distinct words
 
######exercise 22 Find all the four-letter words in the Chat Corpus (text5). With the help of a frequency distribution (FreqDist), show these words in decreasing order of frequency.
fdist5=FreqDist(text5)
print (sorted([w for w in set(text5) if len(w) == 4]))

######exercise 23 Use a combination of for and if statements to loop over the words of the movie script for Monty Python and the Holy Grail (text6) and print all the uppercase words, one per line.
uppers=[]
for w in set(text6):
    if w.isupper():
        uppers.append(w)
print("\nUppercase:",uppers) 

######exercise 26 What does the following Python code do? sum([len(w) for w in text1]) Can you use it to work out the average word length of a text?
print ("\naverage word length",sum([len(w) for w in text1]) / len(text1))

######exercise 27 Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text
def vocab_size():
    

######exercise 28

2.4
3142930641582938830174357788501626427282669988762475256374173175398995908420104023465432599069702289330964075081611719197835869803511992549376
['Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python', 'Monty', 'Python']
text1 length 260819 distince words 19317 

Lexical diversity humor 4.32429738888 Lexical diversity romance 8.28466635116 

wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART;
cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys
wanna; song lasts; last night; ACTION sits; -...)...- S.M.R.; Lime
Player; Player 12%; dont know; lez gurls; long time
None 

[u'b', u'b-day', u'b/c', u'b4', u'babay', u'babble', u'babblein', u'babe', u'babes', u'ba

# Working with NLTK and other types of text

So far, we've worked with text from Wikipedia. But that's not the only source of text in the universe. In fact, it's far from it. Chapter 2 in NLPP1e is all about getting access to nicely curated texts that you can find built into NLTK. 
> 
> _Reading_: NLPP1e Chapter 2.1 - 2.4\.
> 

> _Exercises_: NLPP1e Chapter 2\.
> 
> * Solve exercise 4, 8, 11, 15, 16, 17, 18 in NLPP1e, section 2.8\. As always, I recommend you write up your solutions nicely in a `notebook`.
> * Work through exercise 2.8.23 on Zipf's law. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) connects to a property of the Barabasi-Albert networks. Which one? Take a look at [this article](http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf) and write a paragraph or two describing other important instances of power-laws found on the internet.
>

In [70]:

####exercise 4
from nltk.corpus import state_union 
from collections import Counter
file_list=state_union.fileids()
print ("Exercise The amount of the files in state_union",len(file_list))
#approach 1 
#for state_file in file_list:
#    wordcounts = Counter(w.lower() for w in state_union.words(state_file))
#    women=wordcounts["women"]
#    men=wordcounts["men"]
#    people=wordcounts["people"]
#    print (women,men,people,state_file)
#approach 2
for state_file in file_list:
    fdist = nltk.FreqDist([w.lower() for w in state_union.words(state_file)])
    print("women:",fdist["women"],"men:",fdist["men"],"people:",fdist["people"])

#use conditional frequency distribution to plot the trend of three words along the texts
cfd_2=nltk.ConditionalFreqDist((target,fileid) 
                              for fileid in state_union.fileids() 
                              for w in state_union.words(fileid)
                              for target in ['women','men','people']
                              if w.lower() in target)
#cfd_2.plot()


######exercise 8 Define a conditional frequency distribution over the Names corpus that allows you to see which initial letters are more frequent for males vs. females
from nltk.corpus import names 
name_list=names.fileids()
male_names = names.words('male.txt')
female_names = names.words('female.txt')

cfd_8 = nltk.ConditionalFreqDist(
           (fileid, name[0])
          for fileid in names.fileids()
          for name in names.words(fileid))
cfd_8.plot()

##exercise 11 Investigate the table of modal distributions and look for other patterns. 
###Try to explain them in terms of your own impressionistic understanding of the different genres. 
###Can you find other closed classes of words that exhibit significant differences across different genres?

cfd_11 = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd_11.tabulate(conditions=genres, samples=modals)
##exercise 15
##exercise 16
##exercise 17
##exercise 18

Exercise The amount of the files in state_union 65
women: 2 men: 2 people: 10
women: 7 men: 12 people: 49
women: 2 men: 7 people: 12
women: 1 men: 5 people: 22
women: 1 men: 2 people: 15
women: 2 men: 6 people: 15
women: 2 men: 8 people: 10
women: 0 men: 3 people: 17
women: 0 men: 2 people: 15
women: 0 men: 4 people: 26
women: 2 men: 2 people: 30
women: 2 men: 5 people: 11
women: 1 men: 2 people: 19
women: 1 men: 4 people: 11
women: 0 men: 2 people: 10
women: 0 men: 6 people: 10
women: 2 men: 6 people: 10
women: 0 men: 0 people: 3
women: 5 men: 8 people: 12
women: 1 men: 3 people: 3
women: 0 men: 7 people: 16
women: 3 men: 12 people: 14
women: 1 men: 12 people: 35
women: 1 men: 11 people: 25
women: 0 men: 4 people: 17
women: 2 men: 5 people: 6
women: 0 men: 2 people: 23
women: 0 men: 1 people: 32
women: 0 men: 1 people: 7
women: 0 men: 1 people: 9
women: 0 men: 0 people: 20
women: 0 men: 0 people: 14
women: 1 men: 3 people: 18
women: 1 men: 2 people: 19
women: 1 men: 0 people: 26
women