# Getting the most out of what we've learned

So, now you know Python and NLTK! The main things we still have to do are:

1. Address some specific questions
2. Manage resources and results
3. Brainstorm some other uses for NLTK
4. Integrate IPython into your existing workflow
5. Have an open discussion about what we've done
6. Summarise and say goodbye!

This lesson is pretty light on content and structure. Please do jump in at any point, and tell us about your research, and whether or not what you've learned here will be of much use.

Or, ask us if Python can do a certain thing. Maybe we have some tips!

In [6]:
import nltk
import os
from urllib import urlopen
% matplotlib inline

## Getting text
The most important skill for using NLTK in your life as a researchers is going to be working with your own texts. First, let's look at reading in text files directly from the web.

In [7]:
cocoa = urlopen('http://gutenberg.digitalfabulists.org/pg19073.txt').read()
print cocoa[:100]

﻿The Project Gutenberg EBook of Cocoa and Chocolate, by Arthur W. Knapp

This eBook is for the u


In [8]:
raw = unicode(cocoa, 'utf-8')
title = next(line for line in raw.splitlines() if line.startswith('Title'))
print title
print raw[:200]

Title: Cocoa and Chocolate
﻿The Project Gutenberg EBook of Cocoa and Chocolate, by Arthur W. Knapp

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give i


In [9]:
!pwd

/Users/fionatweedie/NLTK/books


In [10]:
# save text to file
! mkdir books
%cd books
filetitle = title.replace(' ', '-')
bookfile = open(filetitle, 'w')
bookfile.write(raw.encode('utf-8'))
bookfile.close()
print 'created file', title

/Users/fionatweedie/NLTK/books/books
created file Title: Cocoa and Chocolate


### Using Beautiful Soup to read text from the web
Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

In [11]:
from bs4 import BeautifulSoup

In [12]:
import urllib
from urllib import urlopen

In [13]:
url = "http://en.wikipedia.org/wiki/Smog"

In [14]:
raw = urlopen(url).read()
print type(raw)
print raw[100:200]

<type 'str'>
e>Smog - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = docum


Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In [15]:
soup = BeautifulSoup(raw, 'html.parser')
print type(soup)

<class 'bs4.BeautifulSoup'>


Find all the paragraphs, and put them into a list

In [16]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print texts[:10]

[u'Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmanteau of the words smoke and fog to refer to smoky fog.[1] The word was then intended to refer to what was sometimes known as green soup fog, a familiar and serious problem in Mexico from the 19th century to the mid 20th century. This kind of smog is caused by the burning of large amounts of coal within a city; this smog contains soot particulates from smoke, sulphur dioxide and other components.', u'Modern smog, as found for example in Los Angeles, is a type of air pollution derived from vehicular emission from internal combustion engines and industrial fumes that react in the atmosphere with sunlight to form secondary pollutants that also combine with the primary emissions to form photochemical smog. In certain other cities, such as Delhi, smog severity is often aggravated by stubble burning in neighboring agricultural areas. The atmospheric pollution levels of Los Angeles, Beijing, De

In [17]:
import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print type(joined_texts)
print joined_texts[:100]

<type 'unicode'>
Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmante


In order to work on the text, the first step is to tokenise it into words.

In [18]:
import nltk
wordlist = nltk.word_tokenize(joined_texts)
wordlist[:8]

[u'Smog', u'is', u'a', u'type', u'of', u'air', u'pollutant', u'.']

For some other types of analysis, we'll need to create an NLTK text object

In [19]:
good_text = nltk.Text(wordlist)
good_text.concordance('smog')

Displaying 25 of 40 matches:
                                     Smog is a type of air pollutant . The wor
                                     smog '' was coined in the early 20th cent
 the mid 20th century . This kind of smog is caused by the burning of large am
amounts of coal within a city ; this smog contains soot particulates from smok
ioxide and other components . Modern smog , as found for example in Los Angele
mary emissions to form photochemical smog . In certain other cities , such as 
rtain other cities , such as Delhi , smog severity is often aggravated by stub
fe or death . Coinage of the term `` smog '' is generally attributed to Dr. He
 clouds of smoke that contributes to smog . Air pollution from this source has
 , as witnessed by the 2013 autumnal smog in Harbin , China , which closed roa
 major ingredient in the creation of smog in some large cities . The major cul
 ozone , and particles that comprise smog . Photochemical smog is the chemical
s that comprise smog . 

And once we've done all that work creating clean text, it's a good idea to save it for later.

In [20]:
%cd
! mkdir smog
%cd smog

/Users/fionatweedie
/Users/fionatweedie/smog


In [21]:
NLTK_file = open("NLTK-Smog.txt", "w")
NLTK_file.write(str(wordlist))
NLTK_file.close()

In [22]:
text_file = open("Smog-text.txt", "w")
text_file.write(joined_texts)
text_file.close()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 2455: ordinal not in range(128)

In [23]:
joined_texts[2450:2470]

u'ions \u2013 such as from '

In [24]:
#joined_texts[2450:2470]
text_file = open("Smog-text.txt", "w")
text_file.write(joined_texts.encode("UTF-8"))
text_file.close()

### Challenge!
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

### PDF

This part requires the `slate` library. Using DIT4C we found an error with the standard Slate, instead do:

`sudo pip install --upgrade --ignore-installed slate==0.3 pdfminer==20110515`

In [None]:
!wget "http://www.planetebook.com/ebooks/1984.pdf"

In [None]:
import slate
with open('1984.pdf') as f:
    doc = slate.PDF(f)

In [None]:
doc.metadata