# python hands-on session

By: Ties de Kok  
Version: Python 2.7 (see any notes for Python 3.5)

1. handling files
2. data handling
3. web scraping
4. **text mining**
5. (interactive) visualisations

## Introduction

In the previous notebooks we have either worked with quantitative data (Pandas) or we have extracted information from webpages or documents.  
In this notebook we will try to convert qualitative data, more specifically text, into something that we can use for analysis.  

Extracting this "information" from text is often called "text mining" or "natural language processing" (NLP)

NLP is a sub-field that can easily fill an entire separate workshop, here I will only touch upon the basics.

Basic steps for NLP:
1. Obtain and load some raw text
2. Process this text ("clean" the text)
3. Analyse the text

**Note: For the sake of consistency I have prepared this notebook with Python 2.7**  
**However, when doing text analysis I highly recommend to use Python 3.5 instead because of improved unicode support!**

## Overview of tools

There are many different "NLP" tools in the Python eco-system.  

Several well known options:  
1. NLTK (Natural Language Toolkit) (http://www.nltk.org/)
2. TextBlob (http://textblob.readthedocs.io/en/dev/#)
3. Spacy (https://spacy.io/)

**Installation instructions:**
1. NLTK: 
    - `pip install nltk` 
    - run `nltk.download()` **in the notebook** to download and install the NLTK data
2. TextBlob: 
    - `pip install -U textblob`
    - `python -m textblob.download_corpora` **in the console** to download data
3. Spacy: 
    - `conda install -c spacy spacy=0.101.0`
    - `python -m spacy.en.download` **in the console** to download data

## Get some example text

We obviously need some text to extract information from, for this example we will get some text from Twitter.  

**Note: You can skip this step by loading the pre-downloaded tweets, this is explained if you scroll down a couple of cells.**

To download some twitter data we will use the `tweepy` package. (http://tweepy.readthedocs.io/en/v3.5.0/)  
Install `tweepy` using `pip install tweepy`.  

Before we can use `tweepy` we need a twitter account that allows us to use the API.  

Go to: https://apps.twitter.com and login with your twitter account.  
Create a new app and fill something in like:

![](https://dl.dropboxusercontent.com/u/1265025/python_tut/Python_tweepy.PNG)

*Note:* you need to have verified your Twitter account with your mobile phone (https://twitter.com/settings/devices)

After you have created your app click the `Keys and Access Tokens` tab.  
Click `Generate my access token and Token Secret`.

You will need 4 things from this page:  
1. Consumer Key
2. Consumer Secret
3. Access Token
4. Access Token Secret

### Authenticate ourselfs

In [1]:
import tweepy
from tweepy import OAuthHandler

Fill in your details below:

In [None]:
consumer_key = 'YOUR-CONSUMER-KEY'  
consumer_secret = 'YOUR-CONSUMER-SECRET'  
access_token = 'YOUR-ACCESS-TOKEN'  
access_secret = 'YOUR-ACCESS-SECRET'  

In [3]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

### Download 20 most recent tweets from @TheEconomist

In [4]:
tweets = []
for status in tweepy.Cursor(api.user_timeline, id="TheEconomist").items(20):
    tweets.append(status.text)
    print(status.text)

Many of Russia’s ultra-rich have struggled to preserve their fortunes amidst market turmoil https://t.co/jtupmMTeKN https://t.co/caGdrpa7cI
RT @stevenmazie: This one goes to 11: Trump’s #SCOTUS wish-list is designed to reassure conservatives (my @TheEconomist post today) https:/…
RT @Watchjen: Military exercises this wk near Taiwan were not, Beijing said, "aimed at any specific target”. Tsai Ing-wen sworn in https://…
RT @EconCulture: Inside a mind losing its grip with "The Father" https://t.co/IDI2Skb8yc https://t.co/jZGrs3ucay
Our quote of the day is from American novelist Nathaniel Hawthorne https://t.co/pukODOYo9o
RT @AdamCommentism: A tiny robot collects a swallowed battery (Melanie Gonick/@MIT). Full story: https://t.co/lMcUFqQlqY 😍 Soooo cool https…
Breathalysers are relatively new in Nairobi. And Kenyans have figured out a way around them https://t.co/zO7r9cduzp https://t.co/AJUFmCCSWg
RT @rachelsllloyd: "Since when did @TheEconomist write about film?" Since 1927. https://t.co/

### Note:

If you cannot or do not want to use `tweepy` you can also load the tweets from a file that I included:  

```
import pickle
tweets = pickle.load(open("tweets.p", "rb"))
```

If you want to follow the examples below it is best to load the data from the file as well.

In [None]:
import pickle
tweets = pickle.load(open("tweets.p", "rb"))

## Clean the text (pre-processing)

We, for example, are not interested in the link. So we would like to remove any links:

In [3]:
import re

In [4]:
def remove_link(text):
    clean = re.sub(r'http(.*)', '', text)
    clean = clean.strip()
    return clean

In [5]:
c_tweets = [remove_link(x) for x in tweets]
c_tweets[0:5]

 u"Kosovo's recognition by FIFA may open a Pandora's box",
 u"Israel's Dimona nuclear reactor has long been a subject of speculation. It may now close",
 u"National regulation is needed to protect Canada's stockmarket from fraud",
 u'Secessionist Catalonia is cracking down on businesses that communicate only in Spanish']

## Basic example NLTK

In [6]:
import nltk

### Split text into a list of sentences

In [7]:
nltk.sent_tokenize(c_tweets[1])

[u"Kosovo's recognition by FIFA may open a Pandora's box"]

In [8]:
sentences = [nltk.sent_tokenize(x) for x in c_tweets]
sentences[0:3]

  u'Time to act'],
 [u"Kosovo's recognition by FIFA may open a Pandora's box"],
 [u"Israel's Dimona nuclear reactor has long been a subject of speculation.",
  u'It may now close']]

### Split text into a list of words

In [9]:
words = [nltk.word_tokenize(x) for x in c_tweets]
words[0]

[u'Enough',
 u'time',
 u'has',
 u'been',
 u'wasted',
 u'issuing',
 u'about',
 u'antibiotic',
 u'resistance',
 u'.',
 u'Time',
 u'to',
 u'act']

### Part-of-speech tagging

In [10]:
pos = [nltk.pos_tag(x) for x in words]
pos[0]

[(u'Enough', 'JJ'),
 (u'time', 'NN'),
 (u'has', 'VBZ'),
 (u'been', 'VBN'),
 (u'wasted', 'VBN'),
 (u'issuing', 'JJ'),
 (u'about', 'IN'),
 (u'antibiotic', 'JJ'),
 (u'resistance', 'NN'),
 (u'.', '.'),
 (u'Time', 'NNP'),
 (u'to', 'TO'),
 (u'act', 'VB')]

### Lemmatizer

In [11]:
words[0][6]



In [12]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize(words[0][6])



### Classification

You can do a lot of classification tasks using `NLTK` but they require a training set.  
To prevent turning this notebook into a `machine learning` tutorial I will skip this, however the internet is full of resources.  
For example:  
https://github.com/nltk/nltk/wiki/Sentiment-Analysis

## Basic example TextBlob

In [13]:
from textblob import TextBlob

### Turn text into a `TextBlob`

In [14]:
text_tb = [TextBlob(x) for x in c_tweets]

In [15]:
text_tb[0]



### Access characteristics of a piece of text

In [16]:
text_tb[0].tags

[(u'Enough', u'JJ'),
 (u'time', u'NN'),
 (u'has', u'VBZ'),
 (u'been', u'VBN'),
 (u'wasted', u'VBN'),
 (u'issuing', u'JJ'),
 (u'about', u'IN'),
 (u'antibiotic', u'JJ'),
 (u'resistance', u'NN'),
 (u'Time', u'NNP'),
 (u'to', u'TO'),
 (u'act', u'VB')]

In [17]:
text_tb[0].noun_phrases

WordList([u'enough', u'antibiotic resistance'])

In [18]:
text_tb[0].sentences

 Sentence("Time to act")]

In [19]:
text_tb[0].words



### Lemmatize

In [20]:
for x in text_tb[0].words:
    print(x, x.lemmatize())

(u'Enough', u'Enough')
(u'time', u'time')
(u'has', u'ha')
(u'been', u'been')
(u'wasted', u'wasted')
(u'issuing', u'issuing')
(u'about', u'about')
(u'antibiotic', u'antibiotic')
(u'resistance', u'resistance')
(u'Time', u'Time')
(u'to', u'to')
(u'act', u'act')


### Retrieve definition of a word

In [21]:
for x in text_tb[0].words[0:6]:
    print(x, x.definitions[0])

(u'Enough', u'an adequate quantity; a quantity that is large enough to achieve a purpose')
(u'time', u'an instance or single occasion for some event')
(u'has', u'(astronomy) the angular distance of a celestial point measured westward along the celestial equator from the zenith crossing; the right ascension for an observer at a particular location and time of day')
(u'been', u'have the quality of being; (copula, used with an adjective or a predicate noun)')
(u'wasted', u'spend thoughtlessly; throw away')
(u'issuing', u'the act of providing an item for general use or for official purposes (usually in quantity)')


### Detect language

In [22]:
for x in text_tb[0:5]:
    print(x.detect_language())

en
en
en
en
en


### Sentiment Analysis

*Note:* when it comes to sentiment analysis there really is no one-size-fits all solution.  
The below is decent but it is better to use a sentiment resource (e.g. a list with positive / negative words) that is custom to your type of text and goal.

In [23]:
for text in text_tb[0:5]:
    for sentence in text.sentences:
        print(sentence)
        print(sentence.sentiment)

Sentiment(polarity=-0.1, subjectivity=0.25)
Time to act
Sentiment(polarity=0.0, subjectivity=0.0)
Kosovo's recognition by FIFA may open a Pandora's box
Sentiment(polarity=0.0, subjectivity=0.5)
Israel's Dimona nuclear reactor has long been a subject of speculation.
Sentiment(polarity=-0.10833333333333334, subjectivity=0.3666666666666667)
It may now close
Sentiment(polarity=0.0, subjectivity=0.0)
National regulation is needed to protect Canada's stockmarket from fraud
Sentiment(polarity=0.0, subjectivity=0.0)
Secessionist Catalonia is cracking down on businesses that communicate only in Spanish
Sentiment(polarity=-0.051851851851851864, subjectivity=0.42962962962962964)


## Basic example Spacy

*Note:* Loading the English file might take a while.

In [24]:
from spacy.en import English
parser = English()

In [25]:
spacy_text = [parser(x) for x in c_tweets]

In [26]:
spacy_text[1]

Kosovo's recognition by FIFA may open a Pandora's box

### Access characteristics of text

In [27]:
for token in spacy_text[3]:
    print(token.orth_, token.lower_, token.lemma_, token.prob)

(u'National', u'national', u'national', -11.179020881652832)
(u'regulation', u'regulation', u'regulation', -11.475606918334961)
(u'is', u'is', u'be', -4.457748889923096)
(u'needed', u'needed', u'need', -9.039335250854492)
(u'to', u'to', u'to', -3.8560216426849365)
(u'protect', u'protect', u'protect', -10.150605201721191)
(u'Canada', u'canada', u'canada', -9.756773948669434)
(u"'s", u"'s", u"'s", -4.830559253692627)
(u'stockmarket', u'stockmarket', u'stockmarket', -17.218788146972656)
(u'from', u'from', u'from', -6.010132312774658)
(u'fraud', u'fraud', u'fraud', -11.487196922302246)


*Note, **prob**:* The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation. 

### Detect named entities

In [28]:
for token in spacy_text[1]:
    if token.ent_type_ != "":
        print(token, token.ent_type_)

(Kosovo, u'GPE')
(FIFA, u'ORG')
(Pandora, u'GPE')


## Sentiment Analysis

There are also packages available that are more focussed on particular tasks, such as sentiment analysis.  
One that is, for example, easy to use is: `AFINN` --> http://neuro.compute.dtu.dk/wiki/AFINN

In essence it is a word list but you can also install it directly by doing `pip install afinn`

In [29]:
from afinn import Afinn
afinn = Afinn()

In [30]:
afinn.score('This is utterly excellent!')

3.0

In [31]:
for text in spacy_text:
    for sentence in text.sents:
        print(afinn.score(sentence.text), sentence)

(0.0, Time to act)
(2.0, Kosovo's recognition by FIFA may open a Pandora's box)
(0.0, Israel's Dimona nuclear reactor has long been a subject of speculation.)
(0.0, It may now close)
(-3.0, National regulation is needed to protect Canada's stockmarket from fraud)
(0.0, Secessionist Catalonia is cracking down on businesses that communicate only in Spanish)
(0.0, Our quote of the day is from the Scottish biologist Sir Alexander Fleming)
(2.0, RT @EconCulture: "Where to Invade Next": a documentary worth watching, despite its cringe-inducing creator)
(-1.0, A plane bound for Cairo from Paris has disappeared.)
(0.0, The cause is unknown as yet)
(2.0, Sir Nicholas Winton, born #onthisday, rescued 669 children from Czechoslovakia in 1938-39)
(0.0, Confusion still reigns over what exactly business "platforms" are)
(-1.0, A plane disappears en route from Paris to Egypt)
(0.0, What tribal hunters teach the modern world about sleep #econarchive)
(-1.0, Supply shortages have increased the price of

## Search Reddit for threads about Egyptian Airline crash

To get some more text to work with we will extract some text from the website http://www.reddit.com  

We could directly use the API and `requests` but it is easier to use a wrapper called `praw`.

You can install `praw` by running `pip install praw` and information is available here: https://praw.readthedocs.io/en/stable/index.html

### Getting the Reddit data

In [32]:
import praw

In [33]:
r = praw.Reddit(user_agent='Python tutorial')

In [34]:
new_news = r.get_subreddit('worldnews').get_top_from_day(limit=100)

In [35]:
submission_titles = []
for submission in new_news:
    submission_titles.append(submission.title)

*Note:* you can load the submission titles I am using by loading:  
```
import pickle
submission_titles = pickle.load(open(r'submission_titles.p', 'r'))
```

In [36]:
submission_titles[0:5]

[u'Oil company records from 1960s reveal patents to reduce CO2 emissions in cars: ExxonMobil and others pursued research into technologies, yet blocked government efforts to fight climate change for more than 50 years, findings show',
 u'EgyptAir Flight 804 black boxes located',
 u"The world's largest cruise ship and its supersized pollution problem: each of the Harmony\u2019s three four-storey high 16-cylinder W\xe4rtsil\xe4 engines will, at full power, burn 1,377 US gallons of fuel an hour, or about 96,000 gallons a day of some of the most polluting diesel fuel in the world",
 u"Women in Iran are cutting their hair short and dressing as men in a bid to bypass state 'morality' police who rigorously enforce penalties for not wearing a hijab.",
 u'Poland last week dismissed 32 of 39 scientific experts on the State Council for Nature Conservation after they criticized the logging plan of Bialowieza Forest']

### Find reddit threads that talk about the Egyptian airplane that went missing

We could use a very simple word-search based approach:

In [37]:
for x in submission_titles:
    if 'egypt' in x.lower():
        print(x)

EgyptAir Flight 804 black boxes located
EgyptAir crash: Internal blast 'tore right side' of jet, pilot says
EgyptAir: Images released of debris found in plane search
EgyptAir Jetliner that crashed into the Mediterranean was once the target of political vandals who wrote in Arabic on its underside, “We will bring this plane down.”
Egyptian military releases photo of debris collected from Egyptair ms804 crash. Search still underway for blackbox and cvr
Egypt Slams ‘Disrespectful’ CNN Coverage of EgyptAir Tragedy
Ominous graffiti was scribbled on the bottom of the crashed EgyptAir flight
EgyptAir Flight 804: Black boxes for crashed airplane located
The Egyptian Army has shared the first photos of wreckage from the crashed EgyptAir flight


However, the problem is that we also get result that are not related to the plane crash, such as:
> Egypt Gets $25 Billion Loan From Russia for Nuclear Plant

We can try to add additional keywords such as 'plane' and 'flight'

In [38]:
for x in submission_titles:
    if 'egypt' in x.lower():
        if 'plane' in x.lower() or 'flight' in x.lower():
            print(x)

EgyptAir Flight 804 black boxes located
EgyptAir: Images released of debris found in plane search
EgyptAir Jetliner that crashed into the Mediterranean was once the target of political vandals who wrote in Arabic on its underside, “We will bring this plane down.”
Ominous graffiti was scribbled on the bottom of the crashed EgyptAir flight
EgyptAir Flight 804: Black boxes for crashed airplane located
The Egyptian Army has shared the first photos of wreckage from the crashed EgyptAir flight


However, this becomes tedious if we want to keep adding additional keywords such as 'aircraft'.  
A better approach is to, for example, include synonyms for the word 'airplane'.  

We can do this using one of the above libraries but we could also use `PyDictionary`, `pip install PyDictionary`

In [39]:
from PyDictionary import PyDictionary
dictionary=PyDictionary()

In [40]:
print(dictionary.synonym('airplane'))

[u'jet', u'aircraft', u'plane', u'cab', u'ship']


In [41]:
plane_words = dictionary.synonym('airplane') + ['airplane', 'flight']

In [42]:
for x in submission_titles:
    if 'egypt' in x.lower():
        if any(word in x.lower() for word in plane_words):
            print(x)

EgyptAir Flight 804 black boxes located
EgyptAir crash: Internal blast 'tore right side' of jet, pilot says
EgyptAir: Images released of debris found in plane search
EgyptAir Jetliner that crashed into the Mediterranean was once the target of political vandals who wrote in Arabic on its underside, “We will bring this plane down.”
Ominous graffiti was scribbled on the bottom of the crashed EgyptAir flight
EgyptAir Flight 804: Black boxes for crashed airplane located
The Egyptian Army has shared the first photos of wreckage from the crashed EgyptAir flight


## Machine learning: basic topic classification

**Note:** I have little experience with machine learning, I base the code below on examples for the sake of illustration

### Get some data that we can classify:

We will download the titles of the top posts from two sub-reddits: '/r/Python' and '/r/Java'.  

The goal is to build a classifier that can identify a title as being from either '/r/Python' or '/r/Java'.

In [59]:
top_python = r.get_subreddit('python').get_top_from_year(limit=300)
top_python = [x.title for x in top_python]
top_python = top_python[50:]

In [61]:
for x in top_python[0:5]:
    print(x)

"High Performance Python" co-author Micha Gorelick has quite an impressive author bio...
TIL about "Google Python Style Guide"
A PEP8 Wallpapper
The Idiomatic Way to Merge Dictionaries in Python
Django awarded Mozilla Open Source Support Grant


In [60]:
top_java = r.get_subreddit('java').get_top_from_year(limit=300)
top_java = [x.title for x in top_java]
top_java = top_java[50:]

In [62]:
for x in top_java[0:5]:
    print(x)

The Top Starred repositories in Github have been analysed to understand which are the most common whitespace types in different programming languages
JDK 8 Massive Open and Online Course: Lambdas and Streams Introduction
Code Academy has a Java Course now
Java is the most searched programming language on Google!
Improving DuckDuckGo's Java-related searches


*Note:* I use `[50:]` to remove the first 50 titles as they are often not representative

Convert it into a data structure that we can use for machine learning:

In [110]:
python_tuple = tuple((x, 0, 'python') for x in top_python)
java_tuple = tuple((x, 1, 'java') for x in top_java)
data = list(python_tuple + java_tuple)

### Import machine learning tools

*Note:* we will use `sklearn` which comes with `Anaconda`

This example is based on: http://nbviewer.jupyter.org/github/gmonce/scikit-learn-book/blob/master/Chapter%202%20-%20Supervised%20Learning%20-%20Text%20Classification%20with%20Naive%20Bayes.ipynb

In [165]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
import numpy as np

### Split the sample into our training and evaluation set:

In [111]:
from random import shuffle
shuffle(data)

In [145]:
SPLIT_PERC = 0.75
split_size = int(len(data)*SPLIT_PERC)

data_d = [x[0] for x in data]
data_t = [x[1] for x in data]

X_train = data_d[:split_size]
X_test = data_d[split_size:]
y_train = data_t[:split_size]
y_test = data_t[split_size:]

### Text Classification with Naïve Bayes

In [146]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [147]:
def evaluate_cross_validation(clf, X, y, K):
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores))

The initial data consists of text, while we need numerical data for the classifier.  
In the example below I use `CountVectorizer` : http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [148]:
clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

In [149]:
evaluate_cross_validation(clf, data_d, data_t, 5)

[ 0.91  0.88  0.91  0.89  0.94]
Mean score: 0.906 (+/-0.010)


In [150]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print "Accuracy on training set:"
    print clf.score(X_train, y_train)
    print "Accuracy on testing set:"
    print clf.score(X_test, y_test)
    
    y_pred = clf.predict(X_test)
    
    print "Classification Report:"
    print metrics.classification_report(y_test, y_pred)

In [151]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.997333333333
Accuracy on testing set:
0.888
Classification Report:
             precision    recall  f1-score   support

          0       0.89      0.89      0.89        64
          1       0.89      0.89      0.89        61

avg / total       0.89      0.89      0.89       125



### Eye-ball the results:

In [159]:
expected_group = zip(X_test, y_test, clf.predict(X_test))

*Note:* we code 0 to indicate Python and 1 to indicate Java:

In [164]:
for text, actual, predict in expected_group[20:40]:
    print(text, actual, predict)

(u'Announcing General Availability of PyCharm 5: Python 3.5, Docker Integration, Thread Concurrency Visualization, and much more', 0, 0)
(u'Pyxley: Python Powered Dashboards', 0, 0)
(u'Livecoding.tv, Stream yourself as a developer', 1, 1)
(u'Decorated Concurrency - Python multiprocessing made really really easy', 0, 0)
(u'Go and Quasar: a comparison of style and performance', 1, 0)
(u'Hashmap Performance Improvements in Java 8', 1, 1)
(u'All the PyCon 2015 talks you should watch', 0, 0)
(u'Transcrypt Python to JavaScript compiler moved to Beta', 0, 0)
(u'MicroPython 1.8 released', 0, 1)
(u'Hey everyone! I made my first Tic Tac Toe game in Python! What do you think?', 0, 0)
(u'PEP 0506 -- Adding A Secrets Module To The Standard Library', 0, 0)
(u'(Free) Introducing Java 8: A Quick-Start Guide to Lambdas and Streams', 1, 1)
(u'matplotlib 1.5.0 is out -- still alive an kicking with pandas DataFrame support and pretty seaborn styles', 0, 0)
(u'How do commercial Java apps deal with exposed 