# Text data representations

## Example 'real' data: SIPA description, #migrantcrisis tweets

SIPA text is taken from the Wikipedia entry about SIPA, https://en.wikipedia.org/wiki/School_of_International_and_Public_Affairs,_Columbia_University.

Migrantcrisis tweets were read from the twitter search API (see notebook 3.2 Social Media Data), using hashtag #migrantcrisis, on 9th March 2016.

In [1]:
fsipa = open('example_data/sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
print(sipatext)

The School of International and Public Affairs at Columbia University (also known at SIPA) is a public policy and international affairs school and one of Columbia's graduate and professional schools.

Located on Columbia's Morningside Heights campus in the Borough ofManhattan, in New York City, the school has more than 19,000 alumni in more than 150 countries. SIPA's alumni include former heads of state, business leaders, journalists, diplomats, and elected representatives. Many graduates reach the upper echelons of central banks and treasuries, others go to energy companies, non-for-profits and social enterprises.[1] Half of SIPA’s nearly 1,400 students are international, coming from over 100 countries. SIPA has more than 70 full-time faculty and more than 200 adjunct professors, including the world's leading scholars on international relations.

The school offers two traditional two-year master's degrees (Master of Public Administration or Master of International Affairs), an Executi

In [2]:
import json
import pandas as pd

ftweets = open('example_data/migrantcrisis_tweets.json', 'r')
tweetdata = json.load(ftweets)
ftweets.close()
tweetframe = pd.DataFrame(tweetdata['statuses'])
tweetframe[['entities', 'text', 'user']]

Unnamed: 0,entities,text,user
0,"{'urls': [{'indices': [26, 49], 'expanded_url'...",Stirile zilei - 9 martie: https://t.co/R3n6Vr8...,{'created_at': 'Thu Oct 13 17:44:25 +0000 2011...
1,"{'urls': [{'indices': [0, 23], 'expanded_url':...",https://t.co/I09bHPUEyM via @youtube @guardian...,{'created_at': 'Tue Feb 17 10:04:20 +0000 2009...
2,"{'urls': [{'indices': [53, 76], 'expanded_url'...",RT @ECentauri: GERMAN GOVT. PROMOTES INTERRACI...,{'created_at': 'Tue Mar 19 16:54:09 +0000 2013...
3,"{'urls': [], 'symbols': [], 'user_mentions': [...",RT @vladokrstevski: #NeoCommies\n#UDBA\n#Soros...,{'created_at': 'Wed Jul 13 14:28:40 +0000 2011...
4,"{'urls': [], 'symbols': [], 'user_mentions': [...",RT @ECentauri: 'I GOT A SOLUTION TO #migrantcr...,{'created_at': 'Sat Feb 05 05:43:38 +0000 2011...
5,"{'urls': [{'indices': [95, 118], 'expanded_url...","#Migrants - Feeling Unwanted In Germany, Some ...",{'created_at': 'Mon Sep 07 06:52:50 +0000 2015...
6,"{'urls': [{'indices': [139, 140], 'expanded_ur...","RT @INTHENOWRT: This drone captures some 14,00...",{'created_at': 'Tue Nov 20 10:18:29 +0000 2012...
7,"{'urls': [], 'symbols': [], 'hashtags': [{'ind...",People who moan about the migrant crisis.... 🖕...,{'created_at': 'Sun Jan 05 19:55:14 +0000 2014...
8,"{'urls': [], 'symbols': [], 'user_mentions': [...",RT @London_360: .@RitaOra delivers a personal ...,{'created_at': 'Tue Oct 28 17:24:30 +0000 2014...
9,"{'urls': [{'indices': [139, 140], 'expanded_ur...","RT @INTHENOWRT: This drone captures some 14,00...",{'created_at': 'Fri Jun 20 18:02:44 +0000 2014...


## Using the raw data

Sometime you'll want to use the raw data itself as input to algorithms.  Examples of this include things like Twitter hashtags or user names, which are short and (hopefully) consistent. 

## Using Bags of Words

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform([sipatext])
print('{}'.format(word_counts))
print('{}'.format(count_vect.vocabulary_))

  (0, 120)	13
  (0, 109)	7
  (0, 83)	13
  (0, 61)	6
  (0, 13)	13
  (0, 104)	5
  (0, 9)	3
  (0, 15)	3
  (0, 23)	4
  (0, 128)	4
  (0, 10)	2
  (0, 64)	1
  (0, 114)	5
  (0, 62)	1
  (0, 95)	5
  (0, 87)	1
  (0, 50)	1
  (0, 99)	1
  (0, 110)	2
  (0, 69)	1
  (0, 86)	2
  (0, 75)	1
  (0, 55)	1
  (0, 20)	1
  (0, 57)	5
  :	:
  (0, 130)	2
  (0, 89)	1
  (0, 121)	1
  (0, 46)	1
  (0, 79)	1
  (0, 49)	1
  (0, 70)	1
  (0, 35)	1
  (0, 96)	1
  (0, 60)	1
  (0, 135)	1
  (0, 97)	1
  (0, 27)	1
  (0, 92)	1
  (0, 112)	1
  (0, 94)	1
  (0, 56)	1
  (0, 48)	1
  (0, 17)	1
  (0, 124)	1
  (0, 68)	1
  (0, 65)	1
  (0, 133)	1
  (0, 77)	1
  (0, 113)	1
{'campus': 20, 'known': 64, 'reach': 105, 'two': 127, 'has': 53, 'an': 12, 'traditional': 125, 'program': 102, 'upper': 129, 'with': 130, 'companies': 25, 'degree': 28, 'faculty': 41, 'economic': 34, 'political': 96, 'mpa': 76, 'nearly': 78, 'professional': 99, 'policy': 95, 'offers': 84, 'and': 13, '150': 2, '200': 4, 'the': 120, '70': 6, '400': 5, 'include': 58, 'city': 22, 

Let's have a more civilised look at that word count list, by turning it into a Pandas dataframe, sorted by count... 

In [4]:
df = pd.DataFrame(word_counts.A, columns=count_vect.get_feature_names()).transpose()
df.columns = ['count']
df.sort_values(by='count', ascending=False)

Unnamed: 0,count
and,13
the,13
of,13
school,7
international,6
policy,5
public,5
in,5
sipa,5
columbia,4


## N-grams

In [15]:
count_vectn = CountVectorizer(ngram_range =(2, 2))
word_countsn = count_vectn.fit_transform([sipatext])

df_n = pd.DataFrame(word_countsn.A, columns=count_vectn.get_feature_names()).transpose()
df_n.columns = ['count']
df_n.sort_values(by='count', ascending=False)

Unnamed: 0,count
more than,4
school of,4
the school,3
public policy,3
dual degree,2
of public,2
international affairs,2
columbia university,2
countries sipa,2
of international,2


### Stopwords

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vect2 = CountVectorizer(stop_words='english')
word_counts2 = count_vect2.fit_transform([sipatext])

df2 = pd.DataFrame(word_counts2.A, columns=count_vect2.get_feature_names()).transpose()
df2.columns = ['count']
df2.sort_values(by='count', ascending=False)

Unnamed: 0,count
school,7
international,6
public,5
policy,5
sipa,5
university,4
columbia,4
program,3
degree,3
affairs,3


## Using Term frequencies

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False)
tf_features = tf_transformer.fit_transform(word_counts2)
print('{}'.format(tf_features))

  (0, 86)	0.372572406789
  (0, 48)	0.319347777247
  (0, 81)	0.266123147706
  (0, 9)	0.159673888624
  (0, 18)	0.212898518165
  (0, 100)	0.212898518165
  (0, 50)	0.0532246295412
  (0, 91)	0.266123147706
  (0, 72)	0.266123147706
  (0, 39)	0.0532246295412
  (0, 76)	0.0532246295412
  (0, 87)	0.106449259082
  (0, 55)	0.0532246295412
  (0, 59)	0.0532246295412
  (0, 43)	0.0532246295412
  (0, 15)	0.0532246295412
  (0, 13)	0.0532246295412
  (0, 68)	0.0532246295412
  (0, 64)	0.0532246295412
  (0, 105)	0.0532246295412
  (0, 17)	0.0532246295412
  (0, 3)	0.0532246295412
  (0, 0)	0.0532246295412
  (0, 10)	0.106449259082
  (0, 2)	0.0532246295412
  :	:
  (0, 95)	0.0532246295412
  (0, 66)	0.0532246295412
  (0, 26)	0.106449259082
  (0, 80)	0.106449259082
  (0, 36)	0.0532246295412
  (0, 63)	0.0532246295412
  (0, 38)	0.0532246295412
  (0, 56)	0.0532246295412
  (0, 29)	0.0532246295412
  (0, 73)	0.0532246295412
  (0, 47)	0.0532246295412
  (0, 106)	0.0532246295412
  (0, 74)	0.0532246295412
  (0, 69)	0.0532246

In [7]:
tfidf_transformer = TfidfTransformer(use_idf=True)
tfidf_features = tfidf_transformer.fit_transform(word_counts2)
print('{}'.format(tfidf_features))

  (0, 90)	0.0532246295412
  (0, 61)	0.0532246295412
  (0, 104)	0.0532246295412
  (0, 51)	0.0532246295412
  (0, 54)	0.0532246295412
  (0, 97)	0.0532246295412
  (0, 12)	0.0532246295412
  (0, 37)	0.0532246295412
  (0, 44)	0.0532246295412
  (0, 71)	0.0532246295412
  (0, 89)	0.0532246295412
  (0, 69)	0.0532246295412
  (0, 74)	0.0532246295412
  (0, 106)	0.0532246295412
  (0, 47)	0.0532246295412
  (0, 73)	0.0532246295412
  (0, 29)	0.0532246295412
  (0, 56)	0.0532246295412
  (0, 38)	0.0532246295412
  (0, 63)	0.0532246295412
  (0, 36)	0.0532246295412
  (0, 80)	0.106449259082
  (0, 26)	0.106449259082
  (0, 66)	0.0532246295412
  (0, 95)	0.0532246295412
  :	:
  (0, 2)	0.0532246295412
  (0, 10)	0.106449259082
  (0, 0)	0.0532246295412
  (0, 3)	0.0532246295412
  (0, 17)	0.0532246295412
  (0, 105)	0.0532246295412
  (0, 64)	0.0532246295412
  (0, 68)	0.0532246295412
  (0, 13)	0.0532246295412
  (0, 15)	0.0532246295412
  (0, 43)	0.0532246295412
  (0, 59)	0.0532246295412
  (0, 55)	0.0532246295412
  (0, 87)

## Using the NLTK library and corpora

Corpora = lists of words, like stopwords, stemwords, grammar etc. for multiple languages. 

### Get NLTK resources


WARNINGS: 

* You will have to conda install nltk before you can run this cell
* And even when you ahve the NLTK library, you won't be able to run the cells below without the NLTK corpora. 

Run the cell below just once.  
* When you run nltk.download(), a popup window will appear. 
* Click on the "models" tab, then "punkt" then the "download" button

If you want more, try clicking on "book" then "download".  If you ever wanted to have the whole of Basque grammar, then click around the tabs in this popup window: this is the place to get it!

In [11]:
import nltk
# nltk.download() #Uncomment this before running the cell

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Word stemming

Some of the words above are plurals, or heavily related to each other.  Word stemming removes all the word endings etc. to leave just the 'root' of each word. 

In [16]:
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

count_vect3 = CountVectorizer(tokenizer=tokenize, stop_words='english') 
word_counts3 = count_vect3.fit_transform([sipatext])
#print('{}'.format(word_counts3))
#print('{}'.format(count_vect3.vocabulary_))

df3 = pd.DataFrame(word_counts3.A, columns=count_vect3.get_feature_names()).transpose()
df3.columns = ['count']
df3.sort_values(by='count', ascending=False)

Unnamed: 0,count
",",20
school,9
.,8
intern,6
's,5
polici,5
public,5
program,5
columbia,4
sipa,4
