### Learning Python the Hard Way - Session 2
Toronto Data Literacy Group

Creator: Cindy Zhong

Date: January 09, 2017

#### Reading The Data

The data for the file can be downloaded from the github repository. 
If you want to get the data from Twitter youself, it is created using the code from Session 1. https://github.com/cindyzhong/trt_data_lit_grp_python/tree/master/Lesson1

Next, read the tab-delimited file into Python. To do this, we can use the pandas package which provides the read_csv function for easily reading and writing data files. If you haven't used pandas before, you may need to install it.

In [200]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
tweet_df = pd.read_csv("tweet_sample.csv", delimiter=",", encoding = "utf-8")

In [201]:
# A look at the dimension of the dataframe
tweet_df.shape

(6478, 9)

In [202]:
# A look at the columns of the dataframe
tweet_df.columns.values

array(['Unnamed: 0', u'handle', u'tweet_body', u'tweet_created_at',
       u'likes', u'retweet', u'hashtags', u'user_mentions', u'place'], dtype=object)

In [203]:
# A look at sample data
tweet_df[0:5]

Unnamed: 0.1,Unnamed: 0,handle,tweet_body,tweet_created_at,likes,retweet,hashtags,user_mentions,place
0,0,realDonaldTrump,Ford said last week that it will expand in Michigan and U.S. instead of building a BILLION dollar plant in Mexico. Thank you Ford &amp; Fiat C!,2017-01-09 14:16:34,0.0,8055.0,[],[],
1,1,realDonaldTrump,"It's finally happening - Fiat Chrysler just announced plans to invest $1BILLION in Michigan and Ohio plants, adding 2000 jobs. This after...",2017-01-09 14:14:10,0.0,8526.0,[],[],
2,2,realDonaldTrump,"""groveling"" when he totally changed a 16 year old story that he had written in order to make me look bad. Just more very dishonest media!",2017-01-09 11:43:26,0.0,9363.0,[],[],
3,3,realDonaldTrump,"Hillary flunky who lost big. For the 100th time, I never ""mocked"" a disabled reporter (would never do that) but simply showed him.......",2017-01-09 11:36:02,0.0,11849.0,[],[],
4,4,realDonaldTrump,"Meryl Streep, one of the most over-rated actresses in Hollywood, doesn't know me but attacked last night at the Golden Globes. She is a.....",2017-01-09 11:27:50,0.0,20514.0,[],[],


In [204]:
# We can also look at a ramdom sample of the rows
tweet_df.sample(5)

Unnamed: 0.1,Unnamed: 0,handle,tweet_body,tweet_created_at,likes,retweet,hashtags,user_mentions,place
1596,1596,realDonaldTrump,"Our not very bright Vice President, Joe Biden, just stated that I wanted to ""carpet bomb"" the enemy. Sorry Joe, that was Ted Cruz!",2016-07-27 12:57:20,0.0,11410.0,[],[],
1995,1995,realDonaldTrump,Thank you @DallasPD! https://t.co/ORJyN4FsNI,2016-06-17 23:24:13,0.0,2649.0,[],[Dallas Police Dept],
1774,1774,realDonaldTrump,#CrookedHillary is not qualified!\r\nhttps://t.co/6qi7KTW43O,2016-07-12 16:45:45,0.0,12559.0,[CrookedHillary],[],
5559,5559,HillaryClinton,"With just 83 days until Election Day, Trump hired one of the most extreme right-wing voices to run his campaign. https://t.co/geausYW6oD",2016-08-17 18:40:09,0.0,5233.0,[],[],
909,909,realDonaldTrump,"Wow, @CNN is so negative. Their panel is a joke, biased and very dumb. I'm turning to @FoxNews where we get a fair shake! Mike will do great",2016-10-05 00:12:59,0.0,8134.0,[],"[CNN, Fox News]",


In [205]:
# Let's use one tweet as an example
tweet_eg = tweet_df['tweet_body'][8]
tweet_eg

u'RT @MeetThePress: Watch our interview with @KellyannePolls: Russia "did not succeed" in attempts to sway election https://t.co/EZhgUIUbYx #'

#### Cleaning and Pre-Processing The Texts
We are interested in the text of the tweets.
The unique thing about text analytics is there is no standard way of pre-processing the data. Depending on the problem you are trying to solve, the pre-processing can be different.
In most cases, it consist of the following components:
- Removing Unwanted Characters
- Removing Punctuations
- Removing Numbers
- Standardizing Cases
- Removing Stopwords
We will explain each of them in our session.
We will be using a package called NLTK (Natural Language Toolkit), and a package called re (Regular Expression) extensively in this exercise.

#### Basic Text Cleaning Techniques

In [206]:
# Regular Expression itself is a very useful skill to learn.
import re

In [207]:
# A lot of the tweets contains reference urls, we want to remove them first
def remove_url(text):
	text = re.sub('http://[^ ]*', '', text)
	text = re.sub('https://[^ ]*', '', text)
	return text

In [208]:
# Using the function on our sample tweet
tweet_eg = remove_url(tweet_eg)
tweet_eg

u'RT @MeetThePress: Watch our interview with @KellyannePolls: Russia "did not succeed" in attempts to sway election  #'

In [209]:
# Removing the at users
def remove_at_user(text):
	import re
	return re.sub('@[^\s]+','', text)

In [210]:
tweet_eg = remove_at_user(tweet_eg)
tweet_eg

u'RT  Watch our interview with  Russia "did not succeed" in attempts to sway election  #'

In [211]:
# Now try to write a function to remove the retweet 'RT'
def remove_rt(text):
    text = re.sub('RT', '', text, count=1)
    return text

In [212]:
tweet_eg = remove_rt(tweet_eg)
tweet_eg

u'  Watch our interview with  Russia "did not succeed" in attempts to sway election  #'

In [213]:
# Let's remove the punctuations and numbers, basically all the non letters for now
def remove_non_letters(text):
	return re.sub("[^a-zA-Z]", " ", text) 	

In [214]:
tweet_eg = remove_non_letters(tweet_eg)
tweet_eg

u'  Watch our interview with  Russia  did not succeed  in attempts to sway election   '

In [215]:
# We might want to remove some extra blanks
def remove_extra_blanks(text):
	text = re.sub('\n', ' ', text)
	text = re.sub(" +"," ",text).strip() #remove extra spaces
	return text

In [216]:
tweet_eg = remove_extra_blanks(tweet_eg)
tweet_eg

u'Watch our interview with Russia did not succeed in attempts to sway election'

In [217]:
# Standardizing Cases
def all_lower_case(text):
	return text.lower()

tweet_eg = all_lower_case(tweet_eg)
tweet_eg

u'watch our interview with russia did not succeed in attempts to sway election'

In [218]:
# Now, let's put all of the above cleaning functions together
def my_text_cleanser(text):
    if isinstance(text,basestring):
        text = text.encode('utf-8')
        text = remove_url(text)
        text = remove_rt(text)
        text = remove_non_letters(text)
        text = remove_extra_blanks(text)
        text = all_lower_case(text)
        return text

In [219]:
# We will apply the text cleanser to our 'tweet_body' column, using a very commonly used function in pandas 'apply'
tweet_df['tweet_body_clean'] = tweet_df.tweet_body.apply(my_text_cleanser)

In [220]:
# Take a look at the old column and the cleaned new column
tweet_df[['tweet_body','tweet_body_clean']].sample(5)

Unnamed: 0,tweet_body,tweet_body_clean
6428,RT @TheBriefing2016: Ted Cruz once called Donald Trump a pathological liar. And yet...here he is. #RNCinCLE https://t.co/ZfTFm1qy5Q,thebriefing ted cruz once called donald trump a pathological liar and yet here he is rncincle
1503,"CNN anchors are completely out of touch with everyday people worried about rising crime, failing schools and vanishing jobs.",cnn anchors are completely out of touch with everyday people worried about rising crime failing schools and vanishing jobs
4496,A Wall Street money manager should not be able to pay a lower tax rate than a teacher or a nurse.,a wall street money manager should not be able to pay a lower tax rate than a teacher or a nurse
4591,"To Donald, women like Alicia are only as valuable as his personal opinion about their looks. https://t.co/OZv8yg8vjZ https://t.co/PZWmPcORBR",to donald women like alicia are only as valuable as his personal opinion about their looks
3983,This election is over in 20 daysbut the decision we make will affect our country for generations. #DebateNight https://t.co/SfQM2FdOAr,this election is over in daysbut the decision we make will affect our country for generations debatenight


#### Removing Stopwords
Stopwords are words that occur in a sentence often that do not carry any meanings, for example, 'am','and','the'.
We often want to remove these words when we are doing text analytics.
To do this, we will use NLTK

In [146]:
# If you haven't done so already, download the nltk's corpus for stopwords
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [221]:
# Import the stop word list
from nltk.corpus import stopwords 
print (stopwords.words("english")) 

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [222]:
def remove_stopwords(text):
    words = text.split()
    meaningful_words = [w for w in words if not w in stopwords.words("english") ]
    return meaningful_words

In [223]:
tweet_eg = remove_stopwords(tweet_eg)
tweet_eg

[u'watch',
 u'interview',
 u'russia',
 u'succeed',
 u'attempts',
 u'sway',
 u'election']

#### Word Stemming
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.

In [224]:
# Examples of stemmed words
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
print (snowball_stemmer.stem('interaction'))
print (snowball_stemmer.stem('interact'))
print (snowball_stemmer.stem('interactions'))
print (snowball_stemmer.stem('interactivity'))

interact
interact
interact
interact


#### Word Lemmatization
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

In [225]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print (wordnet_lemmatizer.lemmatize('interaction'))
print (wordnet_lemmatizer.lemmatize('interact'))
print (wordnet_lemmatizer.lemmatize('interactions'))
print (wordnet_lemmatizer.lemmatize('interactivity'))

interaction
interact
interaction
interactivity


In [226]:
# We will be using the lemmatizer for our purpose
def lemmatizer(words):
    return [wordnet_lemmatizer.lemmatize(w) for w in words]

In [227]:
def my_text_tokenizer(text):
    words = remove_stopwords(text)
    words = lemmatizer(words)
    return words

In [228]:
# Now let's apply the functions above to our cleaned tweet
tweet_df['tweet_body_terms'] = tweet_df.tweet_body_clean.apply(my_text_tokenizer)

In [229]:
# Take a look at what we've done so far
tweet_df[['tweet_body','tweet_body_clean','tweet_body_terms']].sample(5)

Unnamed: 0,tweet_body,tweet_body_clean,tweet_body_terms
3443,RT @dougmillsnyt: .@HillaryClinton with JAY Z and Beyonce in Cleveland at the concert https://t.co/e8e8eofTsq,dougmillsnyt hillaryclinton with jay z and beyonce in cleveland at the concert,"[dougmillsnyt, hillaryclinton, jay, z, beyonce, cleveland, concert]"
1729,"It doesn't matter that Crooked Hillary has experience, look at all of the bad decisions she has made. Bernie said she has bad judgement!",it doesn t matter that crooked hillary has experience look at all of the bad decisions she has made bernie said she has bad judgement,"[matter, crooked, hillary, experience, look, bad, decision, made, bernie, said, bad, judgement]"
534,"Great crowd in Johnstown, Pennsylvania- thank you. Get out &amp; VOTE on 11/8! Watch the MOVEMENT in PA. this afternoon https://t.co/DUMlbSkVeY",great crowd in johnstown pennsylvania thank you get out amp vote on watch the movement in pa this afternoon,"[great, crowd, johnstown, pennsylvania, thank, get, amp, vote, watch, movement, pa, afternoon]"
581,You should give the money back @HillaryClinton! #DrainTheSwamp https://t.co/m0LKHRUoHz,you should give the money back hillaryclinton draintheswamp,"[give, money, back, hillaryclinton, draintheswamp]"
5407,Happy #WomensEqualityDay from @realDonaldTrump. https://t.co/YfUdtygL4h,happy womensequalityday from realdonaldtrump,"[happy, womensequalityday, realdonaldtrump]"


### Simple Text Analytics on Tweets
With the text pre-processed, we can now do some simple but interesting analytics on the tweets, in this session, we will look at for Trump and Hilary 
- Term Collocation
- Lexical Diversity

In [230]:
# Since we will be creating statistics at user level, we group the dataframe by users
users_df = tweet_df.groupby('handle').agg({'tweet_body_terms':sum,'tweet_body_clean':lambda x: ' '.join(x)})

#### Term Collocations
Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. 
For example, 'crystal clear', 'middle management', and 'plastic surgery' are examples of collocated pairs of words.
We are interested in looking at term collocations the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity.

In [231]:
# Find top collocation in the tweets
from nltk.collocations import BigramCollocationFinder

def top_collocation_text(words):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(words)
    finder.apply_freq_filter(5)
    return finder.nbest(bigram_measures.pmi, 20)

In [232]:
# Let's see what are the most often talked about terms for Hilary and Trump
users_df['top_collocation_text'] = users_df.tweet_body_terms.apply(top_collocation_text)

In [233]:
print (users_df['top_collocation_text'])

handle
HillaryClinton     [(anywhere, near), (bin, laden), (jay, z), (klux, klan), (ku, klux), (role, model), (hiv, aid), (hurricane, matthew), (pm, et), (conspiracy, theory), (glass, ceiling), (north, carolina), (zip, code), (editorial, board), (common, ground), (energy, superpower), (birther, movement), (comprehensive, immigration), (locker, room), (humayun, khan)]              
realDonaldTrump    [(rolling, thunder), (electoral, college), (sometimes, referred), (rhode, island), (supreme, court), (mobile, alabama), (san, diego), (san, jose), (bobby, knight), (referred, pocahontas), (coach, bobby), (lindsey, graham), (self, funding), (paul, ryan), (town, hall), (grand, rapid), (conflict, interest), (facebook, page), (mitt, romney), (radical, islam)]
Name: top_collocation_text, dtype: object


#### Lexical Diversity
Lexical diversity is a measure of how many different words that are used in a text.
The more varied a vocabulary a text possesses, the higher lexical diversity.
For a text to be highly lexically diverse, the speaker or writer has to use many
different words, with littie repetition of the words already used. 
The lexical diversity of a given text is defined as the ratio of total number of words to the number of different unique word stems.

In [234]:
def lexical_diversity(words):
    return 1.0*len(set(words))/(len(words)+1)

In [235]:
users_df['lexical_diversity'] = users_df.tweet_body_terms.apply(lexical_diversity)

In [236]:
print (users_df['lexical_diversity'])

handle
HillaryClinton     0.166746
realDonaldTrump    0.162534
Name: lexical_diversity, dtype: float64
