# Social Media and Human-Computer Interaction - Part 3



###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [1]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb), covering the natural language processing analysis of our tweet corpus, including 

  1. Natural Language Processings
  2. Construction of classifiers
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. Our Tweets class
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
4. Configuration of our Twitter connection

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [3]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Examination of text patterns

Our ultimate goal is to build a classifier capable of distinguishing tweets related to tobacco smoing from other, unrelated tweets. To do this, we will eventually build natural language processing models. However, we will start by doing some basic processing to explore the types of words and language found in the tweets. 

To do this, we will use the [Spacy](https://spacy.io/) Python NLP package. Spacy provides significant NLP power out-of-the box, with customization facilities offering greater flexibility at various stages of the Pipeline. Details can be found at the  [Spacy web site](https://spacy.io/), and in this [tutorial](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/).

However, before we get into the deails, a bit of a roadmap. 

Natural Language Processing involves a series of operations on an input text, each building off of the previous step to add additional insight and undertanding.  Thus, many NLP packages run as pipeline processors providing modular components at each stage of the process. Separating key steps into discrete packages provides needed modularity, as developers can modify and customize individual components as needed. Spacy, like other NLP tools including [GATE](https://gate.ac.uk/) and [cTAKES](https://ctaes.apache.org)  operate on such a model. Although the specific components of each pipeline vary from system to system (and from tasks to task, the key tasks are rougly similar:

1. *Tokenizing*: splitting the text into words, punctuation, and other markers.
2. *Part of speech tagging*: Classifying terms as nouns, verbs, adjective, adverbs, ec.
3. *Dependency Parsing* or *Chunking*: Defining relationships between tokens (subject and object of sentence) and grouping into noun and veb phrases.
4. *Named Entity Recognition*: Mapping words or phrases to standard vocabularies or other common, known values. This step is often key for linking free text to accepted terms for diseases, symptoms, and/or anatomic locations.

Each of these steps might be accomplished through rules, machine learning models, or some combination of approaches. After these initial steps are complete, results might be used to identify relationships between items in the text, build classifiers, or otherwise conduct further analysis. We'll get into these topics later.

The [Spacy documentation](https://spacy.io/usage/spacy-101) and [cTAKES default pipeline description](https://cwiki.apache.org/confluence/display/CTAKES/Default+Clinical+Pipeline) provide two examples of how these components might be arranged in practice.  For more information on NLP theory and methods, see [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/), perhaps the leading NLP textbook.

Given this introduction, we can read in our tweets and get to work.

### 1.1 Reading in data

At the end of [Part 2](SocialMedia - Part 2.ipynb) you had saved two sets of tweets one for smoking and one for vaping. Let's read  in the vaping twets.

In [5]:
vaping=Tweets()
vaping.readTweets("tweets-vaping.json")

In [6]:
vaping.countTweets()

100

and the smoking tweets...

In [7]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.countTweets()

100

### 1.2 Tokenizing

To start with, we will grab a specifc pre-chosen tweet and process it.  

This will give us a beginning feel for what [Spacy](https://spacy.io) can do and how we might use it.

In [8]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
sample

'#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf \n#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr'

Tweets have usage patterns that are non-standard English - URLs, hashtags, user references (this particularly tweet was not selected accidentally). These patterns create challenges for extracting content - we might want to know that "#QuitSmoking" is, in a tweet, a hashtag that should be considered as a complete unit.  

We'll see soon how we might do this, but first, to start the NLP process, we can import the Spacy components and create an NLP object:

In [118]:
import spacy
nlp = spacy.load('en')

we can then parse out the text from the first tweet.

In [11]:
parsed = nlp(sample)

The result is a list of tokens. We can print out each token to start:

In [12]:
print([token.text for token in parsed])

['#', 'Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#', 'SwasthaBharat', '#', 'NHPIndia', '#', 'mCessation', '#', 'QuitSmoking', 'https://t.co/x7xHO9G2Cr']


We can see right away that this parsing isn't quite what we would like. Default English parsing treats  `#QuitSmoking`  as two separate tokens - `#` and `QuitSmoking`. To treat this as a hashtag, we will indeed need to revise the tokenizr - the component of an NLP system responsible for splitting text into tokens'.

### 1.3 Parsing and Part-Of-Speech Tagging 

The next steps in NLP involve *part of speech tagging* and *dependency parsing*. 

Part of speech tagging is the process of classifying each toekn as one of the parts of speech that we all learned in elementrary school. Parts of speech are assigned to attributes of each token:


In [120]:
print(parsed[2].text)
print(parsed[2].pos)
print(parsed[2].pos_)

affects
99
VERB


Note that we have two attributes here - `pos` is the hash code for the part of speech, used for efficiency, while `pos_` is the human readable form. Other attributes derived by Spacy follow the same pattern.

A second attribute - `tag` - provide es additional information.

As described in the [Spacy documentation for part-of-speech tags](https://spacy.io/api/annotation#pos-tagging), the tags associated with these two fields come from different sources. 'tag_' uses parts-of-speech from a version of the [Penn Treebank](https://www.seas.upenn.edu/~pdtb/), a well-known corpus of annotated text. 'pos_' uses a simpler set of tags from [A Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086), published by researchers from Google.  

The tags for `affects` provide an example of the difference. According to the [Spacy documentation ](https://spacy.io/api/annotation#pos-tagging) `VBZ` from the Penn tag set indicates a 'verb, 3rd person singular present', while 'the 'VERB' result for 'pos_' is a more general tag from the Google set. There are many types of verbs in the Penn Treebank that correspond tot the 'VERB' tag from the Google set. 

In [122]:
print(parsed[2].text)
print(parsed[2].tag_)
print(parsed[2].pos_)

affects
VBZ
VERB


If you want to learn more about a part of spech tag, you can use `spacy.explain`

In [123]:
print(spacy.explain(parsed[2].pos_))
print(spacy.explain(parsed[2].tag_))

verb
verb, 3rd person singular present


In [None]:




Chief among this is the *lemma_*: the "standard" or "base" form, reducing verb forms to their base verb, plurals to appropriate singular nouns, etc.  For example, the 29th token is `affects`, which has `affect` as the lemmatized version.

In [119]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
parsed=nlp(sample)
print(parsed[2].text)
print(parsed[2].lemma_)

affects
affect


Note that Spacy stores many fields as both hashes for efficiency and as text  for readability. You'll want to use the text form for interpreting results, but the hash for computing. They differ only in the use of the trailing underscore - thus `lemma` is the hash while `lemma_` is the human readable form.

In [14]:
print(parsed[2].lemma)
print(parsed[2].lemma_)

17543419487618836897
affect


Other values returned by the parser might also be of interest:
* is_stop is True if the token is a "stop" word - a commonly found word that might addd litle or no information.
* is_alpha is True if the token is alphanumeric.

Let's look at token 1 ("Smoking"), token 4 ("parts"), token 12 ("https://t.co/hwTeRdC9Hf'"),  and token 14("#") to see a few more tokens in action.

In [20]:
t1 = parsed[1]
t4 = parsed[4]
t12= parsed[12]
t14 = parsed[14]
print (t1.text,t1.lemma_,t1.pos_,t1.tag_,t1.is_stop,t1.is_alpha)
print (t4.text,t4.lemma_,t4.pos_,t4.tag_,t4.is_stop,t4.is_alpha)
print (t12.text,t12.lemma_,t12.pos_,t12.tag_,t12.is_stop,t12.is_alpha)
print (t14.text,t14.lemma_,t14.pos_,t14.tag_,t14.is_stop,t14.is_alpha)

Smoking smoking NOUN NN False True
parts part NOUN NNS False True
https://t.co/hwTeRdC9Hf https://t.co/hwterdc9hf PROPN NNP False False
# # PROPN NNP False False


Note that URLS are neither alphabetical  nor stop-words, but they are proper nouns

Some NLP systems will go a bit further than Spacy's lemmatization, using a process called "stemming" to reduce words to base forms. With a stemming algorithm, "scared" might be reduced to "scare" - see this description of [Porter's stemming algorithm](https://tartarus.org/martin/PorterStemmer/) for more detail. 

Let's turn the code that we used above into a routine, along with a routine to print out token details and try another tweet or two. To make things easy to read, we'll use some spaces to format things in columns. 

In [24]:
def getTweetText(tweets):
    tweet_id=random.choice(list(tweets.getIds()))
    return tweets.getText(tweet_id)

def printTokDetails(parsed):
    print("{:25} {:25} {:7}{:7}{:7}{:7}".format("Token text","Lemma","POS","Tag","Stop?","Alpha?"))
    for tok in parsed:
        print("{:25} {:25} {:7}{:7}{:7}{:7}".format(str(tok.text),str(tok.lemma_),str(tok.pos_),str(tok.tag_),str(tok.is_stop),str(tok.is_alpha)))

In [25]:
sample2=getTweetText(smoking)

In [26]:
sample2

'Seeing kids smoking in school uniform makes me so angry!'

In [27]:
parsed2=nlp(sample2)

In [28]:
printTokDetails(parsed2)

Token text                Lemma                     POS    Tag    Stop?  Alpha? 
Seeing                    see                       VERB   VBG    False  True   
kids                      kid                       NOUN   NNS    False  True   
smoking                   smoke                     VERB   VBG    False  True   
in                        in                        ADP    IN     False  True   
school                    school                    NOUN   NN     False  True   
uniform                   uniform                   NOUN   NN     False  True   
makes                     make                      VERB   VBZ    False  True   
me                        -PRON-                    PRON   PRP    False  True   
so                        so                        ADV    RB     False  True   
angry                     angry                     ADJ    JJ     False  True   
!                         !                         PUNCT  .      False  False  


You might see some interesting pattners arising here.  For example:

* We see many different type of speech. Initially, we might want to focus on the nouns alone, as they provide much of the content.  

* Look for words like "is" or "was" - these might all refer to a common lemma term - "be", corresponding to the generic form of he verb. Do you see any other incidents of lemma forms that differ from the parsed text?

* URLs and icons might be present in tweets. Are they classified as alphanumeric? Should we include them as part of the "useful" text from a tweet? 


---
## EXERCISE 3.1: Filtering tokens

Although NLP parsing is often a good start, further filtering is often necessary to focus on data relevant for specific tasks. In this problem, we will review some additional tweets and develop a post-processing routine capable of filtering tweets as necessary for our needs. 

3.1.1 Using the `getTweetText`, and `printTokDetails` routines above, aong with the spacy `parser` command, examine several tweets to decide which tokens should be included or not.  List criteria for keeeping/removing tokens. Remember to use `spacy.explain()` for any unfamiliar POS or tag entries. Note that your  criteria will not be perfect, and will likely need refinining. Examiine enough tweets to feel confident in your criteria.

3.1.2 Write a routine  `includeToken` that will return True if a token matches the criteria that you identified in 3.11, and false otherwise. Assume for now that we are only interested in nouns and verbs, as they might be a good starting point to find information about vaping or smoking. 

3.1.3 Write a routine `filterTweetTokens` that will filter the parsed tokens from a single tweet, returning a list of the tokens to be included, based on your criteria.

3.1.4 Run `filterTweetTokens` on a few tweets. Identify any inaccuracies and explain them. When possible, identify an approach for improving performance, and implement it in a revision version of `filterTweetTokens`.

3.1.5. Add these routines to the tweet class, along with some new routines.

3.1.5.1 `parseTweet` will parse one of the tweets in the collection, storing the full list of tokens will be stored in a new entry in the dictionary entitled 'tokens'. `parseTweet` will also filter the tweets, storing the resulting list in an entry entitled 'filteredTokens'.

*NOTE*: The tweets class might or might not have an NLP object available for any given call to `parseTweet`. You should have the class create an NLP object when it is initialzed. 

3.1.5.2 `parseTweets` will call `parseTweet` on all of the tweets in a collection.

3.1.5.3 `getTokens` will be used to get all of the tokens for a given tweet.

3.1.5.4 `getFilteredTokens` will be used to get all of the filtered tokens for a tweet. 


3.1.6 When you are done, test this new version of the class by reading in and parsing the 'smoking' tweet set. 

---
*ANSWER FOLLOWS Cut below here*

### 3.1.1 Sample tweets

In [29]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

Made a sandwich 10 min ago and been looking for it ever since then🤦🏾‍♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
Made                      make                      VERB   VBN    False  True   
a                         a                         DET    DT     False  True   
sandwich                  sandwich                  NOUN   NN     False  True   
10                        10                        NUM    CD     False  False  
min                       min                       NOUN   NN     False  True   
ago                       ago                       ADV    RB     False  True   
and                       and                       CCONJ  CC     False  True   
been                      be                        VERB   VBN    False  True   
looking                   look                      VERB   VBG    False  True   
for                       for                       ADP    IN     False 

In [30]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

"'Weak' evidence linking e-cigarette use with future smoking"

Using e-cigarettes is a recommended way of giving up smoking, but can their use make children more likely to take up the real thing? @nhschoices looks at the evidence

https://t.co/Mh22zscOBI

#behindtheheadlines https://t.co/3h5FetrYWi
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
"                         "                         PUNCT  ``     False  False  
'                         '                         PUNCT  ``     False  False  
Weak                      weak                      ADJ    JJ     False  True   
'                         '                         PUNCT  ''     False  False  
evidence                  evidence                  NOUN   NN     False  True   
linking                   link                      VERB   VBG    False  True   
e                         e                         NOUN   NN     False  True   
-                         -                         

In [31]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

I love smoking weed in beautiful ass places, looking at beautiful ass things.
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
I                         -PRON-                    PRON   PRP    False  True   
love                      love                      VERB   VBP    False  True   
smoking                   smoke                     VERB   VBG    False  True   
weed                      weed                      NOUN   NN     False  True   
in                        in                        ADP    IN     False  True   
beautiful                 beautiful                 ADJ    JJ     False  True   
ass                       ass                       NOUN   NN     False  True   
places                    place                     NOUN   NNS    False  True   
,                         ,                         PUNCT  ,      False  False  
looking                   look                      VERB   VBG    False  True   
at                        at   

In [32]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

History is full of forgotten heroines, but my favorite might be the Québécois woman who hired a hearse so that she could be driven around town smoking in the coffin-bed while enjoying the view. https://t.co/TxjFEO25lR
Token text                Lemma                     POS    Tag    Stop?  Alpha? 
History                   history                   NOUN   NN     False  True   
is                        be                        VERB   VBZ    False  True   
full                      full                      ADJ    JJ     False  True   
of                        of                        ADP    IN     False  True   
forgotten                 forget                    VERB   VBN    False  True   
heroines                  heroine                   NOUN   NNS    False  True   
,                         ,                         PUNCT  ,      False  False  
but                       but                       CCONJ  CC     False  True   
my                        -PRON-                    A

Criteria: 
    
* Alpha is true, and 
* Stop is false, and 
* text is not "RT"
* Tag is NN, Tag is NNP, or POS is VERB

### 3.1.2  `includeToken`

Our routine will accept a token only if it meets the criteria given above. 

In [33]:
def includeToken(tok):
    val =False
    if tok.is_alpha == True and tok.is_stop == False:
        if tok.text =='RT':
            val = False
        elif tok.pos_=='NOUN' or tok.pos_=='VERB':
            val = True
    return val

In [49]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)

Made a sandwich 10 min ago and been looking for it ever since then🤦🏾‍♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe


In [50]:
print(parsed[0])
includeToken(parsed[0])

Made


True

In [51]:
print(parsed[1])
includeToken(parsed[1])

a


False

In [52]:
print(parsed[2])
includeToken(parsed[2])

sandwich


True

In [53]:
for tok in parsed:
    print(tok,includeToken(tok))

Made True
a False
sandwich True
10 False
min True
ago False
and False
been True
looking True
for False
it False
ever False
since False
then False
🤦 False
🏾‍ False
♂ False
️ False
I False
got True
ta False
stop True
smoking True
😂 False
https://t.co/NCbNOyvZXe False


Looks ok. 

### 3.1.3 Write a routine `filterTweetTokens` that will parse a single tweet

In [54]:
def filterTweetTokens(tokens):
    filtered=[]
    for tok in tokens:
        if includeToken(tok) == True:
            filtered.append(tok)
    return filtered

In [55]:
f= filterTweetTokens(parsed)
print(sample)
for tok in f:
    print(tok)

Made a sandwich 10 min ago and been looking for it ever since then🤦🏾‍♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe
Made
sandwich
min
been
looking
got
stop
smoking


### 3.1.4 Run `filterTweetTokens` on a few tweets

In [56]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

Level of saltiness: When people post videos of them smoking weed and you watch them just suck it in and blow it out right after
Level
saltiness
people
post
videos
smoking
weed
watch
suck
blow


In [57]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

"Wyd after smoking this?" https://t.co/nNGtlDRS95
Wyd
smoking


In [58]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

Level of saltiness: When people post videos of them smoking weed and you watch them just suck it in and blow it out right after
Level
saltiness
people
post
videos
smoking
weed
watch
suck
blow


In [59]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)


girls smoking hippy eve rapper nude pics https://t.co/HJXd3NXdKL
girls
smoking
eve
rapper
pics


In [60]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

There are now more ex-smokers than smokers across @WYHpartnership We want more! @yorkshirecancer statement published today backs use of e-cigarettes, combined with specialist stop smoking advice from @YSmokefree and others, as best way to quit. @breathe2025 @PHE_uk https://t.co/nqdrYB17Cw
are
smokers
smokers
want
statement
published
today
backs
use
e
cigarettes
combined
specialist
stop
smoking
advice
others
way
quit


In [61]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

Smoking can interrupt blood flow to the brain and may lead to stroke via @FDATobacco #BrainWeek https://t.co/hSI1eeTEeP
Smoking
can
interrupt
blood
flow
brain
may
lead
stroke


### 3.1.5  Adding to the Tweets class

In [70]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        self.nlp = spacy.load('en')
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
  
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary
    
    # new routine for classifying a token
    def includeToken(self,tok):
        val =False
        if tok.is_alpha == True and tok.is_stop == False:
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='VERB':
                val = True
        return val
    
    # new routine for filtering a list of tokens.
    def filterTweetTokens(self,tokens):
        filtered=[]
        for tok in tokens:
            if includeToken(tok) == True:
                filtered.append(tok)
        return filtered
    
    def parseTweet(self,id):
        text = self.getText(id)
        parsed = nlp(text)
        self.tweets[id]['tokens']=parsed
        filtered= self.filterTweetTokens(parsed)
        self.tweets[id]['filteredTokens']=filtered
        
    def parseTweets(self):
        ids=self.getIds()
        for id in ids:
            self.parseTweet(id)
            
    def getTokens(self,id):
        if 'tokens' in self.tweets[id]:
            return self.tweets[id]['tokens']
        else: 
            return None
    
    def getFilteredTokens(self,id):
        if 'filteredTokens' in self.tweets[id]:
             return self.tweets[id]['filteredTokens']
        else:
            return None

### 3.1.6 Trying out the new routines for parsing a collection.

In [71]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.parseTweets()
smoking.countTweets()

100

In [72]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['Smoking', 'a', 'joint', 'is', 'always', 'a', 'lil', 'date', 'w', 'Christian', ',', 'but', 'smoking', 'a', 'joint', 'in', 'the', 'hot', 'tub', 'that', 'was', 'NIGHT', 'OUT', 'HA']
['Smoking', 'joint', 'is', 'lil', 'date', 'smoking', 'joint', 'tub', 'was']


In [73]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['me', ':', 'smoking', 'weed', 'has', 'n’t', 'affected', 'me', 'at', 'all', '\n\n', 'someone', ':', 'count', 'to', '10', '\n\n', 'me', ':', 'https://t.co/SUoGzARpom']
['smoking', 'weed', 'has', 'affected', 'someone', 'count']


In [74]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['#', 'UNFAO', 'is', 'scaling', 'up', 'efforts', 'on', 'reducing', 'the', 'amount', 'of', '#', 'wood', 'used', 'as', '#', 'fuel', 'for', '#', 'fish', 'smoking', 'in', 'the', '#', 'Gambia', '.', 'With', 'the', 'new', '#', 'UNFAO', 'Thiaroye', 'Technology', 'stove', ',', '#', 'women', 'in', 'the', '#', 'fishsmoking', '&', 'amp', ';', 'drying', '#', 'industry', 'have', 'improved', 'access', 'to', '#', 'technology', '&', 'amp', ';', '#', 'livelihood', '.', '#', 'ZeroHunger', '\n', '@FAOWestAfrica', 'https://t.co/ifT8KSRo3O']
['is', 'scaling', 'efforts', 'reducing', 'amount', 'wood', 'used', 'fuel', 'fish', 'smoking', 'stove', 'women', 'fishsmoking', 'amp', 'drying', 'industry', 'have', 'improved', 'access', 'technology', 'amp', 'livelihood']


*END OF ANSWER cut above here*

---

### 1.3 Revised tokenizing

Your Review of some tweets might lead you to identify text patterns that might not fit with the initial tokenizing or part-of-speech tagging. Fortunately, the spacy tools provide a means for extending the tokenizer for special cases. Here, we review an example of how these tools might be used.

Specifically, review of some tweets led to the following concerns: 
1. The word "E-cigarette is split by the tokenizer into two separate tokens
2. Hashtags are split into the pound symbol (`#`) and the following text.


#### 1.3.2 Tokenizing "E-cigarette"

Consider the following tweet:

In [75]:
smoketweet='E-cigarette use by teens linked to later tobacco smoking, study says https://t.co/AhTpFUw0TW'

In [76]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['E', '-', 'cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Note that "E-cigarette" becomes three tokens. This is not what we want - we want it to be held together as one. 
To do this, we can refer to the Spacy docuentation, which describes the process for adding a [special-case tokenizer rule](https://spacy.io/usage/linguistic-features#section-tokenization). Essentially, these rules allow for the possibility of adding new rules to customize parsing for specific domains:

Each new rule will be a dictionary with three fields:
    * `ORTH` is the text that will be matched
    * `LEMMA` is the lemma form
    * `POS` is the part-of-speech
    
These can then be added to the tokenizer:

In [77]:
from spacy.symbols import ORTH, LEMMA, POS
special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
nlp.tokenizer.add_special_case(u'E-cigarette', special_case)

This text says that the text "e-cigarette" should be handled by the special case rule saying that it is a single token.

In [78]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['e-cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Now we capture "E-cigarette" as one token. Note the importance of including both capitalizations.

### 1.3.3 Tokenizing hashtags

Hashtags are important in tweets, as we might want to track frequency and trends of mentions. However, the default tokenizer does not capture hashtags as such. For example:

In [99]:
nlp = spacy.load('en')
hashtag ="@heal_crypto: #VR uses in therapy - for various additictions such as smoking, alcohol, overeating, etc - #HealCoin https://t.co/T65Fboq7…"
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['@heal_crypto', ':', '#', 'VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#', 'HealCoin', 'https://t.co/T65Fboq7', '…']


In [102]:
print(parsed[2])
print(parsed[3])

#
VR


Note how "#VR" is split into "#" and "VR". To avoid this, we will can add a specialized processing component as a member of a [Spacy pipeline](https://spacy.io/usage/processing-pipelines).

A pipeline is simply a set of units that operate on data in a sequential fashion. Each step in the pipeline processes the input in some way and passes it on to the next pipeline component, possibly adding some additional information. Each units can operate on the original input, or on the outputs of the earlier units. 

To process hashtags, we will use code suggested by a [Spacy GitHub issue](https://github.com/explosion/spaCy/issues/503). To see how this should work, let's walkt through some steps:

First, let's look at the tokens in the tweet parsed above. We can iterate through with enumerate. We can also look at a few interesting elements:

* `nbor` gets the next token after a token.
* `idx ` is the position of the token in the list of characters. 

In [104]:
print(str(parsed[0].idx)+" "+parsed[0].text)
print(str(parsed[0].nbor().idx)+" "+str(parsed[0].nbor().text))
print(str(parsed[1].nbor().idx)+" "+str(parsed[1].nbor().text))
print(str(parsed[2].nbor().idx)+" "+str(parsed[2].nbor().text))

0 @heal_crypto
12 :
14 #
15 VR


we can use this information to find a hash tag. essentially, we can look for a tag that has the text '#'. If we find one, 
we can look at th next tag and merge all of the characters from the start of the first tag to the end of the second tag. 

In [109]:
start=parsed[2].idx
length = len(parsed[3].text)
end = start+length+1
parsed.merge(start,end)

#VR

To do this over all of the tags, we have to repeatedly iterate over the tokens until we can't find anymore hashtags.  This leasds us ot the following routine:


In [110]:
nlp = spacy.load('en')
def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.nbor() is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.nbor().text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc

We can then add it to the `first` position in the pipeline, which will put it after the tokenizer, but before the part of speech tagger and other components.

In [113]:
nlp.add_pipe(hashtag_pipe,first=True)

And then we can try it out...

In [114]:
doc = nlp("twitter #hashtag")
print(doc[0].text)
print(doc[1].text)

twitter
#hashtag


Returning to our original example:

In [115]:
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['@heal_crypto', ':', '#VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#HealCoin', 'https://t.co/T65Fboq7', '…']


Customization of pipelines such as this is often an important part of NLP work.