# Social Media and Human-Computer Interaction - Part 3



###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [6]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb), covering the natural language processing analysis of our tweet corpus, including 

  1. Natural Language Processings
  2. Construction of classifiers
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. Our Tweets class
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
4. Configuration of our Twitter connection

In [7]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [8]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [9]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Examination of text patterns

Our ultimate goal is to build a classifier capable of distinguishing tweets related to tobacco smoing from other, unrelated tweets. To do this, we will eventually build natural language processing models. However, we will start by doing some basic processing to explore the types of words and language found in the tweets. 

To do this, we will use the [Spacy](https://spacy.io/) Python NLP package. Spacy provides significant NLP power out-of-the box, with customization facilities offering greater flexibility at various stages of the Pipeline. Details can be found at the  [Spacy web site](https://spacy.io/), and in this [tutorial](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/).

However, before we get into the deails, a bit of a roadmap. 

Natural Language Processing involves a series of operations on an input text, each building off of the previous step to add additional insight and undertanding.  Thus, many NLP packages run as pipeline processors providing modular components at each stage of the process. Separating key steps into discrete packages provides needed modularity, as developers can modify and customize individual components as needed. Spacy, like other NLP tools including [GATE](https://gate.ac.uk/) and [cTAKES](https://ctaes.apache.org)  operate on such a model. Although the specific components of each pipeline vary from system to system (and from tasks to task, the key tasks are rougly similar:

1. *Tokenizing*: splitting the text into words, punctuation, and other markers.
2. *Part of speech tagging*: Classifying terms as nouns, verbs, adjective, adverbs, ec.
3. *Dependency Parsing* or *Chunking*: Defining relationships between tokens (subject and object of sentence) and grouping into noun and veb phrases.
4. *Named Entity Recognition*: Mapping words or phrases to standard vocabularies or other common, known values. This step is often key for linking free text to accepted terms for diseases, symptoms, and/or anatomic locations.

Each of these steps might be accomplished through rules, machine learning models, or some combination of approaches. After these initial steps are complete, results might be used to identify relationships between items in the text, build classifiers, or otherwise conduct further analysis. We'll get into these topics later.

The [Spacy documentation](https://spacy.io/usage/spacy-101) and [cTAKES default pipeline description](https://cwiki.apache.org/confluence/display/CTAKES/Default+Clinical+Pipeline) provide two examples of how these components might be arranged in practice.  For more information on NLP theory and methods, see [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/), perhaps the leading NLP textbook.

Given this introduction, we can read in our tweets and get to work.

### 1.1 Reading in data

At the end of [Part 2](SocialMedia - Part 2.ipynb) you had saved two sets of tweets one for smoking and one for vaping. Let's read  in the vaping twets.

In [10]:
vaping=Tweets()
vaping.readTweets("tweets-vaping.json")

In [11]:
vaping.countTweets()

100

and the smoking tweets...

In [12]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.countTweets()

100

### 1.2 Initial parsing

To start with, we will grab a tweet and process it. This will give us a beginning feel for what [Spacy](https://spacy.io) can do and how we might use it.

In [21]:
smokingIds=list(smoking.getIds())
tweet_id=smokingIds[11]
sample = smoking.getText(tweet_id)
sample

'#Zenith #Pilot Type 20 Extra Special #Cohiba-#Maduro 5 Edition watch - A smoking start to 2018 for Zenith/Cohiba partnership with Pilot Type 20 special editions \n@ZenithWatches #Type20\nhttps://t.co/n1CnmMTI2Y https://t.co/H82cdAb7MY'

Tweets have usage patterns that are non-standard English - URLs, hashtags, user references (this particularly tweet was not selected accidentally). These patterns create challenges for extracting content - we might want to know that "#Type20" is, in a tweet, a hashtag that should be considered as a complete unit.  

We'll see soon how we might do this, but first, to start the NLP process, we can import the Spacy components and create an NLP object:

In [22]:
import spacy
nlp = spacy.load('en')

we can then parse out the text from the first tweet.

In [23]:
parsed = nlp(sample)

The result is a list of tokens. We can print out each token to start:

In [24]:
print([token.text for token in parsed])

['#', 'Zenith', '#', 'Pilot', 'Type', '20', 'Extra', 'Special', '#', 'Cohiba-#Maduro', '5', 'Edition', 'watch', '-', 'A', 'smoking', 'start', 'to', '2018', 'for', 'Zenith', '/', 'Cohiba', 'partnership', 'with', 'Pilot', 'Type', '20', 'special', 'editions', '\n', '@ZenithWatches', '#', 'Type20', '\n', 'https://t.co/n1CnmMTI2Y', 'https://t.co/H82cdAb7MY']


We can see right away that this parsing isn't quite what we would like. Default English parsing treats the `#Zenith` as two separate tokens - `#` and `Zenith`. To treat this as a hashtag, we will indeed need to revise the parser.


Before we do that, we can also look at some of the othe attributes that we might learn form the parser. 

Chief among this is the *lemma_*: the "standard" or "base" form, reducing verb forms to their base verb, plurals to appropriate singular nouns, etc.  For example, the 29th token is `editions`, which has `edition` as the lemmatized version.

In [29]:
print(parsed[29].text)
print(parsed[29].lemma_)

editions
edition


*pos_* and *tag_* provide basic and detailed information on the part of speech. Looking at the 7th token - "John", we see

In [30]:
print(parsed[15].text)
print(parsed[15].pos_)
print(parsed[15].tag_)

smoking
NOUN
NN


If you want to learn more about a part of spech tag, you can use `spacy.explain`

In [31]:
print(spacy.explain(parsed[15].pos_))
print(spacy.explain(parsed[15].tag_))

noun
noun, singular or mass


* pos_ is a simple  indicator of the part of speech - noun,verb, etc. 
* tag_ is a more detailed indicator of the part of speech. 
* is_stop is True if the token is a "stop" word - a commonly found word that might addd litle or no information.
* is_alpha is True if the token is alphanumeric.

Let's look at token1 ("@BIUK"), token 2 (":"), token 6 ("John"), token 9 ("how"), and token 18 ("scared") to see a few more tokens in action.

In [18]:
t1 = parsed[1]
t2 = parsed[2]
t5 = parsed[6]
t9 = parsed[9]
t17 = parsed[17]
print (t1.text,t1.lemma_,t1.pos_,t1.tag_,t1.is_stop,t1.is_alpha)
print (t2.text,t2.lemma_,t2.pos_,t2.tag_,t2.is_stop,t2.is_alpha)
print (t5.text,t5.lemma_,t2.pos_,t5.tag_,t5.is_stop,t5.is_alpha)
print (t9.text,t9.lemma_,t2.pos_,t9.tag_,t9.is_stop,t9.is_alpha)
print (t17.text,t17.lemma_,t17.pos_,t17.tag_,t17.is_stop,t17.is_alpha)

@hunchoSr @hunchosr PROPN NNP False False
: : PUNCT : False False
never never PUNCT RB True True
gun gun PUNCT NN False True
smoking smoke VERB VBG False True


A few observations:
* "@BIUK" is no alphabetical
* ":" is neither alphabetical or a stop word. 
* "how" is an alphabetical stop word. 
* "John" is alphabetical, but not a stop word.  

Note the part-of-speech information. "John" has pos "PUNCT" and tag "NNP", while "how" has "PUNCT" and "WRB" respectively. We might look at these explanations for additional clarity:

In [19]:
print(spacy.explain("PUNCT"))
print(spacy.explain("NNP"))
print(spacy.explain("WRB"))

punctuation
noun, proper singular
wh-adverb


It's not clear why "John" or "how" would be considered punctuation, but we can guess that this is a shortocming of the machine-learning model used to determine part of speech.

Fornuately, this is not necessarily a problem, as the tags provide enough information to classify these tokens. 

Some NLP systems will go a bit further than Spacy's lemmatization, using a process called "stemming" to reduce words to base forms. With a stemming algorithm, "scared" might be reduced to "scare" - see this description of [Porter's stemming algorithm](https://tartarus.org/martin/PorterStemmer/) for more detail. 

Let's turn the code that we used above into a routine, along with a routine to print out token details and try another tweet or two.

In [20]:
def getTweetText(tweets):
    tweet_id=random.choice(list(tweets.keys()))
    tweet=tweets[tweet_id]['tweet']
    return tweet['text']

def printTokDetails(parsed):
    print("{:30} {:30} {:7}{:7}{:7}{:7}".format("Token text","Lemma","POS","Tag","Stop?","Alpha?"))
    for tok in parsed:
        print("{:30} {:30} {:7}{:7}{:7}{:7}".format(str(tok.text),str(tok.lemma_),str(tok.pos_),str(tok.tag_),str(tok.is_stop),str(tok.is_alpha)))

In [21]:
sample2=getTweetText(tweets)

In [22]:
sample2

'RT @OnlyWayIsShawtz: My boy stopped smoking weed the day he spent 30 minutes looking for his phone under the bed.. While using his phone fl…'

In [23]:
parsed2=nlp(sample2)

In [24]:
printTokDetails(parsed2)

Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
RT                             rt                             PROPN  NNP    False  True   
@OnlyWayIsShawtz               @onlywayisshawtz               NOUN   NN     False  False  
:                              :                              PUNCT  :      False  False  
My                             -PRON-                         ADJ    PRP$   False  True   
boy                            boy                            NOUN   NN     False  True   
stopped                        stop                           VERB   VBD    False  True   
smoking                        smoke                          VERB   VBG    False  True   
weed                           weed                           NOUN   NN     False  True   
the                            the                            DET    DT     True   True   
day                            day                            NOUN   NN     False  True   

You might see some interesting pattners arising here.  For example:

* We see many different type of speech. Initially, we might want to focus on the nouns alone, as they provide much of the content.  

* Look for words like "is" or "was" - these might all refer to a common lemma term - "be", corresponding to the generic form of he verb. Do you see any other incidents of lemma forms that differ from the parsed text?

* URLs and icons might be present in tweets. Are they classified as alphanumeric? Should we include them as part of the "useful" text from a tweet? 

* How should we handle the "RT" code for retweets, user handles, and other twitter idiosyncracies? 

---
## EXERCISE 3.1: Filtering tokens

Although NLP parsing is often a good start, further filtering is often necessary to focus on data relevant for specific tasks. In this problem, we will review some additional tweets and develop a post-processing routine capable of filtering tweets as necessary for our needs. 

3.1.1 Using the `getTweetText`, and `printTokDetails` routines above, aong with the spacy `parser` command, examine several tweets to decide which tokens should be included or not.  List criteria for keeeping/removing tokens. Remember to use `spacy.explain()` for any unfamiliar POS or tag entries. Note that your  criteria will not be perfect, nad will likely need refinining. Examiine enough tweets to feel confident in your criteria.

3.1.2 Write a routine  `includeToken` that will return True if a token matches the criteria that you identified in 3.11, and false otherwise. Assume for now that we are only interested in nouns and verbs, as they might be a good starting point to find information about vaping or smoking.

3.1.3 Write a routine `filterTweetTokens` that will filter the parsed tokens from a single tweet, returning a list of the tokens to be included, based on your criteria.

3.1.4 Run `filterTweetTokens` on a few tweets. Identify any inaccuracies and explain them. When possible, identify an approach for improving performance, and implement it in a revision version of `filterTweetTokens`.

3.1.5 Write a routine `parseTweets` that will iterate over the tweets in the collection. For each tweet, it will call `filterTweetTokens`, storing the resulting tweets in a list indexed by 'tokens' in the main tweet object.

---
*ANSWER FOLLOWS Cut below here*

### 3.1.1 Sample tweets

In [25]:
sample=getTweetText(tweets)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

RT @Pitsogp: Emtee must be smoking nyaope now how can u show the world yo uncircumcised dick @danielmarven  @tumisole  now the song "Roll u…
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
RT                             rt                             PROPN  NNP    False  True   
@Pitsogp                       @pitsogp                       NOUN   NN     False  False  
:                              :                              PUNCT  :      False  False  
Emtee                          emtee                          NOUN   NN     False  True   
must                           must                           VERB   MD     True   True   
be                             be                             VERB   VB     True   True   
smoking                        smoke                          VERB   VBG    False  True   
nyaope                         nyaope                         NOUN   NN     False  True   
now                            now      

In [26]:
sample=getTweetText(tweets)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

@horsekween This is like when we gave that lighter to the guy smoking spice 😂
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
@horsekween                    @horsekween                    VERB   VBN    False  False  
This                           this                           DET    DT     False  True   
is                             be                             VERB   VBZ    True   True   
like                           like                           ADP    IN     False  True   
when                           when                           ADV    WRB    True   True   
we                             -PRON-                         PRON   PRP    True   True   
gave                           give                           VERB   VBD    False  True   
that                           that                           DET    DT     True   True   
lighter                        light                          ADJ    JJR    False  True   
to          

In [27]:
sample=getTweetText(tweets)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

RT @ForeverGAW: 🔝BIG OP 150 FOLLOWER GAW🔝

➡1100 STEAM KEYS⬅
➡1 HELLCLIENT⬅
➡1 FA⬅
➡1 FA GEN⬅
➡THE FOREST⬅
➡1 ORGIN⬅
➡200 MC ALTS⬅
➡150 SPO…
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
RT                             rt                             PROPN  NNP    False  True   
@ForeverGAW                    @forevergaw                    PROPN  NNP    False  False  
:                              :                              PUNCT  :      False  False  
🔝                              🔝                              VERB   VB     False  False  
BIG                            big                            NOUN   NN     False  True   
OP                             op                             NOUN   NN     False  True   
150                            150                            NUM    CD     False  False  
FOLLOWER                       follower                       PROPN  NNP    False  True   
GAW                            gaw      

In [28]:
sample=getTweetText(tweets)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

RT @johncardillo: .@seanmdav hit a grand slam here. Great work. Smoking gun.  https://t.co/vDwjoi64Gk
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
RT                             rt                             PROPN  NNP    False  True   
@johncardillo                  @johncardillo                  PROPN  NNP    False  False  
:                              :                              PUNCT  :      False  False  
.@seanmdav                     .@seanmdav                     PUNCT  .      False  False  
hit                            hit                            VERB   VBD    False  True   
a                              a                              DET    DT     True   True   
grand                          grand                          ADJ    JJ     False  True   
slam                           slam                           NOUN   NN     False  True   
here                           here                           ADV    RB     Tru

Criteria: 
    
* Alpha is true, and 
* Stop is false, and 
* text is not "RT"
* Tag is NN, Tag is NNP, or POS is VERB

### 3.1.2  `includeToken`

Our routine will accept a token only if it meets the criteria given above. 

In [29]:
def includeToken(tok):
    val =False
    if tok.is_alpha == True and tok.is_stop == False:
        if tok.text =='RT':
            val = False
        elif tok.tag_ =='NN' or tok.tag_=='NNP' or tok.pos_=='VERB':
            val = True
    return val

In [30]:
sample=getTweetText(tweets)
parsed=nlp(sample)
print(sample)

Whad you do after smoking some new staff😂😂😂 https://t.co/9HIWJJMoUT


In [31]:
print(parsed[0])
includeToken(parsed[0])

Whad


True

In [32]:
print(parsed[1])
includeToken(parsed[1])

you


False

In [33]:
print(parsed[2])
includeToken(parsed[2])

do


False

In [34]:
for tok in parsed:
    print(tok,includeToken(tok))

Whad True
you False
do False
after False
smoking True
some False
new False
staff True
😂 False
😂 False
😂 False
https://t.co/9HIWJJMoUT False


Looks ok. 

### 3.1.3 Write a routine `filterTweeTokens` that will parse a single tweet

In [35]:
def filterTweetTokens(tokens):
    filtered=[]
    for tok in tokens:
        if includeToken(tok) == True:
            filtered.append(tok)
    return filtered

In [36]:
f= filterTweetTokens(parsed)
print(sample)
for tok in f:
    print(tok)

Whad you do after smoking some new staff😂😂😂 https://t.co/9HIWJJMoUT
Whad
smoking
staff


### 3.1.4 Run `filterTweetTokens` on a few tweets

In [37]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

RT @kingamajorr: Girls have bible verses in their bios on their main instas, but videos of them drinking and smoking on their finstas... th…
drinking
smoking


In [38]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

RT @chocoo_loco: I just want my friends to stop smoking weed😂 https://t.co/LWI2HVofAf
want
stop
smoking
weed


In [39]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

RT @HealthyAirUK: Is being exposed to air pollution like passive smoking? #HelpBritainBreathe https://t.co/WTYfADhJ68
Is
exposed
air
pollution
smoking


here, inclusion of "Do" looks questionable.

fits the criteria. we might ask ourselves if we want verbs, but we do want to include 'smoke','vape', etc..

In [40]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)


RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O
smoking
ad


In [41]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

RT @artofdai: My version of Van Gogh's "head of a skeleton smoking a cigarette" https://t.co/jQvaQQSg3P
version
Van
Gogh
head
skeleton
smoking
cigarette


In [42]:
sample=getTweetText(tweets)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

RT @foxnewspolitics: 'Smoking gun'email shows Obama DOJ blocked conservative groups from settlement funds,GOP lawmaker says- @AlexPappas
ht…
Smoking
shows
Obama
DOJ
blocked
settlement
GOP
lawmaker


### 3.1.5  `parseTweets` routine

In [43]:
def parseTweets(tweets):
    for tweet_id in tweets:
        tweet= tweets[tweet_id]
        text = tweet['tweet']['text']
        parsed=nlp(text)
        filtered = filterTweetTokens(parsed)
        tweet['filtered'] = filtered

In [44]:
parseTweets(tweets)

In [45]:
tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]

In [46]:
tweet['tweet']['text']

'6✨#ℹnfographics for #pot lovers\nhttps://t.co/dgPS5FFIwR\n#weed #marijuana #cannabis #smoking #vaping #herbs #herbal… https://t.co/sjh7HqDMVe'

In [47]:
tweet['filtered']

[pot, weed, marijuana, cannabis, smoking, vaping, herbs]

In [48]:
tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]
print(tweet['tweet']['text'])
tweet['filtered']

RT @onmyworst: Not to sound like tana mongeau but I love smoking weed


[sound, tana, mongeau, love, smoking, weed]

In [49]:
tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]
print(tweet['tweet']['text'])
tweet['filtered']

@komodoteen4u2nv @sleepbutt ooh get a pax! they look like a juul vape so v easy to smoke anywhere


[pax, look, juul, vape, smoke]

In [50]:
tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]
print(tweet['tweet']['text'])
tweet['filtered']

RT @VersaceVersacei: I'm in the shower and my girl is doing her makeup in here and we're smoking a wood together 💕 I love her so much 🖤 htt…


[shower, girl, makeup, smoking, wood, love, htt]

In [51]:
tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]
print(tweet['tweet']['text'])
tweet['filtered']

@JedSanborn I suspect being able to regularly see a doctor might also lower those rates of smoking and obesity. #publichealth


[suspect, doctor, lower, smoking, obesity, publichealth]

tweet_id=random.choice(list(tweets.keys()))
tweet=tweets[tweet_id]
print(tweet['tweet']['text'])
tweet['filtered']

*END OF ANSWER cut above here*

---

### 1.3 Revised tokenizing

Your Review of some tweets might lead you to identify text patterns that might not fit with the initial tokenizing or part-of-speech tagging. Fortunately, the spacy tools provide a means for extending the tokenizer for special cases. Here, we review an example of how these tools might be used.

Specifically, review of some tweets led to the following concerns: 
1. The word "E-cigarette is split by the tokenizer into two separate tokens
2. Hashtags are split into the pound symbol (`#`) and the following text.


#### 1.3.2 Tokenizing "E-cigarette"

Consider the following tweet:

In [52]:
smoketweet='E-cigarette use by teens linked to later tobacco smoking, study says https://t.co/AhTpFUw0TW'

In [53]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['E', '-', 'cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Note that "E-cigarette" becomes three tokens. This is not what we want - we want it to be held together as one. 
To do this, we can add a [special-case tokenizer rule](https://spacy.io/usage/linguistic-features#section-tokenization) as follows:

In [54]:
from spacy.symbols import ORTH, LEMMA, POS
special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
nlp.tokenizer.add_special_case(u'E-cigarette', special_case)

This text says that the text "e-cigarette" should be handled by the special case rule saying that it is a single token.

In [55]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['e-cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Now we capture "E-cigarette" as one token. Note the importance of including both capitalizations.

### 1.3.3 Tokenizing hashtags

Hashtags are important in tweets, as we might want to track frequency and trends of mentions. However, the default tokenizer does not capture hashtags as such. For example:

In [56]:
hashtag ="RT @heal_crypto: #VR uses in therapy - for various additictions such as smoking, alcohol, overeating, etc - #HealCoin https://t.co/T65Fboq7…"
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['RT', '@heal_crypto', ':', '#', 'VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#', 'HealCoin', 'https://t.co/T65Fboq7', '…']


In [64]:
parsed[22]

#

Note how "#VR" is split into "#" and "VR". To avoid this, we will can add a specialized [spacy pipeline](https://github.com/explosion/spaCy/issues/503).

In [65]:
def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.nbor() is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.nbor().text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc
nlp.add_pipe(hashtag_pipe)

doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'

This routine looks at tokens starting with '#' and adds the "nbor" - the next token - to it. This is added to the [spacy pipeline](https://spacy.io/usage/processing-pipelines)

Returning to our original example:

In [66]:
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['RT', '@heal_crypto', ':', '#VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#HealCoin', 'https://t.co/T65Fboq7', '…']


Customization of pipelines such as this is often an important part of NLP work.