# Social Media and Human-Computer Interaction - Part 3



###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [1]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb), covering the natural language processing analysis of our tweet corpus, including 

  1. Natural Language Processings
  2. Construction of classifiers
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. Our Tweets class
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
4. Configuration of our Twitter connection

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [3]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Examination of text patterns

Our ultimate goal is to build a classifier capable of distinguishing tweets related to tobacco smoing from other, unrelated tweets. To do this, we will eventually build natural language processing models. However, we will start by doing some basic processing to explore the types of words and language found in the tweets. 

To do this, we will use the [Spacy](https://spacy.io/) Python NLP package. Spacy provides significant NLP power out-of-the box, with customization facilities offering greater flexibility at various stages of the Pipeline. Details can be found at the  [Spacy web site](https://spacy.io/), and in this [tutorial](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/).

However, before we get into the deails, a bit of a roadmap. 

Natural Language Processing involves a series of operations on an input text, each building off of the previous step to add additional insight and undertanding.  Thus, many NLP packages run as pipeline processors providing modular components at each stage of the process. Separating key steps into discrete packages provides needed modularity, as developers can modify and customize individual components as needed. Spacy, like other NLP tools including [GATE](https://gate.ac.uk/) and [cTAKES](https://ctaes.apache.org)  operate on such a model. Although the specific components of each pipeline vary from system to system (and from tasks to task, the key tasks are rougly similar:

1. *Tokenizing*: splitting the text into words, punctuation, and other markers.
2. *Part of speech tagging*: Classifying terms as nouns, verbs, adjective, adverbs, ec.
3. *Dependency Parsing* or *Chunking*: Defining relationships between tokens (subject and object of sentence) and grouping into noun and veb phrases.
4. *Named Entity Recognition*: Mapping words or phrases to standard vocabularies or other common, known values. This step is often key for linking free text to accepted terms for diseases, symptoms, and/or anatomic locations.

Each of these steps might be accomplished through rules, machine learning models, or some combination of approaches. After these initial steps are complete, results might be used to identify relationships between items in the text, build classifiers, or otherwise conduct further analysis. We'll get into these topics later.

The [Spacy documentation](https://spacy.io/usage/spacy-101) and [cTAKES default pipeline description](https://cwiki.apache.org/confluence/display/CTAKES/Default+Clinical+Pipeline) provide two examples of how these components might be arranged in practice.  For more information on NLP theory and methods, see [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/), perhaps the leading NLP textbook.

Given this introduction, we can read in our tweets and get to work.

### 1.1 Reading in data

At the end of [Part 2](SocialMedia - Part 2.ipynb) you had saved two sets of tweets one for smoking and one for vaping. Let's read  in the vaping twets.

In [5]:
vaping=Tweets()
vaping.readTweets("tweets-vaping.json")

In [6]:
vaping.countTweets()

100

and the smoking tweets...

In [7]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.countTweets()

100

### 1.2 Initial parsing

To start with, we will grab a specifc pre-chosen tweet and process it.  

This will give us a beginning feel for what [Spacy](https://spacy.io) can do and how we might use it.

In [23]:
tweet_id='974316984740429824'
sample=smoking.getText(tweet_id)
sample

'#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf \n#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr'

Tweets have usage patterns that are non-standard English - URLs, hashtags, user references (this particularly tweet was not selected accidentally). These patterns create challenges for extracting content - we might want to know that "#QuitSmoking" is, in a tweet, a hashtag that should be considered as a complete unit.  

We'll see soon how we might do this, but first, to start the NLP process, we can import the Spacy components and create an NLP object:

In [None]:
import spacy
nlp = spacy.load('en')

we can then parse out the text from the first tweet.

In [None]:
parsed = nlp(sample)

The result is a list of tokens. We can print out each token to start:

In [None]:
print([token.text for token in parsed])

We can see right away that this parsing isn't quite what we would like. Default English parsing treats  `#QuitSmoking`  as two separate tokens - `#` and `QuitSmoking`. To treat this as a hashtag, we will indeed need to revise the parser.

Before we do that, we can also look at some of the othe attributes that we might learn form the parser. 

Chief among this is the *lemma_*: the "standard" or "base" form, reducing verb forms to their base verb, plurals to appropriate singular nouns, etc.  For example, the 29th token is `affects`, which has `affect` as the lemmatized version.

In [29]:
print(parsed[2].text)
print(parsed[2].lemma_)

affects
affect


Note that Spacy stores many fields as both hashes for efficiency and as text  for readability. You'll want to use the text form for interpreting results, but the hash for computing. They differ only in the use of the trailing underscore - thus `lemma` is the hash while `lemma_` is the human readable form.

In [36]:
print(parsed[2].lemma)
print(parsed[2].lemma_)

17543419487618836897
affect


*pos_* and *tag_* provide basic and detailed information on the part of speech. Looking at the 2nd token - "Smoking", we see

In [34]:
print(parsed[2].text)
print(parsed[2].tag_)
print(parsed[2].pos_)

affects
VBZ
VERB


'tag_' and 'pos_' are slightly different versions of the part of speech identified for each token. As described in the [Spacy documentation for part-of-speech tags](https://spacy.io/api/annotation#pos-tagging), the tags associated with these two fields come from different sources. 'tag_' uses parts-of-speech from a version of the [Penn Treebank](https://www.seas.upenn.edu/~pdtb/), a well-known corpus of annotated text. 'pos_' uses a simpler set of tags from [A Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086), published by researchers from Google.  

The tags for `affects` provide an example of the difference. According to the [Spacy documentation ](https://spacy.io/api/annotation#pos-tagging) `VBZ` from the Penn tag set indicates a 'verb, 3rd person singular present', while 'the 'VERB' result for 'pos_' corresponds to many types of verbs described in more detail in the Penn Treebank.

If you want to learn more about a part of spech tag, you can use `spacy.explain`

In [37]:
print(spacy.explain(parsed[2].pos_)
print(spacy.explain(parsed[2].tag_))

verb
verb, 3rd person singular present


Other values returned by the parser might also be of interest:
* is_stop is True if the token is a "stop" word - a commonly found word that might addd litle or no information.
* is_alpha is True if the token is alphanumeric.

In [38]:
print([token.text for token in parsed])

['#', 'Smoking', 'affects', 'multiple', 'parts', 'of', 'our', 'body', '.', 'Know', 'more', ':', 'https://t.co/hwTeRdC9Hf', '\n', '#', 'SwasthaBharat', '#', 'NHPIndia', '#', 'mCessation', '#', 'QuitSmoking', 'https://t.co/x7xHO9G2Cr']


Let's look at token 1 ("Smokingg"), token 4 ("parts"), token 12 ("https://t.co/hwTeRdC9Hf'"),  and token 14("#") to see a few more tokens in action.

In [41]:
t1 = parsed[1]
t4 = parsed[4]
t12= parsed[12]
t14 = parsed[14]
print (t1.text,t1.lemma_,t1.pos_,t1.tag_,t1.is_stop,t1.is_alpha)
print (t4.text,t4.lemma_,t4.pos_,t4.tag_,t4.is_stop,t4.is_alpha)
print (t12.text,t12.lemma_,t12.pos_,t12.tag_,t12.is_stop,t12.is_alpha)
print (t14.text,t14.lemma_,t14.pos_,t14.tag_,t14.is_stop,t14.is_alpha)

Smoking smoking NOUN NN False True
parts part NOUN NNS False True
https://t.co/hwTeRdC9Hf https://t.co/hwterdc9hf PROPN NNP False False
# # PROPN NNP False False


A few observations:
* URLS are neither alphabetical  nor stop-words, but they are proper nouns
* '#` is a proper noun. 


Some NLP systems will go a bit further than Spacy's lemmatization, using a process called "stemming" to reduce words to base forms. With a stemming algorithm, "scared" might be reduced to "scare" - see this description of [Porter's stemming algorithm](https://tartarus.org/martin/PorterStemmer/) for more detail. 

Let's turn the code that we used above into a routine, along with a routine to print out token details and try another tweet or two.

In [45]:
def getTweetText(tweets):
    tweet_id=random.choice(list(tweets.getIds()))
    return tweets.getText(tweet_id)

def printTokDetails(parsed):
    print("{:30} {:30} {:7}{:7}{:7}{:7}".format("Token text","Lemma","POS","Tag","Stop?","Alpha?"))
    for tok in parsed:
        print("{:30} {:30} {:7}{:7}{:7}{:7}".format(str(tok.text),str(tok.lemma_),str(tok.pos_),str(tok.tag_),str(tok.is_stop),str(tok.is_alpha)))

In [46]:
sample2=getTweetText(smoking)

In [47]:
sample2

'#UNFAO is scaling up efforts on reducing the amount of #wood used as #fuel for #fish smoking in the #Gambia. With the new #UNFAO Thiaroye Technology stove, #women in the #fishsmoking &amp; drying #industry have improved access to #technology &amp; #livelihood. #ZeroHunger \n@FAOWestAfrica https://t.co/ifT8KSRo3O'

In [48]:
parsed2=nlp(sample2)

In [49]:
printTokDetails(parsed2)

Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
#                              #                              PROPN  NNP    False  False  
UNFAO                          unfao                          PROPN  NNP    False  True   
is                             be                             VERB   VBZ    True   True   
scaling                        scale                          VERB   VBG    False  True   
up                             up                             PART   RP     True   True   
efforts                        effort                         NOUN   NNS    False  True   
on                             on                             ADP    IN     True   True   
reducing                       reduce                         VERB   VBG    False  True   
the                            the                            DET    DT     True   True   
amount                         amount                         NOUN   NN     True   True   

You might see some interesting pattners arising here.  For example:

* We see many different type of speech. Initially, we might want to focus on the nouns alone, as they provide much of the content.  

* Look for words like "is" or "was" - these might all refer to a common lemma term - "be", corresponding to the generic form of he verb. Do you see any other incidents of lemma forms that differ from the parsed text?

* URLs and icons might be present in tweets. Are they classified as alphanumeric? Should we include them as part of the "useful" text from a tweet? 


---
## EXERCISE 3.1: Filtering tokens

Although NLP parsing is often a good start, further filtering is often necessary to focus on data relevant for specific tasks. In this problem, we will review some additional tweets and develop a post-processing routine capable of filtering tweets as necessary for our needs. 

3.1.1 Using the `getTweetText`, and `printTokDetails` routines above, aong with the spacy `parser` command, examine several tweets to decide which tokens should be included or not.  List criteria for keeeping/removing tokens. Remember to use `spacy.explain()` for any unfamiliar POS or tag entries. Note that your  criteria will not be perfect, and will likely need refinining. Examiine enough tweets to feel confident in your criteria.

3.1.2 Write a routine  `includeToken` that will return True if a token matches the criteria that you identified in 3.11, and false otherwise. Assume for now that we are only interested in nouns and verbs, as they might be a good starting point to find information about vaping or smoking. 

3.1.3 Write a routine `filterTweetTokens` that will filter the parsed tokens from a single tweet, returning a list of the tokens to be included, based on your criteria.

3.1.4 Run `filterTweetTokens` on a few tweets. Identify any inaccuracies and explain them. When possible, identify an approach for improving performance, and implement it in a revision version of `filterTweetTokens`.

3.1.5. Add these routines to the tweet class, along with some new routines.

3.1.5.1 `parseTweet` will parse one of the tweets in the collection, storing the full list of tokens will be stored in a new entry in the dictionary entitled 'tokens'. `parseTweet` will also filter the tweets, storing the resulting list in an entry entitled 'filteredTokens'.

*NOTE*: The tweets class might or might not have an NLP object available for any given call to `parseTweet`. You should have the class create an NLP object when it is initialzed. 

3.1.5.2 `parseTweets` will call `parseTweet` on all of the tweets in a collection.

3.1.5.3 `getTokens` will be used to get all of the tokens for a given tweet.

3.1.5.4 `getFilteredTokens` will be used to get all of the filtered tokens for a tweet. 


3.1.6 When you are done, test this new version of the class by reading in and parsing the 'smoking' tweet set. 

---
*ANSWER FOLLOWS Cut below here*

### 3.1.1 Sample tweets

In [106]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

Made a sandwich 10 min ago and been looking for it ever since then🤦🏾‍♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
Made                           make                           VERB   VBN    False  True   
a                              a                              DET    DT     True   True   
sandwich                       sandwich                       NOUN   NN     False  True   
10                             10                             NUM    CD     False  False  
min                            min                            NOUN   NN     False  True   
ago                            ago                            ADV    RB     False  True   
and                            and                            CCONJ  CC     True   True   
been                           be                             VERB   VBD    True   True   
looking                        look                           V

In [107]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

me: smoking weed hasn’t affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
me                             -PRON-                         PRON   PRP    True   True   
:                              :                              PUNCT  :      False  False  
smoking                        smoke                          VERB   VBG    False  True   
weed                           weed                           NOUN   NN     False  True   
has                            have                           VERB   VBZ    True   True   
n’t                            not                            ADV    RB     False  False  
affected                       affect                         VERB   VBN    False  True   
me                             -PRON-                         PRON   PRP    True   True   
at                             at                             ADP    IN     True   True

In [108]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

@Artscphoto @presstelegram That's hilarious because I catch those punk ass kids fighting &amp; smoking dope every week by my house.
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
@Artscphoto                    @artscphoto                    NOUN   NN     False  False  
@presstelegram                 @presstelegram                 NOUN   NN     False  False  
That                           that                           DET    DT     False  True   
's                             be                             VERB   VBZ    False  False  
hilarious                      hilarious                      ADJ    JJ     False  True   
because                        because                        ADP    IN     True   True   
I                              -PRON-                         PRON   PRP    False  True   
catch                          catch                          VERB   VBP    False  True   
those                          those             

In [109]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)
printTokDetails(parsed)

sexy girls droping there pants pictures of girls smoking virginia slims cigarettes https://t.co/RXYgAr8rKZ
Token text                     Lemma                          POS    Tag    Stop?  Alpha? 
sexy                           sexy                           ADJ    JJ     False  True   
girls                          girl                           NOUN   NNS    False  True   
droping                        drop                           VERB   VBG    False  True   
there                          there                          ADV    RB     True   True   
pants                          pant                           NOUN   NNS    False  True   
pictures                       picture                        NOUN   NNS    False  True   
of                             of                             ADP    IN     True   True   
girls                          girl                           NOUN   NNS    False  True   
smoking                        smoke                          VERB   VBG  

Criteria: 
    
* Alpha is true, and 
* Stop is false, and 
* text is not "RT"
* Tag is NN, Tag is NNP, or POS is VERB

### 3.1.2  `includeToken`

Our routine will accept a token only if it meets the criteria given above. 

In [110]:
def includeToken(tok):
    val =False
    if tok.is_alpha == True and tok.is_stop == False:
        if tok.text =='RT':
            val = False
        elif tok.pos_=='NOUN' or tok.pos_=='VERB':
            val = True
    return val

In [111]:
sample=getTweetText(smoking)
parsed=nlp(sample)
print(sample)

I love smoking weed in beautiful ass places, looking at beautiful ass things.


In [112]:
print(parsed[0])
includeToken(parsed[0])

I


False

In [113]:
print(parsed[1])
includeToken(parsed[1])

love


True

In [114]:
print(parsed[2])
includeToken(parsed[2])

smoking


True

In [115]:
for tok in parsed:
    print(tok,includeToken(tok))

I False
love True
smoking True
weed True
in False
beautiful False
ass True
places True
, False
looking True
at False
beautiful False
ass True
things True
. False


Looks ok. 

### 3.1.3 Write a routine `filterTweeTokens` that will parse a single tweet

In [116]:
def filterTweetTokens(tokens):
    filtered=[]
    for tok in tokens:
        if includeToken(tok) == True:
            filtered.append(tok)
    return filtered

In [117]:
f= filterTweetTokens(parsed)
print(sample)
for tok in f:
    print(tok)

I love smoking weed in beautiful ass places, looking at beautiful ass things.
love
smoking
weed
ass
places
looking
ass
things


### 3.1.4 Run `filterTweetTokens` on a few tweets

In [118]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

cole sprouse smoking appreciation tweet https://t.co/oJdzn6j9lb
cole
sprouse
smoking
appreciation
tweet


In [119]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

#Zenith #Pilot Type 20 Extra Special #Cohiba-#Maduro 5 Edition watch - A smoking start to 2018 for Zenith/Cohiba partnership with Pilot Type 20 special editions 
@ZenithWatches #Type20
https://t.co/n1CnmMTI2Y https://t.co/H82cdAb7MY
watch
smoking
start
partnership
editions


In [120]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

really tryna stop smoking 🤦🏻‍♀️
tryna
stop
smoking


In [121]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)


Made a sandwich 10 min ago and been looking for it ever since then🤦🏾‍♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe
Made
sandwich
min
looking
got
stop
smoking


In [126]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

@Artscphoto @presstelegram That's hilarious because I catch those punk ass kids fighting &amp; smoking dope every week by my house.
catch
punk
ass
kids
fighting
amp
smoking
dope
week
house


In [127]:
sample=getTweetText(smoking)
print(sample)
parsed=nlp(sample)
f= filterTweetTokens(parsed)
for tok in f:
    print(tok)

เจ๋ง!👍

Cool!
Sweet! (slang)
Great!
Beautiful!
Awesome!
Excellent!
That's smoking!
That's fab!

#บุพเพสันนิวาส https://t.co/hVqtVvoBp4
smoking


### 3.1.5  Adding to the Tweets class

In [138]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        self.nlp = spacy.load('en')
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
  
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary
    
    # new routine for classifying a token
    def includeToken(self,tok):
        val =False
        if tok.is_alpha == True and tok.is_stop == False:
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='VERB':
                val = True
        return val
    
    # new routine for filtering a list of tokens.
    def filterTweetTokens(self,tokens):
        filtered=[]
        for tok in tokens:
            if includeToken(tok) == True:
                filtered.append(tok)
        return filtered
    
    def parseTweet(self,id):
        text = self.getText(id)
        parsed = nlp(text)
        self.tweets[id]['tokens']=parsed
        filtered= self.filterTweetTokens(parsed)
        self.tweets[id]['filteredTokens']=filtered
        
    def parseTweets(self):
        ids=self.getIds()
        for id in ids:
            self.parseTweet(id)
            
    def getTokens(self,id):
        return self.tweets[id]['tokens']
    
    def getFilteredTokens(self,id):
         return self.tweets[id]['filteredTokens']

### 3.1.6 Trying out the new routines for parsing a collection.

In [139]:
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
smoking.countTweets()

100

In [158]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['Made', 'a', 'sandwich', '10', 'min', 'ago', 'and', 'been', 'looking', 'for', 'it', 'ever', 'since', 'then', '\U0001f926', '🏾\u200d', '♂', '️', 'I', 'got', 'ta', 'stop', 'smoking', '😂', 'https://t.co/NCbNOyvZXe']
['Made', 'sandwich', 'min', 'looking', 'got', 'stop', 'smoking']


In [159]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['BTS', 'was', 'nt', 'going', 'to', 'disband', 'what', 'are', 'u', 'smoking', '.', 'Show', 'me', 'where', 'it', 'said', 'they', 'were', 'going', 'to', 'disband', '?', 'only', 'a', 'few', 'of', 'y’', 'all', 'multifandom', 'said', 'u', 'were', 'going', 'to', 'help', 'but', 'we', 'do', 'n’t', 'even', 'know', 'if', 'those', 'really', 'voted', 'https://t.co/CfsHoGnxfJ']
['going', 'disband', 'u', 'smoking', 'Show', 'said', 'going', 'disband', 'multifandom', 'said', 'going', 'help', 'know', 'voted']


In [160]:
tweet_id=random.choice(list(smoking.getIds()))
smoking.getText(tweet_id)
toks = smoking.getTokens(tweet_id)
print([token.text for token in toks])
filtered = smoking.getFilteredTokens(tweet_id)
print([f.text for f in filtered])

['Our', 'Well', '-', 'Being', 'Advisor', 'Jenny', 'providing', 'some', 'excellent', 'ideas', 'and', ' ', 'quit', 'smoking', 'tips', 'during', 'our', '#', 'NoSmokingDay', '!', '🚭', '#', 'Pharmacy', '#', 'ThinkPharmacyNI', '#', 'healthpluspharmacy', 'https://t.co/UHn2lSsErQ']
['Being', 'providing', 'ideas', 'quit', 'smoking', 'tips', 'ThinkPharmacyNI', 'healthpluspharmacy']


*END OF ANSWER cut above here*

---

### 1.3 Revised tokenizing

Your Review of some tweets might lead you to identify text patterns that might not fit with the initial tokenizing or part-of-speech tagging. Fortunately, the spacy tools provide a means for extending the tokenizer for special cases. Here, we review an example of how these tools might be used.

Specifically, review of some tweets led to the following concerns: 
1. The word "E-cigarette is split by the tokenizer into two separate tokens
2. Hashtags are split into the pound symbol (`#`) and the following text.


#### 1.3.2 Tokenizing "E-cigarette"

Consider the following tweet:

In [52]:
smoketweet='E-cigarette use by teens linked to later tobacco smoking, study says https://t.co/AhTpFUw0TW'

In [53]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['E', '-', 'cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Note that "E-cigarette" becomes three tokens. This is not what we want - we want it to be held together as one. 
To do this, we can add a [special-case tokenizer rule](https://spacy.io/usage/linguistic-features#section-tokenization) as follows:

In [54]:
from spacy.symbols import ORTH, LEMMA, POS
special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
nlp.tokenizer.add_special_case(u'E-cigarette', special_case)

This text says that the text "e-cigarette" should be handled by the special case rule saying that it is a single token.

In [55]:
parsed=nlp(smoketweet)
print( [tok.text for tok in parsed])

['e-cigarette', 'use', 'by', 'teens', 'linked', 'to', 'later', 'tobacco', 'smoking', ',', 'study', 'says', 'https://t.co/AhTpFUw0TW']


Now we capture "E-cigarette" as one token. Note the importance of including both capitalizations.

### 1.3.3 Tokenizing hashtags

Hashtags are important in tweets, as we might want to track frequency and trends of mentions. However, the default tokenizer does not capture hashtags as such. For example:

In [56]:
hashtag ="RT @heal_crypto: #VR uses in therapy - for various additictions such as smoking, alcohol, overeating, etc - #HealCoin https://t.co/T65Fboq7…"
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['RT', '@heal_crypto', ':', '#', 'VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#', 'HealCoin', 'https://t.co/T65Fboq7', '…']


In [64]:
parsed[22]

#

Note how "#VR" is split into "#" and "VR". To avoid this, we will can add a specialized [spacy pipeline](https://github.com/explosion/spaCy/issues/503).

In [65]:
def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.nbor() is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.nbor().text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc
nlp.add_pipe(hashtag_pipe)

doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'

This routine looks at tokens starting with '#' and adds the "nbor" - the next token - to it. This is added to the [spacy pipeline](https://spacy.io/usage/processing-pipelines)

Returning to our original example:

In [66]:
parsed=nlp(hashtag)
print([tok.text for tok in parsed])

['RT', '@heal_crypto', ':', '#VR', 'uses', 'in', 'therapy', '-', 'for', 'various', 'additictions', 'such', 'as', 'smoking', ',', 'alcohol', ',', 'overeating', ',', 'etc', '-', '#HealCoin', 'https://t.co/T65Fboq7', '…']


Customization of pipelines such as this is often an important part of NLP work.