 <table><tr><td><img src="images/dbmi_logo.png" width="75" height="73" alt="Pitt Biomedical Informatics logo"></td><td><img src="images/pitt_logo.png" width="75" height="75" alt="University of Pittsburgh logo"></td></tr></table>
 
 
 # Social Media and Data Science - Part 5
 
 
Data science modules developed by the University of Pittsburgh Biomedical Informatics Training Program with the support of the National Library of Medicine data science supplement to the University of Pittsburgh (Grant # T15LM007059-30S1). 

Developed by Harry Hochheiser, harryh@pitt.edu. All errors are my responsibility.

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.


### Goal: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as smoking.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

---

In [1]:
%matplotlib inline

import operator
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime
from spacy.symbols import ORTH, LEMMA, POS

# 5.0 Introduction

This final part of our journey through social media data retrieval, annotation, natural langauge processing, and classififcation will challenge you to apply these techniques to a new problem. Specifically, you will create, annotate, and process a new data set.

# 5.0.1 Setup

As before, we start with the Tweets class and the configuration for our Twitter API connection.  We may not need this, but we'll load it in any case.

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",tweet_mode='extended',count=corpus_size)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(120)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def combineTweets(self,other):
        for otherid in other.getIds():
            tweet = other.getTweet(otherid)
            searchTerm = other.getSearchTerm(otherid)
            searchTime = other.getSearchTime(otherid)
            self.addTweet(tweet,searchTime,searchTerm)
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        if 'codes' in tweet:
            return tweet['codes']
        else:
            return None
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [3]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

We will also load some routines that we defined in [Part 3](SocialMedia - Part 3.ipynb):
    
1. Our routine for creating a customized NLP pipeline
2. Our routine for including tokens
3. The `filterTweetTokens` routine defined in an exercise (Without the inclusion of named entities. It will be easier to leave them out for now).

In [5]:
def getTwitterNLP():
    nlp = spacy.load('en')
    
    for word in nlp.Defaults.stop_words:
        lex = nlp.vocab[word]
        lex.is_stop = True
    
    special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
    nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
    nlp.tokenizer.add_special_case(u'E-cigarette', special_case)
    vape_case = [{ORTH: u'vape',LEMMA:u'vape',POS: u'NOUN'}]
    
    vape_spellings =[u'vap',u'vape',u'vaping',u'vapor',u'Vap',u'Vape',u'Vapor',u'Vapour']
    for v in vape_spellings:
        nlp.tokenizer.add_special_case(v, vape_case)
    def hashtag_pipe(doc):
        merged_hashtag = True
        while merged_hashtag == True:
            merged_hashtag = False
            for token_index,token in enumerate(doc):
                if token.text == '#':
                    try:
                        nbor = token.nbor()
                        start_index = token.idx
                        end_index = start_index + len(token.nbor().text) + 1
                        if doc.merge(start_index, end_index) is not None:
                            merged_hashtag = True
                            break
                    except:
                        pass
        return doc
    nlp.add_pipe(hashtag_pipe,first=True)
    return nlp

def includeToken(tok):
    val =False
    if tok.is_stop == False:
        if tok.is_alpha == True: 
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='PROPN' or tok.pos_=='VERB':
                val = True
        elif tok.text[0]=='#' or tok.text[0]=='@':
            val = True
    if val== True:
        stripped =tok.lemma_.lower().strip()
        if len(stripped) ==0:
            val = False
        else:
            val = stripped
    return val

def filterTweetTokens(tokens):
    filtered=[]
    for t in tokens:
        inc = includeToken(t)
        if inc != False:
            filtered.append(inc)
    return filtered

Finally, we will include some additional modules from Scikit-Learn:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
import string
import re

Now, we're ready to go along for an exercise

---
## EXERCISE 5.1: Annotating and classifying new data


Identifying the source of social media comments might be an important step in the process of interpreting a large corpus. Continuing with our example of smoking and vaping, it might be interesting to compare tweets from users - people who are talking about their own personal use  to those who might be either promoting vaping  (manufacturers, sponsors, etc.) or warning about dangers of vaping (physicians, researchers, public health agencies, etc.).

A team of researchers at RTI International tackled this problem in a 2018 paper [Classification of Twitter Users Who Tweet About E-Cigarettes](http://publichealth.jmir.org/2017/3/e63/) by Annice Kim and colleagues collected tweets and attributed them to individuals, enthusiasts, "informed agencies (news media or health community), marketers, or spammers. 

Your goal here is to collect a small data set and to attempt a smaller version of this challenge. Specifically, we will try to collect preliminary data for a classifier capable of identifing tweets from users of e-cigarettes vs. others.  Using any of the code found in Parts 1-4, complete these steps:

1. Run some searches for tweets like 'e-cig', 'e-cigarette', 'vape' and 'vaping'. Collect a corpus of 200-300  or more tweets. You might want to save each of these result sets in files.

2. Combine these tweets into one large collection using the 'Tweet' class listed above. Save the results in a file 

3. Annotate 50 of these tweets as pertaining to either 'individual' or 'non-individual'. Be sure that you do at least a few of the tweets from each of the original sets. One way to do this might be to randomize the tweets. Save the annotated results in a file. 

4.Review at the distrbution. Is it close to even? If not, do more.

5. Take your annotated tweets - split them into train (80%) and test (20%) sets.  Process the train data and build a model (based on a TfIdf Vectorizer and an SVM). Evaluate the model on the test data sets.

6. Test your model on the remaining tweets. What does your result look like?

7. Review some of the data to identify opportunities for improvement - how might you make these models bettter?

8. Reflect on the reproducibility and the reusability of the code: what should be done to make these tools easier to apply to other datasets.



----
*ANSWER BELOW - CUT BELOW HERE*

### 1. Running some Searches

In [None]:
ecig = Tweets("vape",100)

In [None]:
ecig.saveTweets("vape1.json")

In [None]:
ecig2 = Tweets("ecig",100)

In [None]:
ecig.countTweets()

In [None]:
ecig2.saveTweets("ecig1.json")

In [None]:
ecig3 = Tweets("vaping",100)

In [None]:
ecig3.saveTweets("vaping1.json")

In [None]:
ecig4 = Tweets("vaping",100)

In [None]:
ecig4.saveTweets("vaping2.json")

In [None]:
ecig5 = Tweets("e-cigarette",100)

In [None]:
ecig5.saveTweets("ecig2.json")

In [None]:
vape2=Tweets("vaping",100)
vape2.saveTweets("vape2.json")

### 2. combine results of searches and save. 

In [7]:
fullTweets = Tweets()
fullTweets.readTweets("ecig1.json")

In [8]:
ecig2 = Tweets()
ecig2.readTweets("ecig2.json")

In [9]:
fullTweets.combineTweets(ecig2)

In [10]:
fullTweets.countTweets()

200

In [11]:
vape1=Tweets()
vape1.readTweets("vape1.json")
fullTweets.combineTweets(vape1)

In [12]:
fullTweets.countTweets()

300

In [13]:
vape2=Tweets()
vape2.readTweets("vape2.json")
fullTweets.combineTweets(vape2)

In [14]:
vaping1=Tweets()
vaping1.readTweets("vaping1.json")
fullTweets.combineTweets(vaping1)
vaping2=Tweets()
vaping2.readTweets("vaping2.json")
fullTweets.combineTweets(vaping2)

In [15]:
fullTweets.countTweets()

591

In [16]:
fullTweets.saveTweets("part5.json")

### 3. annotating 50 tweets.

#### randomly select...

In [17]:
fullTweets = Tweets()
fullTweets.readTweets("part5.json")

ids=list(fullTweets.getIds())

In [18]:
len(ids)

591

In [19]:
import random
random.shuffle(ids)

In [20]:
id = ids[0]
fullTweets.getText(id)

'virgo energy is not allowing a single thing to be out of place at ur job &amp; being the cleanest worker but having the messiest house &amp; interpersonal relationship skills, vaping/essential oils, dropping everything 4 spontaneous trips to new cities, only reading murder mystery novels'

In [21]:
fullTweets.addCode(id,"INDIVIDUAL")

In [22]:
id = ids[1]
fullTweets.getText(id)

'Flavor Spam | Fake Anti-#VAPING Comments Flood FDA | Published on - https://t.co/VVrENPzvF9 #ecig #ukvapers #vape #VapeUK #vaping https://t.co/XuhE9uVoYJ'

In [23]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [24]:
id = ids[2]
fullTweets.getText(id)

'UBLO HEMP CBD Vape Juice | E Liquid | eliquid 0% Nicotine 6 STRENGTH 6 Flavours Hemp Oil ON E BAY https://t.co/cyXpGNNAAr … … … … … #vapecommunity #vapefamily #vapeshop #Vape #cloudchaser #vapelife #vapeporn #CBD #JBRT18VAPE #ublo #CBDlife #atsocialmedia #tweetmaster #hemp https://t.co/JcMuAUaizD'

In [25]:
fullTweets.addCode(id,"INDIVIDUAL")

In [26]:
id = ids[3]
fullTweets.getText(id)

'Hypersensitivity pneumonitis and #ARDS from e-cigarette use: https://t.co/FSov4xGKi9'

In [27]:
fullTweets.addCode(id,"INDIVIDUAL")

In [28]:
id = ids[4]
fullTweets.getText(id)

"@CBCNews @gq_in_sk It's no different then banning cigarettes.Yes it is medical marijuana and people have a license (I currently do) but there are so many better options..oils are great the effects last longer up to 8h while vaping only a couple and it puts you at risk for pneumonia and bronchitis."

In [29]:
fullTweets.addCode(id,"INDIVIDUAL")

In [30]:
id = ids[5]
fullTweets.getText(id)

'"E-cigarette use by American high school students has increased 900%  from 2011-2015 according to a 2016 study by the U.S. surgeon general." https://t.co/Otm37iDlH4'

In [31]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [32]:
id = ids[6]
fullTweets.getText(id)

'Cults in MI\n-Suburban boys who say they’re from Detroit\n-Fishermen\n-Craft Beer Junkies \n-Yoopers\n-Middle Schoolers who vape\n-Dads with boats \n-Fans of UofM/MSU who didn’t attend either school\n-Red Wings fans \n-Anyone at Electric Forest/Faster Horses \n-Hockey Moms \n-Chicks'

In [33]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [34]:
id = ids[7]
fullTweets.getText(id)

'Serious about shopping for Cig2o E-Cigarette Battery with USB Charger? Learn our newest assessment of the product by Nancy A. through @yotpo Source by https://t.co/N6w3iDvl0F... https://t.co/WLiOiE0Fxg'

In [35]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [36]:
id = ids[8]
fullTweets.getText(id)

"'That's going to be an issue': Lawyers weigh in on apartment smoking and vaping ban https://t.co/BAfCCqRwoT"

In [37]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [38]:
id = ids[9]
fullTweets.getText(id)

'he’s vaping omg'

In [39]:
fullTweets.addCode(id,"INDIVIDUAL")

In [40]:
id = ids[10]
fullTweets.getText(id)

"If you wanna vape or use your ecig or whatever that's totally fine with me, you do you. But don't blow that strawberry watermelon popcorn blueberry cheesecake cotton candy poptart flavored nasty shit on me is all I ask. \n\nThanks."

In [41]:
fullTweets.addCode(id,"INDIVIDUAL")

In [42]:
id = ids[11]
fullTweets.getText(id)

'What are the V2 Vape Pens https://t.co/vrNJsqroqH Remember to use our #coupon SOFLA10 during checkout #Vape #Vaping #VapeBlast #RT #eCigs Best Pen-Style Weed Vaporizers'

In [43]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [44]:
id = ids[12]
fullTweets.getText(id)



In [45]:
fullTweets.addCode(id,"INDIVIDUAL")

In [46]:
id = ids[13]
fullTweets.getText(id)

'Hammered that SLAG right off and into a bottle! Check https://t.co/p3Ba0mlgVZ for more!\n\nNew20 = 20% off 1st order!\n\n#vape #vaper #vapor #vapers #vaping #vapefam #vapekit #vapegear #vapegram… https://t.co/PnC2bQ4s7u'

In [47]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [48]:
id = ids[14]
fullTweets.getText(id)

"#Vaping, medical marijuana backers say apartment renter's total smoking ban is 'ridiculous' https://t.co/qID6LjHjsg"

In [49]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [50]:
id = ids[15]
fullTweets.getText(id)

"I liked a @YouTube video https://t.co/FrwoHoVQ8l What's nicotine addiction like? My e-cigarette habit is out of control O.o"

In [51]:
fullTweets.addCode(id,"INDIVIDUAL")

In [52]:
id = ids[16]
fullTweets.getText(id)

"Date: I'm not very materialistic at all and that's what I look for in a man.\n\nMe: *pouring juice into my solid gold e-cigarette* um, what were you saying?"

In [53]:
fullTweets.addCode(id,"INDIVIDUAL")

In [54]:
id = ids[17]
fullTweets.getText(id)

'vaping  fields'

In [55]:
fullTweets.addCode(id,"INDIVIDUAL")

In [56]:
id = ids[18]
fullTweets.getText(id)

"E-cigarette explodes in man's pocket only 'two inches away from his penis' , more details : https://t.co/nWuwKQF2Gx"

In [57]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [58]:
id = ids[19]
fullTweets.getText(id)

'The UK’s billion-pound vaping industry is a shining example of innovation and light-touch regulation. But could vested interests are hampering this method of harm reduction for smokers? https://t.co/XdYdUGUg1Y'

In [59]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [60]:
id = ids[20]
fullTweets.getText(id)

'Every time I start vaping, my cat does not know how to act when she sees the smoke 😭😂😭😂'

In [61]:
fullTweets.addCode(id,"INDIVIDUAL")

In [62]:
id = ids[21]
fullTweets.getText(id)

'E-cigarettes: How "safe" are they? #Vape #Tobacco #HeartDisease\n"e-cigarette vapors contain toxic substances, including the heavy metals lead, cadmium, and nickel." @DrMarthaGulati https://t.co/Z17K3XxDJG https://t.co/tyuBRAU2XB'

In [63]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [64]:
id = ids[22]
fullTweets.getText(id)

'E-cigarette explodes in man’s pocket only ‘two inches away from his penis’ https://t.co/W39HNxZt5g'

In [65]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [66]:
id = ids[23]
fullTweets.getText(id)

"If you wanna vape or use your ecig or whatever that's totally fine with me, you do you. But don't blow that strawberry watermelon popcorn blueberry cheesecake cotton candy poptart flavored nasty shit on me is all I ask. \n\nThanks."

In [67]:
fullTweets.addCode(id,"INDIVIDUAL")

In [68]:
id = ids[24]
fullTweets.getText(id)

'Pick up a spare SERIES-S17 900mAh battery for vaping on the go for as low as £9.99. https://t.co/cVUBURPHml'

In [69]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [70]:
id = ids[25]
fullTweets.getText(id)

'Cults in MI\n-Suburban boys who say they’re from Detroit\n-Fishermen\n-Craft Beer Junkies \n-Yoopers\n-Middle Schoolers who vape\n-Dads with boats \n-Fans of UofM/MSU who didn’t attend either school\n-Red Wings fans \n-Anyone at Electric Forest/Faster Horses \n-Hockey Moms \n-Chicks'

In [71]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [72]:
id = ids[26]
fullTweets.getText(id)

'Diacetyl is used in e-cigarette flavors. Diacetyl causes popcorn lung, which causes irreversible lung damage that makes it difficult to breath due to inflammation air pathways. Please contact the  Tobacco Control Program at (559) 675-7893 to learn more. https://t.co/3mwqwqfo0c https://t.co/SrB18c7AjJ'

In [73]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [74]:
id = ids[27]
fullTweets.getText(id)

'Set-up vibes ... \n\n#TFC #Cotton #SetUp\n#CrazyWire #Ni80 #Tools\n#VapersChoice #WeekendVibes\n#Vapeland \n#VapersSelfie #VapePics #Vape \n#VapingQuality #Vapers #Vaping\n#VapelandOrdinaryShopsExtraordinaryBrands #VapeLove #VapeCommunity  #VapeWithPassion \nhttps://t.co/fq9ZOf1YWf https://t.co/bz9rXcInsZ'

In [75]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [76]:
id = ids[28]
fullTweets.getText(id)

"I honestly didn't think anything could be gayer than vaping. \n\nBut yet here juuls are, being gayer than vapes."

In [77]:
fullTweets.addCode(id,"INDIVIDUAL")

In [78]:
fullTweets.saveTweets("part5-annotated.json")

In [79]:
id = ids[29]
fullTweets.getText(id)

'I ain’t think my vape could get a nigga high like this'

In [80]:
fullTweets.addCode(id,"INDIVIDUAL")

In [81]:
id = ids[30]
fullTweets.getText(id)

"A few warmup sketches of @_ArrowWolf_ . \nVaping n' stuff! https://t.co/Bq7ElP2J17"

In [82]:
fullTweets.addCode(id,"INDIVIDUAL")

In [83]:
id = ids[31]
fullTweets.getText(id)

'virgo energy is not allowing a single thing to be out of place at ur job &amp; being the cleanest worker but having the messiest house &amp; interpersonal relationship skills, vaping/essential oils, dropping everything 4 spontaneous trips to new cities, only reading murder mystery novels'

In [84]:
fullTweets.addCode(id,"INDIVIDUAL")

In [85]:
id = ids[32]
fullTweets.getText(id)

'Dad of six nearly has his todger blasted off when his e-cigarette battery exploded in his pocket - The Sun https://t.co/kIud2EqFav'

In [86]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [87]:
id = ids[33]
fullTweets.getText(id)

'Add a splash of color to your dotRDTA, new Color Caps and Nolli Designs Drip Tips, now available! ❤️💙🖤💛💜💎\n\nGet yours today 🙌🙌🙌: https://t.co/z9CESnRgYQ\n\n📸: dripmedia\n\n#dotmod #vape #vaping #dotRDTA #colorcaps #driptips #new #brandnew #nowavailable https://t.co/ARENk2ViYX'

In [88]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [89]:
id = ids[34]
fullTweets.getText(id)

'FDA requires additional e-cigarette makers to provide critical information so the agency can better examine youth use and product appeal, amid continued concerns around youth access to products https://t.co/bdSNWxQYly'

In [90]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [91]:
id = ids[35]
fullTweets.getText(id)

'Ok, I been using #sourapple for #vaping . Wonna try something new.\nRecommendation? #vaping #vapeshop #vapelife #vapecommunity https://t.co/jcd3wxuxj2'

In [92]:
fullTweets.addCode(id,"INDIVIDUAL")

In [93]:
id = ids[36]
fullTweets.getText(id)

"If you wanna vape or use your ecig or whatever that's totally fine with me, you do you. But don't blow that strawberry watermelon popcorn blueberry cheesecake cotton candy poptart flavored nasty shit on me is all I ask. \n\nThanks."

In [94]:
fullTweets.addCode(id,"INDIVIDUAL")

In [95]:
id = ids[37]
fullTweets.getText(id)

'@burnett_51 @innokin_ecig Feels good huh'

In [96]:
fullTweets.addCode(id,"INDIVIDUAL")

In [97]:
id = ids[38]
fullTweets.getText(id)

'@MetroUK Did he not take the e-cigarette preparation course? https://t.co/hFwpvfF92q'

In [98]:
fullTweets.addCode(id,"INDIVIDUAL")

In [99]:
id = ids[39]
fullTweets.getText(id)

'Cloud Culture. ✔️ – Vaping at Petes\xa0Place https://t.co/IZ7rlEULxr&lt;br&gt;&lt;br&gt; https://t.co/vxs4zJSvI0'

In [100]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [101]:
ids[37]==ids[39]

False

In [102]:
id = ids[40]
fullTweets.getText(id)

'NEW! JustFog Q16 Starter Kit | 100% Authentic High performance, safety, and portability all in one, Beginning of a new trend for starter kits .@cuecig #vape #ejuice #ecig #vaping #vapefam #vapelife #ejuice #vapeon #vapesale #ecigsale https://t.co/6Uci0wjOiv https://t.co/GFeKgedGMT'

In [103]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [104]:
id = ids[41]
fullTweets.getText(id)

"Should switching from smoking to vaping lower life insurance costs?  UK insurers aren't sure.  Out of 10 insurers, only 1/2 give vapers lower non-smoker insurance rates... and only if they use nicotine-free eCigs.\nhttps://t.co/U2jwABCj5p"

In [105]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [106]:
id = ids[42]
fullTweets.getText(id)

'🤑😀📺📱Join us for another episode of "AVV Live" today at 4pm mountain time on FB/Insta Live. We are going to talk the latest #vaping news as well as flavor tests and another amazing giveaway, as always best question/comment during the show wins! #ejuice #livestream #giveaway https://t.co/PPkuvKk126'

In [107]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [108]:
id = ids[43]
fullTweets.getText(id)

'@Washington_vape THANK YOU!!! ❤\n@mymass_ @twik_star @RBlazick @boxerlad680 @Heavenlyink @Stephen_40s @vaping1967 @mattKirkham5 @Sonic_vaper1 @Dripping_Hippie @ScreamQueen131 @Heavencantwait @sarcasticvaper @LordCVapes @Vaping_Train\xa0 @Vixxen_85 @AnibalAsenjo @ZGyurko @VapingKaren ☇#ARIAS🤘🏾'

In [109]:
fullTweets.addCode(id,"INDIVIDUAL")

In [110]:
id = ids[44]
fullTweets.getText(id)

'Flavor Spam | Fake Anti-#VAPING Comments Flood FDA | Published on - https://t.co/VVrENPzvF9 #ecig #ukvapers #vape #VapeUK #vaping https://t.co/XuhE9uVoYJ'

In [111]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [112]:
id = ids[45]
fullTweets.getText(id)

"#eCig Explosions-Don't Believe the Media Hype!  https://t.co/tmsqZDNiL5 https://t.co/xQmI2oCqg6"

In [113]:
fullTweets.addCode(id,"INDIVIDUAL")

In [114]:
id = ids[46]
fullTweets.getText(id)

'Sleek, sexy &amp; the Looking for a sleek &amp; sexy vape?  Her it is! This is a AMAZING DEAL! Get YOURS today! Amigo Vogue II Curve 50W Box MOD Mini Polestar Tank 2200mAh Sub Ohm .@cuecig #vape #ejuice #ecig #vaping #vapelife #vapefam #vapeon #vapesale #ecigsale https://t.co/S0fegSCd0A https://t.co/g65noCV7i6'

In [115]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [116]:
id = ids[47]
fullTweets.getText(id)

'.@NHSDigital have published smoking statistics for 2018. They show the most common reason e-cigarette users gave for use was to aid themselves in quitting smoking (48%).\n\nOur aim is to create a #smokefree generation in #Cumbria by 2022, see https://t.co/zEFKjs3Gg9 for details. https://t.co/HfoMm35W9M'

In [117]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [118]:
id = ids[48]
fullTweets.getText(id)

'Single vs. Dual Coil #Atomizers in #Ecigs https://t.co/ee3JeojAfw https://t.co/UArAvjEOBu'

In [119]:
fullTweets.addCode(id,"NON-INDIVIDUAL")

In [120]:
id = ids[49]
fullTweets.getText(id)

"If you wanna vape or use your ecig or whatever that's totally fine with me, you do you. But don't blow that strawberry watermelon popcorn blueberry cheesecake cotton candy poptart flavored nasty shit on me is all I ask. \n\nThanks."

In [121]:
fullTweets.addCode(id,"INDIVIDUAL")

In [122]:
id = ids[50]
fullTweets.getText(id)

'dude is vaping meth https://t.co/gOgABB62aQ'

In [123]:
fullTweets.addCode(id,"INDIVIDUAL")

In [124]:
fullTweets.saveTweets("part5-annotated.json")

### 4. Review distributions..

In [209]:
fulllTweets=Tweets()
fullTweets.readTweets("part5-annotated.json")
tweets=[]

for id in fullTweets.getIds():
    codes=fullTweets.getCodes(id)
    if codes is not None:
        if 'INDIVIDUAL' in codes:
            code='INDIVIDUAL'
        else:
            code='NONINDIVIDUAL'
        pair = (fullTweets.getText(id),code)
        tweets.append(pair)

In [210]:
len(tweets)

51

In [211]:
indcount=0
for entry in tweets:
    if entry[1]=='INDIVIDUAL':
        indcount= indcount+1
indcount

24

In [None]:
Looks like 24/51 are individual. For a very small set, that's pretty good.

###  5. Create test/train splits and build a model.

In [132]:
def getTestTrainSplit(pairs,splitFactor=0.8):
    random.shuffle(pairs)
    split=int(len(pairs)*splitFactor)
    train=pairs[:split]
    test =pairs[split:]
    return train,test

In [133]:
train,test=getTestTrainSplit(tweets)

In [134]:
len(train)

40

In [135]:
len(test)

11

In [136]:
def getSplit(itemList):
    counts={}
    for item in itemList:
        cat = item[1] # category is second in the pair.
        if cat not in counts:
            counts[cat]=0
        counts[cat]=counts[cat]+1
    count = len(itemList)
    res=[]
    for cat,c in counts.items():
        ratio = c/count
        res.append((cat,ratio))
    return res

In [138]:
trainSplit = getSplit(train)
trainSplit

[('NONINDIVIDUAL', 0.475), ('INDIVIDUAL', 0.525)]

In [139]:
getSplit(test)

[('NONINDIVIDUAL', 0.7272727272727273), ('INDIVIDUAL', 0.2727272727272727)]

 not a great distribution for test, but ok. 

In [140]:
trainTexts,trainCats=zip(*train)
testTexts,testCats=zip(*test)

In [147]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenizeText(text):
    nlp=getTwitterNLP()
    tokens=nlp(text)
    return filterTweetTokens(tokens)

vectorizer = TfidfVectorizer(tokenizer=tokenizeText,preprocessor=lambda x: x)
clf = LinearSVC()
pipe = Pipeline([('vectorizer', vectorizer), ('clf', clf)])

In [148]:
pipe.fit(trainTexts,trainCats)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2',
        preprocessor=<function <...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [149]:
preds = pipe.predict(testTexts)

In [165]:
print("accuracy:", accuracy_score(testCats, preds))

accuracy: 0.45454545454545453


In [166]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(testCats,preds)

In [167]:
print(cm)

[[1 2]
 [4 4]]


In [173]:
def convertToNumeric(cats):
    nums =[]
    for c in cats:
        if c =='INDIVIDUAL':
            nums.append(1)
        elif c=='NONINDIVIDUAL': # we know that all entries are either 'smoking' or 'vaping'
            nums.append(-1)
    return nums

numCats=convertToNumeric(testCats)
numPreds=convertToNumeric(preds)

In [174]:
convertToNumeric(testCats)

[-1, 1, -1, -1, 1, -1, -1, -1, 1, -1, -1]

In [175]:
print(numPreds)

[-1, 1, 1, -1, -1, -1, 1, 1, -1, -1, 1]


In [176]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print("Precision is "+str(precision_score(numCats,numPreds,average=None)))
print("Recall is "+ str(recall_score(numCats,numPreds,average=None)))

Precision is [0.66666667 0.2       ]
Recall is [0.5        0.33333333]


so, since nonindividual is -1,  we see that preceision and recall are better for non-individual vs. individual

if we look back at the splits above, we see that the test set is skewed towards nonidnividual, with more samples,, os this result makes seom sense.

### 7. test on remaining tweets. 

first, we must find all tweets. These will be any tweets that don't have codes associated.

In [224]:
fulllTweets=Tweets()
fullTweets.readTweets("part5-annotated.json")
remainder = []
for id in fullTweets.getIds():
    codes=fullTweets.getCodes(id)
    if codes is None:
        text=fullTweets.getText(id)
        remainder.append(text)

In [225]:
len(remainder)

540

ok. now let's predict on them

In [None]:
fullPreds = pipe.predict(remainder)

and how many of these are individual or non?