 <table><tr><td><img src="images/dbmi_logo.png" width="75" height="73" alt="Pitt Biomedical Informatics logo"></td><td><img src="images/pitt_logo.png" width="75" height="75" alt="University of Pittsburgh logo"></td></tr></table>
 
 
 # Social Media and Data Science - Part 5
 
 
Data science modules developed by the University of Pittsburgh Biomedical Informatics Training Program with the support of the National Library of Medicine data science supplement to the University of Pittsburgh (Grant # T15LM007059-30S1). 

Developed by Harry Hochheiser, harryh@pitt.edu. All errors are my responsibility.

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.


### Goal: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as smoking.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

---

In [1]:
%matplotlib inline

import operator
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime
from spacy.symbols import ORTH, LEMMA, POS

# 5.0 Introduction

This final part of our journey through social media data retrieval, annotation, natural langauge processing, and classififcation will challenge you to apply these techniques to a new problem. Specifically, you will create, annotate, and process a new data set.

# 5.0.1 Setup

As before, we start with the Tweets class and the configuration for our Twitter API connection.  We may not need this, but we'll load it in any case.

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",tweet_mode='extended',count=corpus_size)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            count = self.countTweets()
            time.sleep(30)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def combineTweets(self,other):
        for otherid in other.getIds():
            tweet = other.getTweet(otherid)
            searchTerm = other.getSearchTerm(otherid)
            searchTime = other.getSearchTime(otherid)
            self.addTweet(tweet,searchTime,searchTerm)
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        if 'codes' in tweet:
            return tweet['codes']
        else:
            return None
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

Put the values of your keys into these variables

In [3]:
consumer_key = 'C9UQFFbYEy3hBWRI1lzLKAjcs'
consumer_secret = 'idhLHrw2FbJOCLSp3c6CHhp1YECGLzU4TkPemBo5pN5plTzxXr'
access_token = '852862527184576512-2eJwZTBVXFSfMn7qVWNCPXRL7vojqsF'
access_secret = '3CeXbgi2lKUpIfhhhTNIYtMGzwYQT2Ok0PRuLI0AVeisI'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

We will also load some routines that we defined in [Part 3](SocialMedia - Part 3.ipynb):
    
1. Our routine for creating a customized NLP pipeline
2. Our routine for including tokens
3. The `filterTweetTokens` routine defined in an exercise (Without the inclusion of named entities. It will be easier to leave them out for now).

In [5]:
def getTwitterNLP():
    nlp = spacy.load('en')
    
    for word in nlp.Defaults.stop_words:
        lex = nlp.vocab[word]
        lex.is_stop = True
    
    special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
    nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
    nlp.tokenizer.add_special_case(u'E-cigarette', special_case)
    vape_case = [{ORTH: u'vape',LEMMA:u'vape',POS: u'NOUN'}]
    
    vape_spellings =[u'vap',u'vape',u'vaping',u'vapor',u'Vap',u'Vape',u'Vapor',u'Vapour']
    for v in vape_spellings:
        nlp.tokenizer.add_special_case(v, vape_case)
    def hashtag_pipe(doc):
        merged_hashtag = True
        while merged_hashtag == True:
            merged_hashtag = False
            for token_index,token in enumerate(doc):
                if token.text == '#':
                    try:
                        nbor = token.nbor()
                        start_index = token.idx
                        end_index = start_index + len(token.nbor().text) + 1
                        if doc.merge(start_index, end_index) is not None:
                            merged_hashtag = True
                            break
                    except:
                        pass
        return doc
    nlp.add_pipe(hashtag_pipe,first=True)
    return nlp

def includeToken(tok):
    val =False
    if tok.is_stop == False:
        if tok.is_alpha == True: 
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='PROPN' or tok.pos_=='VERB':
                val = True
        elif tok.text[0]=='#' or tok.text[0]=='@':
            val = True
    if val== True:
        stripped =tok.lemma_.lower().strip()
        if len(stripped) ==0:
            val = False
        else:
            val = stripped
    return val

def filterTweetTokens(tokens):
    filtered=[]
    for t in tokens:
        inc = includeToken(t)
        if inc != False:
            filtered.append(inc)
    return filtered

Finally, we will include some additional modules from Scikit-Learn:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
import string
import re

Now, we're ready to go along for an exercise

Identifying the source of social media comments might be an important step in the process of interpreting a large corpus. Continuing with our example of smoking and vaping, it might be interesting to compare tweets from users - people who are talking about their own personal use  to those who might be either promoting vaping  (manufacturers, sponsors, etc.) or warning about dangers of vaping (physicians, researchers, public health agencies, etc.).

A team of researchers at RTI International tackled this problem in a 2018 paper [Classification of Twitter Users Who Tweet About E-Cigarettes](http://publichealth.jmir.org/2017/3/e63/) by Annice Kim and colleagues collected tweets and attributed them to individuals, enthusiasts, "informed agencies (news media or health community), marketers, or spammers. 

Your goal here is to collect a small data set and to attempt a smaller version of this challenge. Specifically, we will try to collect preliminary data for a classifier capable of identifing tweets from users of e-cigarettes vs. others.  Using any of the code found in Parts 1-4, complete these steps:

1. Run some searches for tweets like 'e-cig', 'e-cigarette', 'vape' and 'vaping'. Collect a corpus of 200-300  or more tweets. You might want to save each of these result sets in files.

2. Combine these tweets into one large collection using the 'Tweet' class listed above. Save the results in a file 

3. Annotate 50 of these tweets as pertaining to either 'individual' or 'non-individual'. Be sure that you do at least a few of the tweets from each of the original sets. One way to do this might be to randomize the tweets. Save the annotated results in a file. 

4.Review at the distrbution. Is it close to even? If not, do more.

5. Take your annotated tweets - split them into train (80%) and test (20%) sets.  Process the train data and build a model (based on a TfIdf Vectorizer and an SVM). Evaluate the model on the test data sets.

6. Test your model on the remaining tweets. What does your result look like?

7. Review some of the data to identify opportunities for improvement - how might you make these models bettter?

8. Reflect on the reproducibility and the reusability of the code: what should be done to make these tools easier to apply to other datasets.



----
*ANSWER FOLLOWS - insert answer here*

1.Run some searches for tweets like 'e-cig', 'e-cigarette', 'vape' and 'vaping'. Collect a corpus of 200-300 or more tweets. You might want to save each of these result sets in files.

In [7]:
tweet_e_cig = Tweets("e-cig",100)
tweet_e_cigarette = Tweets("e-cigarette",100)
tweet_vape = Tweets("vape",100)
tweet_vaping = Tweets("vaping",100)

In [8]:
tweet_e_cig.saveTweets('e-cig.json') 
tweet_e_cigarette.saveTweets('e-cigarette.json') 
tweet_vape.saveTweets('vape.json')
tweet_vaping.saveTweets('vaping.json')

Combine these tweets into one large collection using the 'Tweet' class listed above. Save the results in a file

In [10]:
tweet_e_cig.combineTweets(tweet_e_cigarette)
tweet_e_cig.combineTweets(tweet_vape)
tweet_e_cig.combineTweets(tweet_vaping)
tweet_e_cig.saveTweets('tweets.json')

Annotate 50 of these tweets as pertaining to either 'individual' or 'non-individual'. Be sure that you do at least a few of the tweets from each of the original sets. One way to do this might be to randomize the tweets. Save the annotated results in a file.

In [15]:
tweets = Tweets()
tweets.readTweets('tweets.json')

In [17]:
tweets.countTweets()

378

In [82]:
all = list(tweets.getIds())
id = np.random.choice(378,50,replace = False)

The start of the tweet of random:

In [85]:
curid = id[0]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@tanamongeau real shit please vape cum on live for us


In [86]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [87]:
curid = id[1]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

£ 48.99 - from £89.00
https://t.co/49puqK4cd2

EUR 58,98 

 Electronic Cigarette TC Vape Box Mod Rofvape Witcher 75W E Cigarette Starter Kit E Shisha Vape Kit All-in-One 510 Thread Ecig Kits | | OTHD 
https://t.co/AJNYoET3Ma 
#vape #witchervape #tcvape #shishavape #allinonevape https://t.co/IyOCgIHebA


In [88]:
tweets.addCode(tid, 'INDIVIDUAL')

In [90]:
curid = id[2]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

100% Authentic VOOPOO Drag 157W TC Box Mod TC mod VW 18650 Battery Temperature Control e-cig 157W 18650 box mod vape NO battery https://t.co/g3HNUcOq0P


In [91]:
tweets.addCode(tid, 'INDIVIDUAL')

In [92]:
curid = id[3]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

There is not yet any research on e-cigarettes and SIDS, but using an e-cigarette appears to be much safer than continuing to smoke; both during pregnancy and once your baby is born. https://t.co/KZ1ntPmQhl


In [93]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [94]:
curid = id[4]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

We’ll release additional data soon on kids use of tobacco. None of the metrics we're seeing are moving in the right direction in relation to e-cig use by kids. We can’t allow a new generation to become addicted to nicotine. We'll be taking new steps to help reverse these trends.


In [95]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [96]:
curid = id[5]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [97]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [98]:
curid = id[6]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Call Nurse Leahy at 406-258-3882 and tell her how you feel. #vaping #notblowingsmoke https://t.co/TrGtHBAGcQ


In [99]:
tweets.addCode(tid, 'INDIVIDUAL')

In [100]:
curid = id[7]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Affordable vaping for smokers in poor countries branded 'a human rights issue' https://t.co/clVQj2T0sU https://t.co/gnlZmV4SY8


In [101]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [102]:
curid = id[8]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

RT NIDAnews: An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/KMapfY6JfS https://t.co/0bnMXn1qHM


In [103]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [104]:
curid = id[9]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

clouds are just god’s vape smoke


In [105]:
tweets.addCode(tid, 'INDIVIDUAL')

In [106]:
curid = id[10]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Dude just cruised past me riding a lawn chair taped to an electric skateboard while vaping and blasting Jack Johnson. Now I’m questioning all my life choices. https://t.co/VfFlJZKil4


In [107]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [108]:
curid = id[11]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [109]:
tweets.addCode(tid, 'INDIVIDUAL')

In [110]:
curid = id[12]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Re-watching Public Enemies, patiently waiting for Stephen Dorff to bust out his blu e-cig...just in case I missed it the first time around.


In [111]:
tweets.addCode(tid, 'INDIVIDUAL')

In [112]:
curid = id[13]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@DJStrifeGaming vaping outside near Marriott. Any good (bad) ideas


In [113]:
tweets.addCode(tid, 'INDIVIDUAL')

In [114]:
curid = id[14]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

.@SecAzar: "A kid should simply never have an e-cigarette in their mouth, whatever the product is that they might be consuming." https://t.co/WjAFd5jtWW


In [115]:
tweets.addCode(tid, 'INDIVIDUAL')

In [116]:
curid = id[15]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

“are you vaping on the pork chops?”


In [117]:
tweets.addCode(tid, 'INDIVIDUAL')

In [118]:
curid = id[16]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@JPNadda @MoHFW_INDIA @vapeindia It's interesting to see the Australian Political support for the vaping regulatory compliance. I hope the Indian authorities also take such a democratic outlook. Lets give our people a safe alternative.

https://t.co/tYN5MsjYJ0


In [119]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [120]:
curid = id[17]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [121]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [122]:
curid = id[18]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Robbers Come Back After Belgian E-Cigarette Shop Owner Tells Them to Come Back Later https://t.co/38iBonZbWk


In [123]:
tweets.addCode(tid, 'INDIVIDUAL')

In [124]:
curid = id[19]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

THE ENTIRE WEED INDUSTRY: were gonna use one kind of vape pen battery. even though represent thousands of companies, all weed oil pods will be compatible with it

APPLE: ooh sorry you’re gonna need 18 adapters for that thing you bought 2 months ago. we’re innovators


In [125]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [126]:
curid = id[20]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

This raspberry cheese cake vape juice is a dessert lmao


In [127]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [128]:
curid = id[21]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Jenks Schools Says E-Cigarette Use Is Growing Among Younger Students - News On 6: https://t.co/YCMrQ7NE61


In [129]:
tweets.addCode(tid, 'INDIVIDUAL')

In [130]:
curid = id[22]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@IanMcMian1 Yes. Luckily I’m white and will be staying away from all crowds and political demonstrations. Not worth getting shot by a cop reaching for my vape


In [131]:
tweets.addCode(tid, 'INDIVIDUAL')

In [132]:
curid = id[23]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

THE ENTIRE WEED INDUSTRY: were gonna use one kind of vape pen battery. even though represent thousands of companies, all weed oil pods will be compatible with it

APPLE: ooh sorry you’re gonna need 18 adapters for that thing you bought 2 months ago. we’re innovators


In [133]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [134]:
curid = id[24]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@shelbylfunk Get an usher. No vaping inside the arena.


In [135]:
tweets.addCode(tid, 'INDIVIDUAL')

In [136]:
curid = id[25]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

.@SecAzar: "A kid should simply never have an e-cigarette in their mouth, whatever the product is that they might be consuming." https://t.co/WjAFd5jtWW


In [137]:
tweets.addCode(tid, 'INDIVIDUAL')

In [138]:
curid = id[26]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

In @AddictionJrnl:
From 2014-2017 in England, e-cigarette use was greater among smokers from higher compared with lower socioeconomic status (SES) groups, but this difference attenuated over time.
https://t.co/AJULHNTXsF https://t.co/KNmsrwfdg5


In [139]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [140]:
curid = id[27]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

So it’s this thing called an e-cig https://t.co/NPHPot6SDe


In [141]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [142]:
curid = id[28]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Citrus never tasted so good. 
#lorange #cartridges #moxie #thccartridges #vaping #vaporization #weedstagram #weed #weedporn #pammj #medicalmarijuana #marijuanadoctor… https://t.co/YherJrxFDP


In [143]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [144]:
curid = id[29]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

E-cigarette disaster - Southern Standard: https://t.co/YCMrQ7NE61


In [145]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [146]:
curid = id[30]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [147]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [152]:
curid = id[31]
tid = all[curid]
print(tid)
tweet = tweets.getText(tid)
print(tweet)

1056344494541422592
A poem:

New York in the fall
Makes me feel less alone
You can see everybody's breath
It's almost like
We're all vaping


In [153]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [151]:
curid = id[32]
tid = all[curid]
print(tid)
tweet = tweets.getText(tid)
print(tweet)

1056256485288747015
An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [154]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [155]:
curid = id[33]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Anyone see a iPhone10 or Vape from last night at the party? Someone’s saying it’s gone missing and any info would be greatly appreciated!


In [156]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [157]:
curid = id[34]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

High Tech Companies Report Low Earnings; Trouble In E-Cig Paradise For Altria https://t.co/pVI8YUG3WC


In [158]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [159]:
curid = id[35]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

THE ENTIRE WEED INDUSTRY: were gonna use one kind of vape pen battery. even though represent thousands of companies, all weed oil pods will be compatible with it

APPLE: ooh sorry you’re gonna need 18 adapters for that thing you bought 2 months ago. we’re innovators


In [160]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [161]:
curid = id[36]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@FakingDancy @pandorable1968 @thunderbella @mally_da @theRealDawson83 @QueenBeeCanadas @CCfanessa @moseley_carla @DetroitLove88 @JettaAngeli @icandisf @CunningSq16 @CanuckSassy @stephlococcus @lynda424200 @superparentx4 @over_nurse3 @HeartOfGlass_1 Ahh yes. I don't actually *watch* tv most of the time. I'm always doing something. Smoking/vaping, writing, chatting, tweeting, texting, reading, researching, always doing something. Come November 1st at midnight, the tv will be 95% background noise, regardless of the station.


In [162]:
tweets.addCode(tid, 'INDIVIDUAL')

In [163]:
curid = id[37]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Just in - #e-cigarette maker #JUUL spent USD $560k in just the last 3 months lobbying US lawmakers while regulators at the FDA try to crack down on the youth e-cig crisis in the US. More: https://t.co/ZTgjR8nSx1


In [164]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [165]:
curid = id[38]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

🔥Kangertech EVOD MEGA?? ??Starter kit🔥
👉https://t.co/NX2SfBfvrq👈.
1900mAh Capacity USB Charger 2.4 ml Tank
🌎FREE SHIPPING🌎
#buynow #shopping #ebay #gift#e-cig #gear #smoke #ebayseller #ebaystore#onlinestore #electronic #kit #starter #EVOD#Kangertech @eBay_UK


In [166]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [167]:
curid = id[39]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

£ 48.99 - from £89.00
https://t.co/49puqK4cd2

EUR 58,98 

 Electronic Cigarette TC Vape Box Mod Rofvape Witcher 75W E Cigarette Starter Kit E Shisha Vape Kit All-in-One 510 Thread Ecig Kits | | OTHD 
https://t.co/AJNYoET3Ma 
#vape #witchervape #tcvape #shishavape #allinonevape https://t.co/IyOCgIHebA


In [168]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [169]:
curid = id[40]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Cherry Wood Ego Evod Vape stand holder for tanks, battery,vaper and e-cig: $9.95 End Date… https://t.co/KFcZl5qIQk


In [170]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [171]:
curid = id[41]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

THE ENTIRE WEED INDUSTRY: were gonna use one kind of vape pen battery. even though represent thousands of companies, all weed oil pods will be compatible with it

APPLE: ooh sorry you’re gonna need 18 adapters for that thing you bought 2 months ago. we’re innovators


In [172]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [173]:
curid = id[42]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

.@SecAzar: "A kid should simply never have an e-cigarette in their mouth, whatever the product is that they might be consuming." https://t.co/WjAFd5jtWW


In [174]:
tweets.addCode(tid, 'INDIVIDUAL')

In [175]:
curid = id[43]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

100ML "UBLO" E Liquid Juice Vape 0mg £9.99 each buy 2 get 1 #free #UBLO
https://t.co/2Ncti0cIM5 … … … …
#tweetmaster #atsocialmedia #vapecommunity #vapefamily #vapenation #vapeshop  #Vape #cloudchaser #vapelife #vapeporn #JBRT18VAPE #SHORTFILL GREAT TASTE GREAT VALUE https://t.co/buOnja4VLs


In [176]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [177]:
curid = id[44]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Marlboro Maker Altria Pulls Some E-Cig Products https://t.co/CZplUFSU5l


In [180]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [179]:
curid = id[45]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

An analysis on e-cigarette use reveals that they can increase your risk for a heart attack.
https://t.co/9JXSc8uUGS https://t.co/gw2XYZW6PB


In [181]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [183]:
curid = id[46]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

@KahliroKory @Kris_Z_Massey @sullysfca @VettingBernie @TrudeauMacron @DineshDSouza @NRATV @DLoesch @AssataProtege @AnooshMCL @gailborges @thereal_hair @TwumpFaschion @FirstDudeUS @LokiLoptr @GammaRae206 @EriqKunz1 @ShellyRKirchoff @CarmaCreated @TheRealMrAleem You vaping  some real good kush, or WHAT?


In [184]:
tweets.addCode(tid, 'INDIVIDUAL')

In [185]:
curid = id[47]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Today's Roundup: Tiny beauty &amp; protein ribbons; cross-cultural ethics meets #AI ; vaccine confidence declines in EU; Right to Try dissected; good #scicomm advice; CMS approves NC #Medicaid pilot; e-cig maker ramps up lobbying; more: https://t.co/4TziB3rFbE @DCRINews @califf001 https://t.co/6yTCGO2BOm


In [186]:
tweets.addCode(tid, 'NON-INDIVIDUAL')

In [187]:
curid = id[48]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Smokin tree ain’t no vape pens


In [188]:
tweets.addCode(tid, 'INDIVIDUAL')

In [189]:
curid = id[49]
tid = all[curid]
tweet = tweets.getText(tid)
print(tweet)

Voluntary industry action won’t solve the problem of youth e-cig use. We need mandatory @FDATobacco rules that apply to all manufacturers, including a ban on flavors that attract kids and FDA review before new products are introduced. https://t.co/Ems0HBamW3


In [190]:
tweets.addCode(tid, 'INDIVIDUAL')

In [193]:
annotated_id = []
for i in range(50):
    curid = id[i]
    tid = all[curid]
    annotated_id.append(tid)

In [195]:
len(annotated_id)

50

In [200]:
annotated = Tweets()
for ids in annotated_id:
    tweet = tweets.getTweet(ids)
    searchTerm = tweets.getSearchTerm(ids)
    searchTime = tweets.getSearchTime(ids)
    annotated.addTweet(tweet,searchTime,searchTerm)
print(annotated.countTweets())

50


In [201]:
annotated.saveTweets('annotated.json')

Review at the distrbution. Is it close to even? If not, do more.

In [202]:
annotated.getCodeProfile()

[('NON-INDIVIDUAL', 30), ('INDIVIDUAL', 19)]

1.Take your annotated tweets - split them into train (80%) and test (20%) sets. Process the train data and build a model (based on a TfIdf Vectorizer and an SVM). Evaluate the model on the test data sets.

In [224]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenizeText(text):
    nlp=getTwitterNLP()
    tokens=nlp(text)
    return filterTweetTokens(tokens)

def flattenTweets(tweets):
    flat=[]
    for i in tweets.getIds():
        text = tweets.getText(i)
        if tweets.getCodes(i) is not None :
            for x in tweets.getCodes(i):
                cat = x
            pair =(text,cat)
            flat.append(pair)
    return flat

def getTestTrainSplit(pairs,splitFactor=0.8):
    random.shuffle(pairs)
    split=int(len(pairs)*splitFactor)
    train=pairs[:split]
    test =pairs[split:]
    return train,test

def getTestTrain(tweets,splitFactor=0.8):
    tweets = flattenTweets(tweets)
    train,test=getTestTrainSplit(tweets,splitFactor)
    return train,test



In [221]:
vectorizer= TfidfVectorizer(tokenizer=tokenizeText,preprocessor=lambda x: x)
clf = LinearSVC()
pipe = Pipeline([('vectorizer', vectorizer), ('clf', clf)])

In [225]:
train,test=getTestTrain(annotated)

In [229]:
print("train is: ",len(train),"test is :",len(test))

train is:  39 test is : 10


In [230]:
trainTexts,trainCats=zip(*train)
testTexts,testCats=zip(*test)

In [231]:
pipe.fit(trainTexts,trainCats)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2',
        preprocessor=<function <...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [232]:
preds = pipe.predict(testTexts)

Test your model on the remaining tweets. What does your result look like?

In [243]:
print("accuracy:", accuracy_score(testCats, preds))

accuracy: 0.9


In [244]:
def convertToNumeric(cats):
    nums =[]
    for c in cats:
        if c =='INDIVIDUAL':
            nums.append(1)
        elif c=='NON-INDIVIDUAL':
            nums.append(-1)
    return nums

In [245]:
numCats=convertToNumeric(testCats)
numPreds=convertToNumeric(preds)

In [246]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [247]:
print("Precision is ",precision_score(numCats,numPreds,average=None))
print("Recall is ",recall_score(numCats,numPreds,average=None))

Precision is  [0.85714286 1.        ]
Recall is  [1.   0.75]


1.Review some of the data to identify opportunities for improvement - how might you make these models bettter?

    First of all, the amount of data should be added to the corpus.
    Secondly, we should analyze the content of data. for example, hashtags are mostly appear in smoking, so I think we should analzye that, for NON-INDIVIDUAL company, they like to use hashtag for advertisments. We can take that as feature. Icons and other Individual features should also take into consideration.
    Thirdly, more filter should be used, just like useless icons should be droped.

2.Reflect on the reproducibility and the reusability of the code: what should be done to make these tools easier to apply to other datasets.

    I think getTwitterNLP() can be reused. But in most of the rountine above, most of them should be revised, such as reduce the parameters,(for searchtime function. etc). Also, we can take same things in one class, such as evaluation. To make our code more generic, a standard interface should be established and we should loose the coupling interactions between the functions.

In [223]:
#for x in annotated.getIds():
    #if annotated.getCodes(x) is not None :
#    for y in annotated.getCodes(x) :
#        print(y)
#annotated.getIds()
#str = '1056012093714063366'
#annotated.getCodes(str)

{'NON-INDIVIDUAL'}

*END ANSWER*

---