# Social Media and Human-Computer Interaction - Part 4

### Goal: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as smoking.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [None]:
%matplotlib inline

import operator
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import spacy
import time
from datetime import datetime
from spacy.symbols import ORTH, LEMMA, POS

# 4.0 Introduction

Picking up where  [Part 3](SocialMedia - Part 3.ipynb) left off, this module introduces basics of classification of textual content, using classifiers from the popular [Scikit-learn](http://scikit-learn.org/stable/index.html) machine-learning tools. 

# 4.0.1 Setup

As before, we start with the Tweets class and the configuration for our Twitter API connection.  We may not need this, but we'll load it in any case.

In [None]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

*REDACT FOLLOWING DETAILS*

In [None]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [None]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

We will also load some routines that we defined in [Part 3](SocialMedia - Part 3.ipynb):
    
1. Our routine for creating a customized NLP pipeline
2. Our routine for including tokens
3. The `filterTweetTokens` routine defined in an exercise (Without the inclusion of named entities. It will be easier to leave them out for now).

In [None]:
def getTwitterNLP():
    nlp = spacy.load('en')
    
    for word in nlp.Defaults.stop_words:
        lex = nlp.vocab[word]
        lex.is_stop = True
    
    special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]
    nlp.tokenizer.add_special_case(u'e-cigarette', special_case)
    nlp.tokenizer.add_special_case(u'E-cigarette', special_case)
    vape_case = [{ORTH: u'vape',LEMMA:u'vape',POS: u'NOUN'}]
    
    vape_spellings =[u'vap',u'vape',u'vaping',u'vapor',u'Vap',u'Vape',u'Vapor',u'Vapour']
    for v in vape_spellings:
        nlp.tokenizer.add_special_case(v, vape_case)
    def hashtag_pipe(doc):
        merged_hashtag = True
        while merged_hashtag == True:
            merged_hashtag = False
            for token_index,token in enumerate(doc):
                if token.text == '#':
                    try:
                        nbor = token.nbor()
                        start_index = token.idx
                        end_index = start_index + len(token.nbor().text) + 1
                        if doc.merge(start_index, end_index) is not None:
                            merged_hashtag = True
                            break
                    except:
                        pass
        return doc
    nlp.add_pipe(hashtag_pipe,first=True)
    return nlp

def includeToken(tok):
    val =False
    if tok.is_stop == False:
        if tok.is_alpha == True: 
            if tok.text =='RT':
                val = False
            elif tok.pos_=='NOUN' or tok.pos_=='PROPN' or tok.pos_=='VERB':
                val = True
        elif tok.text[0]=='#' or tok.text[0]=='@':
            val = True
    if val== True:
        stripped =tok.lemma_.lower().strip()
        if len(stripped) ==0:
            val = False
        else:
            val = stripped
    return val

def filterTweetTokens(tokens):
    filtered=[]
    for t in tokens:
        inc = includeToken(t)
        if inc != False:
            filtered.append(inc)
    return filtered

We will start things off by reading in our two stored sets of tweets, and creating an NLP object of processing the tweets:

In [None]:
vaping=Tweets()
vaping.readTweets("tweets-vaping.json")
print("Number of vaping tweets: "+str(vaping.countTweets()))
smoking=Tweets()
smoking.readTweets("tweets-smoking.json")
print("Number of smoking tweets: "+str(smoking.countTweets()))

Finally, we will include some additional modules from Scikit-Learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
import string
import re

# 4.0.2 An outline for classification

Our examination of basic machine learning from text will address several key tasks:
    
1. Vectorization: Converting text into numerical representations appropriate for machine learning algorithms
2. A classification pipeline.
3. Dividing a dataset into test and train sets. 
4. Training, testing, and evaluating a model.

Much of the ideas below are informed by Nic Schrading's [Intro to NLP with spaCy](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/) and the [scikit-learn Working with Text data tutorial[(http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files).

# 4.1 Vectorization: converting text into numerical representation

Even when tokenzied, tweets are little more than glorified collections of text. As machine learning algorithms used for building classifiers operate on numbers, not text, we must convert the collection of words/tokens in each tweet into an alternative form capable of encoding all of the information in a tweet in numeric form. These numeric representations can be used to calculate similarities between items, thus forming the basis for comparisons used in clustering and machine learning.


## 4.1.1 Basic vectorization
The easiest way to do this is to convert each tweet into a *vector* of numbers, one for each word that might possibly show up in any of the tweets. The entry for each word will contain the number of times that that word occurs in the given tweet. This simple representation captures enough information to distinguish between tweets, at the expense of losing some information that might prove valuable for some tasks (try to think about what information a vector might not include - we'll get back to that later).  

Fortunately, the scikit learn library makes it easy to convert tokens from a tweet into a vector.  Too see how this works, let's start with the text from a few pre-selected tweets.

In [113]:
tweet_ids=['974316600072404992','974317062372974592','974317442796208128','974316896840568833','974316873469841410']
texts=[smoking.getText(t) for t in tweet_ids]

In [114]:
texts

['Welcome to Eternity...Smoking or non smoking ? https://t.co/W9nAU7GEaQ',
 'Cochrane Podcast: Are there any smoking cessation programmes that can help adolescents to stop smoking?https://t.co/cyrRqGZmgV',
 'BTS wasnt going to disband what are u smoking. Show me where it said they were going to disband? only a few of y’all multifandom said u were going to help but we don’t even know if those really voted https://t.co/CfsHoGnxfJ',
 "@hunteroffwitch @RealRedElephant @Taz53556229 Well that's a false analogy. Cigarettes DON'T kill people laying on a table. People who smoke cigarettes are the one's doing harm to themselves. It's not Marlboro's fault, it's your fault for smoking. This is an idiotic comparison.",
 'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe']

We can then take this text and run it through Scikit-learn's `CountVectorizer`, which will turn this text into a vector representation suitable for machine learning. We mut first fit and then transform the vectorizer

In [115]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
vectorizer.fit(texts)
vec =vectorizer.transform(texts)

The vectorizer has five rows (one for each entry) and one column for each unique term found in any of the entries.  W can see the terms by looking at the vocabulary:

In [116]:
print(vectorizer.vocabulary_)

{'it': 39, 'well': 85, 'doing': 21, 'ncbnoyvzxe': 49, 'marlboro': 45, 'only': 55, 'there': 75, 'show': 64, 'all': 3, 'analogy': 5, 'what': 87, 'we': 83, 'few': 28, 'going': 30, 'taz53556229': 70, 'and': 6, 'comparison': 18, 'your': 90, 'stop': 68, 'min': 47, 'disband': 20, 'cessation': 13, 'smoke': 66, 'fault': 27, 'adolescents': 1, 'people': 57, 'who': 89, 'co': 16, 'smoking': 67, 'bts': 10, 'were': 86, 'the': 72, 'idiotic': 36, 'said': 62, 'ever': 25, 'welcome': 84, 'are': 8, 'or': 56, 'gotta': 31, 'help': 33, 'but': 11, 'programmes': 59, 'realredelephant': 61, '10': 0, 'podcast': 58, 'if': 37, 'an': 4, 'been': 9, 'since': 65, 'even': 24, 'multifandom': 48, 'kill': 40, 'don': 22, 'eternity': 23, 'they': 76, 'know': 41, 'themselves': 73, 'to': 79, 'harm': 32, 'can': 12, 'then': 74, 'not': 51, 'is': 38, 'where': 88, 'cfshognxfj': 14, 'of': 52, 'for': 29, 'hunteroffwitch': 35, 'table': 69, 'me': 46, 'really': 60, 'on': 53, 'ago': 2, 'wasnt': 82, 'w9nau7geaq': 81, 'this': 77, 'cochrane':

And we can look at the results by printing the array version of the transformed vectorizer:

In [117]:
print(vec.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
  0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 2 1 0 0 1
  0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 2 0 1 0 1 0 0 0 1 0 3 0 0 1 1 0
  0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 2 0 1 0 0 1 0 0 0 0
  0 0 0 0 1 0 1 3 1 0 1 1 0 0 2 1 1 0 0]
 [0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 2 0 0 1 0 0 1 1 0 0 0 1 2 0 1 0 0 1 0 0 1
  1 0 1 2 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 2 0 0 0 1 0 0 0 0 1 1 0 1 1 1
  1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1]
 [1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0
  0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


To interpret this , can look at the first tweet

In [118]:
texts[0]

'Welcome to Eternity...Smoking or non smoking ? https://t.co/W9nAU7GEaQ'

Notice that the word 'smoking' has occurred two times in this tweet. We can stat by finding the position of 'smoking' in the vocabulary:

In [119]:
index = vectorizer.vocabulary_.get("smoking")
print(index)

67


so column 36 has the count for 'me'. Let's look at the value in row 0, column 67.

In [120]:
vec[0,67]

2

the count is 2, as expected! 

Try a few more examples to confirm that the counts are working. 

## 4.1.2 TF-IDF Vectorization

Vectorization by count is only one approach that we might take. However, direct counts suffer from an important shortcoming: if we leave in all of the prepositions and all other low-content words, we might see vectors that are dominated by these words, making comparisons based on more informative words more difficult.  More generally, if we have texts with very similar wods, the counts might obscure some of the key differences. 


To see how this might work, try the following experiment:

Imagine these two texts:

* "The man went to the store to by some milk."
* "The man went to the store to by some coffee."

Clearlly, the most interesting difference here is 'milk' vs. 'coffee'. Let's see what happens when we try to vectorize these texts. 

In [121]:
simtexts=["The man went to the store to buy some milk.","The man went to the store to buy some coffee."]
simvectorizer=CountVectorizer()
simvectorizer.fit(simtexts)
simvec =simvectorizer.transform(simtexts)
print(simvec.toarray())

[[1 0 1 1 1 1 2 2 1]
 [1 1 1 0 1 1 2 2 1]]


Note that these vectors are almost idenical, differing only in the second and fourth positions.  We can guess that one of these corresponds to 'milk' and the other to 'coffee'. Let's confirm:

In [122]:
print(simvectorizer.vocabulary_.get("milk"))
print(simvectorizer.vocabulary_.get("coffee"))

3
1


As expected. 

For some applications, this similarity might be fine. For others, it might be undesirable. In those cases, a different approach is needed.

Specifically, we might like to have a technique that emphasizes words that are distinctive, downplaying those that are frequently found. In the above example, we might reduce the importance of 'man', 'went', and 'store', focusing instead on'milk' and 'coffee'. 

This problem was addressed years ago by the information retrieval community, who found that searching for uncommon or distinctive words was more effective than searching for common words. This led to the development of the `term frequency/inverse document frequency' model, which uses the frequencies of words across texts to adjust the counts  of vector representation of text, by computing the product of two numbers for each term:

* The 'term frequency' is the number of times a term appears in a text divided by the total number of terms in the text.
* The 'inverse document frequency' is the logarithm of the number of documents divided by the number of documents containing the term.

The term frequency is higher for words that are used frequently in each document, while the inverse document frequency decreases as the number of documents with a term increases. The product of these terms forms the tf-idf score.  The tf-idf score applied to each term in a text (in our case, a tweet), can form a vector analogous to the count vector shown above. 

The [Wikipedia page on tf-idf](https://en.wikipedia.org/wiki/Tf–idf) provides a reasonably good overview of tf-idf scores and some common variants.

If this were an information retrieval course, the next exercise might be to write a TF-IDF vectorizer, but scikit has one ready to go. It can be used just like the CountVectorizer. 

In [123]:
from sklearn.feature_extraction.text import TfidfVectorizer
simvectorizer2=TfidfVectorizer()
simvectorizer2.fit(simtexts)
simvec2 =simvectorizer2.transform(simtexts)
print(simvectorizer2.vocabulary_)
print(simvec2.toarray())

{'milk': 3, 'to': 7, 'store': 5, 'coffee': 1, 'buy': 0, 'man': 2, 'some': 4, 'went': 8, 'the': 6}
[[0.25841146 0.         0.25841146 0.36318829 0.25841146 0.25841146
  0.51682292 0.51682292 0.25841146]
 [0.25841146 0.36318829 0.25841146 0.         0.25841146 0.25841146
  0.51682292 0.51682292 0.25841146]]


If we compare this result to the previous vector, we can notice a few things. 

1. Terms that had had a weight of 1 are reduced to 0.258..
2. More frequent terms (weight of 2 -'to' and 'the' are reduced to 0.516 
3. Unique terms - 'milk' and 'coffee' still have a zero weight when they are not seen, but a higher weight when they are seen. 

This vectorization does not completely eliminate weights for the frequent terms, but it does reduce their importance relatively.  With more documents, more frequent terms would decrease further relatively. 

To see this, let's go back to the set of 5 tweets:

In [124]:
vectorizer =TfidfVectorizer()
vectorizer.fit(texts)
vec2=vectorizer.transform(texts)
print(vectorizer.vocabulary_)
print(vec2.toarray()[0])

{'it': 39, 'well': 85, 'doing': 21, 'ncbnoyvzxe': 49, 'marlboro': 45, 'only': 55, 'there': 75, 'show': 64, 'all': 3, 'analogy': 5, 'what': 87, 'we': 83, 'few': 28, 'going': 30, 'taz53556229': 70, 'and': 6, 'comparison': 18, 'your': 90, 'stop': 68, 'min': 47, 'disband': 20, 'cessation': 13, 'smoke': 66, 'fault': 27, 'adolescents': 1, 'people': 57, 'who': 89, 'co': 16, 'smoking': 67, 'bts': 10, 'were': 86, 'the': 72, 'idiotic': 36, 'said': 62, 'ever': 25, 'welcome': 84, 'are': 8, 'or': 56, 'gotta': 31, 'help': 33, 'but': 11, 'programmes': 59, 'realredelephant': 61, '10': 0, 'podcast': 58, 'if': 37, 'an': 4, 'been': 9, 'since': 65, 'even': 24, 'multifandom': 48, 'kill': 40, 'don': 22, 'eternity': 23, 'they': 76, 'know': 41, 'themselves': 73, 'to': 79, 'harm': 32, 'can': 12, 'then': 74, 'not': 51, 'is': 38, 'where': 88, 'cfshognxfj': 14, 'of': 52, 'for': 29, 'hunteroffwitch': 35, 'table': 69, 'me': 46, 'really': 60, 'on': 53, 'ago': 2, 'wasnt': 82, 'w9nau7geaq': 81, 'this': 77, 'cochrane':

Let's compare this to the previous vector:

In [125]:
print(vec.toarray()[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0]


You should be able to see a similar trend as some values are scaled up and others are scaled down.

Our goal in this section was to convert each tweet into a numeric representation suitable for use in a classifier. We've now seen two ways to do this. However, you may have noticed that we are doing this without the benefit of any of the spaCy tokenizing that we developed in [Part 3](SocialMedia - Part 3.ipynb). We'll see how we can add this back in the next section.

# 4.1.3 Vectorizing with our tokenizer

Recall from [Part 3](SocialMedia - Part 3.ipynb) that we established a set of routines to create a custom tokenizer to handle hashtags appropriately (the `getTwitterNLP` routine), a routine to only include certain types of tokens (`includeToken`), and a routine to process all tokens and add in any named entities (`filterTweetTokens`).  We'd like to find a way to build these processes in to the vectorization process. 

Fortunately, this is easy to do. To fit with the way that the scikitlearn Vectorizers work, we start by writing a routine that takes a text, calls the spaCy pipeline, and then filters the tokens:

In [126]:
def tokenizeText(text):
    nlp=getTwitterNLP()
    tokens=nlp(text)
    return filterTweetTokens(tokens)

Then, we create a vectorizer that uses this routine as the tokenizer:

In [127]:
simvectorizer3=CountVectorizer(tokenizer=tokenizeText, preprocessor=lambda x: x)
simvectorizer3.fit(texts)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function <lambda> at 0x138bc2e18>, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenizeText at 0x138bc2a60>, vocabulary=None)

*Note* the `preprocessor` argument passed to the `CountVectorizer`. This is necessary to override some of the functions in the vectorizer. 

By default, the vectorizer will perform a number of functions, including a preprocessing stage that will perform some string transformations. See the [Count Vectorizer API docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) for some details.

Ordinarily, this would be a fine thing to do. However, we are not interested in having scikit-learn do any preprocessing or tokenizing for us, as our `tokenizeText` routine does what we need it to do. So, we need to turn it off. To do this, we use the `preprocessor` argument, with this value:

    lambda x: x
    
which indicates an un-named procedure that will simply return the value it is provided. Using this procedure as the argument effectively turns the preprocessing step into a meaningless operation, ensuring that the `CountVectorizer` will use exactly the tokens that result from `tokenizeText`.

As before, we can look at the vocabulary

In [128]:
print(simvectorizer3.vocabulary_)

{'podcast': 28, 'get': 15, 'marlboro': 24, 'vote': 37, 'show': 32, 'multifandom': 26, 'analogy': 4, 'kill': 19, 'be': 5, 'eternity': 13, 'know': 20, 'comparison': 10, 'harm': 17, '@taz53556229': 2, 'look': 22, 'stop': 35, 'min': 25, 'cessation': 7, '@hunteroffwitch': 0, 'table': 36, 'disband': 12, 'make': 23, 'smoke': 33, 'fault': 14, 'lay': 21, 'people': 27, 'cochrane': 9, 'cyrrqgzmgv': 11, 'bts': 6, 'cigarette': 8, 'say': 31, '@realredelephant': 1, 'sandwich': 30, 'adolescent': 3, 'smoking': 34, 'welcome': 38, 'help': 18, 'programme': 29, 'go': 16}


and examine the resulting TF-IDF vectors:

In [129]:
simvec3 =simvectorizer3.transform(texts)
svarray3=simvec3.toarray()
print(simvec3.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0
  0 0 1]
 [0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1
  0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 3 0 1 0 1 0 0 0 0 0 1 0 0 0 0 2 1 0 1 0
  0 1 0]
 [1 1 1 0 1 0 0 0 2 0 1 0 0 0 2 0 0 1 0 1 0 1 0 0 1 0 0 2 0 0 0 0 0 1 1 0
  1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1
  0 0 0]]


of course, you can do something similar with a TF-IDF index. 

---
## EXERCISE 4.1: Verification

How would you go about verifying the correctness of this output?  Describe a plan. You can assume, based on our prior work, that the tokenizer works well.

----
*ANSWER BELOW - CUT BELOW HERE*

Here's one possible strategy. For each text in the sample, we will do the following

1. Get the tokens for the text
2. Convert the list of tokens into a dictionary, holding the number of times each token is found.
3. We will then iterate through the tokens in that dictionary. For each one, we will do the following steps:
    
    * Find the entry in the vectorizer vocabulary.
    * Use that entry to find the corresponding entry in the array from the transformed vectorizer.
    * ensure the counts match. If they don't, note a discrepancy by storing the token and the count in a list.

In [130]:
def textToCount(text):
    toks =tokenizeText(text)
    counted={}
    for tok in toks:
        if tok not in counted:
            counted[tok]=0
        counted[tok]=counted[tok]+1
    return counted

def checkText(text,vecEntries):
    counted = textToCount(text)
    
    # now, look for these in the vocabulary
    errs=[]
    for tok,count in counted.items():
        pos = simvectorizer3.vocabulary_.get(tok)
        if pos == None:
            errs.append((tok,count))
        else:
            vecCount = vecEntries[pos]
            if vecCount!=count:
                errs.append((tok,count))
    return errs

In [131]:
text=texts[0]
ve=svarray3[0]
errs=checkText(text,ve)

In [132]:
errs

[]

In [133]:
text

'Welcome to Eternity...Smoking or non smoking ? https://t.co/W9nAU7GEaQ'

Now, we can tie it together for all of texts..

In [134]:
texts

['Welcome to Eternity...Smoking or non smoking ? https://t.co/W9nAU7GEaQ',
 'Cochrane Podcast: Are there any smoking cessation programmes that can help adolescents to stop smoking?https://t.co/cyrRqGZmgV',
 'BTS wasnt going to disband what are u smoking. Show me where it said they were going to disband? only a few of y’all multifandom said u were going to help but we don’t even know if those really voted https://t.co/CfsHoGnxfJ',
 "@hunteroffwitch @RealRedElephant @Taz53556229 Well that's a false analogy. Cigarettes DON'T kill people laying on a table. People who smoke cigarettes are the one's doing harm to themselves. It's not Marlboro's fault, it's your fault for smoking. This is an idiotic comparison.",
 'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe']

In [135]:
def checkTextList(texts,vectorArray):
    errs=[]
    for i in range(len(texts)):
        text = texts[i]
        entries=vectorArray[i]
        errs_i=checkText(text,entries)
        errs.append(errs_i)
    return errs

In [136]:
checkTextList(texts,svarray3)

[[], [], [], [], []]

Looks good! No errors on the five texts in the list. 

For an added challenge, consider how you might verify the TF-IDF vectorizer. This will require recreating the calculations for the term frequency and the inverse document frequency.

*END ANSWER*

---

# 4.2 A Classification pipeline