
```
---
title: API Review with Foursquare API
type:  lesson + lab + demo
duration: "1:25"
creator:
    name: David Yerrington
    city: SF
---
```
<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px">

#  API Demo / Lab + NLP
Week 8 | 1.3


<img src="https://snag.gy/RNAEgP.jpg" width="600">

Can we correctly identify which of these two old men tweeted what?


## (5 mins) Opening 

Today we are going to attempt to classify wether a tweet comes from Trump, or Sanders.  We are going to:

- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Rest API](https://dev.twitter.com/rest/public)



## Create an "App"

![](https://snag.gy/HPBQbJ.jpg)

We now will now go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/), just like we did for Foursquare.  After we set up our app, we will only need to reference the cooresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

Someone was nice enough to build a nice libary for us in Python that we only need to plug in our keys and start collecting data with.  The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, just run the next frame (there is no conda package).

In [2]:
!pip install twitter python-twitter

Collecting twitter
  Downloading twitter-1.17.1-py2.py3-none-any.whl (55kB)
[K    100% |████████████████████████████████| 61kB 4.2MB/s 
[?25hCollecting python-twitter
  Downloading python-twitter-3.1.tar.gz (80kB)
[K    100% |████████████████████████████████| 81kB 4.2MB/s 
[?25hCollecting future (from python-twitter)
  Downloading future-0.15.2.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 380kB/s 
Collecting requests-oauthlib (from python-twitter)
  Downloading requests_oauthlib-0.6.2-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib->python-twitter)
  Downloading oauthlib-2.0.0.tar.gz (122kB)
[K    100% |████████████████████████████████| 122kB 3.8MB/s 
[?25hBuilding wheels for collected packages: python-twitter, future, oauthlib
  Running setup.py bdist_wheel for python-twitter ... [?25l- done
[?25h  Stored in directory: /Users/davidyerrington/Library/Caches/pip/wheels/22/1e/2e/506871fa7dc610616948e70812d5e2518cd89c13f757b98f6c
  

## Some Boring Twitter Rules

Twitter says they will rate limit your requests:

>When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of what Twitter says are "the rulez":

![](https://snag.gy/yJ6vIH.jpg)


## About those Keys: OAuth Review

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's going on here?  Take a minute..

## Our Application Keys

Take note of our application keys that we will be using with our little application that will be connecting to Twitter and mining Tweets from the official Bernie Sanders and Donald Trump twitter accounts.

![](https://snag.gy/H1djQK.jpg)

## Tweet Miner Class Setup

The following code is meant to get us up and running with connectivity to twitter, and the ability to make requests and easily transform the JSON responses to DataFrames.  We will be using object oriented Python in order to organize our code.  We may go into review since this was a topic we covered earlier in the class but we can review it during the lab for those who want to know more about it.


> "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding any rate limit blocks.

#### Key Setup

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the "generate tokens" button to get this
- **access_token_secret** - Also available after "generate tokens" is pressed


In [79]:
twitter_keys = {
    'consumer_key':        'KmN03M1X1pImZ43sqdIu4yfnE',
    'consumer_secret':     'ePlIrIX5VXbZnO7DBu1RbFlw5lOai9dQr9n5TZb6vxnIdrr5Fz',
    'access_token_key':    '185036086-Q7K5IjuSoQZJwSIqD0wyHf6t62iPKatmfaPkriAM',
    'access_token_secret': 'cYpQz3xWHQbLplOj8iSeiNSOmMcsTOXmWcKMrJ9buLj5d'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)

api.GetUserTimeline(?)

In [120]:
import twitter, re, datetime, pandas as pd

class twitterminer():

    request_limit   =   20    
    api             =   False
    data            =   []
    
    twitter_keys = {
        'consumer_key':        'KmN03M1X1pImZ43sqdIu4yfnE',
        'consumer_secret':     'ePlIrIX5VXbZnO7DBu1RbFlw5lOai9dQr9n5TZb6vxnIdrr5Fz',
        'access_token_key':    '185036086-Q7K5IjuSoQZJwSIqD0wyHf6t62iPKatmfaPkriAM',
        'access_token_secret': 'cYpQz3xWHQbLplOj8iSeiNSOmMcsTOXmWcKMrJ9buLj5d'
    }
    
    def __init__(self,  request_limit = 20):
        
        self.request_limit = request_limit
        
        # This sets the twitter API object for use internall within the class
        self.set_api()
        
    def set_api(self):
        
        self.api = twitter.Api(
            consumer_key         =   self.twitter_keys['consumer_key'],
            consumer_secret      =   self.twitter_keys['consumer_secret'],
            access_token_key     =   self.twitter_keys['access_token_key'],
            access_token_secret  =   self.twitter_keys['access_token_secret']
        )

    def mine_user_tweets(self, user="dyerrington", mine_rewteets=False, max_pages=5):

        data           =  []
        last_tweet_id  =  False
        page           =  0
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.request_limit, max_id=last_tweet_id - 1)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.request_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id': item.id,
                    'handle': item.user.name,
                    'retweet_count': item.retweet_count,
                    'text': item.text,
                    'mined_at': datetime.datetime.now(),
                    'created_at': item.created_at,
                }
                
                last_tweet_id =   item.id
                data.append(mined)
                
            page          +=  1
            
        return data

## Does anyone remember how we "instantiate" a new instance of this class?

**Bonus bonus** How do we call the method to *mine_user_tweets()*?

In [121]:
# twitter ids:  realDonaldTrump, berniesanders
# Let's test this out here..
miner = twitterminer(request_limit=200)
sanders = miner.mine_user_tweets(user="berniesanders", max_pages=5)

In [122]:
df = pd.DataFrame(sanders)
df["tweet_id"].value_counts()

758346361393655808    1
776507334738710529    1
743451632650723329    1
768626058622930944    1
770631570642366464    1
768078475387338752    1
750770008154714113    1
738525888141107200    1
753982394617638916    1
733441885218308096    1
745985033626804226    1
731893684354981888    1
736995427652767744    1
757771807701266432    1
758405138558033921    1
749362783213330433    1
777176037776130048    1
751807900549472257    1
765564970763366401    1
737347348574011392    1
733084849230077954    1
741797689369579521    1
740918095661961216    1
770401018198777858    1
745720776842567681    1
749256829117444096    1
771792432384004097    1
770754509379346432    1
738173810071900160    1
770631412756180993    1
                     ..
730756163994705923    1
745756399393603584    1
773593154746482688    1
741393533202894850    1
775792193663406084    1
738021814983491584    1
772608074603384833    1
757777881695133696    1
776882976739168260    1
743601708295659521    1
7713414752212090

##  Now we create some training data

We will have to munge a little bit in order to get our "mined" data from the Twitter API.  

 - Mine Trump Tweets
 - Create DataFrame
 - Mine Sanders Tweets
 - Append to DataFrame

In [41]:
# we only need to "instantiate" once.  Then we can call mine_user_tweets as much as we want.
miner = twitterminer(request_limit=400)
trump_tweets = miner.mine_user_tweets("realDonaldTrump")

In [42]:
trump_df = pd.DataFrame(trump_tweets)
trump_df.head(10)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Mon Sep 19 19:52:00 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072053,3061,Philly FOP Chief On Presidential Endorsement: ...,777958440211771392
1,Mon Sep 19 16:53:42 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072069,6163,Hillary Clinton's weakness while she was Secre...,777913567676866560
2,Mon Sep 19 16:41:15 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072076,10104,Once again someone we were told is ok turns ou...,777910435425226753
3,Mon Sep 19 16:32:32 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072080,10503,Great job once again by law enforcement! We ar...,777908242538196992
4,Mon Sep 19 12:27:28 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072088,6137,"""@TarukMatuk: @CNN @FoxNews @realDonaldTrump @...",777846568741441536
5,Mon Sep 19 12:14:59 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072094,3699,"""@AngPiazza: @foxandfriends @realDonaldTrump ...",777843428449251328
6,Mon Sep 19 11:02:30 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072099,1778,Will be on @foxandfriends at 7:02 A.M. Enjoy.,777825186758418432
7,Mon Sep 19 02:32:03 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072104,11494,"Terrible attacks in NY, NJ and MN this weekend...",777696726933180416
8,Mon Sep 19 02:30:34 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072109,16274,"Under the leadership of Obama &amp; Clinton, A...",777696356211326976
9,Sun Sep 18 21:11:48 +0000 2016,Donald J. Trump,2016-09-19 14:39:37.072117,10959,HAPPY BIRTHDAY - to the United States Air Forc...,777616135856545792


## Any interesting ngrams going on with Trump?

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4))

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'https co', 116),
 (u'thank you', 25),
 (u'will be', 13),
 (u'imwithyou https', 13),
 (u'imwithyou https co', 13),
 (u'americafirst https', 12),
 (u'americafirst https co', 12),
 (u'of the', 10),
 (u'to the', 10),
 (u'crooked hillary', 9),
 (u'for the', 9),
 (u'america great', 8),
 (u'maga https', 8),
 (u'maga https co', 8),
 (u'hillary clinton', 8),
 (u'great again', 8),
 (u'america great again', 7),
 (u'make america', 7),
 (u'last night', 6),
 (u'make america great again', 6)]

In [51]:
sanders_tweets = miner.mine_user_tweets("berniesanders")

In [56]:
all_tweets = pd.DataFrame(trump_tweets + sanders_tweets)
all_tweets.handle.value_counts()
# all_tweets.groupby("handle").size()

Bernie Sanders     200
Donald J. Trump    200
Name: handle, dtype: int64

## Preprocessing our Tweets

In order to do classfication recall that we need a set of features.  Our features are literally what our presidential hopefulls say on Twitter. 

We will need to:
- Vectorize input text data
- Intialize a model (let's try Logistic regression)
- Train / Predict / Cross Validate
- Score / Evaluate


In [57]:
from sklearn.linear_model import LogisticRegression

# Preprocess our text data to Tfidf
tfv = TfidfVectorizer(lowercase=True, strip_accents='unicode')
X_all = tfv.fit_transform(all_tweets['text'])

# Setup logistic regression (or try another classification method here)
estimator = LogisticRegression()
estimator.fit(X_all, all_tweets['handle'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Check Prediction vs Random Sanders Tweet

In [58]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####

X_all = tfv.transform(source_test)

# Predict using previously trained logist regression `estimator`
estimator.predict_proba(X_all)

array([[ 0.74252626,  0.25747374],
       [ 0.33364651,  0.66635349]])

## Lab Time

We would like you to perform an analysis using a proper cross validation.  Also, try classfication using other models.

### 1. Implement the same analysis using more data.

Experiment with using more data.  The API may not like that you are blowing through their limits so definitely be careful.  Try to grab only what you need 1x, then work on the copy of the objects that are returned.  Read the documents about rate limits and see if you can get enough without hitting the rate limit.  Are there any options availabl in the API to avoid such a problem?

In [None]:
# We deviate from trump / sanders using student tweets here to illustrate the NLP pipeine with twitter data

twitter_handles = ["sayambuiar", "five_virtues", "vnessified"]
tweets = {}

for twitter_handle in twitter_handles:
    print "Mining tweets for: ", twitter_handle
    miner = twitterminer(request_limit=200)
    tweets[twitter_handle] = miner.mine_user_tweets(user=twitter_handle, max_pages=10)


Mining tweets for:  sayambuiar


In [124]:
student_tweets = pd.DataFrame(tweets['sayambuiar'])
student_tweets = student_tweets.append(pd.DataFrame(tweets['five_virtues']))
student_tweets = student_tweets.append(pd.DataFrame(tweets['vnessified']))


In [126]:
student_tweets.handle.value_counts()




Vanessa Grass    2197
Cecilia L.       1022
deniise           590
Name: handle, dtype: int64

In [157]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(3,5), stop_words="english")

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(student_tweets[student_tweets['handle'] == "Vanessa Grass"]['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'san francisco ca', 16),
 (u'francisco ca http', 12),
 (u'san francisco ca http', 12),
 (u'll let know', 4),
 (u'heard good things', 3),
 (u'kolsvein oh yea', 3),
 (u'hq san francisco ca', 3),
 (u'onthefirefly thanks james', 3),
 (u'hq san francisco', 3),
 (u'fav tweet http', 2),
 (u'public speaking workshop', 2),
 (u'plaid collar privilege', 2),
 (u'francisco international airport sfo', 2),
 (u'benedictfritz seanwolter http', 2),
 (u'themillsf san francisco ca', 2),
 (u'lying lack self awareness', 2),
 (u'hbo silicon valley', 2),
 (u'thanks james onthefirefly', 2),
 (u'look like https', 2),
 (u'really going miss', 2)]

In [135]:
tfv = TfidfVectorizer(lowercase=True, strip_accents='unicode', stop_words='english')
X_all = tfv.fit_transform(student_tweets['text'])

In [137]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_all, student_tweets['handle'], test_size=0.33, random_state=42)

In [149]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# estimator = MultinomialNB()
estimator = LogisticRegression()
estimator.fit(X_train, y_train)

probabilities = estimator.predict_proba(X_test)
predictions = estimator.predict(X_test)

In [150]:
estimator.score(X_test, y_test)

0.71280827366746224

In [178]:
from sklearn.metrics import classification_report, confusion_matrix

print classification_report(y_test, predictions)


             precision    recall  f1-score   support

 Cecilia L.       0.94      0.52      0.67       338
Vanessa Grass       0.66      0.99      0.79       695
    deniise       1.00      0.14      0.25       224

avg / total       0.80      0.71      0.66      1257



In [179]:
# Confusion Matrix
print confusion_matrix(y_test, predictions)

[[175 163   0]
 [  6 689   0]
 [  5 187  32]]


In [163]:
probabilities = estimator.predict_proba(X_all)

student_tweets['cecilia_proba'] = probabilities[:, 0]
student_tweets['vanessa_proba'] = probabilities[:, 1]
student_tweets['deniise_proba'] = probabilities[:, 2]

In [169]:
student_tweets[(student_tweets['handle'] == 'Cecilia L.') & (student_tweets['vanessa_proba'] > .5)].sort_values(by="vanessa_proba", ascending=False)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,cecilia_proba,vanessa_proba,deniise_proba
11,Thu Sep 05 06:50:25 +0000 2013,Cecilia L.,2016-09-19 16:18:32.247347,0,http://t.co/Pin2Da9W4m,375511218716286976,0.176034,0.765964,0.058002
47,Tue Feb 08 17:15:50 +0000 2011,Cecilia L.,2016-09-19 16:18:32.247460,0,Haha RT @someecards Wanted to thank you in adv...,35024012887195648,0.138061,0.720472,0.141467
4,Wed Apr 09 06:53:20 +0000 2014,Cecilia L.,2016-09-19 16:18:32.247324,0,♫ Nirvana – Sam Smith http://t.co/uPC7lLsHzB #...,453787730426621952,0.192774,0.696613,0.110613
584,Mon Mar 02 03:59:46 +0000 2009,Cecilia L.,2016-09-19 16:18:33.238442,0,Looking forward to watching the Dark Knight on...,1267920104,0.221683,0.689121,0.089196
9,Sat Oct 05 22:17:03 +0000 2013,Cecilia L.,2016-09-19 16:18:32.247341,0,♫ Sacred and profane by Cecilia Lam http://t.c...,386616049371975680,0.201197,0.685600,0.113203
824,Tue Oct 14 04:06:27 +0000 2008,Cecilia L.,2016-09-19 16:18:33.976438,0,yes SY i am in austin now,958593087,0.221097,0.680583,0.098320
827,Wed Oct 08 15:13:09 +0000 2008,Cecilia L.,2016-09-19 16:18:33.976450,0,just finished watching 2001 space odyssey last...,951389174,0.239252,0.671230,0.089518
463,Thu Aug 06 06:33:56 +0000 2009,Cecilia L.,2016-09-19 16:18:33.237844,0,http://bit.ly/10Va67\n What an astounding and ...,3162305078,0.233047,0.670867,0.096087
699,Mon Dec 29 06:24:30 +0000 2008,Cecilia L.,2016-09-19 16:18:33.657248,0,Just finished watching Superman.,1083531237,0.257338,0.666151,0.076511
88,Sat Dec 11 23:45:44 +0000 2010,Cecilia L.,2016-09-19 16:18:32.247590,0,Fun! http://www.bbc.co.uk/bbcone/wallaceandgr...,13741253984256000,0.254823,0.652025,0.093152


In [172]:
student_tweets[(student_tweets['handle'] == 'deniise') & (student_tweets['vanessa_proba'] > .5)].sort_values(by="vanessa_proba", ascending=False)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,cecilia_proba,vanessa_proba,deniise_proba
26,Fri Aug 19 18:34:45 +0000 2016,deniise,2016-09-19 16:18:29.471354,0,thanks for the suggestion quora https://t.co/m...,766704974759723008,0.071407,0.770974,0.157619
528,Sun Oct 11 22:48:24 +0000 2015,deniise,2016-09-19 16:18:30.740458,1,KATE MCKINNON AS DANA SCULLY http://t.co/xaT97...,653341407097188352,0.198481,0.699471,0.102048
555,Fri Oct 02 08:09:53 +0000 2015,deniise,2016-09-19 16:18:30.740548,0,"Studying Obama's winning digital campaigns, th...",649858830633824256,0.205888,0.695787,0.098325
549,Sat Oct 03 16:30:05 +0000 2015,deniise,2016-09-19 16:18:30.740528,950,"RT @Salon: Thank you, @fart, for one of the fu...",650347095312990208,0.161160,0.695261,0.143579
570,Fri Sep 25 19:56:38 +0000 2015,deniise,2016-09-19 16:18:30.740598,0,When people I've known for months meet my drun...,647499973018300416,0.191042,0.694465,0.114492
376,Sun Jan 31 11:15:14 +0000 2016,deniise,2016-09-19 16:18:30.166440,0,"yo, it's got to be cause I'm seasoned/haters g...",693754404558614528,0.152214,0.685417,0.162369
346,Wed Feb 10 06:28:29 +0000 2016,deniise,2016-09-19 16:18:30.166331,0,"my fav is when they add ""OBVIOUSLY YOU AREN'T ...",697306120977649664,0.143985,0.683889,0.172126
367,Fri Feb 05 23:19:04 +0000 2016,deniise,2016-09-19 16:18:30.166407,0,great Venn diagram of interests here https://t...,695748505596354560,0.128102,0.681980,0.189918
536,Fri Oct 09 01:24:09 +0000 2015,deniise,2016-09-19 16:18:30.740485,0,all right all right all right http://t.co/fI48...,652293440839880704,0.172901,0.677347,0.149752
331,Thu Feb 18 04:00:06 +0000 2016,deniise,2016-09-19 16:18:30.166275,0,important questions: does anyone actually like...,700167883985256448,0.086985,0.673641,0.239374


In [177]:
student_tweets[(student_tweets['handle'] == 'Vanessa Grass') & (student_tweets['cecilia_proba'] > .5)].sort_values(by="cecilia_proba", ascending=False)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,cecilia_proba,vanessa_proba,deniise_proba
1057,Sat Jul 26 19:33:14 +0000 2014,Vanessa Grass,2016-09-19 16:18:37.662977,0,Today is a great day.,493116855783915521,0.660448,0.280438,0.059114
573,Sun Aug 23 20:39:50 +0000 2015,Vanessa Grass,2016-09-19 16:18:36.378094,0,Just booked my flight to Portland for @xoxo an...,635552048277139456,0.617015,0.301389,0.081596
1078,Tue Jul 22 03:38:24 +0000 2014,Vanessa Grass,2016-09-19 16:18:37.663050,0,"@postobject @MakeshiftSoc dang, I work in Sunn...",491427012095926272,0.583399,0.357188,0.059413
587,Thu Aug 13 20:08:30 +0000 2015,Vanessa Grass,2016-09-19 16:18:36.378148,0,Today we got chair massages AND pupusas at wor...,631920282156949504,0.543074,0.359195,0.097731
1485,Tue Apr 29 00:30:05 +0000 2014,Vanessa Grass,2016-09-19 16:18:38.653814,0,Today was a good day because I got to tween so...,460939039810387970,0.533443,0.391663,0.074894


### 2. Implement K-Folds or test/train split.

Double check that you are getting random data before moving forward.  What would happen if you over sample Trump more than Sanders?

### 3. Mine more Tweets that aren't in your data set
Or use the hold-out method to do a proper test.  Refer back to our advanced classification evaluation lesson if you need to.

### 4. Check your classification report
How's precision / recall of your model?

### 5.  Change out your TFIDF vectorizer for CountVectorizer.
How has this impacted your mode performance at all?

### 6.  Implement a different classification method such as random forrests.
Or pick one of your favorites

### 7.  Try to remove stopwords from your text during your preprocessing step

Then double check your classfication report.  Have things improved?

### 8.  Try removing samples that have links or that are obviously just announcements or "noise" that doesn't appear to represent "True" tweets by the authors.

### 9. What are some contrasting words or phrases that you can see between the ngrams for each author?

### 10.  What do you think you can do to improve the scores further?

### 11. **BONUS** Using TextBlob, add a sentiment feature to your dataset.

### 12. BONUS BONUS Apply PCA to your text features
Is this effective? (ie: we could talk about LDA here a little bit)

## Closing

- What where the most impactful changes that helped your models?
- What do you think would happen if we had more Trump Tweets than Sanders?
- What other projects might you think to apply these problems against?