# Building a Classifier

**GOALS**

Build a Classifier that compares our classification methods in order to predict the individual who is tweeting: 


```
@iamcardib
@hillaryclinton
@_yiannopoulos
@thrashermag
@fwmagazine
```

In order to do this, we will review the process of retrieving a tweet and building a dataframe from the text of the tweet.  From here, your goal is to

1. Build a labeled dataframe containing at least 100 tweets from the five users.  
2. Explore the top 5 retweeted tweets from each user, make a visualization, discuss
3. Prepare the data for modeling using a `CountVectorizer` or `TfidfVectorizer`.  Remember to incorporate stop words and n-grams in your work.
4. Use a `LogisticRegression` classifier to determine the user.  How did it perform?
5. Use a `NaiveBayes` classifier to determine the user.  Did this do better?
6. Use a `DecisionTreeClassifier` to model the tweets.  How did this compare to the other two methods?
7. Build a table that compares the important information about these models.  
8. Suppose your task is to verify whether or not another account was actually Milo all along.  Which model would you use?  Why?  

In [1]:
#auth info



In [2]:
import tweepy
import json
from tweepy import OAuthHandler
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth)

In [3]:
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)
    
# this cursor connects to twitter
# show last 10 tweets on home timeline

Paul Manafort was heading to jail, and the judge had a few choice words for him about recent allegations that he’d… https://t.co/dCbmdfeRMb
China’s vaccine scandal is sparking protests and panic among parents https://t.co/HAWePDiSuR https://t.co/e688fMVehw
How bad are the finances for state and local pensions? They have less than three quarters of the money they need to… https://t.co/gQxspCVj2x
Is the third time the charm for the Three Stripes and James Harden? 
https://t.co/WYBf8G6zLW https://t.co/sdrLXFZlen
Some entrepreneurs are meant to stay with their companies long term. But some are destined to start something new o… https://t.co/w8DnkAN1Sm
It’s 2018. Can an app save a business? Absolutely, so long as it works. https://t.co/NITHpSUWJX
The world's biggest toilet-building spree is under way in India https://t.co/m2V8NkvUv7 https://t.co/AE7O0p1rGu
“Esto es solo un comienzo, no un final, de posibles sanciones”, afirma el 
comunicado de la Casa Blanca sobre Nicar… https://t.co/7JayYU

In [4]:
for status in tweepy.Cursor(api.home_timeline).items(1):
    # Process a single status
    print(status._json)

{'created_at': 'Tue Jul 31 01:05:13 +0000 2018', 'id': 1024098623154532352, 'id_str': '1024098623154532352', 'text': 'Officials have ordered more Californian towns to evacuate as fires grow https://t.co/zq83urXmXQ https://t.co/vTssSbPOQa', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/zq83urXmXQ', 'expanded_url': 'https://bloom.bg/2vjI8G8', 'display_url': 'bloom.bg/2vjI8G8', 'indices': [72, 95]}], 'media': [{'id': 1024098619803107329, 'id_str': '1024098619803107329', 'indices': [96, 119], 'media_url': 'http://pbs.twimg.com/media/DjZURNBVAAEWEQY.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DjZURNBVAAEWEQY.jpg', 'url': 'https://t.co/vTssSbPOQa', 'display_url': 'pic.twitter.com/vTssSbPOQa', 'expanded_url': 'https://twitter.com/business/status/1024098623154532352/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 680, 'h': 419, 'resize': 'fit'}, 'large': {'w': 12

In [5]:
def process_or_store(tweet):
    print(json.dumps(tweet))

In [6]:
for status in tweepy.Cursor(api.home_timeline).items(1):
    print(status)

Status(_api=<tweepy.api.API object at 0x10efa2128>, _json={'created_at': 'Tue Jul 31 01:05:13 +0000 2018', 'id': 1024098623154532352, 'id_str': '1024098623154532352', 'text': 'Officials have ordered more Californian towns to evacuate as fires grow https://t.co/zq83urXmXQ https://t.co/vTssSbPOQa', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/zq83urXmXQ', 'expanded_url': 'https://bloom.bg/2vjI8G8', 'display_url': 'bloom.bg/2vjI8G8', 'indices': [72, 95]}], 'media': [{'id': 1024098619803107329, 'id_str': '1024098619803107329', 'indices': [96, 119], 'media_url': 'http://pbs.twimg.com/media/DjZURNBVAAEWEQY.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DjZURNBVAAEWEQY.jpg', 'url': 'https://t.co/vTssSbPOQa', 'display_url': 'pic.twitter.com/vTssSbPOQa', 'expanded_url': 'https://twitter.com/business/status/1024098623154532352/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small'

In [7]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(tweet)

Status(_api=<tweepy.api.API object at 0x10efa2128>, _json={'created_at': 'Fri Jul 27 22:40:58 +0000 2018', 'id': 1022975158536097792, 'id_str': '1022975158536097792', 'text': 'Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for… https://t.co/eSxSVjMh7I', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'BetteMidler', 'name': 'Bette Midler', 'id': 139823781, 'id_str': '139823781', 'indices': [23, 35]}, {'screen_name': 'HelloDollyBway', 'name': 'Hello, Dolly!', 'id': 4785031154, 'id_str': '4785031154', 'indices': [96, 111]}], 'urls': [{'url': 'https://t.co/eSxSVjMh7I', 'expanded_url': 'https://twitter.com/i/web/status/1022975158536097792', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': N

In [8]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(tweet._json['text'])

Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for… https://t.co/eSxSVjMh7I


In [9]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(json.dumps(tweet._json['text']))

"Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for\u2026 https://t.co/eSxSVjMh7I"


In [10]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(json.dumps(tweet._json))

{"created_at": "Fri Jul 27 22:40:58 +0000 2018", "id": 1022975158536097792, "id_str": "1022975158536097792", "text": "Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for\u2026 https://t.co/eSxSVjMh7I", "truncated": true, "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "BetteMidler", "name": "Bette Midler", "id": 139823781, "id_str": "139823781", "indices": [23, 35]}, {"screen_name": "HelloDollyBway", "name": "Hello, Dolly!", "id": 4785031154, "id_str": "4785031154", "indices": [96, 111]}], "urls": [{"url": "https://t.co/eSxSVjMh7I", "expanded_url": "https://twitter.com/i/web/status/1022975158536097792", "display_url": "twitter.com/i/web/status/1\u2026", "indices": [117, 140]}]}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_re

In [11]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print((tweet._json.keys()))

dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])


In [12]:
tweets = []
retweets = []
user = []
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(2000):
    tweets.append(tweet._json['text'])
    retweets.append(tweet._json['retweet_count'])
    user.append(tweet._json['user']['screen_name'])

In [13]:
import pandas as pd
df = pd.DataFrame({'tweets': tweets, 'retweets': retweets, 'user': user})
df.head()

Unnamed: 0,tweets,retweets,user
0,Such a pleasure seeing @BetteMidler back where...,5717,HillaryClinton
1,Yesterday was the court-ordered deadline for t...,18964,HillaryClinton
2,From mother to activist to candidate - congrat...,7944,HillaryClinton
3,It was wonderful to spend some time with the t...,7514,HillaryClinton
4,RT @domesticworkers: Miles de niños y niñas si...,590,HillaryClinton


### To Do

- Write a function that takes in usernames and tweet number, and returns a `DataFrame` with the appropriate number of tweets, labeled user, tweet body, retweets, and geo location information.
- Explore top retweets
- Prepare for `sklearn`
- Classification Models

In [None]:
# User CountVectorizer
# or TFIDF vectorizer

In [14]:
tweets = []
retweets = []
user = []
for tweet in tweepy.Cursor(api.user_timeline, id = "iamcardib").items(2000):
    tweets.append(tweet._json['text'])
    retweets.append(tweet._json['retweet_count'])
    user.append(tweet._json['user']['screen_name'])

In [15]:
import pandas as pd
df = pd.DataFrame({'tweets': tweets, 'retweets': retweets, 'user': user})
df.head()

Unnamed: 0,tweets,retweets,user
0,Mood https://t.co/burVuT9Apz,11001,iamcardib
1,@CardiMila__ 😎,16,iamcardib
2,I got a baby i need some money shieeettt i nee...,1051,iamcardib
3,DO YOU SMELL WHAT CARDI IS COOKING ?,2748,iamcardib
4,People love doubting you then when you hit the...,2023,iamcardib


# workflow for using tweets to predict the user
## tfidf.fit_transform > sparse word matrix > X values > use this to predict