# Do More with Twitter Data

Twitter is what's happening and what people are talking about right now, with hundreds of millions of Tweets sent each day.

## Intro

Often, when people think about conducting analysis on data from Twitter, they think analyzing Tweet content. While this is a rich collection of data, another important dimension in which to think about Twitter data analysis is that of its *users*. 

Twitter users post all sorts of interesting content in Tweets, but they also frequently share information about themselves by way of their account profile. If you visit [this author's profile](https://twitter.com/jrmontag), you'll find a handful of data points that are not Tweet-related, but user-related. Among other things, you might find geographical data, pointers to other websites, and a free-text profile description e.g. "counts 🐥💬, drinks ☕️, takes 📷, climbs 🗻". This is data that a user may not regularly Tweet about, and which you would miss if you were only looking at their posted content.


In this notebook, we're going to look at how to use the Twitter Search APIs to collect data around a cultural topic, and then use the resulting data to learn something interesting about the users participating in that discussion. Specifically, we'll look for clusters of similar users among all of the users we identify. Along the way, we'll look at some of the ways that you can make the journey from the collection of JSON data, processing relevant elements of each Tweet, engineering features that can be used for model training, and finally, inspecting the results of our models to see what we've learned.


This notebook will follow the outline below:

- data collection
- data inspection
- feature engineering
    - source data
    - preprocessing
    - tokenization
    - stopwords
    - vectorization
- selecting and tuning a model
- inspecting a model
- model iteration

## Environment Setup
First, some imports.

In [1]:
from collections import Counter
import itertools as it
import json
import logging
import os
import re
import string
import sys

from bokeh.plotting import figure, ColumnDataSource, show, output_notebook; output_notebook()
from bokeh.models import HoverTool
from bokeh.palettes import brewer, Viridis256
import hdbscan
import matplotlib.pyplot as plt
from nltk.util import everygrams
from nltk.tokenize.casual import TweetTokenizer
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.externals import joblib
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score
from sklearn.decomposition import TruncatedSVD
from stop_words import get_stop_words
from tweet_parser.tweet import Tweet
from searchtweets import load_credentials, gen_rule_payload, collect_results
# from MulticoreTSNE import MulticoreTSNE as TSNE
import yaml



# Data Collection

We'll use the [2019 Amazon Rainforest Wildfires](https://en.wikipedia.org/wiki/2019_Amazon_rainforest_wildfires) as our topic. Ultimately we are interested in those users who are Tweeting about the fire, so we start by looking for relevant Tweets and then we'll dig into the users behind those Tweets.

When in doubt, it's a reasonable strategy to start broad and simple with our rule - in this case we can simply use "amazon". After inspecting the data we can refine the rule (and resulting data) in the name of increasing it's relevance to the task at hand.

In [3]:
import json
import tweepy
from access_keys import ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET
import csv
import datetime

In [12]:
#Function to Extract Tweets
def get_topic_tweets(file_name, topic=None, date_start=None, date_end=None, num_tweets = 1000):
    
    #Authrization to consumer key and consumer secret
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    
    #Access to user's access key and access secret
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    
    #Calling api
    api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    
    #Number of tweets to be extracted
    number_of_tweets = num_tweets  
    
    #Topic_To_Choose_From
    #Open/Create a file to append data
    csvFile = open(file_name, 'a')
    
    #Use csv Writer
    csvWriter = csv.writer(csvFile)
    
    for tweet in tweepy.Cursor(api.search, q=topic, count=num_tweets, lang='en', since=date_start, until=date_end).items():
        csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

In [20]:
get_topic_tweets(file_name="election.csv",topic='#2020Election', date_start = '2019-09-12', date_end = '2019-09-13', num_tweets=10000)

# Data Inspection

Great, now we have some data to work with. Importantly, the first step is always to inspect the data. Is it what you were expecting? Is it relevant? Are there sources of noise you can negate in your rule? All of these issues can be addressed by iterating on your filters and inspecting the results.

Additionally, since we intentionally capped the number of total Tweets, it's good to inspect the time series of data to see what range it covers.

Since Tweets are automatically parsed with the [Tweet Parser](https://tw-ddis.github.io/tweet_parser/index.html) in our Python session, we can use some of the convenient attributes to pull out the text data. 

In [None]:
def tweets_to_df(tweets):
    """Helper function to extract specific tweet features into a dataframe."""
    tweet_df = pdf.DataFrame({'ts': [t.created_at_datetime for t in tweets],
                              'text': [t.all_text for t in tweets],
                              'uid': [t.user_id for t in tweets], }
                            )
    #Creating a datetimeindex will allow us to do more timeseries manipulations
    tweet_df['ts'] = pd.to_datetime(tweet_df['ts'])
    return tweet_df
                             
                             

In [None]:
tweet_df = tweets_to_df(tweets)

tweet_df.head()

In [None]:
#Plot a time series
(tweet_df[['ts','text']].set_index('ts')
 # 'T' = minute
 .resample('T')
 .count()
 .rename(columns=dict(text='1-minute counts'))
 .plot(figsize=(10,7))
);

With this small sample, let's do a bit of rough text processing to look at the text we're seeing in these Tweets. A simple - and often, informative - first way to inspect the content of text data is through looking at the most common n-grams. In language modeling, an "n-gram" is a contiguous collection of some n items - in languages where appropriate, this is often white-space separated words. For example, two-grams in the sentence "The dog ate my homework" would be "the dog", "dog ate", "ate my", "my homework".

We'll use the all_text attribute of our Tweet objects to simply pull in all the text, regardless of whether it was a Retweet, original Tweet, or Quote Tweet. Then we'll concatenate all the Tweet text together (from the whole corpus), split it up into words using an open-source tokenizer from NLTK (we'll talk more about this, shortly), remove some punctuation, and then simply count the most common set of n-grams.

This is a very rough (but quick) way of getting a feel for the text data we have. If we see content that we don't think is relevant, we can go back and modify our rule.

In [None]:
def get_all_tokens(tweet_list):
    """
    Helper function to generate a list of text tokens from concatenating
    all of the text contained in Tweets in 'tweet_list'
    """
    
    #concat entire corpus
    all_text = ' '.join((t.all_text for t in tweets))
    #tokenize
    tokens = (TweetTokenizer(preserve_case=False,
                            reduce_len=True,
                            strip_handles=False)
             .tokenize(all_text))
    #Remove symbol-only tokens for now
    tokens = [tok for tok in tokens if not tok in string.punctuation]
    return tokens

In [None]:
tokens = get_all_tokens(tweets)

print('Total Number Of Tokens: {}'.format(len(tokens)))

In [None]:
#Calculate a range of ngrams using some handy functions
top_grams = Counter(everygrams(tokens, min_len=2, max_len=4))

top_grams.most_common(25)

Using these top n-grams, we can see the phrases "amazon fire" and "brazil fire" were very common at the event. If you don't happen to be familiar with the the amazon fires, you may want to inspect those terms a bit more to understand their context.

We can go back to the Dataframe and filter on one of those terms to see what the original content was about.

In [None]:
# Create a filter series matching "Amazon"
mask = tweet_df['text'].str.lower().str.contains("coppola")

#look at text only from matchin rows
tweet_df[mask][['text']].head(10)