## Why use MongoDB?

This Jupyter Notebbok is a very brief introduction into the use of the MongoDB database to house the Streaming Twitter data as you are pulling from the API. [MongoDB](https://en.wikipedia.org/wiki/MongoDB) is a document oriented database and is not a relational database like Oracle or MySQL. Also known as a [schemaless database](https://www.mongodb.com/blog/post/why-schemaless), this database is ideal for a [JSON](https://en.wikipedia.org/wiki/JSON) object, which is what a tweet is behind the scenes. When working with the Streaming Twitter API, you want a database that can accept the JSON object as it is and immediately insert that object into the database without defining a schema beforehand.

By using MongoDB to house the Streaming Twitter API data, you can essentially have data retrieved, stored and analyzed by simply turning the Twitter faucet on and off. Making Twitter data analysis easy and reproducible. 

## Load the libraries for this Notebook

In [1]:
from __future__ import print_function
import tweepy
import pprint
import json
import datetime
from pymongo import MongoClient 
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re, string
import string
import config as cf
import codecs

## Set up MongoDB host path/Set Twitter access

In [2]:
WORDS = ["#StarWars","Trump","#sundaymorning"]

MONGO_HOST= 'mongodb://127.0.0.1:27017/twitterdb'

CONSUMER_KEY = cf.CONSUMER_KEY
CONSUMER_SECRET = cf.CONSUMER_SECRET
ACCESS_TOKEN = cf.ACCESS_TOKEN
ACCESS_TOKEN_SECRET = cf.ACCESS_SECRET

## StreamListener Function (Insert into MongoDB while streaming)

This function is the standard Tweepy StreamListener, but has the needed code for inserting into MongoDB as the StreamListener pulls from Twitter

In [3]:
class StreamListener(tweepy.StreamListener):    
 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            
            client = MongoClient(MONGO_HOST)
        
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert_one(datajson)
            
            return db
        
        except Exception as e:
            return "Exception occured on insert. Continuing to load..."
 

## Call Twitter using Auth

The Twitter API has many rate limit rules. Calling the API using "Auth" allows for larger rates of retrieval of the streaming data in a [shorter time](https://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively./).

>...."The solution is to use Application only Auth instead of the Access Token Auth. Application only auth has higher limits, precisely up to 450 request/sec and again with a limitation of requesting maximum 100 tweets per request, this gives a rate of 45,000 tweets/15-min, which is 2.5 times more than the Access Token Limit."

I ran this for approximately one minute and put around 1,800 tweets into the database. This is used in conjuction with "wait_on_rate_limit=True" which allows Tweepy to wait to be able to pull from the Twitter stream again without having to explicitly code for the reconnection.  

In [None]:
#Set auth access
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)

#Track Twitter Stream
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

## Connect to MongoDB to retrieve records (tweets)

In [5]:
#Reintiate connection to MongoDB
client = MongoClient(MONGO_HOST)
        
#Grab the tweets from the MongoDB
tweets = client.twitterdb

In [6]:
#Number of records in MongoDB
tweets.twitter_search.count()

#Delete "All" records in Collection, if needed
#tweets.twitter_search.remove()

1772

## Tweet JSON

The tweet is a JSON object. The below is pulling one random record from the TwitterDB database we just created. As you can see, the tweet contains a lot of metadata that can be leveraged in many different ways. One of the big disadvantages of using a relational database is that you have to define the schema up front. By defining the schema up front, you limit what analyis you can do with the tweet by the data you have selected as important before hand. Since MongoDB stores it's records as JSON objects, the tweet as a whole can be stored and accessed like any record in a relational database. If the analytical questions change as the analysis progresses, you as the analyst have access to the entire tweet object's metadata. 

In [7]:
#Record is in JSON format, which is similiar to Python Dictionary. 
pprint.pprint(tweets.twitter_search.find_one())

{u'_id': ObjectId('5a2de8f57fc89c0268327fa2'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Mon Dec 11 02:09:57 +0000 2017',
 u'entities': {u'hashtags': [],
               u'symbols': [],
               u'urls': [],
               u'user_mentions': [{u'id': 2962868158L,
                                   u'id_str': u'2962868158',
                                   u'indices': [3, 15],
                                   u'name': u'Rep. Don Beyer',
                                   u'screen_name': u'RepDonBeyer'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'low',
 u'geo': None,
 u'id': 940040928806367232L,
 u'id_str': u'940040928806367232',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'is_quote_status': True,
 u'lang': u'en',
 u'place': None,
 u'quote_count': 0,
 u'quoted_status': {u'contributors': None,
              

## Load entire database into a Pandas dataframe

The combination of using MongoDB and the Pandas framework, we can load the data into the database and then put into a Pandas dataframe with relatively litte code. This Pandas dataframe has only the top level elements from the Tweet JSON object and would still require pulling out hashtags by looking deeper into the JSON object in the database. In order to add those deeper level elements, we need to explicitly set those deeper level values and add them to the dataframe. To perform the anlysis we are going to do however, capturing the top level elements is fine. We just need the twwet text. 

In [8]:
#We can load the entire DB into Pandas dataframe to work with data in a relational database way
df=pd.DataFrame(list(tweets.twitter_search.find()))

In [9]:
#First 5 records in df
df.head(5)

Unnamed: 0,_id,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,...,quoted_status_id_str,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,5a2de8f57fc89c0268327fa2,,,Mon Dec 11 02:09:57 +0000 2017,,"{u'user_mentions': [{u'indices': [3, 15], u'sc...",,,0,False,...,9.395425704678441e+17,0,0,False,"{u'quote_count': 504, u'contributors': None, u...","<a href=""http://twitter.com/download/android"" ...",RT @RepDonBeyer: According to his own aides Tr...,1512958197200,False,"{u'follow_request_sent': None, u'profile_use_b..."
1,5a2de8f57fc89c0268327fa3,,,Mon Dec 11 02:09:57 +0000 2017,"[46, 140]","{u'user_mentions': [{u'indices': [0, 15], u'sc...",,"{u'display_text_range': [46, 183], u'entities'...",0,False,...,,0,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",@the_jimmy_dean @Alyssa_Milano @GDouglasJones ...,1512958197117,True,"{u'follow_request_sent': None, u'profile_use_b..."
2,5a2de8f57fc89c0268327fa4,,,Mon Dec 11 02:09:57 +0000 2017,,"{u'user_mentions': [{u'indices': [3, 15], u'sc...",,,0,False,...,9.396370423460659e+17,0,0,False,"{u'quote_count': 16, u'contributors': None, u'...","<a href=""http://twitter.com/download/iphone"" r...",RT @W_C_Patriot: #BrianStelter says calling ou...,1512958197395,False,"{u'follow_request_sent': None, u'profile_use_b..."
3,5a2de8f57fc89c0268327fa5,,,Mon Dec 11 02:09:57 +0000 2017,,"{u'user_mentions': [{u'indices': [3, 16], u'sc...",,,0,False,...,,0,0,False,"{u'quote_count': 5, u'contributors': None, u't...","<a href=""http://twitter.com/download/android"" ...",RT @Trumpfan1995: Have you ever been friends w...,1512958197558,False,"{u'follow_request_sent': None, u'profile_use_b..."
4,5a2de8f57fc89c0268327fa6,,,Mon Dec 11 02:09:57 +0000 2017,,"{u'user_mentions': [{u'indices': [3, 12], u'sc...",,,0,False,...,9.400376338702418e+17,0,0,False,"{u'quote_count': 0, u'contributors': None, u't...","<a href=""http://ctriq.org"" rel=""nofollow"">Trum...",RT @EastinCM: Trump/Moore 2020...perfect blend...,1512958197577,False,"{u'follow_request_sent': None, u'profile_use_b..."


## Perform LDA (Latent Dirichlet allocation) Topic Modeling. 

[LDA analysis](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) can be used for topic modeling. We can take the tweets from our database and see what the general topics are for our tweet corpus. LDA can be summarized as... 

>In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael I. Jordan in 2003.[1] Essentially the same model was also proposed independently by J. K. Pritchard, M. Stephens, and P. Donnelly in the study of population genetics in 2000. Both papers have been highly influential, with 19858 and 20416 citations respectively by August 2017.



In [11]:
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.collocations import *
tweets_texts = df["text"].tolist()
stopwords=stopwords.words('english')
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
def process_tweet_text(tweet):
    
    if tweet.startswith('@null'):
        return "[Tweet not available]"
    
    # Remove tickers
    tweet = re.sub(r'\$\w*','',tweet)
    
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*\/\w*','',tweet) 
    
    # Remove puncutations like 's
    tweet = re.sub(r'['+string.punctuation+']+', ' ',tweet) 
    twtok = TweetTokenizer(strip_handles=True, reduce_len=True)
    tokens = twtok.tokenize(tweet)
    tokens = [i.lower() for i in tokens if i not in stopwords and len(i) > 2 and  
                                             i in english_vocab]
    return tokens

words = []
for tw in tweets_texts:
      words += process_tweet_text(tw)

In [12]:
cleaned_tweets = []

#Form sentences of processed words
for tw in tweets_texts:
    cleaned_tweet = " ".join(w for w in words if len(w) > 2 and w.isalpha()) 
    cleaned_tweets.append(cleaned_tweet)
    
#Add new column to df    
df['CleanTweetText'] = cleaned_tweets

## Topics found in this Twitter data

In [14]:
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.collocations import *

#Stop words and remove punctuation
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)

#Maps several words into one common root. 
lemma = WordNetLemmatizer()

#Normalize
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized


texts = [text for text in cleaned_tweets if len(text) > 2]
doc_clean = [clean(doc).split() for doc in texts]
dictionary = corpora.Dictionary(doc_clean)

#Create Doc/Term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

#Create LDA model
ldamodel = models.ldamodel.LdaModel(doc_term_matrix, num_topics=6, id2word = 
dictionary, passes=5)

#Print topics
for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6):
    print("Topic {}: Words: ".format(topic[0]))
    topicwords = [w for (w, val) in topic[1]]
    print(topicwords)

Topic 0: Words: 
[u'man', u'like', u'trump', u'story', u'false', u'day']
Topic 1: Words: 
[u'day', u'man', u'like', u'idea', u'trump', u'medium']
Topic 2: Words: 
[u'man', u'day', u'idea', u'fake', u'trump', u'one']
Topic 3: Words: 
[u'day', u'man', u'trump', u'fake', u'idea', u'story']
Topic 4: Words: 
[u'man', u'day', u'idea', u'trump', u'people', u'three']
Topic 5: Words: 
[u'man', u'day', u'idea', u'think', u'false', u'call']
