The twitter mining code are done with following reference:
- https://github.com/agalea91/twitter_search/blob/master/twitter_search.py
- https://galeascience.wordpress.com/2016/03/18/collecting-twitter-data-with-python/
To use twitter_search.py:
1. Create twitter app at https://apps.twitter.com
2. Open twitter_search.py, then
    - At function load_api() fill consumer_key, consumer_secret, access_token, access_secret. This credential could be get from the created twitter app.
    - At function main() fill attribute time_limit, max_tweets, min_days_old, max_days_old, USA.
3. Run twitter_search.py at command prompt.

This notebook will:
1. Transform the .json from scraping to excell
2. Sentiment Analysis of the tweet collected.

# 1. Json to Excell

In [1]:
import os
import json
import glob
import pandas as pd
    
# change working directory
path = r'C:\Users\LW130003\Desktop\twitter_search-master\json files'
os.chdir(path)

# load json files
"""
Here, I cheat a bit, I moved all the .json files to one folder to make it easier for the computer to search.
"""
tweets = []
file_list = glob.glob(os.path.join(os.getcwd(), "*.json"))
for file in file_list:
    with open(file, 'r') as f:
        for line in f.readlines():
            tweets.append(json.loads(line))
            
# print dictionary key and value
for key, value in tweets[0].items() :
    print (key, value)

created_at Tue Feb 20 23:14:14 +0000 2018
id 966088636776271877
id_str 966088636776271877
text RT @crazymarolive: #KeatonJones mama got me messed up! https://t.co/yXOyFHAuGY
truncated False
entities {'hashtags': [{'text': 'KeatonJones', 'indices': [19, 31]}], 'symbols': [], 'user_mentions': [{'screen_name': 'crazymarolive', 'name': 'CrazyMaro', 'id': 809133687753342977, 'id_str': '809133687753342977', 'indices': [3, 17]}], 'urls': [], 'media': [{'id': 940404681230217216, 'id_str': '940404681230217216', 'indices': [55, 78], 'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/940404681230217216/pu/img/6vZq_bFKe05fKkRB.jpg', 'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/940404681230217216/pu/img/6vZq_bFKe05fKkRB.jpg', 'url': 'https://t.co/yXOyFHAuGY', 'display_url': 'pic.twitter.com/yXOyFHAuGY', 'expanded_url': 'https://twitter.com/crazymarolive/status/940404959669080065/video/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 

The entire file is iterated over line by line and looks complicated. Remember that we don't need all these data and we are only interested in a small portion of the total number of tweet attributes that exist, the attributes we care are:
- text
- date created
- user name
- number of favorites
- number of retweets
- number user is following
- number of user followers

In [2]:
# convert to dataframe

# create a dictionary
data = {'text': [], 'screen_name': [], 'created_at': [],
        'retweet_count': [], 'favorite_count': [],
        'friends_count': [], 'followers_count': []}
 
# loop the tweets and assign it to dictionary
for t in tweets:
    data['text'].append(t['text'])
    data['screen_name'].append(t['user']['screen_name'])
    data['created_at'].append(t['created_at'])
    data['retweet_count'].append(t['retweet_count'])
    data['favorite_count'].append(t['favorite_count'])
    data['friends_count'].append(t['user']['friends_count'])
    data['followers_count'].append(t['user']['followers_count'])

# create dataframe
df = pd.DataFrame(data)
print(df.shape)
df.head()

# export dataframe
#df.to_csv(r'C:\Users\LW130003\Desktop\twitter_keaton_jones.csv', index=False)

(103, 7)


Unnamed: 0,created_at,favorite_count,followers_count,friends_count,retweet_count,screen_name,text
0,Tue Feb 20 23:14:14 +0000 2018,0,145,1569,59,ROBINEYOUNG1,RT @crazymarolive: #KeatonJones mama got me me...
1,Tue Feb 20 23:07:07 +0000 2018,0,1145,2445,2,DeanoTheBeano72,This poor soul. Kids can be so bloody cruell. ...
2,Tue Feb 20 22:46:42 +0000 2018,0,12,229,607,Desiflo33,RT @iamlatocha: i need each and everyone of ya...
3,Tue Feb 20 21:18:08 +0000 2018,0,64,79,4,KadedollaDolla,RT @BrownEyeEnt: #Petty (Ep.23) (BULLY ADDITIO...
4,Tue Feb 20 14:52:17 +0000 2018,0,6202,5682,2008,thatsmydea,RT @ufc: If any of you have any info on how to...


# 2. Sentiment Analysis

In this section we will do:
1. Natural Language Processing
2. Topic Modeling

## 2.1 Natural Language Processing

Natural Language Processing is done by tokenizing the text, removing stopwords (stopwords are words that are commonly appear in sentence, example: I, he, she, is, are, etc.

In [3]:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re, string
import nltk
tweets_texts = df["text"].tolist()
stopwords=stopwords.words('english')
english_vocab = set(w.lower() for w in nltk.corpus.words.words())

# Tokenization
def tokenize_tweet(text):
    """
    this function will process the tweets and tokenize it.
    input: tweet text
    output: tokenize text
    """
    if text.startswith('@null'):
        return "[Tweet not available]"
    text = re.sub(r'\$\w*', '', text) # Remove tickers
    text = re.sub(r'https?:\/\/.*\/\w*', '', text) # Remove hyperlinks
    text = re.sub(r'['+string.punctuation+']+', ' ', text) # Remove puncutations
    
    # Tokenize text
    tweet_tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
    tokens = tweet_tokenizer.tokenize(text)
    
    # lowercase the text and remove words contained in stopwords
    tokens = [i.lower() for i in tokens if i not in stopwords and len(i) > 2 and  i in english_vocab]
    
    return tokens

# Clean tweets and put it in dataframe
cleaned_tweets = []
for t in df.text:
    words = tokenize_tweet(t)
    cleaned_tweet = " ".join(w for w in words if len(w) > 2 and w.isalpha()) #Form sentences of processed words
    cleaned_tweets.append(cleaned_tweet)
df['CleanTweetText'] = cleaned_tweets
df.head()

Unnamed: 0,created_at,favorite_count,followers_count,friends_count,retweet_count,screen_name,text,CleanTweetText
0,Tue Feb 20 23:14:14 +0000 2018,0,145,1569,59,ROBINEYOUNG1,RT @crazymarolive: #KeatonJones mama got me me...,got
1,Tue Feb 20 23:07:07 +0000 2018,0,1145,2445,2,DeanoTheBeano72,This poor soul. Kids can be so bloody cruell. ...,poor soul bloody school got stage
2,Tue Feb 20 22:46:42 +0000 2018,0,12,229,607,Desiflo33,RT @iamlatocha: i need each and everyone of ya...,need everyone repost
3,Tue Feb 20 21:18:08 +0000 2018,0,64,79,4,KadedollaDolla,RT @BrownEyeEnt: #Petty (Ep.23) (BULLY ADDITIO...,bully racist bully petty
4,Tue Feb 20 14:52:17 +0000 2018,0,6202,5682,2008,thatsmydea,RT @ufc: If any of you have any info on how to...,connect please let know


In [4]:
# Text exploration biagram
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words, 5)
finder.apply_freq_filter(5)
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[]


# 2.2 Topic Modeling
In this section, we will:
1. Use TFIDF Vectorizer to vectorize the text
2. Use KMeans and LDA to cluster the text data.

## 2.2.1 KMeans

In [5]:
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True) #to show plotted graph
from imageio import imread
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline
import base64 
import codecs
import nltk
import sklearn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.metrics.pairwise import cosine_similarity  

In [6]:
# Use TFIDF vectorizer to convert tweet into a vector
tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3))  
tf = tfidf_vectorizer.fit_transform(cleaned_tweets)  
feature_names = tfidf_vectorizer.get_feature_names() # num phrases  

# Check similarity
dist = 1 - cosine_similarity(tf)  
print(dist) 

[[  0.00000000e+00   7.57199721e-01   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   1.00000000e+00]
 [  7.57199721e-01  -8.88178420e-16   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   1.00000000e+00]
 [  1.00000000e+00   1.00000000e+00  -2.22044605e-16 ...,   1.00000000e+00
    1.00000000e+00   1.00000000e+00]
 ..., 
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   1.00000000e+00]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
   -2.22044605e-16   1.00000000e+00]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   0.00000000e+00]]


In [7]:
# Plot top words frequency

count_vec = np.asarray(tf.sum(axis=0)).ravel()
zipped = list(zip(feature_names, count_vec))
x, y = (list(x) for x in zip(*sorted(zipped, key=lambda x: x[1], reverse=True)))
# Now I want to extract out on the top 15 and bottom 15 words
Y = np.concatenate([y[0:15], y[-16:-1]])
X = np.concatenate([x[0:15], x[-16:-1]])

# Plotting the Plot.ly plot for the Top 50 word frequencies
data = [go.Bar(
            x = x[0:50],
            y = y[0:50],
            marker= dict(colorscale='Jet',
                         color = y[0:50]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 Word frequencies'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

# Plotting the Plot.ly plot for the Top 50 word frequencies
data = [go.Bar(
            x = x[-100:],
            y = y[-100:],
            marker= dict(colorscale='Portland',
                         color = y[-100:]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Bottom 100 Word frequencies'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')


In [8]:
from sklearn.cluster import KMeans  
num_clusters = 3  
km = KMeans(n_clusters=num_clusters)  
km.fit(tf)  
clusters = km.labels_.tolist()  
df['ClusterID'] = clusters  
print(df['ClusterID'].value_counts())

#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
    print("Cluster {} : Words :".format(i))
    for ind in order_centroids[i, :10]: 
        print(' %s' % feature_names[ind])

0    76
1    21
2     6
Name: ClusterID, dtype: int64
Cluster 0 : Words :
 help hate
 help
 hate
 las
 kids
 got
 responsibility
 responsibility kids
 much
 heart
Cluster 1 : Words :
 bully
 petty
 bully racist bully
 bully racist
 bully petty
 racist bully petty
 racist bully
 racist
 heartbreaking message school
 hate
Cluster 2 : Words :
 message school
 backing
 heartbreaking message school
 heartbreaking message
 heartbreaking
 entire
 entire backing heartbreaking
 backing heartbreaking message
 backing heartbreaking
 entire backing


The result is:
    1. Cluster 0 are words to show compassion to Keaton Jones
    2. Cluster 1 are words to show disapproval about bullying and racism.
    3. Cluster 2 are words used spread the news about Keaton Jones.

## 2.2.2 Latent Dirichlet Allocation

In [9]:
# Latent Dirichlet Allocation
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
texts = [text for text in cleaned_tweets if len(text) > 2]
doc_clean = [clean(doc).split() for doc in texts]
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
ldamodel = models.ldamodel.LdaModel(doc_term_matrix, num_topics=3, id2word = 
dictionary, passes=5)
for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6):
    print("Topic {}: Words: ".format(topic[0]))
    topicwords = [w for (w, val) in topic[1]]
    print(topicwords)


detected Windows; aliasing chunkize to chunkize_serial



Topic 0: Words: 
['school', 'backing', 'heartbreaking', 'entire', 'message', 'got']
Topic 1: Words: 
['bully', 'racist', 'petty', 'get', 'truth', 'god']
Topic 2: Words: 
['baby', 'hate', 'help', 'people', 'much', 'kid']


The result is:
    1. Topic 0 are words to show disapproval about bullying and racism.
    2. Topic 1 are words to show compassion to Keaton Jones.
    3. Topic 2 are words used to spread the news about Keaton Jones.

More things to do:
1. More visualization: wordclouds, triagram, etc.
2. Create predictive model to specify whether the tweet is positive or negative.