# Analysis of Twitter's Trending Topics in Comparison to Top News Headlines
## Zion Calvo | Surya Keswani | Donald Stewart 
### sukeswan@ucsc.edu | dolstewa@ucsc.edu | zcalvo@ucsc.edu

This notebook creates an interactive visualization of Twitter's trending topics from March 1, 2011 to March 7, 2011. The dataset comes from RMIT University in Australia. The dataset can be found [here](http://nlp.uned.es/~damiano/datasets/TT-classification.html). The interactive visualization was made using Google's open-source Facets Library ([linked here](https://pair-code.github.io/facets/)). To run this notebook, download the dataset and store this notebook in the same directory as the dataset.

#### Information about the Twitter Trending Topics dataframe (processed in the read_data function)
1. `Md5 Hash`: Each of the 1,036 trending topics has a hash assigned to it. There are 1,036 files with tweet information. Each file is titled with a Md5 hash and corresponds to the trending topic hash. 

2. `Date`: Each topic is assigned the date of when it became trending. The dates in the original dataset are formatted as yyyyMMdd. The read_data function reformats the dates as Day MM/DD/YYYY. 

3. `Topic`: The trending twitter topics, ranging from news stories to people to hashtags.

4. `Type`: Each of the 1,036 topics have been analyzed by RMIT and manually categorized. The manual annotation consists in one of the four classes in the taxonomy: 

    + news
    + ongoing-event
    + meme
    + commemorative
    
    
5. `Tweets`: The number of times a topic was tweeted about during the dataset collection period. The number of tweets have been extracted from the tweets folder, looking at the corresponding Md5 hash file for each topic. 
6. `Unique Users`: The number of unique users that tweeted about the corresponding trending topic during the dataset collection period. The unique users have been extracted from the tweets folder, looking at the corresponding Md5 hash file for each topic. 
7. `Average Traffic`: The average of the # Tweets and Unique Users. This score serves as another way to measure the growing trends.

#### Functions
1. `read_data()`: Takes the data in the directory and converts it to a pandas dataframe. 
2. `read_headlines()`: Reads in the top 35 headlines in provided headlines file. Headlines have been webscrapped from top news sources. Headlines stored in pandas dataframe. 
3. `get_topX()`: Uses ttt dataframe and X integer as input and outputs 3 dataframes

    + `topX_avg`: Dataframe with the top X topics based on highest average traffic score
    + `topX_users`: Dataframe with the top X topics based on highest number of unique users
    + `topX_tweets`: Dataframe with the top X topics based on the highest number of tweets
    
    
4. `run_facets()`: Takes dataframe as input and return interactive data visualization tool
5. `get_tweets()`: *Coming soon* 
6. `main()`: Makes function calls and prints data analysis to console. To see full output, change the `main()` cell settings by clicking *Cell -> Cell Output -> Toggle Scrolling*

***Note***: In order to respect Twitter's TOS, tweets are not redistributed and only tweets ids and author screen names are provided. Tweet texts can be downloaded using scripts. 

***Note***: The Facet library visualization takes 30-60 seconds to load.

In [1]:
import pandas as pd                                                             
import io        
!pip`` install facets-overview                                                    
from IPython.core.display import display, HTML                                  
print("Imports completed")

Collecting facets-overview
  Downloading https://files.pythonhosted.org/packages/df/8a/0042de5450dbd9e7e0773de93fe84c999b5b078b1f60b4c19ac76b5dd889/facets_overview-1.0.0-py2.py3-none-any.whl
Installing collected packages: facets-overview
Successfully installed facets-overview-1.0.0
Imports completed


In [2]:
def read_data(): #read data and store in ttt dataframe
    # pandas reads uploaded csv file and stores in twitter trending topics dataframe
    fields = ["Md5 Hash", "Date", "Topic", "Type"];
    ttt = pd.read_csv('TT-annotations.csv', sep=";", header=None, names=fields)
    hashList = ttt["Md5 Hash"].tolist()
    amount_tweets = []
    unique_users = []
    
    for file in hashList:
        hash_file = pd.read_csv("tweets/" + file, sep="\t", header=None, names=["Tweet ID", "User", "MD5 Hash"])
        amount_tweets.append(len(hash_file))
        unique_users.append(len(list(hash_file.User.unique())))
    
    # convert the date formats to be more readable
    ttt['Date'].replace({20110301: "Tuesday 3/1/11", 
                         20110302: "Wednesday 3/2/11",
                         20110303: "Thursday 3/3/11", 
                         20110304: "Friday 3/4/11", 
                         20110305: "Saturday 3/5/11",
                         20110306: "Sunday 3/6/11", 
                         20110307: "Monday 3/7/11"}, inplace = True)
    
    # add the unique number of tweets and unique users per topic in the trending list 
    ttt['# Tweets'] = amount_tweets
    ttt['Unique Users'] = unique_users
    ttt['Average Traffic'] = (ttt['# Tweets'] + ttt['Unique Users'])/2
    return ttt                                                                   # dataframe preview

In [3]:
def read_headlines(): 
    headlines = pd.read_csv('headline - Sheet1.csv')
    
    # convert the date formats to be more readable
    headlines['Date'].replace({20110301: "Tuesday 3/1/11", 
                               20110302: "Wednesday 3/2/11",
                               20110303: "Thursday 3/3/11", 
                               20110304: "Friday 3/4/11", 
                               20110305: "Saturday 3/5/11",
                               20110306: "Sunday 3/6/11", 
                               20110307: "Monday 3/7/11"}, inplace = True)
    return headlines

In [4]:
def get_topX(ttt, X): # returns the top 30 topics based on avg traffic, users, and # of tweets, each stored in a dataframe
    topX_avg = ttt.nlargest(X, ['Average Traffic']) 
    topX_users = ttt.nlargest(X, ['Unique Users'])
    topX_tweets = ttt.nlargest(X, ['# Tweets'])
    return topX_avg, topX_users, topX_tweets

In [5]:
def run_facets(dataframe):
    jsonstr = dataframe.to_json(orient='records')   # Display the Dive visualization for the data
    HTML_TEMPLATE = """
            <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
            <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
            <facets-dive id="elem" height="600"></facets-dive>
            <script>
              var data = {jsonstr};
              document.querySelector("#elem").data = data;
            </script>"""
    html = HTML_TEMPLATE.format(jsonstr=jsonstr)    # Note: This code block has been taken from Google facets tutorial 
    display(HTML(html))                             # Link to tutorial: https://colab.research.google.com/github/PAIR-code/facets/blob/master/colab_facets.ipynb#scrollTo=blPpZw5R3Bb4

In [6]:
def get_tweets():
    from requests import get
    from bs4 import BeautifulSoup as bs4

    main_page = get('https://twitter.com/anyuser/status/44004278418939904')
    page_html = bs4(main_page.text, 'html.parser')
    page_html.find('title').text

In [9]:
def main():
    ttt = read_data()
    #headlines = read_headlines()
    topX_avg, topX_users, topX_tweets = get_topX(ttt, 35)

    print("Interactive Facets Visulization Below:")
    run_facets(ttt)

    print("TTT Dataframe Preveiw Below:")
    display(ttt)

    print("Top 35 Trending Topics based on Average Traffic Below:")
    display(topX_avg)

    print("Top 35 Trending Topics based on Unique Users Below:")
    display(topX_users)

    print("Top 35 Trending Topics based on # of Tweets Below:")
    display(topX_tweets)
    
    # print("Top 35 News Headlines Below:")
    # display(headlines)

In [10]:
main()

Interactive Facets Visulization Below:


TTT Dataframe Preveiw Below:


Unnamed: 0,Md5 Hash,Date,Topic,Type,# Tweets,Unique Users,Average Traffic
0,e86dba234c7429d9aea70c5f104dbebc,Sunday 3/6/11,#mileyonsnl,ongoing-event,1358,536,947.0
1,fe2dca2bdfca21e6c83567a531d6534e,Sunday 3/6/11,Leap Year,ongoing-event,1496,1335,1415.5
2,cf90d03111f55e55ff9c7b5ef079fa30,Saturday 3/5/11,Sober Valley Lodge,meme,1490,1376,1433.0
3,748089790952e9dbfb174b7ef39adf2b,Thursday 3/3/11,Howard Davies,news,557,452,504.5
4,3b7dd11b6654dd5f35745285ededdbd9,Thursday 3/3/11,Sky News,news,1483,1203,1343.0
...,...,...,...,...,...,...,...
1031,05d739a093794343728b66d5802db10b,Wednesday 3/2/11,Antonio Damasio,ongoing-event,197,166,181.5
1032,8669b32af04ae65b88b38df5662545ea,Saturday 3/5/11,Sheila Mello,ongoing-event,190,170,180.0
1033,4e49ec9f0f036f91a5b34278d15484f3,Friday 3/4/11,Gareca,ongoing-event,284,222,253.0
1034,d93d84ae006f530cbf2b33ebf824d48e,Monday 3/7/11,Patrícia Poeta,ongoing-event,309,283,296.0


Top 35 Trending Topics based on Average Traffic Below:


Unnamed: 0,Md5 Hash,Date,Topic,Type,# Tweets,Unique Users,Average Traffic
607,f6c5c0ae6c636b204045d21267608be5,Saturday 3/5/11,YOURMAN,meme,1492,1479,1485.5
711,6401813085837e9f1f8b27eae98ac4f3,Saturday 3/5/11,Mountain Dew,meme,1493,1454,1473.5
1011,5eb1c412235828a21e3767fc12ab9975,Friday 3/4/11,Fast Times,ongoing-event,1498,1441,1469.5
10,e8611c0126305e7a3af7ebe8317dc699,Tuesday 3/1/11,RIP Jane Russell,news,1497,1441,1469.0
408,257c941753478852a2436136bc72ab61,Wednesday 3/2/11,TOP10 Profile STALKERS,meme,1496,1439,1467.5
324,f466ab0a8cd1f819044a225a3e2bfeaa,Thursday 3/3/11,Google Profiles,news,1497,1435,1466.0
460,6a5c42fc43e58b8ed67ab5da8d965c51,Wednesday 3/2/11,Wild Thing,ongoing-event,1500,1429,1464.5
529,43a77176c021459123b4c9087be6690c,Wednesday 3/2/11,Emilio Estevez,meme,1496,1433,1464.5
933,b9ce5104bf8173b0565bda046bf8116c,Wednesday 3/2/11,Adonis DNA,meme,1493,1424,1458.5
333,b4247c18124765ae6cacb49f80d5d0ab,Tuesday 3/1/11,Lázaro Ramos,ongoing-event,1498,1416,1457.0


Top 35 Trending Topics based on Unique Users Below:


Unnamed: 0,Md5 Hash,Date,Topic,Type,# Tweets,Unique Users,Average Traffic
607,f6c5c0ae6c636b204045d21267608be5,Saturday 3/5/11,YOURMAN,meme,1492,1479,1485.5
711,6401813085837e9f1f8b27eae98ac4f3,Saturday 3/5/11,Mountain Dew,meme,1493,1454,1473.5
877,463636c775a2ba72afdfd29d240b4a19,Wednesday 3/2/11,Nancy Grace,meme,1464,1445,1454.5
736,c80e68f4191a8150a52cf5cf43a6bdaf,Wednesday 3/2/11,Still Winning,meme,1462,1442,1452.0
10,e8611c0126305e7a3af7ebe8317dc699,Tuesday 3/1/11,RIP Jane Russell,news,1497,1441,1469.0
1011,5eb1c412235828a21e3767fc12ab9975,Friday 3/4/11,Fast Times,ongoing-event,1498,1441,1469.5
408,257c941753478852a2436136bc72ab61,Wednesday 3/2/11,TOP10 Profile STALKERS,meme,1496,1439,1467.5
357,4dcf6c45da315ab121d9bcc23a157866,Thursday 3/3/11,World Record,news,1475,1438,1456.5
324,f466ab0a8cd1f819044a225a3e2bfeaa,Thursday 3/3/11,Google Profiles,news,1497,1435,1466.0
529,43a77176c021459123b4c9087be6690c,Wednesday 3/2/11,Emilio Estevez,meme,1496,1433,1464.5


Top 35 Trending Topics based on # of Tweets Below:


Unnamed: 0,Md5 Hash,Date,Topic,Type,# Tweets,Unique Users,Average Traffic
70,050feefcdd7d85023c61753474848a48,Monday 3/7/11,Rosenmontag,meme,1500,1298,1399.0
114,4a4de7aa4553a1bd4fea73042d479f6b,Tuesday 3/1/11,Robot Chicken,ongoing-event,1500,1279,1389.5
460,6a5c42fc43e58b8ed67ab5da8d965c51,Wednesday 3/2/11,Wild Thing,ongoing-event,1500,1429,1464.5
602,6de9a283570262f6978fdbbab2c5bc41,Wednesday 3/2/11,Kalla,news,1500,1203,1351.5
809,9ff084e11fff0932581f87e9a5e19c1a,Thursday 3/3/11,Wellington Paulista,ongoing-event,1500,987,1243.5
134,9b5ff8a3f9652462677d4ffe8dbc5a67,Friday 3/4/11,Java Jazz Festival,ongoing-event,1499,1305,1402.0
300,028cd4c37ed3a705a309424766cccb5e,Tuesday 3/1/11,Carlos Estevez,meme,1499,1324,1411.5
353,1c6d0fdb4b97bf152dd965d516873a45,Thursday 3/3/11,Nick Carter,ongoing-event,1499,1150,1324.5
596,86007de4b13059d7af68019129b25d54,Saturday 3/5/11,Destra,ongoing-event,1499,1133,1316.0
716,20eb10b2a243bb876e3db547bdfd3e8f,Friday 3/4/11,Taylor Hall,news,1499,1164,1331.5
