# TATR: Finding Hashtag Popularity 

This notebook is part of a greater series of Juypter Notebook structured around Twitter Tweet analysis. This particular notebook will look at finding the most popular hashtag for a given date. This notebook will showcase two main features with the end goal of finding the most popular hashtag by date. The first feature is collapsing all the tokenized hashtag of a date into a larger set. The second feature is to save the "tokenized list" into a csv and load it back such that Panda dataframes can still treat this as a list. This notebook will also provide the a framework that can be expanded to suit your needs.

Any additional assumptions and clarification will be discussed and declared throughout the notebook.

### Note: 
This notebook will use concepts found in the TATR notebook series

Written 2018.

## Import Libraries

Now we will import all the Python 3 libraries that will be used in this notebook. You do not need to know all the functionalities of each libraries as some are massive. However any functionalities that is used will be explained as they appear, therefore do not worry too much if you do not recongize the libraries. 

To import or download the required libraries see the Juypter documentation or the libraries's home page for instruction. 

### Note: 
All libraries that are used are available for Anaconda 

In [127]:
# Importing data structure libraries
import pandas as pd
import numpy as np

# Import text analysist tools
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TweetTokenizer

# Import Counter
from collections import Counter

# Import libraries for loading files
import ast

## Setting up some dummy data

Similar to before, this notebook will create a small set of dummy data to demostrate. This is to ensure that all those using this notebook have the ability to test out its functionality. However we will be making some modification and changes to the one made in the the other notebook.

### Dummy Panda Dataframe Structure

We are going to create 5 sets of 5 entries each entry will have 2 values (Date, Hashtag). To find out how to generate these values see TATR Tokenization and Extraction. 

To do so we will used 1 functions to help create the Date
* pd.Timestamp(Some date format) : Which turns the input into a date 
* np.random.randint(range, how many) : Creates a random integer from 0 to range for the declared amount

### Note:

We will be randomly assigning hashtags to each date. Therefore it is possible your results will be different each time.

In [106]:
# Setup empty dataframe
DummyDataframe = pd.DataFrame(columns=['Date', 'Hashtag'])

# Setup 5 different dates
dummyDates = [ pd.Timestamp('20180101'),  pd.Timestamp('20180201'),  pd.Timestamp('20180301')]

# List of hashtag we are going to use
dummyHashtags = ["Twitter","New2018", "JupyterLearning", "TwitterApp", "GenericTwitterNews"]

# Create 15 entries and assigning a random amount of hashtags to it
for i in range(15):
    DummyDataframe.loc[i] = [dummyDates[i % 3], [dummyHashtags[x] for x in range(np.random.randint(5))] ]
    
# See what the dataframe looks like
DummyDataframe

Unnamed: 0,Date,Hashtag
0,2018-01-01,[Twitter]
1,2018-02-01,"[Twitter, New2018]"
2,2018-03-01,"[Twitter, New2018]"
3,2018-01-01,"[Twitter, New2018, JupyterLearning]"
4,2018-02-01,"[Twitter, New2018, JupyterLearning]"
5,2018-03-01,"[Twitter, New2018]"
6,2018-01-01,[Twitter]
7,2018-02-01,[Twitter]
8,2018-03-01,[Twitter]
9,2018-01-01,"[Twitter, New2018, JupyterLearning, TwitterApp]"


Agian similar to t previous notebook (TATR: Graphing) we are going to sort the dates (mostly so it is more readable for us). In addition we are going to take all the hashtags of each date and collapse them into a larger list.

In [107]:
# Set index to Date
DummyDataframe = DummyDataframe.set_index("Date")

# Group by date and then sum (add together) all the hashtag of a single date
DummyDataframe = DummyDataframe[["Hashtag"]].groupby('Date').agg({'Hashtag': 'sum'})

# Look at the data
DummyDataframe

Unnamed: 0_level_0,Hashtag
Date,Unnamed: 1_level_1
2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T..."
2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe..."
2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ..."


As you can see they added together all the individual list of hashtags into a single list. We can now find out what is the most popular of each date. To do so we are going to first write a function that counts the occurance of each hashtag. Afterwards we are going to use Panda's apply and lamda features to change the panda dataframe.

In [108]:
"""
Helper function to get the most popular hashtag

:dataframe     = The dataframe of the data
:column        = Columns to be used for the data
"""
def most_popular_hashtag(dataframe, column):
    
    # Calculates what is most popular
    popular = max(set(dataframe[column]), key=dataframe[column].count)
    
    # Assigns the values
    dataframe["most_popular_hashtag"] = popular
    
    return dataframe

Now that we declared the function we are going to apply it using Panda apply and lamda feature

In [109]:
# Now lets save the top most popular word by day into their own column
DummyDataframe = DummyDataframe.apply(lambda x: most_popular_hashtag(x, "Hashtag"), axis = 1)

# Print out the first 5 entry
DummyDataframe

Unnamed: 0_level_0,Hashtag,most_popular_hashtag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T...",Twitter
2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe...",Twitter
2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ...",Twitter


In [110]:
"""
Helper function to get the most popular hashtag

:dataframe     = The dataframe of the data
:column        = Columns to be used for the data
"""
def most_popular_hashtag_percent(dataframe, column):
    
    # Calculates what is most popular
    popular = max(set(dataframe[column]), key=dataframe[column].count)
    
    # Assigns the values
    dataframe["most_popular_hashtag"] = popular
    
    # Calculate the percent of the popular hashtag and round it to two decimal place 
    dataframe["popular_hashtag_percent"] = "%.2f" % ((Counter(dataframe[column])[popular] / len(dataframe[column])) * 100)
    
    return dataframe

Now that we declared a new version of the previous function, we will now see what it looks like.

In [111]:
# Now lets save the top most popular word by day into their own column
DummyDataframe = DummyDataframe.apply(lambda x: most_popular_hashtag_percent(x, "Hashtag"), axis = 1)

# Print out the first 5 entry
DummyDataframe

Unnamed: 0_level_0,Hashtag,most_popular_hashtag,popular_hashtag_percent
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T...",Twitter,41.67
2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe...",Twitter,45.45
2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ...",Twitter,45.45


As you can see we now have a percent assoicated with the most popular hashtag. Therefore giving us more insight on how "popular" the most popular hashtag is overall.

## Saving our results

Now that we have our results we want to save what we have done. However this is different than what was done in previous notebooks. The reason for this is because we have been using "list" to store the results. When this is converted into a CSV it will be saved as it appears in the panda dataframe. This is problematic when loading the csv back in. This is because when we are loading the csv the program does "not know" that it is suppose to be a "list". Therefore we are going to have to alter the save function made in TATR Panda and CSV of Tweets. Therefore we are going to use the same "save" function (with a slight modification for index), however we are going to change out load function.

Using the function "ast.literal_eval" we can interpret the string of character as a list. 

To find out more about this function see:
* https://docs.python.org/2/library/ast.html

In [124]:
'''
Save the dataframe into a file

:dataframe: The dataframe that is being saved
:name_of_file:    The name of the CSV file you wish to save it as
:index: If you want to save the index set to true
'''
def save_frame_to_CSV(dataframe, name_of_file, index_save):
    
    print("Begin saving dataframe into a csv...\n")
    
    # Attach the CSV file extendsion to the name
    name = name_of_file + ".csv"
    
    # Convert the dataframe into a CSV that is seperated by commas
    # Remove "index=False" if you want to save the index
    dataframe.to_csv(name, sep=',', encoding='utf-8', index=index_save)
    
    print("Finish and saved into " + name + "\n")

In [160]:
'''
Similar to the old load csv file before except this converts the list present in the csv back into list objects to be used
Parameters:

:dataframe: The dataframe that is being saved
:name_of_file:    The name of the CSV file you wish to save it as
:csv_file_name   = Name of the file to load
:colList         = Which columns that are a list
'''
def read_CSV_file_convert (csv_file_name, colList, convertList):
    
    # Create the function to convert to a list
    string_to_list = lambda x: ast.literal_eval(str(x))

    # Create the converter that will convert all the columns we tell it to back to a list object
    conv = {}
    
    # Assign the converter for each of the columns we want to convert (a.k.a the columns with a list)
    for entry in colList:
        if entry in convertList:            
            conv[entry] = string_to_list

    print("Loading csv " + csv_file_name + "\n")
    name = csv_file_name + ".csv"

    # Load a specified amount of columns from the csv
    return_frame = pd.read_csv(name, usecols = colList, converters=conv)
    
    print("Finish loading csv " + csv_file_name)
    
    return return_frame

Now that we have defined both functions we are going to save the results we have.

In [135]:
# Save the CSV and the index
save_frame_to_CSV(DummyDataframe, "TATR_Finding_Hashtag", True)

Begin saving dataframe into a csv...

Finish and saved into TATR_Finding_Hashtag.csv



In [161]:
# Load back the dataframe and the column names
LoadFrame = read_CSV_file_convert("TATR_Finding_Hashtag",['Date','Hashtag', 'most_popular_hashtag', 'popular_hashtag_percent'],['Hashtag'])

# See the dataframe
LoadFrame

Loading csv TATR_Finding_Hashtag
Finish loading csv TATR_Finding_Hashtag


Unnamed: 0,Date,Hashtag,most_popular_hashtag,popular_hashtag_percent
0,2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T...",Twitter,41.67
1,2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe...",Twitter,45.45
2,2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ...",Twitter,45.45


Now that we loaded the csv back in we all that is different from before is that "Date" is not the index. This can be easily fixed by setting the "Date" column as the index.

In [162]:
# Set Date as the index
LoadFrame = LoadFrame.set_index("Date")

# See the frame
LoadFrame

Unnamed: 0_level_0,Hashtag,most_popular_hashtag,popular_hashtag_percent
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T...",Twitter,41.67
2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe...",Twitter,45.45
2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ...",Twitter,45.45


Finally to see if our dataframe still works we are going to run the algorthmn for finding the popular hashtag. This will require us to use a new dataframe to hold the results.

In [163]:
# Load in the Loadframe with just the hashtag column and run the most popular hashtag function
TestingLoadFrame = LoadFrame[['Hashtag']].apply(lambda x: most_popular_hashtag_percent(x, "Hashtag"), axis = 1)

# See what it looks like
TestingLoadFrame

Unnamed: 0_level_0,Hashtag,most_popular_hashtag,popular_hashtag_percent
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01,"[Twitter, Twitter, New2018, JupyterLearning, T...",Twitter,41.67
2018-02-01,"[Twitter, New2018, Twitter, New2018, JupyterLe...",Twitter,45.45
2018-03-01,"[Twitter, New2018, Twitter, New2018, Twitter, ...",Twitter,45.45


As you can see we have the same results meaning our load function works as intended.

## Conculsion

In this notebook we went over how to find the most popular hashtag for each date as well as by how much. This provide useful information as you can now trace what is a popular hashtag in your corpus. There is other functionalities and uses that can be applied to the tokenized hashtags. However this notebook present a common analysis. This notebook also showcases how to load back your dataframes that uses "list" therefore removing the need to recompute the tokens.

This notebook is one of the more advance notebook in the TATR notebook series.