# TATR: Tokenization and Extraction

This notebook is part of a greater series of Juypter Notebook structured around Twitter Tweet analysis. This particular notebook will look at tokenization of extracting key features from a tweet text. This notebook also serves as one of the introductory notebook for TATR as tokenization and extraction are fundamental features for any analysis. 

Any additional assumptions and clarification will be discussed and declared throughout the notebook.

### Note: 
This notebook recommend you look at TATR: Panda and CSV of Tweets. This is because we will be using some of the features discussed in that notebook. 

Written 2018.

## Introduction: Tokenization

Tokenization is the process in which we seperate a string into parts. Whether that be by words, sentences or some other rubric. Therefore the tokenization is important when segmenting your twitter data. This will be further elaborated when tokenization comes up. 

To find out more about tokenization in general:
* https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization


## Import Libraries

Now we will import all the Python 3 libraries that will be used in this notebook. You do not need to know all the functionalities of each libraries as some are massive. However any functionalities that is used will be explained as they appear, therefore do not worry too much if you do not recongize the libraries. 

To import or download the required libraries see the Juypter documentation or the libraries's home page for instruction. 

### Note: 
All libraries that are used are available for Anaconda 

In [28]:
# Importing data structure libraries
import pandas as pd
import numpy as np

# Import text analysist tools
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TweetTokenizer

## Tokenization using NLTK

In this notebook we will look at using NLTK (Natural Language Toolkit) for tokenization. NLTK is a popular and powerful Python library when working with human language data. The library contains many different functionalities from tokenization to word classification. In this notebook we will just be examining the tokenization features of NLTK

To find out more about NLTK see:
* http://www.nltk.org/

To find out more about NLTK tokenization see:
* http://www.nltk.org/api/nltk.tokenize.html

## Prefix

In this notebook we will be creating 4 dummy tweets. This is done to ensure that anyone coming to this notebook is able to test the functionalities of it without needing a set of twitter tweet beforehand. Therefore if you do have a corpus already feel free to skip this section.

### Note:

This notebook uses Panda as it primary data structure and CSV as it data file format. 

In [29]:
# Creating the panda dataframe
pandaDataFrame = pd.DataFrame({ 
                                'Text' :["This is a basic tweet without anything speical",
                                         "This tweet uses #twitter hashtags",
                                         "This is used to reply to @user",
                                         "#Combing different @uses of things! and http://www.google.ca :D"]
                              })

# Lets see what the dataframe look like
pandaDataFrame

Unnamed: 0,Text
0,This is a basic tweet without anything speical
1,This tweet uses #twitter hashtags
2,This is used to reply to @user
3,#Combing different @uses of things! and http:/...


## Tokenization

Now that we have some data loaded into a Panda dataframe, we can begin tokenization. NLTK offers a varity of methods for tokenization. In this notebook we will look at one particular tokenization NLTK offers, that is TweetTokenizer. This is NLTK tokenization made especially for tweets.

We will first create a function to conduct the tokenization. The reason for this will be explained later on the notebook. 

### Note: 
During tokenization we also record the amount of tokens created. Although not important to this particular notebook, having the count can be useful for analysis.

There is also additional options that can be enabled with NLTK TweetTokenizer, however in this notebook we will not be using them.

In addition, depending on your data set this process can take a long time. Therefore it is best to segement larger dataset into smaller subsets. The process of segenting larger dataset into smaller ones can be found in a more advance notebook. 

In [30]:
"""
Tokenize the text within the dataframe

:dataframe: The dataframe with the tweets
:column_name: the column name with the tweet text
"""
def tokenize_text(dataframe, column_name):

    # Initalize the tokenizer
    tweetTokenizer = TweetTokenizer()
    
    # Calculate the amount of tokens
    token = tweetTokenizer.tokenize(dataframe[column_name]) 

    # Save Tokenized text into a new column called "token"
    dataframe['token'] = token
    
    # Save Token Count into a new column called "count_token"
    dataframe['count_token'] = len(token)
    
    return dataframe

Now that we have it initalized we can apply this to each value of the dataframe. To do this we will examine a nice feature of Panda, "apply" and "lamda". Using "apply" it allows us to apply the function to each data cell in Text. "Lamda" allows us to run functions with mulitple parameter in apply.

In the "apply" function, "axis" can either refer to '0' or rows or '1' or columns

In [31]:
# Run the tokenizer on our dataframe and save it into a new dataframe
# In this case "x" refers to the "pandaDataframe" and "Text" is the column label
TokenizeTweetFrame = pandaDataFrame.apply(lambda x: tokenize_text(x, "Text"), axis=1)

# Seeing what TokenizeTweetFrame look like
TokenizeTweetFrame

Unnamed: 0,Text,token,count_token
0,This is a basic tweet without anything speical,"[This, is, a, basic, tweet, without, anything,...",8
1,This tweet uses #twitter hashtags,"[This, tweet, uses, #twitter, hashtags]",5
2,This is used to reply to @user,"[This, is, used, to, reply, to, @user]",7
3,#Combing different @uses of things! and http:/...,"[#Combing, different, @uses, of, things, !, an...",9


As you can see we now have 3 columns in our dataframe (excluding index). You can see that for each text, hashtags, replies, words, and emoji are seperated into their own individual tokens.

## Extracting Token

Now that we have each tweet tokenized, we can move onto extraction. Although having the tokenized version of each tweet is useful to have, maybe you are only convern with one particular aspect of the tweet text. For example, hashtags are a interesting area to analyze. Therefore having another column with just the hashtags will be useful.

## Extracting Hashtags from Tokens

We will first start out by extracting hashtags into their own column. To do so we are going to have to use some regular expression or regex. Regular expression is a method of matching a sequence of characters, in this case hashtags. Therefore we will again declare a function, similar to the one before for tokenization, that extracts hashtags and their count. 

To find out more on regular expression see:
* https://en.wikipedia.org/wiki/Regular_expression

For a quick reference on what regular expression can do see:
* https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-referencete:

### Note:
During extraction of the hashtags, we will be removing the "#" from the results. This is because we know all the results in the new column will be hashtags, therefore it will be easier and cleaner to remove them during this process than later on.

In [32]:
"""
Extract the hashtags within the tokenized text

:dataframe: The dataframe with the tweets
:column_name: the column name with the token text
"""
def extract_hashtag(dataframe, column_name):
    
    # Finds all the hashtags with regex 
    # The regex matches all sequences that start with "#" 
    hashtag = re.findall(r"#(\S+)", dataframe[column_name])

    # Insert hashtag and count into dataframe
    
    # If there is any hashtag insert them and their count
    if hashtag:
        dataframe['HASHTAG'] = hashtag
        dataframe['count_hashtag'] = len(hashtag)
        
    # If there is no hashtags just insert empty values
    else:
        dataframe['HASHTAG'] = []
        dataframe['count_hashtag'] = 0
        
    return dataframe

Again, similar to before we will again be using Panda's apply and lamda features to apply this function to dataset. In addition we will be expanding the previous dataframe "TokenizeTweetFrame" with the new columns. However, feel free to create a new dataframe for the extracted tokens and count

In [33]:
# Run the extractor on our TokenizeTweetFrame
# In this case "x" refers to the "TokenizeTweetFrame" and "Text" is the column label
TokenizeTweetFrame = TokenizeTweetFrame.apply(lambda x: extract_hashtag(x, "Text"), axis=1)

# Seeing what TokenizeTweetFrame look like
TokenizeTweetFrame

Unnamed: 0,Text,token,count_token,HASHTAG,count_hashtag
0,This is a basic tweet without anything speical,"[This, is, a, basic, tweet, without, anything,...",8,[],0
1,This tweet uses #twitter hashtags,"[This, tweet, uses, #twitter, hashtags]",5,[twitter],1
2,This is used to reply to @user,"[This, is, used, to, reply, to, @user]",7,[],0
3,#Combing different @uses of things! and http:/...,"[#Combing, different, @uses, of, things, !, an...",9,[Combing],1


Now that we have extracted the different hashtags from the text, we can do the same for other tokens as well. In this notebook we will showcase how to also extract urls and replies. However in this notebook we will be moving them to their own dataframe. This is done mostly to keep the dataframe small and readable for this notebook format. Feel free to keep it as one dataframe.

We will first start by writing a function that removes the replies. You will find it is very similar to the extract_hashtag function.

In [34]:
"""
Extract the replies within the tokenized text

:dataframe: The dataframe with the tweets
:column_name: the column name with the token text
"""
def extract_replies(dataframe, column_name):
    
    # Finds all the replies with regex 
    # The regex matches all sequences that start with "@"
    replies = re.findall(r"@(\S+)", dataframe[column_name])
    
    # If there is any replies insert them and their count
    if replies:
        dataframe['REPLIES'] = replies
        dataframe['count_replies'] = len(replies)
        
    # If there is no replies just insert empty values
    else:
        dataframe['REPLIES'] = []
        dataframe['count_replies'] = 0
        
    return dataframe

Important to know that we will be using the original pandaDataframe and not the TokenizeTweetFrame

In [35]:
# Run the extractor on our pandaDataFrame
# In this case "x" refers to the "pandaDataFrame" and "Text" is the column label
ExtractTweetFrame = pandaDataFrame.apply(lambda x: extract_replies(x, "Text"), axis=1)

# Seeing what ExtractTweetFrame look like
ExtractTweetFrame

Unnamed: 0,Text,REPLIES,count_replies
0,This is a basic tweet without anything speical,[],0
1,This tweet uses #twitter hashtags,[],0
2,This is used to reply to @user,[user],1
3,#Combing different @uses of things! and http:/...,[uses],1


The final type of token we will extract is the URL links. 

In [36]:
"""
Extract the URLs within the tokenized text

:dataframe: The dataframe with the tweets
:column_name: the column name with the token text
"""
def extract_URL(dataframe, column_name):
    
    # Finds all the replies with regex 
    # The regex matches all sequences that start with "http" 
    URL = re.findall(r"http\S+", dataframe[column_name])
    
    # If there is any replies insert them and their count
    if URL:
        dataframe['URL'] = URL
        dataframe['count_URL'] = len(URL)
        
    # If there is no replies just insert empty values
    else:
        dataframe['URL'] = []
        dataframe['count_URL'] = 0
        
    return dataframe

Again we will be expanding ExtractTweetFrame in this example

In [37]:
# Run the extractor on our pandaDataFrame
# In this case "x" refers to the "pandaDataFrame" and "Text" is the column label
ExtractTweetFrame = ExtractTweetFrame.apply(lambda x: extract_URL(x, "Text"), axis=1)

# Seeing what ExtractTweetFrame look like
ExtractTweetFrame

Unnamed: 0,Text,REPLIES,count_replies,URL,count_URL
0,This is a basic tweet without anything speical,[],0,[],0
1,This tweet uses #twitter hashtags,[],0,[],0
2,This is used to reply to @user,[user],1,[],0
3,#Combing different @uses of things! and http:/...,[uses],1,[http://www.google.ca],1


## Conclusion

In this notebook we looked at tokenization using NLTK. The library allowed us to seperate the tweet into different components to be extracted out into their own segement. Using Panda apply and lamda features we were able to extract out hashtags, replies, URL, and their counts into their own columns. 

Although this notebook does not go into detail on how to save these results (these will be explored in a different notebook), it does provide some of the foundational work that may be needed in one's research. Therefore in keeping this notebook for touching on too many topics, features such as cleaning are not discussed in this particular notebook but will be discussed in another.

This notebook serves as one of the introductory notebook in the TATR series. 