# TFIDF

-------------------------------------------------

The raw code for this Jupyter notebook is by default hidden for easier reading. The main focus of this particular page of the notebook is on the graphs and their interpretation. To toggle on/off the raw code, click below:

In [1]:
# Setup Code toggle button
from IPython.core.display import HTML  

HTML(''' 
<center><h3>
<a href="javascript:code_toggle()">Talk is cheap, show me the code.</a>
</center></h3>
<script>
    var code_show=true; //true -> hide code at first

    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt

        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
</script>
''')

In [2]:
# Setup notebook theme
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
set_nb_theme(get_themes()[1])

In [2]:
# Load R magic
%load_ext rpy2.ipython

&nbsp;

## Problem Description

Compute the TF-IDF values for a non-stop-word in $10$ selected documents. Create a table with the TF, IDF, and TF-IDF values, as well as the corresponding URIs. Rank the results by TF-IDF values in decreasing order.

&nbsp;
## Prepare the Data

First the query term "hacker" is chosen to be investigated, and `grep` finds $191$ files containing the term. Then I choose any random sample of $10$ and continue on.

&nbsp;

In [4]:
TERM = 'hacker'

# Use `bash magic' to locate needed files
# and put them in a list for use in python.
term = ' ' + TERM + ' '
term_files = ! grep -H -r "{term}" ../data/words/* | cut -d ':' -f 1 | sort -u
print("Number of files with the term 'hacker': {}\n".format(len(term_files)))

import random
print("Chosen Files:")
files = random.sample(term_files, 10)
for file in files:
    print(file)

Number of files with the term 'hacker': 28

Chosen Files:
../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
../data/words/7982f059c834722e29cf2a4ab6bdacfb3af5f1f6.txt
../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
../data/words/97b624d4a30335aaad8ef41538b21e4f157f954e.txt
../data/words/6391ce21cc8816c106a4e70af11628146b723bcc.txt
../data/words/e27595e9b26ece682f34d1cbda74bc7323a3c4c7.txt
../data/words/2521d3727d75a711dc3c1e3dc0afdbafeb85bebb.txt
../data/words/61b814f84bfbc8d437fb16e15df5901fbda3a6e3.txt
../data/words/038682164dc20c865f747846cc76644c11a18b05.txt
../data/words/451433b6ced8098a09b9f2523ad75b764d80d580.txt


&nbsp;

###### Data Needed to Calculate TF-IDF

To calculate Term Frequency - Inverse Document Frequency (TF-IDF) the importance of the words in a document are scored on how frequently they occur across multiple documents. Words that are very common like "the" or "a" will be scaled down while words that appear often in a single document will be scaled up. The term frequency value should be normalized to take different sized documents into account, and very high term frequency values should be regarded as suspicious.

*  **Term Frequency:** the number of times a word appears in a document.
*  **Inverse Document Frequency:** total documents in corpus over number of documents with term.

$$
    \begin{align*}
    \text{Let}\ t&=\text{Term}\\
    \text{Let}\ IDF&=\text{Inverse Document Frequency}\\
    \text{Let}\ TF&=\text{Term Frequency}\\[2em]
    TF \:&=\: \frac{\text{term frequency in document}}{\text{total words in document}}\\[1em]
    IDF(t) \:&=\: \log_2\left(\frac{\text{total documents in corpus}}{\text{documents with term}}\right)
    \end{align*}
$$

###### Getting the Data

In order to make these calculations the data needed from each file is:

*  Per Document:
  *  The number of occurances of the term
  *  Total number of words in the document
*  From the corpus:
  *  Total number of documents in the corpus
  *  Number of documents containing the term.

&nbsp;

In [5]:
import string
from collections import defaultdict

words = []
file_data = []
for file in files:
    with open(file, 'r') as infile:
        text = infile.read().replace('\n', '')
        words = [word.strip(string.punctuation).lower() for word in text.split()]
        frequency_dict = defaultdict(int)
        for word in words:
            frequency_dict[word] += 1
        h = file[14:-4]
        file_data.append((file, h, words, frequency_dict))

# Print data needed for calcualations
for file, h, words, frequency_dict in file_data:
    print("File: {}\nHash: {}\nWords: {}\nTerm Occurrences: {}\n"\
          .format(file, h, len(words), frequency_dict[TERM]))

File: ../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
Hash: e28f112a102a722c8c5c74289a17b78c197655b1
Words: 432
Term Occurrences: 1

File: ../data/words/7982f059c834722e29cf2a4ab6bdacfb3af5f1f6.txt
Hash: 7982f059c834722e29cf2a4ab6bdacfb3af5f1f6
Words: 1246
Term Occurrences: 1

File: ../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
Hash: e8d480c1f853cd16427c2a50ec2f82e00192e11e
Words: 1006
Term Occurrences: 1

File: ../data/words/97b624d4a30335aaad8ef41538b21e4f157f954e.txt
Hash: 97b624d4a30335aaad8ef41538b21e4f157f954e
Words: 583
Term Occurrences: 1

File: ../data/words/6391ce21cc8816c106a4e70af11628146b723bcc.txt
Hash: 6391ce21cc8816c106a4e70af11628146b723bcc
Words: 3341
Term Occurrences: 1

File: ../data/words/e27595e9b26ece682f34d1cbda74bc7323a3c4c7.txt
Hash: e27595e9b26ece682f34d1cbda74bc7323a3c4c7
Words: 1309
Term Occurrences: 2

File: ../data/words/2521d3727d75a711dc3c1e3dc0afdbafeb85bebb.txt
Hash: 2521d3727d75a711dc3c1e3dc0afdbafeb85bebb
Words: 569
Term Occu

&nbsp;

The data still needs the URI associated with each hash, this was stored in a `.csv` file using the `line_hash.sh` script. The data would be best structured at this point in a dataframe instead of as unorganized as it currently is. 

&nbsp;

In [6]:
import pandas

link_hash = pandas.read_csv("../../data/link_hash.csv", delimiter=',')

hashes = []
for file, h, words, frequency_dict in file_data:
    hashes.append(h)

df = link_hash[link_hash['hash'].isin(hashes)]
with open('../data/tfidf.dat', 'w') as csv_out:
    
    csv_out.write("uri,hash,file,word_count,term_occurrences\n")
    for file, h, words, frequency_dict in file_data:
        
        # Create a csv for R
        csv_out.write("{},{},{},{},{}\n"\
                     .format(df.loc[df['hash'] == h, 'link'].iloc[0],
                             h, file, len(words), frequency_dict[TERM]))
        
        # Display for inspection
        print("URI: {}\nHash: {}\nFile: {}\nTotal Words: {}\nTerm Occurrences: {}\n"\
             .format(df.loc[df['hash'] == h, 'link'].iloc[0],
                     h, file, len(words), frequency_dict[TERM]))

URI: www.tekbotic.com
Hash: e28f112a102a722c8c5c74289a17b78c197655b1
File: ../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
Total Words: 432
Term Occurrences: 1

URI: www.rtinsights.com/blockchain-technology-and-iot-security/
Hash: 7982f059c834722e29cf2a4ab6bdacfb3af5f1f6
File: ../data/words/7982f059c834722e29cf2a4ab6bdacfb3af5f1f6.txt
Total Words: 1246
Term Occurrences: 1

URI: www.grahamcluley.com/smashing-security-podcast-007-ascii-art-attack/
Hash: e8d480c1f853cd16427c2a50ec2f82e00192e11e
File: ../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
Total Words: 1006
Term Occurrences: 1

URI: www.govtech.com/security/national-cybersecurity-center-works-quickly-to-help-its-clients-recover.html
Hash: 97b624d4a30335aaad8ef41538b21e4f157f954e
File: ../data/words/97b624d4a30335aaad8ef41538b21e4f157f954e.txt
Total Words: 583
Term Occurrences: 1

URI: www.theatlantic.com/technology/archive/2017/02/what-happened-to-trumps-secret-hacking-intel/515889/
Hash: 6391ce21cc8816c106a4

&nbsp;

## Analyze the Data

Python has many utilities now to deal with data such as [pandas](http://pandas.pydata.org/), [numpy](http://www.numpy.org/), and [scipy](https://www.scipy.org/), but the syntax does not lend itself to thinking of data as vectors. For performing vectorized operations languages like [R](https://cran.r-project.org/) and [GNU Octave](https://www.gnu.org/software/octave/) are better suited. With the data all neatly packaged in a dataframe format this will be the most efficient way to solve the problem.

To begin the data is read into a R dataframe from the CSV file, and the variables `corpus_size` and `docs_with_term` are passed into the R kernel. Some basic statistics are shown about the vectors `word_count` and `term_occurrences`.

&nbsp;

In [4]:
%%R

# From http://www.worldwidewebsize.com/ 23-FEB-2017
corpus_size <- 4460000000

# Google results 23-FEB-2017
docs_with_term <- 230000000

df.words <- read.csv('../data/tfidf.dat')

library(stargazer)
stargazer(df.words, type = 'text')


Statistic        N    Mean    St. Dev.  Min  Max 
-------------------------------------------------
word_count       10 1,555.900 1,204.202 432 4,039
term_occurrences 10   1.800     1.619    1    6  
-------------------------------------------------


&nbsp;

###### Calculate TF

As stated earlier:

$$
TF \:=\: \frac{\text{term frequency in document}}{\text{total words in document}}
$$

In the dataframe `data.words` each row represents a document, the column vector `word_count` represents the total words in the document, and the column vector `term_ocurrences` represents term frequency in the document. To calculate the TF for each document elementwise division of the column vectors is performed. In R this done with:

    df$newcol <- (df$x / df$y)

Which is the equivalent to:

$$
\begin{bmatrix}
    x_1\\
    \vdots\\
    x_n
\end{bmatrix} \odot
\begin{bmatrix}
    \frac{1}{y_1}\\
    \vdots\\
    \frac{1}{y_n}
\end{bmatrix}\:=\:
\begin{bmatrix}
    \frac{x_1}{y_1}\\
    \vdots\\
    \frac{x_n}{y_n}
\end{bmatrix}
$$

And display some statistics with the new values.

&nbsp;

In [8]:
%%R

df.words$TF <- (df.words$term_occurrences / df.words$word_count)
stargazer(df.words, type = 'text')


Statistic        N    Mean    St. Dev.   Min    Max 
----------------------------------------------------
word_count       10 1,555.900 1,204.202  432   4,039
term_occurrences 10   1.800     1.619     1      6  
TF               10   0.002     0.001   0.0003 0.004
----------------------------------------------------


&nbsp;

###### Calculate IDF

Again the IDF equation is:

$$
IDF(t) \:=\: \log_2\left(\frac{\text{total documents in corpus}}{\text{documents with term}}\right)
$$

The total documents in corpus has been passed into the R kernel as `corpus_size` and the number of documents that contain the term as `docs_with_term`.

&nbsp;

In [9]:
%%R

cat("IDF: "); cat(corpus_size); cat(' / '); cat(docs_with_term); cat(' = ')
idf <- log2(corpus_size / docs_with_term)
cat(idf)

IDF: 4.46e+09 / 2.3e+08 = 4.277338

&nbsp;

Finally to calculate the TF-IDF for each document the TF of each document is multiplied by the IDF. This can be done in the same vectorized manner that the TF calculation was performed.

&nbsp;

In [10]:
%%R

df.words$TFIDF <- (df.words$TF * idf)
stargazer(df.words, type = 'text')


Statistic        N    Mean    St. Dev.   Min    Max 
----------------------------------------------------
word_count       10 1,555.900 1,204.202  432   4,039
term_occurrences 10   1.800     1.619     1      6  
TF               10   0.002     0.001   0.0003 0.004
TFIDF            10   0.006     0.005   0.001  0.019
----------------------------------------------------


&nbsp;

## Format Data for Display

The `xtable` R package can create $\LaTeX$ tables from dataframes. First the dataframe must be created with only the wanted data to display though.

&nbsp;

In [11]:
%%R

IDF <- rep.int(idf, 10)
df <- data.frame("TF-IDF" = df.words$TFIDF,
                 "TF" = df.words$TF,
                 "IDF" = IDF,
                 "URI" = df.words$uri
                )

library(plyr)
df <- arrange(df, -TF.IDF)

library(xtable)
latex <- xtable(df)
digits(latex) <- c(0, 3, 3, 3, 0)
latex

% latex table generated in R 3.3.2 by xtable 1.8-2 package
% Thu Feb 23 22:16:33 2017
\begin{table}[ht]
\centering
\begin{tabular}{rrrrl}
  \hline
 & TF.IDF & TF & IDF & URI \\ 
  \hline
1 & 0.019 & 0.004 & 4.277 & www.generation-nt.com/nsa-harold-martin-vol-outils-hacking-tao-actualite-1939093.html \\ 
  2 & 0.010 & 0.002 & 4.277 & www.tekbotic.com \\ 
  3 & 0.008 & 0.002 & 4.277 & www.securezoo.com \\ 
  4 & 0.007 & 0.002 & 4.277 & www.govtech.com/security/national-cybersecurity-center-works-quickly-to-help-its-clients-recover.html \\ 
  5 & 0.007 & 0.002 & 4.277 & www.24brasil.com \\ 
  6 & 0.004 & 0.001 & 4.277 & www.grahamcluley.com/smashing-security-podcast-007-ascii-art-attack/ \\ 
  7 & 0.003 & 0.001 & 4.277 & www.rtinsights.com/blockchain-technology-and-iot-security/ \\ 
  8 & 0.003 & 0.001 & 4.277 & www.esquire.com/news-politics/a49791/russian-dnc-emails-hacked/ \\ 
  9 & 0.003 & 0.001 & 4.277 & www.vanguardngr.com/2017/02/6-ways-bank-account-can-hacked/ \\ 
  10 & 0.001 & 0.