# TFIDF

-------------------------------------------------

The raw code for this Jupyter notebook is by default hidden for easier reading. The main focus of this particular page of the notebook is on the graphs and their interpretation. To toggle on/off the raw code, click below:

In [1]:
# Setup Code toggle button
from IPython.core.display import HTML  

HTML(''' 
<center><h3>
<a href="javascript:code_toggle()">Talk is cheap, show me the code.</a>
</center></h3>
<script>
    var code_show=true; //true -> hide code at first

    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt

        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
</script>
''')

In [2]:
# Setup notebook theme
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
set_nb_theme(get_themes()[1])

In [3]:
# Load R magic
%load_ext rpy2.ipython

&nbsp;

## Problem Description

Compute the TF-IDF values for a non-stop-word in $10$ selected documents. Create a table with the TF, IDF, and TF-IDF values, as well as the corresponding URIs. Rank the results by TF-IDF values in decreasing order.

&nbsp;
## Prepare the Data

First the query term "hacker" is chosen to be investigated, and `grep` finds $191$ files containing the term. Then I choose any random sample of $10$ and continue on.

&nbsp;

In [4]:
TERM = 'hacker'

# Use `bash magic' to locate needed files
# and put them in a list for use in python.
term = ' ' + TERM + ' '
term_files = ! grep -H -r "{term}" ../data/words/* | cut -d ':' -f 1
print("Number of files with the term 'hacker': {}\n".format(len(term_files)))

import random
print("Chosen Files:")
files = random.sample(term_files, 10)
for file in files:
    print(file)

Number of files with the term 'hacker': 38

Chosen Files:
../data/words/0526a5b31b8343e8efad4970531525976e00bb4b.txt
../data/words/74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d.txt
../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
../data/words/6391ce21cc8816c106a4e70af11628146b723bcc.txt
../data/words/61b814f84bfbc8d437fb16e15df5901fbda3a6e3.txt
../data/words/038682164dc20c865f747846cc76644c11a18b05.txt
../data/words/74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d.txt
../data/words/df19e825fc38b7a0c4c9f5adea93c9a0948f6367.txt
../data/words/429ee67adf0e4bd44c4302ee737749c657ac98d4.txt


&nbsp;

###### Data Needed to Calculate TF-IDF

To calculate Term Frequency - Inverse Document Frequency (TF-IDF) the importance of the words in a document are scored on how frequently they occur across multiple documents. Words that are very common like "the" or "a" will be scaled down while words that appear often in a single document will be scaled up. The term frequency value should be normalized to take different sized documents into account, and very high term frequency values should be regarded as suspicious.

*  **Term Frequency:** the number of times a word appears in a document.
*  **Inverse Document Frequency:** total documents in corpus over number of documents with term.

$$
    \begin{align*}
    \text{Let}\ t&=\text{Term}\\
    \text{Let}\ IDF&=\text{Inverse Document Frequency}\\
    \text{Let}\ TF&=\text{Term Frequency}\\[2em]
    TF \:&=\: \frac{\text{term frequency in document}}{\text{total words in document}}\\[1em]
    IDF(t) \:&=\: \log_2\left(\frac{\text{total documents in corpus}}{\text{documents with term}}\right)
    \end{align*}
$$

###### Getting the Data

In order to make these calculations the data needed from each file is:

*  Per Document:
  *  The number of occurances of the term
  *  Total number of words in the document
*  From the corpus:
  *  Total number of documents in the corpus
  *  Number of documents containing the term.

&nbsp;

In [5]:
import string
from collections import defaultdict

corpus_size = !ls -1q ../data/words/ | wc -l
corpus_size = corpus_size[0]

docs_with_term = len(term_files)

words = []
file_data = []
for file in files:
    with open(file, 'r') as infile:
        text = infile.read().replace('\n', '')
        words = [word.strip(string.punctuation).lower() for word in text.split()]
        frequency_dict = defaultdict(int)
        for word in words:
            frequency_dict[word] += 1
        h = file[14:-4]
        file_data.append((file, h, words, frequency_dict))

# Print data needed for calcualations
print("Corpus Size: {}\nDocuments with term: {}\n".format(corpus_size, docs_with_term))
for file, h, words, frequency_dict in file_data:
    print("File: {}\nHash: {}\nWords: {}\nTerm Occurrences: {}\n"\
          .format(file, h, len(words), frequency_dict[TERM]))

Corpus Size: 928
Documents with term: 38

File: ../data/words/0526a5b31b8343e8efad4970531525976e00bb4b.txt
Hash: 0526a5b31b8343e8efad4970531525976e00bb4b
Words: 898
Term Occurrences: 1

File: ../data/words/74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d.txt
Hash: 74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d
Words: 845
Term Occurrences: 4

File: ../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
Hash: e28f112a102a722c8c5c74289a17b78c197655b1
Words: 432
Term Occurrences: 1

File: ../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
Hash: e8d480c1f853cd16427c2a50ec2f82e00192e11e
Words: 1006
Term Occurrences: 1

File: ../data/words/6391ce21cc8816c106a4e70af11628146b723bcc.txt
Hash: 6391ce21cc8816c106a4e70af11628146b723bcc
Words: 3341
Term Occurrences: 1

File: ../data/words/61b814f84bfbc8d437fb16e15df5901fbda3a6e3.txt
Hash: 61b814f84bfbc8d437fb16e15df5901fbda3a6e3
Words: 1355
Term Occurrences: 6

File: ../data/words/038682164dc20c865f747846cc76644c11a18b05.txt
Hash: 038682164dc20c865f74

&nbsp;

The data still needs the URI associated with each hash, this was stored in a `.csv` file using the `line_hash.sh` script. The data would be best structured at this point in a dataframe instead of as unorganized as it currently is. 

&nbsp;

In [6]:
import pandas

link_hash = pandas.read_csv("../../data/link_hash.csv", delimiter=',')

hashes = []
for file, h, words, frequency_dict in file_data:
    hashes.append(h)

df = link_hash[link_hash['hash'].isin(hashes)]
with open('../data/tfidf.dat', 'w') as csv_out:
    
    csv_out.write("uri,hash,file,word_count,term_occurrences\n")
    for file, h, words, frequency_dict in file_data:
        
        # Create a csv for R
        csv_out.write("{},{},{},{},{}\n"\
                     .format(df.loc[df['hash'] == h, 'link'].iloc[0],
                             h, file, len(words), frequency_dict[TERM]))
        
        # Display for inspection
        print("URI: {}\nHash: {}\nFile: {}\nTotal Words: {}\nTerm Occurrences: {}\n"\
             .format(df.loc[df['hash'] == h, 'link'].iloc[0],
                     h, file, len(words), frequency_dict[TERM]))

URI: www.scmagazine.com/attackers-steal-from-atms-after-infecting-banks-with-memory-only-malware/article/637029/?dcmp=emc-scus_newswire&spmailingid=16524299&spuserid=mzeyntk5nzmznjus1&spjobid=960720405&spreportid=otywnziwnda1s0
Hash: 0526a5b31b8343e8efad4970531525976e00bb4b
File: ../data/words/0526a5b31b8343e8efad4970531525976e00bb4b.txt
Total Words: 898
Term Occurrences: 1

URI: www.redmondpie.com/firm-that-helped-fbi-break-into-san-bernardino-iphone-gets-hacked-tools-leaked-online/
Hash: 74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d
File: ../data/words/74ee2852ae0adb0dcedd9cb50913cfbe291f7f6d.txt
Total Words: 845
Term Occurrences: 4

URI: www.tekbotic.com
Hash: e28f112a102a722c8c5c74289a17b78c197655b1
File: ../data/words/e28f112a102a722c8c5c74289a17b78c197655b1.txt
Total Words: 432
Term Occurrences: 1

URI: www.grahamcluley.com/smashing-security-podcast-007-ascii-art-attack/
Hash: e8d480c1f853cd16427c2a50ec2f82e00192e11e
File: ../data/words/e8d480c1f853cd16427c2a50ec2f82e00192e11e.txt
Tot

&nbsp;

## Analyze the Data

Python has many utilities now to deal with data such as [pandas](http://pandas.pydata.org/), [numpy](http://www.numpy.org/), and [scipy](https://www.scipy.org/), but the syntax does not lend itself to thinking of data as vectors. For performing vectorized operations languages like [R](https://cran.r-project.org/) and [GNU Octave](https://www.gnu.org/software/octave/) are better suited. With the data all neatly packaged in a dataframe format this will be the most efficient way to solve the problem.

To begin the data is read into a R dataframe from the CSV file, and the variables `corpus_size` and `docs_with_term` are passed into the R kernel. Some basic statistics are shown about the vectors `word_count` and `term_occurrences`.

&nbsp;

In [7]:
%%R -i corpus_size -i docs_with_term

df.words <- read.csv('../data/tfidf.dat')

library(stargazer)
stargazer(df.words, type = 'text')

Please cite as: 








Statistic        N    Mean    St. Dev.  Min  Max 
-------------------------------------------------
word_count       10 1,462.600 1,212.145 432 4,039
term_occurrences 10   2.300     1.829    1    6  
-------------------------------------------------


&nbsp;

###### Calculate TF

As stated earlier:

$$
TF \:=\: \frac{\text{term frequency in document}}{\text{total words in document}}
$$

In the dataframe `data.words` each row represents a document, the column vector `word_count` represents the total words in the document, and the column vector `term_ocurrences` represents term frequency in the document. To calculate the TF for each document elementwise division of the column vectors is performed. In R this done with:

    df$newcol <- (df$x / df$y)

Which is the equivalent to:

$$
\begin{bmatrix}
    x_1\\
    \vdots\\
    x_n
\end{bmatrix} \odot
\begin{bmatrix}
    \frac{1}{y_1}\\
    \vdots\\
    \frac{1}{y_n}
\end{bmatrix}\:=\:
\begin{bmatrix}
    \frac{x_1}{y_1}\\
    \vdots\\
    \frac{x_n}{y_n}
\end{bmatrix}
$$

And display some statistics with the new values.

&nbsp;

In [8]:
%%R

df.words$TF <- (df.words$term_occurrences / df.words$word_count)
stargazer(df.words, type = 'text')


Statistic        N    Mean    St. Dev.   Min    Max 
----------------------------------------------------
word_count       10 1,462.600 1,212.145  432   4,039
term_occurrences 10   2.300     1.829     1      6  
TF               10   0.002     0.002   0.0003 0.005
----------------------------------------------------


&nbsp;

###### Calculate IDF

Again the IDF equation is:

$$
IDF(t) \:=\: \log_2\left(\frac{\text{total documents in corpus}}{\text{documents with term}}\right)
$$

The total documents in corpus has been passed into the R kernel as `corpus_size` and the number of documents that contain the term as `docs_with_term`.

&nbsp;

In [14]:
%%R

idf <- log2((as.integer(corpus_size)/as.integer(docs_with_term)))
cat("IDF: ")
cat(idf)

IDF: 4.610053

&nbsp;

Finally to calculate the TF-IDF for each document the TF of each document is multiplied by the IDF. This can be done in the same vectorized manner that the TF calculation was performed.

&nbsp;

In [17]:
%%R

df.words$TFIDF <- (df.words$TF * idf)
stargazer(df.words, type = 'text')


Statistic        N    Mean    St. Dev.   Min    Max 
----------------------------------------------------
word_count       10 1,462.600 1,212.145  432   4,039
term_occurrences 10   2.300     1.829     1      6  
TF               10   0.002     0.002   0.0003 0.005
IDF              10   0.010     0.008   0.001  0.022
TFIDF            10   0.010     0.008   0.001  0.022
----------------------------------------------------


&nbsp;

## Format Data for Display

The `xtable` R package can create $\LaTeX$ tables from dataframes. First the dataframe must be created with only the wanted data to display though.

&nbsp;

In [27]:
%%R

IDF <- rep.int(idf, 10)
df <- data.frame(df.words$uri, df$TF, idf, df$TFIDF)

library(xtable)
xtable(df)

df


Fehler in data.frame(df.words$uri, df$TF, idf, df$TFIDF) : 
  Argumente implizieren unterschiedliche Anzahl Zeilen: 10, 0, 1


  Argumente implizieren unterschiedliche Anzahl Zeilen: 10, 0, 1

