# filter_using_TFIDF Documentation

Zihuan Ran: zran@usc.edu

April 2020

The aim of this document is to give users details about `filter_using_TFIDF` repository, including what it does, how to use it, and possible improvement.

Dependencies: 

    pandas, numpy, pymongo, configparser, matplotlib.pyplot; 

    final_kw_list.csv, resources/secrets.ini

## 1 Background

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.

Here we use 

$${\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)}$$

Term frequency:

$$tf(t,d) = f_{t,d}\ the\ frequency\ of\ term\ t\ in\ document\ d\$$

Probabilistic inverse document frequency:

$$ \mathrm{idf}(t, D) = {\displaystyle \log {\frac {N-n_{t}}{n_{t}}}}$$

with

$${\displaystyle N}:\ total\ number\ of\ documents\ in\ the\ corpus.$$

$$n_{t} = {\displaystyle |\{d\in D:t\in d\}|}:\ number\ of\ documents\ where\ the\ term\ {\displaystyle t}\ appears$$


The **relevance score** $R$ is the aim of this repository, which is defined to be:

$$R(a) = \sum_{t} (tfidf(t)*(frequency\ of\ t\ in\ a))\  ,for\ a\ as\ an\ artifact$$

It works to define the importance of a keyword or phrase within a the database, as a **estimation** of how related it is in Cybersecurity. This will be discussde further in later sections.

In this project, we are applying TF-IDF score scheme to evaluate **how related is one artifact** in our database.

## 2 Usage

The script `filter_using_TFIDF.py` has execution format: 

`python filter_TF-IDF.py`
    
The TFIDF score for keywords will be in file: `final_kw_TFIDF_Score.csv`
The relevance result will be in json file: `final_filter_TFIDF_result.json`
The result is in the format of `{"_id": relevance score}`
The program also automatically generate a CDF figure: `rlv_score_cdf.png`

The jupyter notebook `Filter_using_TFIDF.ipynb` does similar work as specified in itself, but with an extra use of producing samples with scores for manual check. For detail please refer to `Filter_using_TFIDF.ipynb`, section **Generate samples**.

## 3 Results

In [1]:
import pandas as pd
import json

In [2]:
with open('final_filter_TFIDF_result.json') as f:
    data = json.load(f)

Examples:
```
{'5e5fd5726dc9c2e22610ca33': 10.24767956100582,
 '5e5fd5726dc9c2e22610ca35': 0.0,
 '5e5fd5726dc9c2e22610ca3b': 7.1031021197942525,
 '5e5fd5726dc9c2e22610ca3e': 11.618451817838762,
 ...}
```

Below is the cumulative density figure for cummulative count vs. the log relevance score.

Most keywords has a relevance score equals or below 3e+10, so it is reasonable to take log of the score.

![CDF](rlv_score_cdf.png)

## 4 Future Improvment

* As mentioned in 'Background' section above, this way of generating relevance scores has a systematic problem:

We are using *importance of Cybersecurity keywords in the **unfilterd** database* to represent the *effectiveness of distinguishing CS-related content*. This approach is useful and efficient in covering all data in the database, but it also goes into a loop where **the more CS-related the database is, the better keywords'scores performs**, hence the better the filter will be.

One possible solution is to manually check file `final_kw_TFIDF_Score.csv` and see if the more CS-specific words does has better scores, and maybe some further human tuning. 

However, please note that this is by no means the only solution.