# filter_TF-IDF Documentation

Zihuan Ran

April 2020

The aim of this document is to give users details on functionalities of `filter_TF-IDF.py` and conclusions from it.

Dependencies: 

    pandas, numpy, pymongo, configparser, sys, matplotlib.pyplot; 

    keywords_for_TFIDF.csv, secrets.ini

## Background

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.

The score to be built in this part aims to define the importance of a keyword or phrase within a the database.

In this project, we are applying TF-IDF score scheme to evaluate **how related is one artifact** in our database.

## Usage

The script `filter_TF-IDF.py` can be used in the following two ways:
1. To calculate TF-IDF score for each word we provide.

    Execution: `python filter_TF-IDF.py TFIDF`
    
    The provided keywords should be in the file `keywords_for_TFIDF.csv`
    
to retrive all text elements(description, keywords, title) for artifacts, 

2. To calculate the relevance-score for each artifact in database.

    Execution: `python filter_TF-IDF.py filter <json-file-for-result>`
    
    relevance score is defined to be:
    > sum of (frequency of keyword}\*{TFIDF score for that keyword) over all keywords user provide in `keywords_for_TFIDF.csv`
    
    The result will be in user specified json file: `<json-file-for-result>`
    
    The result is inthe format of `{"_id": relevance score}`
    
    The program also automatically generate a CDF figure: `rlv_score_cdf.png`

## Results

In [1]:
import pandas as pd

In [2]:
import json

In [3]:
with open('filter_TFIDF_result.json') as f:
    data = json.load(f)

In [4]:
data

{'5e5fd56e6dc9c2e22610c9d6': 3402243195.205805,
 '5e5fd56e6dc9c2e22610c9d7': 2769302329.7568436,
 '5e5fd56e6dc9c2e22610c9d8': 3377061301.0105367,
 '5e5fd56e6dc9c2e22610c9d9': 3189025661.0815034,
 '5e5fd56e6dc9c2e22610c9da': 2071384977.6706123,
 '5e5fd56e6dc9c2e22610c9db': 1566759323.1248379,
 '5e5fd56e6dc9c2e22610c9dc': 902373197.5565042,
 '5e5fd56e6dc9c2e22610c9dd': 3035254622.189856,
 '5e5fd56e6dc9c2e22610c9de': 2300500771.94303,
 '5e5fd56e6dc9c2e22610c9df': 1922451170.9404886,
 '5e5fd56e6dc9c2e22610c9e0': 3061280237.3001404,
 '5e5fd56e6dc9c2e22610c9e1': 76441.16435319686,
 '5e5fd56e6dc9c2e22610c9e2': 2059477223.6540704,
 '5e5fd56e6dc9c2e22610c9e3': 2059477223.6540704,
 '5e5fd56e6dc9c2e22610c9e4': 1360031307.1198237,
 '5e5fd56e6dc9c2e22610c9e5': 3658210481.1818275,
 '5e5fd56e6dc9c2e22610c9e6': 3058182402.1447353,
 '5e5fd56e6dc9c2e22610c9e7': 3620063500.6513824,
 '5e5fd56e6dc9c2e22610c9e8': 4191675173.534049,
 '5e5fd56e6dc9c2e22610c9e9': 3504577964.3099027,
 '5e5fd56e6dc9c2e22610c9ea'

In [6]:
len(data)

834199

In [22]:
low = sorted(data, key=data.get, reverse=True)[:200]

In [10]:
0.5*834199

417099.5

In [23]:
middle = sorted(data, key=data.get, reverse=True)[417099:417099+200]

In [12]:
import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pymongo
from configparser import ConfigParser
import sys
import matplotlib.pyplot as plt

In [13]:
# mongodb configuration
config = ConfigParser()
config.read('secrets.ini')
DB_USER = config['MONGODB']['CKIDS_USER']
DB_PASS = config['MONGODB']['CKIDS_PASS']
DB_NAME = config['MONGODB']['CKIDS_DB_NAME']
HOST = config['AWS']['HOST_IP']
PORT = config['AWS']['HOST_PORT']
# connect to mongodb
client = pymongo.MongoClient("mongodb://{DB_USER}:{DB_PASS}@{HOST}:{PORT}/{DB_NAME}".format(
    DB_USER=DB_USER, DB_PASS=DB_PASS, HOST=HOST, PORT=PORT, DB_NAME=DB_NAME))
db = client[DB_NAME]
collection = db["raw_artifacts"]
print("Connected to MongoDB.")

Connected to MongoDB.


In [24]:
for ids in low:
    low_url = db.collection.find({"_id":{ids}}, {"url":1})

In [27]:
ids

'5e5ffcf96e9d3cc9bb476aa6'

In [28]:
db.collection.find({"_id":{ids}}, {"url":1})

<pymongo.cursor.Cursor at 0x11ee3ce10>

Below is the cumulative density figure for cummulative count vs. the relevance score.

We can see that most keywords has a relevance score equals or below 3e+10.

Since the relevance score is large so hard to compare clearly, one possible action is to take log of the scurrent score.

![CDF](rlv_score_cdf.png)