## Deadline + Late Penalty

$\textbf{Note:}$ It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.

* $\textbf{Submission deadline for the Project (Part-1) is 20:59:59 (08:59:59 PM) on 4th Nov, 2019}$
* $\textbf{LATE PENALTY: 10% on day-1 and 20% on each subsequent day.}$

## Instructions
1. This note book contains instructions for $\textbf{COMP6714-Project (Part-1)}$. We will release the instructions for the $\textbf{Part-2 of the Project}$ in a seperate notebook. 

* You are required to complete your implementation for part-1 in a file `project_part1.py` provided along with this notebook. Please $\textbf{DO NOT ALTER}$ the name of the file.

* You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

* You can submit your implementation for **Project (Part-1)** via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.

* For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.

* You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. **Only functions are allowed** in `project_part1.py`

* You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza. 

* We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day. 

* For the **Final Evaluation**, we will be using a different dataset, so your final scores may vary.  

* You are allowed to have a limited number of Feedback Attempts $\textbf{(15 Attempts for each student)}$, we will use your **LAST** submission for Final Evaluation.

### Allowed Libraries:

You are required to write your implementation for the project (part-1) using `Python 3.6.5`. You are only allowed to use the following python libraries:
* $\textbf{spacy (v2.1.8)}$

# Q1: Compute TF-IDF score for query (80 Points)

### Introduction

In this project, you are required to compute $TF\text{-}IDF$ score of a document $D_{j}$ $\textit{w.r.t}$ an input query $Q$ and a Dictionary of Entities $(DoE)$.

### Inputs (Q1):
Inputs to your model are as follows:
1. Documents ($D$) as a dictionary with $key:$ doc_id; $value:$ document text
* Query ($Q$), as a string of words
* Dictionary of Entities ($DoE$), with $key:$ entity; $value:$ entity_id

The procedure for computation of the $TF\text{-}IDF$ score follows following steps:

1. $\textbf{TF-IDF index construction for Entities and Tokens}$
* $\textbf{Split the query into lists of Entities and Tokens}$
* $\textbf{Query Score Computation}$

Detailed description of these steps is as under:

## 1. TF-IDF index construction for Entities and Tokens

We require you to separately learn TF-IDF index for tokens ($TF\text{-}IDF_{token}$) and entities ($TF\text{-}IDF_{entity}$). The computation of each of the TF-IDF index is given as follows: 

### TF-IDF index For Tokens:

The term frequency of the token $t$ in a document $D_{j}$ is computed as follows:

$$
TF_{token}(t,D_{j}) = {\# \; of \; times \; token \; t \; appears \; in \; D_{j}}
$$


To de-emphasize the high token frequency, we apply double log to normalize the term frequency. The computation of normalized term frequency of token $t$ is illustrated as follows:

$$
TF_{norm\_token}(t,D_{j}) =  1.0 + \ln(1.0 + \ln(TF_{token}(t,D_{j})))
$$

And, the Inverse Document Frequency of the token $t$ is computed as follows: 

$$
IDF_{token}(t) = 1.0 + \ln(\frac{total \; \# \; of \; docs}{1.0 + \# \; of \; docs \; containing \; token \; \textit{t}})
$$

The TF-IDF score of token $t$ in document $D_{j}$ is computed as: <br>

$$
TF\text{-}IDF_{token}(t,D_{j}) = TF_{norm\_token}(t,D_{j}) * IDF_{token}(t)
$$


### TF-IDF index for Entities:
The term frequency of the entity $e$ in a document $D_{j}$ is computed as follows:

$$
TF_{entity}(e,D_{j}) = {\# \; of \; times \; entity \; e \; appears \; in \; D_{j}}
$$

We simply use natural log to normalize the term frequency of the entities, as given below:

$$
TF_{norm\_entity}(e,D_{j}) =  1.0 + \ln(TF_{entity}(e,D_{j}))
$$

And, the Inverse Document Frequency of the entity $e$ is computed as follows: 


$$
IDF_{entity}(e) = 1.0 + \ln(\frac{total \; \# \; of \; docs}{1.0 + \# \; of \; docs \; containing \; entity \; \textit{e}})
$$


The TF-IDF score of the entity $e$ in the document $D_{j}$ is computed as: <br>

$$
TF\text{-}IDF_{entity}(e,D_{j}) = TF_{norm\_entity}(e,D_{j}) * IDF_{entity}(e)
$$

$\textbf{Note:}$ We assign `TF-IDF score = 0.0` for the cases where the term frequency (TF) for the token and/or entity is `ZERO`.

## 2. Split the Query into Entities and Tokens:

At first, you are required to split the query ($Q$) into all possible combinations of free keywords, i.e., tokens ($K = \{k_{i}\}_{i=1}^{N}$) and entities ($E= \{e_{i}\}_{i=1}^{N}$), where entities correspond to a subset of entities found in $DoE$ formed by individual and/or combination of tokens in $Q$. This process is explained below:

> $\textbf{Step 1:}$ We look for probable entities in the $Q$ by considering individual and/or combination of query tokens formed by combining the tokens in the increasing order of the query string. Amonst them, we only select the entities present in $DoE$.<br>
> $\textbf{Step 2:}$ Based on the selected list of entities found in $\textbf{Step-1}$ enumerate all possible subsets of entities.<br>
> $\textbf{Step 3:}$ Filter subsets of entities found in $\textbf{Step-2}$ such that for each subset the token count does not exceed the corresponding token count in $Q$. We treat the filtered subset as the final entities of the corresponding query split.<br>
> $\textbf{Step 4:}$ For each filtered entity subset, the rest of the keywords in the query, i.e., $(Q \setminus wordsInEntities(e_{i}))$ are treated as the tokens of the query split.<br>


Formally, let query be a a string of tokens, e.g., $Q = \;"A\;B \;C \;D \;E \;F\; G"$ and dictionary of entities be $DoE = \{AB, DF, GK\}$. The list of entities formed by the tokens in the query and/or combinations of query tokens (contained in $DoE$) is $[AB, DF]$ and upon enumerating the possible subsets of the entities, we get following different possible splits of the query to the lists of the entities and the tokens:

$\textbf{Split-1:}$ $e_{1} = []$; $k_{1} = [A,B,C,D,E,F,G]$

$\textbf{Split-2:}$ $e_{2} = [AB]$; $k_{2} = [C,D,E,F,G]$

$\textbf{Split-3:}$ $e_{3} = [DF]$; $k_{3} = [A,B,C,E,G]$

$\textbf{Split-4:}$ $e_{4} = [AB, DF ]$; $k_{4} = [C,E,G]$

$\textbf{Note:}$ <br>
1. In order to split the query, we only care about the subset of entities contained in $DoE$ that can be formed by individual and/or combination of tokens in the $Q$.

* Entities in $DoE$ may correspond to single and/or multiple tokens, e.g., in the example given above $A$, $ABC$ etc., may also correspond to valid entities and may appear in the $DoE$.

* Maximum number of query splits are upper-bounded by the subset of the entities in $DoE$ that can be formed by the tokens in the $Q$.

* For every query split, the leftover keywords $Q \setminus wordsInEntities(e_{i})$ are considered as the corresponding token split.

* In order to form entities, we only combine keywords in the increasing order of the elements in the query string. For example, in $Q =\; "A\; B\; C\; D\; E\; F\; G\;"$, the entities such as: $BA$, $CB$ etc., will not be considered as entities and hence will not appear in the $DoE$.

* In the example given above, if $DoE$ = $\{AB, BC\}$, then there will be only three possible splits of the query. Because the $Q$ contains only one instance of the token $B$, so it will not be possible to form a subset with multiple entities $[AB, BC]$, as it would require at least two instances of token $B$ in the $Q$ (also discussed in $\textbf{Step-3}$ above ).

## 3. Query Score Computation:

Later, you are required to use the corresponding $TF\text{-}IDF$ index to separately compute scores for the list of tokens and entities corresponding to each query split, i.e., $(k_{i},e_{i})$, $\textit{w.r.t}$ the document $D_{j}$ as follows:


$$
s_{i1} = \sum_{entity \in e_{i}} TF_{norm\_entity}(entity,D_{j}) * IDF_{entity}(entity) \\
s_{i2} = \sum_{token \in k_{i}} TF_{norm\_token}(token,D_{j}) * IDF_{token}(token) \\
score_{i}(\{k_{i},e_{i}\}, D_{j}) = s_{i1} + \lambda * s_{i2}|_{\lambda = 0.4}
$$

Finally, you are required to return the maximum score among all the query splits, i.e.,

$$
score_{max} = max\{score_{i}\}_{i=1}^{N}\\
$$

Note, in the above-mentioned equations, we use two separate $TF\text{-}IDF$ indexes, i.e., ($TF\text{-}IDF_{token}$) and ($TF\text{-}IDF_{entity}$) to compute the scores for the token splits and the entity splits of the query respectively.

Some key instructions regarding TF-IDF indexing, parsing and text processing are as follows:

### Key Instructions:

1. **Note** that for a given set of documents, you only need to index the documents only once and later use your index to compute the query scores.

* You are only allowed to use Spacy (v2.1.8) for text processing and parsing. You can install the Spacy via following web-link: [Spacy](https://spacy.io/usage)

* We assume the parsing result of Spacy is always correct, we will not cater to any in-consistency in the Spacy's parsing results. 

* All the tokens in the documents $(D)$, query $(Q)$ and dictionary of entities $(DoE)$ are case-sensitive. You  $\textbf{SHOULD NOT ALTER}$ the case of the tokens.

* You are required to compute two separate indexes, i.e., (i) For tokens, and (ii) For Entities, such that:

> 1. In order to compute the index of the Entities (i.e., $TF\text{-}IDF_{entity}$), you should index all the entities detected by spacy irrespective of their entity types and/or presence in $DoE$. For details on spacy's parsing and entity recognition, please see the web-link: [Spacy Parsing](https://spacy.io/usage/linguistic-features)<br>
> 2. For single-word Entities, e.g., `Trump` etc., you should only compute the index corresponding to the entities. For such entities, you should not consider the corresponding token for computing the TF-IDF index of tokens.<br>
> 3. For multi-word entities, e.g., `New York Times` etc., individual tokens corresponding to the entities should be considered as free tokens and should be indexed while TF-IDF index construction of tokens (i.e., $TF\text{-}IDF_{token}$).<br>

* `Stopwords`: You should only use the token's attribute `is_stop` on a string parsed by Spacy to declare any token as stopword and eventually remove it. This also applies to stopwords within multi-word entities, e.g., `Times of India`.

* `Punctuation`: You should only use the token's attribute `is_punct` on a string parsed by Spacy to decalre any token as a punctuation mark and eventually remove it.

* `Special Cases`: You should not explicitly strip out punctuations or amend the Spacy's tokenization and parsing results. Some examples in this regard are as follows:
> 1. In the sentence: `I am going to U.S.` the correctly extracted entity is `U.S.`<br>
  2. Likewise, in the sentence: `I am going to school.` the spacy will extract the token `school` and will consider the fullstop `.`  as a punctuation mark.

### Toy Example for Illustration

Here, we provide a small toy example for illustration: <br>
Let the dictionary of documents ($D$) be:

The term frequencies corresponding to the tokens (i.e., $TF_{token}$) are shown below as a dictionary of dictionary of the form: <br> 
$\{token$ : $\{doc\_id: count\}\}$.

Likewise, The term frequencies corresponding to the entities (i.e., $TF_{entity}$) are shown below as a dictionary of dictionary of the form: <br> 
$\{entity$ : $\{doc\_id: count\}\}$.

Let the query ($Q$) be:

Let the $DoE$ be:

The possible query splits are:

$e_1$ = [], $k_1$ =  [`New`, `York`, `Times`, `Trump`, `travel`]

$e_2$ = [`New York Times`], $k_2$ = [`Trump`, `travel`]

$e_3$ = [`New York`], $k_3$= [`Times`, `Trump`, `travel`]

$\textbf{Note:}$ We cannot select the query split with the entity part as the combination of following entities: $e_{i}$ = [`New York`, `New York Times`], because there are only single instances of the tokens `New` and `York` in the $Q$.

For `doc_id=3`, after applying the formulas mentioned in sub-headings `2,3` given above, we get following scores for all the query splits:

And the maximum score `max_score` among all the query splits is: <br>

`1.562186043243266` <br>

And, the corresponding query split is:<br>

`{'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']}`

### Output Format (Q1):
Your output should be a tuple of the form:<br> 
`(max_score, {'tokens':[...], 'entities':[...]})`, where <br>
* `max_score` corresponds to the maximum TF-IDF score among all the query splits based on $Q$ and $DoE$.
* The query split corresponding to the `max_score`, i.e., a python dictionary containing the tokens and entities list corresponding to the query split `{'tokens':[...], 'entities':[...]}`.

### Running Time (Q1):
* On CSE machines, your implementation for $\textbf{parsing and indexing}$ approx 500 documents of average length of 500 tokens $\textbf{SHOULD NOT take more than 120 seconds}$. 
* Once all the documents are indexed, $\textbf{the query spliting and score}$ computation $\textbf{SHOULD NOT take more than 15 sec}$.

### How we test implementation of Q1

In [2]:
import pickle
import project_part1 as project_part1

In [3]:
fname = './Data/sample_documents.pickle'
documents = pickle.load(open(fname,"rb"))

documents

{1: 'President Trump was on his way to new New York in New York City.',
 2: 'New York Times mentioned an interesting story about Trump.',
 3: 'I think it would be great if I can travel to New York this summer to see Trump.'}

In [4]:
## Step- 1. Construct the index...
index = project_part1.InvertedIndex()

index.index_documents(documents)

In [5]:
## Test cases
Q = 'New York Times Trump travel'
DoE = {'New York Times':0, 'New York':1,'New York City':2}
doc_id = 3

## 2. Split the query...
query_splits = index.split_query(Q, DoE)


## 3. Compute the max-score...
result = index.max_score_query(query_splits, doc_id)
result

[{'tokens': ['New', 'York', 'Times', 'Trump', 'travel'], 'entities': []}, {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']}, {'tokens': ['Trump', 'travel'], 'entities': ['New York Times']}]


(1.562186043243266,
 {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']})

## Project Submission and Feedback

For project submission, you are required to submit the following files:

1. Your implementation in a python file `project_part1.py`.

2. A report `project_part1.pdf` You need to write a concise and simple report illustrating
    - Implementation details of $Q1$.

**Note:** Every student will be entitled to **15 Feedback Attempts** (use them wisely), we will use the last submission for final evaluation of **part-1**.

In [None]:
以下是我的笔记：

In [1]:
import spacy
from collections import Counter
from math import log
from copy import deepcopy

nlp = spacy.load('en') #load the English module

class InvertedIndex:
    def __init__(self):
        ## You should use these variable to store the term frequencies for tokens and entities...
        self.tf_tokens
        self.tf_entities

        ## You should use these variable to store the inverse document frequencies for tokens and entities...
        self.idf_tokens
        self.idf_entities

    ## Your implementation for indexing the documents...
    def index_documents(self, documents):
        
        #1. construct the term frequency for tokens and entities
        token_dict = {}
        ent_dict = {}

        for i in documents.keys():
            doc = nlp(documents[i])
            tokenlist, entlist = [], []

            for token in doc:
                if ((not token.is_stop) and (not token.is_punct)):
                    tokenlist.append(token.text)
            for ent in doc.ents:
                entlist.append(ent.text)

            tcount, ecount = Counter(tokenlist),Counter(entlist) #get the count of every token
            single_word_ent = []
            #aggregate entities into a dict, key:entity, value: {docID:count}
            for key in ecount.keys(): 
                if(len(key.split()) == 1): #collect the single-word entity
                    single_word_ent.append(key)
                if key not in ent_dict:
                    ent_dict[key] = {i: ecount[key]}
                else:
                    ent_dict[key][i] = ecount[key]

            single_count = Counter(single_word_ent) 

            #aggregate tokens into a dict, key:token, value: {docID:count}
            for key in tcount.keys():
                if key in single_count.keys():  #exclude the single word entity in the token list
                    tcount[key] -= single_count[key]

                if tcount[key] != 0: 
                    if key not in token_dict:
                        token_dict[key] = {i: tcount[key]}
                    else:
                        token_dict[key][i] = tcount[key]

        #2. construct the tf and idf index
        tf_tokens = deepcopy(token_dict) #use the same data format 
        tf_entities = deepcopy(ent_dict)
        idf_tokens, idf_entities = {},{}
        docs_num = len(documents)

        # caluculate tf(norm) for tokens
        for token in tf_tokens.keys():
            for docid in tf_tokens[token].keys():
                tf_tokens[token][docid] = 1.0 + log(1.0 + log(tf_tokens[token][docid]))

        #calculate tf(norm) for entities
        for ent in tf_entities.keys():
            for docid in tf_entities[ent].keys():
                tf_entities[ent][docid] = 1.0 + log(1.0 + log(tf_entities[ent][docid]))

        # caluculate idf for tokens
        for token in token_dict.keys():
            idf_tokens[token] = log(docs_num / (1.0 + len(token_dict[token]))) + 1.0

        # caluculate idf for entities
        for ent in ent_dict.keys():
            idf_entities[ent] = log(docs_num / (1.0 + len(ent_dict[ent]))) + 1.0
        
        self.tf_tokens = tf_tokens
        self.tf_entities = tf_entities
        self.idf_tokens = idf_tokens
        self.idf_entities = idf_entities


    ## Your implementation to split the query to tokens and entities...
    def split_query(self, Q, DoE):
        pass
        return query_splits #a list



    ## Your implementation to return the max score among all the query splits...
    def max_score_query(self, query_splits, doc_id):
        pass ## Replace this line with your implementation...
        ## Output should be a tuple (max_score, {'tokens': [...], 'entities': [...]})


In [33]:
documents


{1: 'President Trump was on his way to new New York in New York City.',
 2: 'New York Times mentioned an interesting story about Trump.',
 3: 'I think it would be great if I can travel to New York this summer to see Trump.'}

In [70]:
#part 3
query =   {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']}

s,s1,s2 = 0.0, 0.0, 0.0
docid = 3
for ent in query['entities']:
    if (ent in tf_entities) and (docid  in tf_entities[ent]): # if the token appear in this doc
        s1 += tf_entities[ent][docid] * idf_entities[ent] 
for token in query['tokens']:
    if (token in tf_tokens) and (docid  in tf_tokens[token]): # if the token appear in this doc
        s2 += tf_tokens[token][docid] * idf_tokens[token]  
s = s1+0.4*s2
print(f'\'tokens_score\': {s2}, \'entities_score\': {s1}, \'combined_score\': {s1+0.4*s2}')

'tokens_score': 1.4054651081081644, 'entities_score': 1.0, 'combined_score': 1.562186043243266


In [54]:
Q = 'New York Times Trump travel'
DoE = {'New York Times':0, 'New York':1,'New York City':2}
doc_id = 3

## 2. Split the query...
query_splits = index.split_query(Q, DoE)

NameError: name 'index' is not defined

In [109]:
from itertools import combinations

def split_query(Q, DoE):
    query_splits = []
    
    #1.select the probable entities
    '''1)a empty entity list
       2)find all combinations as eneities of the query
       3)if a entity is in DOE, append it in the entity list
    '''
    tokenlist, entitylist = [], []
    query = nlp(Q)
    #1.1 find the single-word entities and the tokens
    for token in query:
        tokenlist.append(token.text)
        if (token in DoE):
            entitylist.append(token.text) #this is the single-word entity
    print('query_tokens: ', tokenlist)
    query_splits.append({'tokens': tokenlist, 'entities': []})
    
    #1.2 construct combinations of tokens and find multi-word eentities
    for i in range(2, len(tokenlist)+1): #from length = 2 to len(tokenlist)
        for combi in combinations(tokenlist, i): #get the index combination
            ent = ''
            for i in range(len(combi)):
                if (i != (len(combi)-1)):
                    ent += combi[i] + ' '
                else:
                    ent += combi[i]
            if (ent in DoE):
                entitylist.append(ent) #this is the multi-word entity
    print('step1: ', entitylist)
    
    #2.enumerate all possible subsets of entities, using combinations
    '''
    2.1 find the combinations
    2.2 check token count of each combination doesn't exceed Q_count
    '''
    entity_subset = []
    Q_count = Counter(tokenlist) #token count in Query
    print ('qurey counter: ', Q_count)
    #2.1 find the combinations
    for subset_len in range(1, len(entitylist)+1):
        for subset in combinations(entitylist, subset_len):
            print(subset)
            #2.2.1 count the token in subset
            subset_word = []
            for j in range(subset_len): #split the words in each entity of the subset
                subset_word += subset[j].split(' ')
            subset_counter = Counter(subset_word)
            print(subset_counter)
            #2.2.2 check that if the count of subset exceed Q-count
            exceed = 0
            for key in subset_counter.keys():
                if subset_counter[key] > Q_count[key]:
                    exceed = 1
                    break
                    
            if exceed == 0:
                #3 for filtered entity subset,append it to the query_splits list
                token_counter = Q_count - subset_counter
                query_splits.append({'tokens': list(token_counter.elements()), 'entities': list(subset)})
                entity_subset.append(subset)
    print('entity subset: ', entity_subset)
    print()
    print(query_splits)
    return query_splits
            

In [110]:
Q = 'New York Times Trump travel'
DoE = {'New York Times':0, 'New York':1,'New York City':2}
query_splits = split_query(Q, DoE)

query_tokens:  ['New', 'York', 'Times', 'Trump', 'travel']
step1:  ['New York', 'New York Times']
qurey counter:  Counter({'New': 1, 'York': 1, 'Times': 1, 'Trump': 1, 'travel': 1})
('New York',)
Counter({'New': 1, 'York': 1})
('New York Times',)
Counter({'New': 1, 'York': 1, 'Times': 1})
('New York', 'New York Times')
Counter({'New': 2, 'York': 2, 'Times': 1})
entity subset:  [('New York',), ('New York Times',)]

[{'tokens': ['New', 'York', 'Times', 'Trump', 'travel'], 'entities': []}, {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']}, {'tokens': ['Trump', 'travel'], 'entities': ['New York Times']}]


In [121]:
def max_score_query(query_splits, doc_id):
    score_list = []
    
    for query in query_splits:
        score,score1,score2 = 0.0, 0.0, 0.0
        for ent in query['entities']:
            if (ent in tf_entities) and (docid  in tf_entities[ent]): # if the token appear in this doc
                score1 += tf_entities[ent][docid] * idf_entities[ent] 
        for token in query['tokens']:
            if (token in tf_tokens) and (docid  in tf_tokens[token]): # if the token appear in this doc
                score2 += tf_tokens[token][docid] * idf_tokens[token]  
        score = score1 + 0.4*score2
        score_list.append(score)
        print(f'\'tokens_score\': {score2}, \'entities_score\': {score1}, \'combined_score\': {score}')
    print(score_list)
    
    max_score = max(score_list)
    return (max_score, query_splits[score_list.index(max_score)])

In [122]:
doc_id = 3
result = max_score_query(query_splits, doc_id)
result

'tokens_score': 2.8301009632046026, 'entities_score': 0.0, 'combined_score': 1.132040385281841
'tokens_score': 1.4054651081081644, 'entities_score': 1.0, 'combined_score': 1.562186043243266
'tokens_score': 1.4054651081081644, 'entities_score': 0.0, 'combined_score': 0.5621860432432658
[1.132040385281841, 1.562186043243266, 0.5621860432432658]


(1.562186043243266,
 {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']})

In [99]:
Counter(t)

Counter({'New York': 1, 'New York Times': 2})

In [100]:
c - Counter(t)

Counter()

In [93]:
print('nelo' * 2)

nelonelo


In [35]:
class Parent():
    def __init__(self):
        self.name = ''
        
tom = Parent()
tom.name

''

In [38]:
class InvertedIndex:
    def __init__(self):
        ## You should use these variable to store the term frequencies for tokens and entities...
        self.tf_tokens = []
        self.tf_entities = [] 

        ## You should use these variable to store the inverse document frequencies for tokens and entities...
        self.idf_tokens = []
        self.idf_entities = []

index = InvertedIndex()

In [40]:
index.tf_tokens

[]

In [1]:
import project_part1 as project_part1

This is me!


In [2]:
type(project_part1)

module

In [5]:
project_part1.InvertedIndex()

<project_part1.InvertedIndex at 0x17ef1fffda0>