## TFIDF dataset

This file is was created in order to prepare files and calculate TFIDF scores for data
located in file "flat_data.txt".
During running this code following files are generated:
+ vocabulary.txt
+ inverted_indx.txt
+ tfid.csv


In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import ItalianStemmer
from nltk.tokenize import word_tokenize, regexp_tokenize
from collections import defaultdict
import numpy as np

In [4]:
def preprocess(text):
    ''' Function used for preprocessing text inside descriptions'''
    text = text.lower()
    # removing '\n'
    text = text.replace('\\n', ' ')
    # removing punctuation
    tokenizer = regexp_tokenize(text, "[\w\$]+")
    # removing numbers
    filtered = [w for w in tokenizer if not w.isnumeric()]
    # filter the non stopwords
    filtered = [w for w in filtered if not w in stopwords.words('italian')]
    its = ItalianStemmer()
    # removing the stem
    filtered = [its.stem(word) for word in filtered]
    return filtered

**PREPROCESSING AND CREATING VOCABULARY FILE**

In [5]:
vocabulary_set = set()
annouc_list = []
occurence_words_list = []

# open file with our data
with open("flat_data.txt", "r" ,encoding="utf-8") as flat_data:
    reader = csv.reader(flat_data, delimiter=",")
    for i, line in enumerate(reader):
        #if i%100==0: print(i) #to see the progress of calculations
        if line != [] and i!=0:
            # preprocess the dictionary text
            description = preprocess(line[5])
            # put new words to vocabulary set
            vocabulary_set.update(description)
            # put prepared words into the list with all announcements
            annouc_list.append(set(description))
            # count words frequency
            freq_word_dict = {}
            for w in description:
                try: freq_word_dict[w] += 1
                except: freq_word_dict[w] = 1
            # save the frequency dict for words in description
            occurence_words_list.append(freq_word_dict)

Saving all collected words into "vocabulary.txt" file

In [7]:
vocabulary = {k:v for v, k in enumerate(vocabulary_set)}
voc_file = open("vocabulary.txt", 'w', encoding = "utf8")
for term in vocabulary:
    voc_file.write('{0}\t{1}\n'.format(term, vocabulary[term]))
voc_file.close()

#### CREATING INVERTED INDEX
Iterate through the annoucements:
-> for each word inside the announcement create inverted index

In [8]:
# inverted index dictionary has a structure: {id_word: announcements}
inv_indx = defaultdict(set)
for idx, words in enumerate(annouc_list):
    for word in words:
        inv_indx[vocabulary[word]].add(idx)

Saving to file - inverted_indx.txt

In [9]:
inv_file = open("inverted_indx.txt", 'w', encoding = "utf8")
for id_word, docks in inv_indx.items():
    inv_file.write('{0}\t{1}\n'.format(id_word, '\t'.join(map(str, docks))))
inv_file.close()

#### Computing TFIDF

Function **computeTFID** calculates TFID score for a specific annoucement (doc_id) given as an input.

In [10]:
def computeTFID(freq_dict, doc_id, tot_num_docs, inv_indx):
    # freq_dict - dictionary with the frequency of words in the description
    # doc_id - announcement_id
    # inv_indx - inverted index (also saved in inverted_indx.txt)
    # tot_num_docs - total number of annoucements >10000
    tfid_per_annoucement = {}
    for word in freq_dict.keys():
        num_in_annouc = len(inv_indx[vocabulary[word]])
        log_part = np.log(float(tot_num_docs)/num_in_annouc)
        tfid_per_annoucement[word_id] = round(float(freq_dict[word])/numWords * log_part, 5)
    tfid[doc_id] = tfid_per_annoucement
    return

Compute all tfidf scores:

In [11]:
global tfid #global variable
tfid = {}
inv_indx_tfid = {}
len_rows = len(docs_list)
numWords = len(freq_dict)
# iterate through all announcemets
for doc_id, freq_dict in enumerate(occurence_words_list): 
    computeTFID(freq_dict, doc_id, len_rows, inv_indx)

Saving **tfid** into the **"tfid.csv"** file.

In [12]:
tfid_file = open("tfid.csv", 'w', encoding = "utf8")
for id_annouc, words_jacc_dict in tfid.items():
    tfid_file.write('{0}\t{1}\n'.format(id_annouc, '\t'.join(map(str, words_jacc_dict.keys()))))
tfid_file.close()