# Objective : Learning the process of identifying most discriminative words using Entropy

## Question :
### 1. Create a hypothetical document data set consisting of 25 documents and 75 keywords assuming there are four possible ratings and the keywords have frequencies 0,1, 2,3,4
### 2. Print the indices of the first 5 most discriminative words
### 3. Print the TFD corresponding the the discriminative words

In [1]:
import numpy as np
n_docs=25 # number of documents
n_kwds=75 # number of keywords in each document
n_rts=4 # number of possible ratings
k=n_docs*n_kwds # total number of elements in the TDF matrix  (Term-Document-Frequency matrix)
sim_1d=np.random.choice(5, k) # gives 80 randomly obtained numbers from 0,1,2,3,4
sim_1d


array([3, 1, 0, ..., 2, 0, 4])

In [2]:
sim_d2=sim_1d.reshape(n_docs,n_kwds) # coversion of 1d-frequencies to matrix form
sim_d2 # displaying the matrix


array([[3, 1, 0, ..., 4, 2, 2],
       [3, 0, 4, ..., 0, 1, 3],
       [3, 2, 3, ..., 1, 4, 0],
       ...,
       [3, 2, 3, ..., 3, 0, 1],
       [2, 3, 2, ..., 3, 2, 4],
       [3, 2, 1, ..., 2, 0, 4]])

In [3]:
sim_ratings=np.random.choice(n_rts,n_docs)
sim_ratings


array([3, 2, 2, 1, 3, 0, 0, 0, 2, 0, 3, 1, 0, 2, 3, 2, 0, 3, 3, 1, 2, 3,
       2, 2, 2])

## Computing Entropies
### Creation of 1-d array to store GI values

In [4]:
et=np.zeros(n_kwds)  # initialized an array for storing gini indices
et


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

In [6]:
import math
for j in range(n_kwds): # Running through the columns
    doc_list=[] # initialization for every keyword to store the indices of documents
    for i in range(8): # Running throug the rows
        if (sim_d2[i,j]!=0): # checking the occurrence of  the jth keyword in ith document
            doc_list.append(i)
    #print('keyword ...',j) # todispplay the keyword index
    # #print(doc_list) # to display the list of documents
    # #print(sim_ratings[doc_list]) # to display the corresponding rating
    rt_arr=sim_ratings[doc_list]
    u,c=np.unique(rt_arr,return_counts=True)
    #print('Frequencies are',c)
    p=c/sum(c)
    #print('Probabilities are',p)
    et[j]=-sum(p*np.log(p))
  # print('----------------')
print('Entropies are',et)



Entropies are [1.27703426 1.27703426 1.01140426 1.32088834 1.32966135 1.01140426
 1.35178399 1.32966135 1.32966135 1.27703426 1.33217904 1.27703426
 1.35178399 1.27703426 1.27703426 1.32966135 1.33217904 1.32966135
 1.05492017 1.35178399 1.35178399 1.35178399 1.32966135 1.24245332
 1.32966135 1.33217904 1.27703426 1.35178399 1.35178399 1.35178399
 1.07899221 1.01140426 1.32966135 1.32966135 1.01140426 1.07899221
 1.32088834 1.27703426 1.27703426 1.32966135 1.24245332 1.24245332
 1.27703426 1.05492017 1.32966135 1.33217904 1.03972077 1.32966135
 1.05492017 1.27703426 1.01140426 1.27703426 1.35178399 1.27703426
 1.32088834 1.27703426 1.35178399 1.32966135 1.32088834 1.35178399
 1.35178399 1.03972077 1.01140426 1.01140426 1.35178399 1.32088834
 1.35178399 1.35178399 1.27703426 1.07899221 1.33217904 1.27703426
 1.32966135 1.32088834 1.27703426]


### Listing the first 5 most discriminative keywords

In [7]:
indices=np.argsort(et) # indices of the sorted values
print('Indices of the  first 5 most discriminative keywords are',  indices[0:5] )


Indices of the  first 5 most discriminative keywords are [ 2 34  5 63 62]


### Printing the TFD for the most discriminative words

In [8]:
print(sim_d2[:,indices[0:5]])


[[0 2 1 0 1]
 [4 1 0 3 1]
 [3 0 1 2 0]
 [0 0 0 3 0]
 [3 2 3 0 1]
 [1 2 4 3 2]
 [3 4 1 3 2]
 [1 4 3 4 2]
 [1 0 2 2 1]
 [1 2 1 1 3]
 [2 4 4 3 2]
 [4 3 3 0 2]
 [3 2 1 1 4]
 [1 0 3 3 4]
 [1 0 4 0 0]
 [4 0 3 2 4]
 [2 3 4 1 3]
 [0 0 4 0 2]
 [1 3 3 3 3]
 [4 4 3 3 2]
 [4 0 4 1 0]
 [1 4 2 2 0]
 [3 0 4 3 3]
 [2 4 4 1 3]
 [1 1 4 0 3]]


### Conclusion
### The indices of the first 5 most discriminative words