# Objective : Learning the process of identifying most discriminative words using Gini Index
## Question :
### 1. Create a hypothetical document data set consisting of 8 documents and 10 keywords assuming there are three possible ratings and the keywords have frequencies 0,1 and 2
### 2. Print the indices of the first 5 most discriminative words
### 3. Print the TFD corresponding the the discriminative words

In [1]:
import numpy as np
n_docs=8 # number of documents
n_kwds=10 # number of keywords in each document
n_rts=3 # number of possible ratings
k=n_docs*n_kwds # total number of elements in the TDF matrix  (Term-Document-Frequency matrix)
sim_1d=np.random.choice(3, k) # gives 80 randomly obtained numbers from 0,1,2
#sim_1d


In [2]:
sim_d2=sim_1d.reshape(n_docs,n_kwds) # coversion of 1d-frequencies to matrix form
sim_d2 # displaying the matrix


array([[2, 0, 0, 1, 0, 0, 1, 2, 1, 0],
       [1, 0, 2, 0, 0, 0, 1, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 1, 2, 0],
       [2, 2, 2, 2, 1, 0, 2, 0, 0, 2],
       [2, 0, 1, 0, 0, 0, 1, 0, 2, 2],
       [2, 1, 0, 1, 0, 0, 2, 0, 2, 0],
       [1, 1, 2, 0, 0, 2, 1, 1, 2, 2],
       [0, 2, 2, 1, 0, 1, 2, 2, 1, 1]])

In [3]:
sim_ratings=np.random.choice(n_rts,n_docs)
sim_ratings


array([1, 2, 0, 2, 2, 2, 2, 0])

## Computing Gini Indices
### Creation of 1-d array to store GI values

In [4]:
gi=np.zeros(n_kwds)  # initialized an array for storing gini indices
gi


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [5]:
for j in range(10): # Running through the columns
    doc_list=[] # initialization for every keyword to store the indices of documents
    for i in range(8): # Running throug the rows
        if (sim_d2[i,j]!=0): # checking the occurrence of  the jth keyword in ith document
            doc_list.append(i)
            #print('keyword ...',j) # todispplay the keyword index
            # #print(doc_list) # to display the list of document
            # #print(sim_ratings[doc_list]) # to display the corresponding rating
    rt_arr=sim_ratings[doc_list]
    u,c=np.unique(rt_arr,return_counts=True)
    #print('Frequencies are',c)
    p=c/sum(c)
    #print('Probabilities are',p)
    gi[j]=1-sum(p*p)
    # print('----------------')
print('Gini Index values are',gi)


Gini Index values are [0.44897959 0.375      0.44444444 0.64       0.5        0.5
 0.44897959 0.625      0.61111111 0.32      ]


### Listing the first 5 most discriminative keywords

In [6]:
indices=np.argsort(gi) # indices of the sorted values
print('Indices of the  first 5 most discriminative keywords are',  indices[0:5] )


Indices of the  first 5 most discriminative keywords are [9 1 2 0 6]


### Printing the TFD for the most discriminativewords

In [7]:
print(sim_d2[:,indices[0:5]])


[[0 0 0 2 1]
 [1 0 2 1 1]
 [0 0 1 1 0]
 [2 2 2 2 2]
 [2 0 1 2 1]
 [0 1 0 2 2]
 [2 1 2 1 1]
 [1 2 2 0 2]]


### Conclusion
#### The indices of the first 5 most discriminative words using GI are [2 5 1 3 4]