# Calculating the confidence values for a sample dataset 

The aim of the notebook is to explore algorithms to calculate the confidence of a data point whne you have multiple conflicting data sources

# Setup

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML

Toggling Code blocks:

In [2]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# Reading data 

In [3]:
data = pd.read_csv('sample_data.csv')
data.drop('id', axis=1, inplace=True)

# User defined functions: 

## carry_out_iterations:

The function carries out iterations required for the truth finder algorithm

We just need to provide the data, initial array of trustworthiness scores and it'll run the iterations, print the trustworthiness score at each iteration and also store the confidence scores for each iteration including the final one.


Inputs : 

- **data**: The data for on which the algorithm will run
    
- **list_of_columns**: The list of columns from data which need to be considered as data sources. e.g : list_of_cols = ['Source_A','Source_B','Source_C','Source_D']
    
- **t_w** : An array of initial source trustworthiness values e.g. t_w = np.array([0.5,0.5,0.5,0.5])

Outputs : 

- **t_w_df** : A dataframe containing the confidence trustworthiness scores for all iterations
        
- **confidence_iterations** : the data frame containing all the confidence scores across all iterations 
        
- **train_data_confidence** :  tha data frame containing the confidence scores at the last iteration 

In [4]:
def carry_out_iterations( data,list_of_cols,t_w): 
    
    ## copying only required columns 
    train_data =  data[list_of_cols].copy()
    
    ## creating empty data frame with same structure as traindata to copy confidence scores 
    train_data_confidence =  train_data.copy()
    train_data_confidence.loc[:,:]= 0
    
    ## calculating (1-t(w)). Carrying out calculation required for the equation
    t_w_inv =  1- t_w
    tau_w =  -np.log(t_w_inv)
    
    ## creating dataframe that maintains list of confidence values through each iteration
    confidence_iterations = pd.DataFrame(columns =train_data.columns.tolist() + ['iteration'])
    t_w_df = pd.DataFrame(columns = train_data.columns)
    
    ## specifying max number of iterations as 70 
    
    for iteration in range(0,70):

        for data_row in range(0,len(train_data_confidence)):
            
            ##pulling the facts from each row and calculating tau
            
            row = train_data.iloc[data_row,:]
            Row_values = pd.DataFrame(row.values,columns=['field_values'], index = row.index )
            Row_values['t_w'] = t_w
            Row_values['t_w_inv'] =  t_w_inv
            Row_values['tau_inv'] = tau_w
            
            ## initializing sigma and confidence for each fact 
            
            facts_df =  pd.DataFrame(row.unique(), columns = ['field_values'])
            facts_df['sigma'] =  0 
            facts_df['sigma_star'] =  0
            facts_df['confidence'] = 0
            
            ## calculating sigma(unadjusted confidencen score), sigma_star (adjusted confidence score) and confidence for each fact
            
            for i in range(0,len(facts_df)): 
                facts_df.loc[i,'sigma'] =  Row_values.loc[Row_values.field_values==facts_df.field_values[i],'tau_inv'].sum()

            for i in range(0,len(facts_df)):
                facts_df.loc[i,'sigma_star'] =  facts_df.loc[i,'sigma'] - (facts_df.loc[~facts_df.index.isin([i]),'sigma'].sum())
            
            ## carrying out final calculation for confidence 
            facts_df['confidence'] =  1/(1 + np.exp(-facts_df['sigma_star'] ))
            
            ## feeding the confidence values into the train_data_confidence dataframe 
            Row_values = Row_values.merge(facts_df[['field_values','confidence']], how='left')
                       
            train_data_confidence.iloc[data_row,:] = Row_values['confidence'].values 
        
        
        ## maintaining record of the trusworthiness scores of websites
        t_w_prev =  t_w.copy()
        t_w_df.loc[iteration]= t_w
        t_w = train_data_confidence.mean()
        t_w_inv =  1- t_w
        tau_w =  -np.log(t_w_inv)
        
        ## maintaining record of the confidence scores for all iterations
        train_data_conf_iter = train_data_confidence.copy()

        train_data_conf_iter['iteration'] = iteration
        confidence_iterations =  pd.concat([confidence_iterations.reset_index(drop=True), train_data_conf_iter.reset_index(drop=True)], axis=0)
        
        ## printing itertion number and the trustworthiness score
        print(iteration, np.array(t_w_prev))
        if iteration > 5:
            if np.sum(np.abs(t_w.values - t_w_prev.values)) < 0.001:
                break
            
            
    return(t_w_df,confidence_iterations,train_data_confidence )

## get_final_confidence

This function calculates the final confidence for the required source 
We need to provide as inputs the data, column for which the confidence needs to be calculated and the confidence scores from the last iteration of the algorithm run

Inputs:
- **data** : data for running algorithm, calculating confidence 

- **column_to_check_confidence**: column whose confidence needs to be checked

- **list_of_cols**: columns used to run the algorithm 

- **train_data_confidence**: final confidence values generated by the algoerithm last iteration
    
Outputs:  
    
- **data_copy**: A copy of the fed data with a column added that shows the confidence for each data point

In [5]:
def get_final_confidence(data, column_to_check_confidence ,list_of_cols ,train_data_confidence):
    
    ##copying the data
    train_data =  data[list_of_cols].copy()
    data_copy = data.copy()
    
    ##initializing the confidence scores
    data_copy['KO_confidence_score'] = 0.00
    
    for i in range(0,len(data)):
        
        ## making confidence 0 if data point not present 
        KO_value =  data_copy.loc[i,column_to_check_confidence]
        confidence_score =  0  
        
        ## adding confidence from the values mathcing with train_data_confidence 
        if np.isin(KO_value,train_data.iloc[i,:].values):
            col_name = train_data.iloc[i,:][train_data.iloc[i,:].values==KO_value].index[0]
            confidence_score = train_data_confidence.loc[i,col_name]
        data_copy.loc[i,'KO_confidence_score'] = confidence_score
    return(data_copy)

# Truth-Finder Algorithm:

The [Truth Finder Algorithm](https://ieeexplore.ieee.org/document/4415269) is a type of [Truth discovery method](https://en.wikipedia.org/wiki/Truth_discovery) which is the process of extracting the true value from a set of data sources that provide conflicting information. These methods often calculate a confidence i.e. probability of value being true for each fact provided by the data sources and pick the fact with the highest confidence as the ‘true value’. We can leverage these methods by calculating the confidence of the any value being the correct data point. 


## Brief description  :

A more complete description of the Algorithm and it's relevance to farmer data can be found [here](https://docs.google.com/document/d/1IxaRkn9iFSCvxlUnMGdIm4JDJFK0SDSyWHXxR2iMzEk/edit?usp=sharing)

## Algorithm premise:

Truth Finder is an iterative algorithm that works on optimzing the confidence of data points based on it's adherence to 4 premises:  

- Premise 1: There is only true fact for each data point


- Premise 2: This true fact appears to be the same or similar on different sources. Different sources that provide this true fact may present it in either the same or slightly different ways. For example: if ‘Aman’ is the true value, then there are likely to be multiple data sources saying ‘Aman’ and some sources providing similar names like ‘Amin’.


- Premise 3: The false facts on different sources  are less likely to be the same or similar: Amongst a set of facts, the sub-set facts that doesn't match with any of the others are unlikely to be true


- Premise 4: A source that provides mostly true facts for many objects will likely provide true facts for other objects. A source that is correct for many objects is more likely to be correct for other objects

## Algorithm iteration steps:

The iterative steps are as follows :  
    
    - Initialise the average confidence of each data source {t(w)} randomly. We take it as 0.5 for all data sources
    
    - Calculate the confidence score s*(f) for each data data source-fact using the below equation
    
![Eqn.PNG](attachment:Eqn.PNG)

    - Recalculate average data source confidence t(w) as the average of the confidence of all data points provided by data source
    
    - Again, calculate the confidence score like in step 2 and repeat the steps 2 and 3 until the t(w) values do not change from one iteration to another

## Sample data :

We have manually created a simple sample 23 row data to test the algorithm and check if it converges.  

We have 4 data sources which have been names as Source A,B,C,D and a column 'Krushak Odisha' which are the data points for which finally the confidence has to be calculated 

Showing the data below: 

In [753]:
data

Unnamed: 0,Krushak_Odisha,Source_A,Source_B,Source_C,Source_D
0,Dhoni,MS,MS,MS,MS
1,Sehwag,Sehwag,Viru,Sehwag,Sehwag
2,Gambhir,Gautam,Gautam,Gautam,Gautam
3,Sachin,Sachin,Sachin,Sachin,Sachin
4,Yuvaraj,Yuvaraj,Yuvaraj,Yuvaraj,Yuvaraj
5,Raina,Raina,Suresh,Raina,Raina
6,Kohli,Virat,Kohli,Kohli,Virat
7,Nehra,Ashish,Ashish,Ashish,Ashish
8,Yusuf,Yusuf,Yusuf,Yusuf,Yusuf
9,Munaf,Munaf,Patel,Patel,Patel


## Running the algorithm: 

### Initializing the trustworthiness of data sources with user input :
we need to provide the initial trusworthiness for the data sources here.
We need to provide the trustworthiness as integer number with spaces between them. 
e.g If we want to initialize a values of 0.5 for all the data sources, we must give input as :


0.5 0.5 0.5 0.5

If no input is provided or less than 4 numbers are provided as input, then it'll consider the default values as 0.5,0.5,0.5,0.5

Only provide values between 0 and 1

In [6]:
input_int_array = [ float(x) for x in input().split()]

0. 50.


In [7]:
t_w =  np.array(input_int_array)
if len(t_w) != 4:
    t_w = np.array([0.5,0.5,0.5,0.5])

The input array of intial trustworthiness scores is:  

In [8]:
print (t_w)

[0.5 0.5 0.5 0.5]


### Carrying out the iterations:  

We use user defined function 'carry_out_iterations' to carry out the iterations. This prints the trustworhtiness score at each iteration and the algorithm ends when there is minimal change in the trustworthiness score between two iterations

In [10]:
list_of_cols = ['Source_A','Source_B','Source_C','Source_D']
t_w_df,confidence_iterations,train_data_confidence = carry_out_iterations( data,list_of_cols,t_w)

0 [0.5 0.5 0.5 0.5]
1 [0.76504575 0.68460131 0.66771242 0.72771242]
2 [0.85010205 0.70485743 0.66909746 0.78404346]
3 [0.88311666 0.69632968 0.65415882 0.799436  ]
4 [0.89939943 0.68981138 0.64639339 0.80377826]
5 [0.90812893 0.6863121  0.64279198 0.80418776]
6 [0.91307497 0.68443451 0.64099311 0.8035765 ]
7 [0.91599229 0.68336654 0.64000606 0.80291665]
8 [0.91775929 0.68272907 0.63942717 0.80242243]
9 [0.91884755 0.68233776 0.63907476 0.80208958]
10 [0.91952472 0.68209404 0.63885613 0.8018739 ]


Showing the trustworthiness values for each iteration in a dataframe (same values as printed above): 


In [11]:
t_w_df

Unnamed: 0,Source_A,Source_B,Source_C,Source_D
0,0.5,0.5,0.5,0.5
1,0.765046,0.684601,0.667712,0.727712
2,0.850102,0.704857,0.669097,0.784043
3,0.883117,0.69633,0.654159,0.799436
4,0.899399,0.689811,0.646393,0.803778
5,0.908129,0.686312,0.642792,0.804188
6,0.913075,0.684435,0.640993,0.803576
7,0.915992,0.683367,0.640006,0.802917
8,0.917759,0.682729,0.639427,0.802422
9,0.918848,0.682338,0.639075,0.80209


Showing how confidence values change for iterations (first 50 rows- has two iterations completely and some rows of 3rd): 

In [759]:
confidence_iterations.head(40)

Unnamed: 0,Source_A,Source_B,Source_C,Source_D,iteration
0,0.961538,0.961538,0.961538,0.961538,0
1,0.8,0.2,0.8,0.8,0
2,0.961538,0.961538,0.961538,0.961538,0
3,0.961538,0.961538,0.961538,0.961538,0
4,0.961538,0.961538,0.961538,0.961538,0
5,0.8,0.2,0.8,0.8,0
6,0.5,0.5,0.5,0.5,0
7,0.961538,0.961538,0.961538,0.961538,0
8,0.961538,0.961538,0.961538,0.961538,0
9,0.137931,0.862069,0.862069,0.862069,0


### Getting Final confidence scores:

We use the user defined function get_final_confidence to get the confidence scores for 'Krushak Odisha' column. 

We need to privide the data, results of the algorithm and column for which to calculate the confidence as inputs

In [760]:
column_to_check_confidence = 'Krushak_Odisha'
data_copy = get_final_confidence(data, column_to_check_confidence ,list_of_cols ,train_data_confidence)

Showing final results :

In [761]:
data_copy

Unnamed: 0,Krushak_Odisha,Source_A,Source_B,Source_C,Source_D,KO_confidence_score
0,Dhoni,MS,MS,MS,MS,0.0
1,Sehwag,Sehwag,Viru,Sehwag,Sehwag,0.976337
2,Gambhir,Gautam,Gautam,Gautam,Gautam,0.0
3,Sachin,Sachin,Sachin,Sachin,Sachin,0.998502
4,Yuvaraj,Yuvaraj,Yuvaraj,Yuvaraj,Yuvaraj,0.998502
5,Raina,Raina,Suresh,Raina,Raina,0.976337
6,Kohli,Virat,Kohli,Kohli,Virat,0.165096
7,Nehra,Ashish,Ashish,Ashish,Ashish,0.0
8,Yusuf,Yusuf,Yusuf,Yusuf,Yusuf,0.998502
9,Munaf,Munaf,Patel,Patel,Patel,0.124915
