# Advanced Topics in Computer Science - Project 1

## A Neural Network Approach for Truth Discovery in Social Sensing

<b>Team Name:</b> Carbonara Bros <br>
<b>Team Members:</b> Andrea De Angelis, Vincenzo Di Cicco, Maurizio Mazzei, Paolo Montana

Paper: https://www3.nd.edu/~sslab/pdf/mass17.pdf <br>
Dataset: Population_claims (https://amubox.univ-amu.fr/index.php/s/mB2orlnFUkZgtnk)

In [90]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras import optimizers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
dataset = pd.read_csv('datasets/Population_claims.csv')
dataset.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue,SourceID,TimeStamp
0,abu dhabi,Population2006,1000230,0 (68.162.248.83),
1,abu dhabi,Population2006,1850230,1513217: Mohammedfairouz,
2,amsterdam,Population2006,741329,141597: Ilse@,
3,amsterdam,Population2006,742884,1300620: Krator,
4,adelaide,Population2006,1124315,3922171: Pirate05,


In [3]:
ground_truth = pd.read_csv('datasets/Population_groundtruth.csv')
ground_truth.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue
0,"cleveland, ohio",Population2000,478403
1,"gary, indiana",Population2000,102746
2,"flint, michigan",Population2000,124943
3,"compton, california",Population2000,93493
4,"washington, d.c.",Population2005,582049


### Original Dataset Statistics

In [11]:
number_original_claims = len(dataset)
print('# of claims in the original dataset: {}'.format(number_original_claims))

# of claims in the original dataset: 49955


* A unique claim is identified by the triple (ObjectID, PropertyID e PropertyValue)

In [12]:
number_original_unique_claims = len(dataset.drop_duplicates(subset=['ObjectID', 'PropertyID', 'PropertyValue']))
print('# of unique claims in the original dataset: {}'.format(number_original_unique_claims))

# of unique claims in the original dataset: 44590


In [14]:
unique_claims_ratio = number_original_unique_claims/number_original_claims
print('Just the {}% of the claims are equal'.format(1-unique_claims_ratio))
print('The remaining {}% of the claims are unique'.format(unique_claims_ratio))

Just the 0.1073966569912922% of the claims are equal
The remaining 0.8926033430087078% of the claims are unique


In [16]:
original_unique_sources = dataset.SourceID.unique()
number_original_unique_sources = len(original_unique_sources)
print('# of unique sources: {}'.format(number_original_unique_sources))

# of unique sources: 4264


### Ground Truth Construction

* We have 308 rows in the ground truth with only true claims, but we need also false claims
* Each row is a triple (ObjectID, PropertyID, PropertyValue) defining a true claims
* For each claim in the ground truth, we want to identify which rows in the dataset refers to the same claim
    * These examples will be labeled with 1 (indicating a true claim)
* For each claim in the ground truth, we want to identify which rows has the same ObjectID and PropertyID, but different PropertyValue
    * These examples will be labeled with 0 (indicating false claim, since they are surely false)

In [17]:
# List of true claims (only for convenience)
truth_claims = []

for index, row in ground_truth.iterrows():
    truth_claims.append(row)

In [19]:
# Original dataset's header
columns = dataset.columns
print(columns)

Index(['ObjectID', 'PropertyID', 'PropertyValue', 'SourceID', 'TimeStamp'], dtype='object')


* DataFrame created with the same header of the original one
* Append of the positive examples to this new DataFrame
* 'Target' columns addition (all values set to 1)

In [20]:
df_positives = pd.DataFrame(columns = columns)

In [21]:
for count in range(len(truth_claims)):
    d1 = dataset[dataset.ObjectID == truth_claims[count][0]]
    d2 = d1[d1.PropertyID == truth_claims[count][1]]
    d3 = d2[d2.PropertyValue == truth_claims[count][2]]
    d3['Target'] = 1
    df_positives = df_positives.append(d3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [22]:
df_positives.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue,SourceID,Target,TimeStamp
169,"cleveland, ohio",Population2000,478403,81676: Beirne,1.0,
171,"cleveland, ohio",Population2000,478403,94900: EurekaLott,1.0,
172,"cleveland, ohio",Population2000,478403,627347: MJCdetroit,1.0,
175,"cleveland, ohio",Population2000,478403,1960810: Nyttend,1.0,
177,"cleveland, ohio",Population2000,478403,1948715: Confiteordeo,1.0,


In [23]:
len(df_positives) # positive examples

598

* DataFrame creation with the same header of the original one
* Append of the negatives examples to this new DataFrame
* 'Target' columns addition (all values set to 0)

In [24]:
df_negatives = pd.DataFrame(columns = columns)

In [25]:
for count in range(len(truth_claims)):
    d1 = dataset[dataset.ObjectID == truth_claims[count][0]]
    d2 = d1[d1.PropertyID == truth_claims[count][1]]
    d3 = d2[d2.PropertyValue != truth_claims[count][2]]
    d3['Target'] = 0
    df_negatives = df_negatives.append(d3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [26]:
df_negatives.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue,SourceID,Target,TimeStamp
174,"cleveland, ohio",Population2000,444323,0 (75.179.37.144),0.0,
176,"cleveland, ohio",Population2000,372628,0 (130.13.0.12),0.0,
178,"cleveland, ohio",Population2000,478404,0 (71.17.156.29),0.0,
383,"gary, indiana",Population2000,0,2278184: El diablo21,0.0,
2220,"flint, michigan",Population2000,118551,551240: Blueskiesfalling,0.0,


In [27]:
len(df_negatives) # negative examples

448

In [30]:
final_df = df_positives.append(df_negatives)

In [31]:
final_df.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue,SourceID,Target,TimeStamp
169,"cleveland, ohio",Population2000,478403,81676: Beirne,1.0,
171,"cleveland, ohio",Population2000,478403,94900: EurekaLott,1.0,
172,"cleveland, ohio",Population2000,478403,627347: MJCdetroit,1.0,
175,"cleveland, ohio",Population2000,478403,1960810: Nyttend,1.0,
177,"cleveland, ohio",Population2000,478403,1948715: Confiteordeo,1.0,


### Final DataFrame Statistics

In [34]:
len(final_df)

1046

In [32]:
number_of_claims = final_df.groupby(['ObjectID','PropertyID', 'PropertyValue']).ngroups
number_of_claims

685

In [33]:
number_of_sources = final_df.groupby('SourceID').ngroups
number_of_sources

643

* From this final_df we want to build the "Sensing Matrix"
* The sensing matrix shape will be (685, 643)
* N = 685; M = 643

<table>
    <tr>
        <td></td>
        <th>Source 1</th>
        <th>Source 2</th> 
        <th>...</th>
        <th>Source M</th>
    </tr>
    <tr>
        <th>Claim 1</th>
        <td>1</td>
        <td>0</td> 
        <td>...</td>
        <td>0</td>
    </tr>
    <tr>
        <th>Claim 2</th>
        <td>0</td>
        <td>1</td> 
        <td>...</td>
        <td>1</td>
    </tr>
    <tr>
        <th>...</th>
        <td>...</td>
        <td>...</td> 
        <td>...</td>
        <td>...</td>
    </tr>
    <tr>
        <th>Claim N</th>
        <td>0</td>
        <td>0</td> 
        <td>...</td>
        <td>0</td>
    </tr>
</table>

* M sources {$S_{1}, S_{2}, .., S_{M}$}

* N claims {$ C_{1}, C_{2}, .., C_{N} $}

* Every source report N claims
    * If the ith source report the jth claim, the corresponding sensing matrix cell will contain 1
    * If the ith source don't report the jth claim, the corresponding sensing matrix cell will contain 0
* Which is -> $S_iC_j = 1$ or $S_iC_j = 0$

### Sensing Matrix Construction

* We need to identify the unique claims (N claims)
* For every unique claims we need to know the source which report it
* For every triple (ObjectID, PropertyID, PropertyValue) we need to know the 'Target' (truth value)

In [35]:
sensing_df = (final_df.iloc[:, 0:3]).drop_duplicates()

In [36]:
sensing_df.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue
169,"cleveland, ohio",Population2000,478403
382,"gary, indiana",Population2000,102746
2219,"flint, michigan",Population2000,124943
5927,"compton, california",Population2000,93493
7253,"washington, d.c.",Population2005,582049


In [38]:
# append SourceID column
for index, row in sensing_df.iterrows():
    source_id = final_df.loc[index, 'SourceID']
    sensing_df.loc[index, 'SourceID'] = source_id

In [39]:
# append Target column
for index, row in sensing_df.iterrows():
    target = final_df.loc[index, 'Target']
    sensing_df.loc[index, 'Target'] = target

In [41]:
print('Now we have {} unique claims'.format(len(sensing_df)))

Now we have 685 unique claims


In [42]:
sensing_df.head()

Unnamed: 0,ObjectID,PropertyID,PropertyValue,SourceID,Target
169,"cleveland, ohio",Population2000,478403,81676: Beirne,1.0
382,"gary, indiana",Population2000,102746,620554: Harpchad,1.0
2219,"flint, michigan",Population2000,124943,201610: Pentawing,1.0
5927,"compton, california",Population2000,93493,541143: Postoak,1.0
7253,"washington, d.c.",Population2005,582049,0 (84.171.251.182),1.0


* Now, we will build a matrix as the following:
    * claims -> list of source which report that claim
    * i.e. {Claim1 : {[Source1, 1], [Source2, 0], ...}
* Firstly, we will build the matrix with every report set to 0
* Then, we will put 1 when a source reports a claim

In [43]:
matrix = {}
for index, row in sensing_df.iterrows():
    # index = (object_id, property_id)
    values = {}
    for source in final_df.SourceID.unique():
        values[source] = 0 
        
    matrix[index] = values

In [44]:
for index, row in sensing_df.iterrows():
    # index = (object_id, property_id)
    source_id = row['SourceID']
    matrix[index][source_id] = 1

* Building y numpy array (truth values)

In [47]:
y = sensing_df.iloc[:,4].values

In [48]:
len(y)

685

In [49]:
y

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1.

* Building the sensing matrix numpy array, appending the truth value to every row

In [53]:
sensing_matrix = []
for k, v in matrix.items():
    row = []
    for k1, v1 in v.items():
        row.append(v1) 
    sensing_matrix.append(row)

In [54]:
for count, sources in enumerate(sensing_matrix):
    sources.append(y[count])

In [55]:
sm = np.array(sensing_matrix)

In [56]:
sm

array([[1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [57]:
sm.shape

(685, 644)

### Training the Neural Network

* Split X and y
* Split X_train and X_test, and y_train and y_test

In [58]:
X = sm[:,0:-1]

In [59]:
y = sm[:,-1]

In [60]:
y.shape

(685,)

In [61]:
X.shape

(685, 643)

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [63]:
X_train.shape

(513, 643)

In [64]:
y_train.shape

(513,)

In [66]:
# Neural Network Architecture

model = Sequential()
model.add(Dropout(0.3, input_shape=(643,)))
model.add(Dense(output_dim = 643, activation = 'relu', input_dim = 643))
model.add(Dropout(0.3, input_shape=(643,)))
model.add(Dense(output_dim = 321, activation = 'relu'))
model.add(Dense(output_dim = 1, activation = 'sigmoid'))

sgd = optimizers.SGD(lr=0.025)
model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=['accuracy'])

  """
  import sys
  


In [68]:
model.fit(x = X_train, y = y_train, epochs=150, batch_size=20, verbose=2)

Epoch 1/150
 - 1s - loss: 0.6920 - acc: 0.5224
Epoch 2/150
 - 0s - loss: 0.6862 - acc: 0.5887
Epoch 3/150
 - 0s - loss: 0.6795 - acc: 0.5809
Epoch 4/150
 - 0s - loss: 0.6747 - acc: 0.5926
Epoch 5/150
 - 0s - loss: 0.6724 - acc: 0.5945
Epoch 6/150
 - 0s - loss: 0.6676 - acc: 0.6062
Epoch 7/150
 - 0s - loss: 0.6682 - acc: 0.6296
Epoch 8/150
 - 0s - loss: 0.6643 - acc: 0.6413
Epoch 9/150
 - 0s - loss: 0.6626 - acc: 0.6394
Epoch 10/150
 - 0s - loss: 0.6584 - acc: 0.6413
Epoch 11/150
 - 0s - loss: 0.6535 - acc: 0.6550
Epoch 12/150
 - 0s - loss: 0.6532 - acc: 0.6433
Epoch 13/150
 - 0s - loss: 0.6490 - acc: 0.6452
Epoch 14/150
 - 0s - loss: 0.6540 - acc: 0.6277
Epoch 15/150
 - 0s - loss: 0.6468 - acc: 0.6374
Epoch 16/150
 - 0s - loss: 0.6436 - acc: 0.6413
Epoch 17/150
 - 0s - loss: 0.6425 - acc: 0.6433
Epoch 18/150
 - 0s - loss: 0.6288 - acc: 0.6706
Epoch 19/150
 - 0s - loss: 0.6361 - acc: 0.6472
Epoch 20/150
 - 0s - loss: 0.6388 - acc: 0.6433
Epoch 21/150
 - 0s - loss: 0.6297 - acc: 0.6550
E

<keras.callbacks.History at 0x7f90ad0b60f0>

### Confusion Matrix

<table>
    <tr>
        <th></th>
        <th>Predicted yes</th>
        <th>Predicted no</th>
    </tr>
    <tr>
        <th>Actual yes</th>
        <td>true positives</td>
        <td>false negatives</td>
    </tr>
    <tr>
        <th>Actual No</th>
        <td>false positives</td>
        <td>true negatives</td>
    </tr>
</table>

* Training set

In [79]:
y_pred = model.predict(X_train)
y_pred = (y_pred > 0.6)
    
cm = confusion_matrix(y_train, y_pred)

In [80]:
cm

array([[282,  14],
       [  6, 211]])

* 513 observations
* 493 correct predictions
* 20 wrong predictions

In [89]:
print('Precision: {}'.format(493/513))

Precision: 0.9610136452241715


* Test set

In [77]:
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.6)
    
cm = confusion_matrix(y_test, y_pred)

In [78]:
cm

array([[93,  3],
       [24, 52]])

* 172 new observations
* 145 correct predictions
* 27 wrong predictions

In [88]:
print('Precision: {}'.format(145/172))

Precision: 0.8430232558139535
