# Analysing Dataset

In [1]:
def fix_layout(width:int=95):
    from IPython.core.display import display, HTML
    display(HTML('<style>.container { width:' + str(width) + '% !important; }</style>'))
    
fix_layout()

This notebook is dedicated to datasets analysis. 

Here we will concentrate on connecting our datasets into one dataframe that will later be used to extract meaningful information that can help us answer our research questions. 

In addition, data cleaning is performed where necessary since we will not use all the data provided. The goal of this notebook is to make data as easy as possible to use for future plotting and data story writing.

So, let's dive into our data!

In [2]:
import os
import re
import json
import time
import datetime
from functools import reduce
from itertools import product

from json import load, JSONDecodeError
from functional import pseq, seq
import pandas as pd
import pandas_profiling
import requests
import pathlib

# necessary to load the utils, which are in src
import sys
sys.path.append('../src')

from utils import file, logging
from utils.statement_handling import extract_information, safe_json_read

In [3]:
def group_and_count(df, groupby_column, with_pct=False, with_avg=False):
    result = df.groupby(groupby_column).size().sort_values(ascending=False).reset_index().rename(columns={0: 'count'})
    if with_pct:
        result['count_pct'] = result['count'] / result['count'].sum()
    if with_avg:
        result['count_avg'] = result['count'].mean()
    return result

In [4]:
directory_liar_dataset = "../data/liar_dataset"
directory_statements = f"{directory_liar_dataset}/statements"
directory_election_results = "../data/election_results"
directory_county_data = "../data/county_data"

# LIAR Dataset

And now, let's read the data regarding corresponding statements and merge useful information from statements with the liar dataset:

In [5]:
lies = seq(pathlib.Path(directory_statements).iterdir()).map(safe_json_read)\
                               .filter(lambda x: len(x) > 0)\
                               .map(extract_information)\
                               .to_pandas()

lies['statement_date'] = pd.to_datetime(lies['statement_date'])
lies.head()

ERROR:root:File ..\data\liar_dataset\statements\5355.json is empty or something...
ERROR:root:File ..\data\liar_dataset\statements\9.json is empty or something...


Unnamed: 0,author_name_slug,context,label,ruling_date,speaker_current_job,speaker_first_name,speaker_home_state,speaker_last_name,statement,statement_date,statement_id,statement_type,statement_type_description
0,meghan-ashford-grooms,in a Web site video,pants-fire,2010-01-12T15:52:21,,Barbara Ann,,Radnofsky,The attorney general requires that rape victim...,2009-10-22,1,Claim,blog post
1,jody-kyle,an interview on MSNBC,true,2007-10-03T00:00:00,author,Mike,Arkansas,Huckabee,Hes sued gun manufacturers. He was supportive ...,2007-09-21,100,Attack,<p>\r\n\tA criticism of a candidate.</p>\r\n
2,bill-adair,an interview with Pajamas Media.,pants-fire,2009-04-30T18:20:28,Congresswoman,Michele,Minnesota,Bachmann,"In the 1970s, the swine flu broke out . . . un...",2009-04-27,1000,Claim,blog post
3,w-gardner-selby,a debate in Austin,half-true,2014-10-02T14:07:32,Pharmacist,Leticia,Texas,Van de Putte,Dan Patrick said that if women get paid less t...,2014-09-29,10000,Claim,blog post
4,tom-kertscher,a news conference,barely-true,2014-10-02T13:15:42,,,Wisconsin,Wisconsin Professional Police Association,Data on violent crime shows Wisconsin has beco...,2014-09-25,10001,Attack,<p>\r\n\tA criticism of a candidate.</p>\r\n


In [6]:
lies.shape

(15471, 13)

#### Label

In [7]:
group_and_count(lies, 'label')

Unnamed: 0,label,count
0,half-true,3009
1,false,2919
2,mostly-true,2860
3,barely-true,2587
4,true,2250
5,pants-fire,1604
6,full-flop,149
7,half-flip,68
8,no-flip,25


In [8]:
def label_to_nb(l): 
    """ Converting label to number
    
    Parameters
    ----------
    l: str
        Label of the lie, can be 'true', 'mostly-true', 'half-true', 'barely-true', 'false', 'pants-fire'
        
    Returns
    -------
    int
        Number in range from 0 to 5. We still need to think about lie representation.
    
    ToDos
    -----
    - think about this, this will give the false-hoods more weight
    """
    # TODO: handle FLIP-FLOPS!!!!
    #return ['true', 'mostly-true', 'half-true', 'barely-true', 'false', 'pants-fire'].index(l)
    return ['full-flop','half-flip', 'no-flip', 'true', 'mostly-true', 'half-true', 'barely-true', 'false', 'pants-fire'].index(l)


In [9]:
# TODO: check this cell, why it is not deleted
# no longer necessary
lies['label_as_nb'] = lies['label'].apply(label_to_nb) * 2 
lies['statement_id'] = pd.to_numeric(lies['statement_id'])


#### Context

In [10]:
group_and_count(lies, 'context').shape[0]

5828

In [11]:
print(f"The number of different context names is: {group_and_count(lies, 'context').shape[0]}.")

The number of different context names is: 5828.


That is much too many different contexts and lots of them appear only a few times.\
We thus need to regroup/reduce the number of contexts.

In [12]:
group_and_count(lies, 'context').head(100)

Unnamed: 0,context,count
0,a tweet,471
1,an interview,380
2,a news release,329
3,a press release,325
4,a speech,319
5,a TV ad,275
6,a campaign ad,221
7,a headline,175
8,a television ad,168
9,,166


So how to regroup all these or part of these?
We can use the mean of communication for example:
radio/tv/facebook/twitter/internet
and these classes can have overlap...

In [13]:
# Insensitive case search of terms to regroup similar contexts together
# tweet, facebook, tv, campaign, blog, conference, fox, others
def clean_up_context(c):
    if 'tweet' in c.lower():
        return 'tweet'
    elif 'facebook' in c.lower():
        return 'facebook'
    elif any([s in c for s in ['television', 'TV', 'broadcast', 'press', 'CNN', 'radio', 'magazine']]):
        return 'tv'
    elif 'campaign' in c.lower():
        return 'campaign'
    elif 'blog' in c:
        return 'blog'
    elif 'conference' in c.lower():
        return 'conference'
    elif 'fox' in c.lower():
        return 'fox'
    else:
        return 'others'

lies['clean_context'] = lies['context'].apply(clean_up_context)

In [14]:
group_and_count(lies, 'clean_context')

Unnamed: 0,clean_context,count
0,others,9734
1,tv,2773
2,campaign,1030
3,tweet,584
4,fox,457
5,facebook,325
6,blog,309
7,conference,259


In [15]:
# In search for empty contexts
lies.loc[lies['context']=='']

Unnamed: 0,author_name_slug,context,label,ruling_date,speaker_current_job,speaker_first_name,speaker_home_state,speaker_last_name,statement,statement_date,statement_id,statement_type,statement_type_description,label_as_nb,clean_context
34,robert-farley,,half-true,2009-05-05T15:12:52,,,,Chain email,Barack Obamas nominee for regulatory czar has ...,2009-04-27,1003,Claim,blog post,10,others
353,lauren-carroll,,mostly-true,2014-12-19T10:57:40,U.S. Senator,Marco,Florida,Rubio,The reason why Cubans don&#39;t have access to...,2014-12-17,10330,Claim,blog post,8,others
467,sean-gorman,,mostly-true,2015-01-26T13:00:00,,,,Virginia House Democratic Caucus,Assault weapons and handguns are allowed in th...,2015-01-19,10439,Claim,blog post,8,others
531,michael-van-sickler,,barely-true,2007-10-05T00:00:00,author,Mike,Arkansas,Huckabee,"The reality is, with a $2 trillion-a-year heal...",2007-09-02,105,Claim,blog post,12,others
686,sean-gorman,,mostly-true,2015-03-23T14:30:00,State delegate,Ken,Virginia,Plum,We&#39;ve got 40 years of study now that show ...,2015-03-12,10643,Claim,blog post,8,others
699,w-gardner-selby,,barely-true,2015-03-30T11:08:40,Senator,Ted,Texas,Cruz,Today roughly half of born-again Christians ar...,2015-03-23,10655,Claim,blog post,12,others
992,sarah-hauer,,half-true,2015-06-10T14:26:02,State Representative,Dale,Wisconsin,Kooyenga,<p dir=ltr>Since Republicans took over after t...,2015-05-22,10931,Claim,blog post,10,others
1387,rachel-brooks,,barely-true,2015-10-02T15:05:02,,,,National Republican Senatorial Committee,Says Missouri Democratic Senate candidate&nbsp...,2015-09-14,11302,Claim,blog post,12,others
1613,james-b-nelson,,full-flop,2015-11-06T05:00:00,U.S. representative,Glenn,Wisconsin,Grothman,On support for the Export-Import Bank,2015-11-06,11527,Flip,Flip-o-Meter items,0,others
1818,nancy-badertscher,,mostly-true,2015-12-30T00:00:00,promoting safe driving,State,Georgia,Public Service Announcement,"By the end of 2015, more than 1,300 people wil...",2015-12-28,11720,Claim,blog post,8,others


In [16]:
# What are the other contexts about
lies.loc[lies['clean_context']=='others']
# We need to consider also column called statement_type_description

Unnamed: 0,author_name_slug,context,label,ruling_date,speaker_current_job,speaker_first_name,speaker_home_state,speaker_last_name,statement,statement_date,statement_id,statement_type,statement_type_description,label_as_nb,clean_context
0,meghan-ashford-grooms,in a Web site video,pants-fire,2010-01-12T15:52:21,,Barbara Ann,,Radnofsky,The attorney general requires that rape victim...,2009-10-22,1,Claim,blog post,16,others
1,jody-kyle,an interview on MSNBC,true,2007-10-03T00:00:00,author,Mike,Arkansas,Huckabee,Hes sued gun manufacturers. He was supportive ...,2007-09-21,100,Attack,<p>\r\n\tA criticism of a candidate.</p>\r\n,6,others
2,bill-adair,an interview with Pajamas Media.,pants-fire,2009-04-30T18:20:28,Congresswoman,Michele,Minnesota,Bachmann,"In the 1970s, the swine flu broke out . . . un...",2009-04-27,1000,Claim,blog post,16,others
3,w-gardner-selby,a debate in Austin,half-true,2014-10-02T14:07:32,Pharmacist,Leticia,Texas,Van de Putte,Dan Patrick said that if women get paid less t...,2014-09-29,10000,Claim,blog post,10,others
10,dylan-baddour,a debate in the Rio Grande Valley.,false,2014-10-03T13:00:00,governor,Greg,Texas,Abbott,I&rsquo;ve been involved in prosecuting a terr...,2014-09-19,10007,Claim,blog post,14,others
13,angie-drobnic-holan,interviews.,full-flop,2009-05-01T14:56:32,Senator,Arlen,Pennsylvania,Specter,On switching parties.,2009-04-28,1001,Flip,Flip-o-Meter items,0,others
15,louis-jacobson,a speech at Northwestern University,half-true,2014-10-06T15:49:39,President,Barack,Illinois,Obama,"When I took office, the deficit was nearly 10 ...",2014-10-02,10012,Claim,blog post,10,others
16,al-mckeon,an ad,mostly-true,2014-10-03T17:20:47,,,,Americans for Responsible Solutions,Republican House candidate&nbsp;Marilinda Garc...,2014-09-16,10013,Claim,blog post,8,others
17,c-eugene-emery,a Providence Journal-WPRI debate,true,2014-10-05T00:01:00,Law professor,Jorge,Rhode Island,Elorza,"We have a retiree that is collecting a $17,000...",2014-09-30,10014,Claim,blog post,6,others
18,al-mckeon,an ad,mostly-true,2014-10-03T17:31:05,,,,Americans for Responsible Solutions,Most people in New Hampshire want to raise the...,2014-09-16,10015,Claim,blog post,8,others


Let's just see what we have here:

In [17]:
def _count_for_last_name_(df, last_name):
    return group_and_count(lies.loc[lies['speaker_last_name'].str.contains(last_name, flags=re.IGNORECASE), :], 'label', with_pct=True)\
            .rename(columns={'count': f'count_{last_name}', 'count_pct': f'count_pct{last_name}'})

In [18]:
pd.merge(_count_for_last_name_(lies, 'obama'), _count_for_last_name_(lies, 'trump'), on='label')

Unnamed: 0,label,count_obama,count_pctobama,count_trump,count_pcttrump
0,mostly-true,169,0.267829,74,0.112977
1,half-true,162,0.256735,93,0.141985
2,true,127,0.201268,30,0.045802
3,false,73,0.115689,217,0.331298
4,barely-true,71,0.11252,137,0.20916
5,full-flop,13,0.020602,7,0.010687
6,pants-fire,9,0.014263,97,0.148092


Here we can see that Barack Obama had 549 statements labeled with _pants on fire_.

In [19]:
lies[lies['speaker_current_job'].str.contains('County') == True].shape

(335, 15)

In [20]:
lies['statement_date'].describe()

count                   15471
unique                   3555
top       2011-10-11 00:00:00
freq                       26
first     1995-04-01 00:00:00
last      2018-11-22 00:00:00
Name: statement_date, dtype: object

Above, we can see that statements range from 1995 to 2016.

Now, let's do some profiling to get some more insights, get some intuitive understanding of data and to see certain patterns if they exist.

In [21]:
profile = pandas_profiling.ProfileReport(lies)
profile.to_file(outputfile="profiler/output.html")

# Federal Election Results

We have another dataset that we will explore and merge to our LIAR dataset in order to get some more insight into data. This one is regarding election results.

In [22]:
pd.options.display.max_colwidth = 300
pd.options.display.max_columns = 300

In [23]:
from itertools import product
from functools import reduce

In [24]:
def add_ending(f):
    """ File ending depending on a year
    
    Parameters
    ----------
    f: str
        Name of the file
    
    ToDos:
    - do 2012 it's a special snowflake
    """
    if '2016' in f:
        return f"{f}x"
    else:
        return f


election_files = [(add_ending(f'{directory_election_results}/federalelections{year}.xls'), year) for year in [2014, 2016]]

Now, let's prepare some data for viewing:

In [25]:
election_results_cols_of_interest = ['CANDIDATE NAME', 'PRIMARY VOTES', 'PRIMARY %']

def fix_columns_election_results(df, year, type_):
    """we are only interested in the primary votes, since these reflect the opinion the most"""
    df = df.loc[:, election_results_cols_of_interest]
    df[f'primary_votes_{type_.lower()}_{year}'] = df['PRIMARY VOTES']
    df[f'primary_votes_{type_.lower()}_{year}_pct'] = df['PRIMARY %']
    return df.drop(columns=['PRIMARY VOTES', 'PRIMARY %'])


def get_only_voting_results(df):
    return df.loc[df['CANDIDATE NAME'].notna() & df['PRIMARY VOTES'].notna() & df['CANDIDATE NAME'].ne('Scattered') & df['CANDIDATE NAME'].ne('All Others'), :]


def prep_election_results(df, year, type_):
    return fix_columns_election_results(get_only_voting_results(df), year, type_)

In [26]:
election_results = [prep_election_results(pd.read_excel(f, sheet_name=f'{year} US {type_} Results by State'), year, type_) for (f, year), type_ in product(election_files, ['Senate', 'House'])]

# we let the results as they are, merge, and then check if the person is a senator or a member of the house based on the other results
# yes they did a spelling mistake
election_results += [prep_election_results(pd.read_excel(f'{directory_election_results}/federalelections2012.xls', sheet_name=f'2012 US House & Senate Resuts'), 2012, 'all')]
election_results = reduce(lambda acc, el: pd.merge(acc, el, on='CANDIDATE NAME', how='outer'), election_results)

In [27]:
election_results.head()

Unnamed: 0,CANDIDATE NAME,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct
0,"Sessions, Jeff",Unopposed,,,,,,,,,
1,"Sullivan, Dan",44740,0.400548,,,,,,,,
2,"Miller, Joe",35904,0.321441,,,,,,,,
3,"Treadwell, Mead",27807,0.24895,,,,,,,,
4,"Jaramillo, John M.",3246,0.029061,,,,,,,,


In [28]:
idx_multiple_election_results = election_results.loc[:, [c for c in election_results.columns if any((c.endswith(str(y)) for y in [2012, 2014, 2016]))]].notna().sum(axis=1) > 1

In [29]:
print(f"we have multple election results for {idx_multiple_election_results.sum()} politicians")

we have multple election results for 1110 politicians


In [30]:
idx_multiple_election_results.mean()

0.19618239660657477

In [31]:
election_results[idx_multiple_election_results].head()

Unnamed: 0,CANDIDATE NAME,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct
16,"Gardner, Cory",338324,1.0,,,,,,,49340,1.0
19,"Wade, Kevin",18181,0.756627,,,,,,,Unopposed,
32,"Schatz, Brian",115445,0.49346,,,162905.0,0.861662,,,,
33,"Hanabusa, Colleen Wakako",113663,0.485843,,,,,74022.0,0.803757,,
36,"Roco, John P.",4425,0.123572,,,3956.0,0.110303,,,545,0.0112848


In [32]:
# yeah ... let's see how many we can join. the one letter endings might be a problem
election_results['CANDIDATE NAME'].value_counts()

Tonko, Paul D.                  36
Reed, Thomas W., II             36
Collins, Chris                  24
Lowey, Nita M.                  12
Slaughter, Louise M.            12
Katko, John M.                  12
Crowley, Joseph                 12
Higgins, Brian                  12
Maloney, Sean Patrick           12
Engel, Eliot L.                 12
Nadler, Jerrold L.              12
King, Peter T.                   9
Gibson, Christopher P.           9
Israel, Steve J.                 9
Kuster, Ann McLane               8
Jeffries, Hakeem S.              8
Assini, Mark W.                  8
Stefanik, Elise M.               8
Maloney, Carolyn B.              8
Clarke, Yvette D.                8
Zeldin, Lee M.                   6
Long, Wendy                      6
Bishop, Timothy H.               6
Grimm, Michael G.                6
Hayworth, Nan                    6
Guinta, Frank C.                 4
Turek, Jessica L.                4
Hanna, Richard L.                4
Rice, Kathleen M.   

In [33]:
# we are only interest in people and they have a first name
lies = lies.loc[lies['speaker_first_name'].notnull(), :]

In [34]:
# to aggregate the statements
lies['statement_year'] = lies['statement_date'].dt.year

# for the merging
lies['speaker_full_name'] = lies['speaker_last_name'] + ', ' + lies['speaker_first_name']

### Cleaning job titles

In [35]:
# todo expand this and check this! this is just a quick and dirty fix
# is it really houseman? probably not...
_job_titles_of_interest = [('senat', 'senator'), ('governor', None), ('congress', 'congressman'), ('mayor', None), ('president', None), ('house', 'houseman'), ('rep', 'houseman')]
job_titles_of_interest = [out if out is not None else j for j, out in _job_titles_of_interest]

def cleaned_job_title(jt):
    jt = str(jt).lower()
    
    for j, out in _job_titles_of_interest:
        if j in jt:
            return out if out is not None else j
    else:
        return jt

lies['speakers_job_title_cleaned'] = lies['speaker_current_job'].apply(cleaned_job_title)

In [36]:
_t = lies.merge(election_results, left_on='speaker_full_name', right_on='CANDIDATE NAME', how='outer')

In [37]:
print(f"found election results for {_t['CANDIDATE NAME'].notnull().sum()} ({_t['CANDIDATE NAME'].notnull().mean()}%) people")

found election results for 8272 (0.39341767335679634%) people


In [38]:
votes_cols = [c for c in _t.columns if 'votes' in c]
useful_idx = reduce(lambda acc, el: acc | el, [_t[c].notnull() for c in votes_cols]) & _t['speaker_full_name'].notnull() 

print(f"found useful results for {useful_idx.sum()} people")

columns_of_interest = ['label', 'label_as_nb', 'subject', 'speaker_full_name', 'speakers_job_title_cleaned', 'state_info', 'party_affiliation', 'context', 'statement_date'] + votes_cols
_t.loc[useful_idx, columns_of_interest]

found useful results for 3020 people


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,label,label_as_nb,subject,speaker_full_name,speakers_job_title_cleaned,state_info,party_affiliation,context,statement_date,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct
42,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview with Pajamas Media.,2009-04-27,,,,,,,,,14569,0.803452
43,false,14.0,,"Bachmann, Michele",congressman,,,a press release,2009-05-06,,,,,,,,,14569,0.803452
44,pants-fire,16.0,,"Bachmann, Michele",congressman,,,a Washington Times interview,2009-06-17,,,,,,,,,14569,0.803452
45,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview with the Washington Times.,2009-06-17,,,,,,,,,14569,0.803452
46,false,14.0,,"Bachmann, Michele",congressman,,,a statement on the House floor,2009-07-27,,,,,,,,,14569,0.803452
47,false,14.0,,"Bachmann, Michele",congressman,,,an interview on CNN,2015-09-10,,,,,,,,,14569,0.803452
48,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview on Sean Hannity's show on the Fox News Channel.,2009-10-30,,,,,,,,,14569,0.803452
49,false,14.0,,"Bachmann, Michele",congressman,,,an interview on CNN's Larry King Live,2010-03-03,,,,,,,,,14569,0.803452
50,false,14.0,,"Bachmann, Michele",congressman,,,"on CBS's ""Face the Nation""",2010-03-28,,,,,,,,,14569,0.803452
51,false,14.0,,"Bachmann, Michele",congressman,,,,2010-06-11,,,,,,,,,14569,0.803452


In [39]:
_t.loc[useful_idx, 'speakers_job_title_cleaned'].value_counts()

senator                                                                                                 1070
houseman                                                                                                 700
congressman                                                                                              442
                                                                                                         261
milwaukee county executive                                                                               211
governor                                                                                                  89
ohio treasurer                                                                                            29
state assembly member, 78th district                                                                      25
actor                                                                                                     19
businessman        

In [40]:
_t.loc[_t['speakers_job_title_cleaned'].isin(job_titles_of_interest) & useful_idx, columns_of_interest]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,label,label_as_nb,subject,speaker_full_name,speakers_job_title_cleaned,state_info,party_affiliation,context,statement_date,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct
42,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview with Pajamas Media.,2009-04-27,,,,,,,,,14569,0.803452
43,false,14.0,,"Bachmann, Michele",congressman,,,a press release,2009-05-06,,,,,,,,,14569,0.803452
44,pants-fire,16.0,,"Bachmann, Michele",congressman,,,a Washington Times interview,2009-06-17,,,,,,,,,14569,0.803452
45,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview with the Washington Times.,2009-06-17,,,,,,,,,14569,0.803452
46,false,14.0,,"Bachmann, Michele",congressman,,,a statement on the House floor,2009-07-27,,,,,,,,,14569,0.803452
47,false,14.0,,"Bachmann, Michele",congressman,,,an interview on CNN,2015-09-10,,,,,,,,,14569,0.803452
48,pants-fire,16.0,,"Bachmann, Michele",congressman,,,an interview on Sean Hannity's show on the Fox News Channel.,2009-10-30,,,,,,,,,14569,0.803452
49,false,14.0,,"Bachmann, Michele",congressman,,,an interview on CNN's Larry King Live,2010-03-03,,,,,,,,,14569,0.803452
50,false,14.0,,"Bachmann, Michele",congressman,,,"on CBS's ""Face the Nation""",2010-03-28,,,,,,,,,14569,0.803452
51,false,14.0,,"Bachmann, Michele",congressman,,,,2010-06-11,,,,,,,,,14569,0.803452


Now, our dataframe looks like this:

In [41]:
_t.head(1)

Unnamed: 0,author_name_slug,context,label,ruling_date,speaker_current_job,speaker_first_name,speaker_home_state,speaker_last_name,statement,statement_date,statement_id,statement_type,statement_type_description,label_as_nb,clean_context,statement_year,speaker_full_name,speakers_job_title_cleaned,CANDIDATE NAME,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct
0,meghan-ashford-grooms,in a Web site video,pants-fire,2010-01-12T15:52:21,,Barbara Ann,,Radnofsky,The attorney general requires that rape victims pay for the rape kit.,2009-10-22,1.0,Claim,blog post,16.0,others,2009.0,"Radnofsky, Barbara Ann",,,,,,,,,,,,


# County Data

In [42]:
# load data file
county_raw = pd.read_csv(f"{directory_county_data}/acs2015_county_data.csv")
US_states = county_raw['State'].unique()
county_raw.head()

Unnamed: 0,CensusId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,Citizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga,55221,26745,28476,2.6,75.8,18.5,0.4,1.0,0.0,40725,51281.0,2391.0,24974,1080,12.9,18.6,33.2,17.0,24.2,8.6,17.1,87.5,8.8,0.1,0.5,1.3,1.8,26.5,23986,73.6,20.9,5.5,0.0,7.6
1,1003,Alabama,Baldwin,195121,95314,99807,4.5,83.1,9.5,0.6,0.7,0.0,147695,50254.0,1263.0,27317,711,13.4,19.2,33.1,17.7,27.1,10.8,11.2,84.7,8.8,0.1,1.0,1.4,3.9,26.4,85953,81.5,12.3,5.8,0.4,7.5
2,1005,Alabama,Barbour,26932,14497,12435,4.6,46.2,46.7,0.2,0.4,0.0,20714,32964.0,2973.0,16824,798,26.7,45.3,26.8,16.1,23.1,10.8,23.1,83.8,10.9,0.4,1.8,1.5,1.6,24.1,8597,71.8,20.8,7.3,0.1,17.6
3,1007,Alabama,Bibb,22604,12073,10531,2.2,74.5,21.4,0.4,0.1,0.0,17495,38678.0,3995.0,18431,1618,16.8,27.9,21.5,17.9,17.8,19.0,23.7,83.2,13.5,0.5,0.6,1.5,0.7,28.8,8294,76.8,16.1,6.7,0.4,8.3
4,1009,Alabama,Blount,57710,28512,29198,8.6,87.9,1.5,0.3,0.1,0.0,42345,45813.0,3141.0,20532,708,16.7,27.2,28.5,14.1,23.9,13.5,19.9,84.9,11.2,0.4,0.9,0.4,2.3,34.9,22189,82.0,13.5,4.2,0.4,7.7


# DATA SET COMPLETE

At this point, we collected all the columns we need. Let's see how we can clean them:

In [43]:
median_speaker_value = _t.groupby(['statement_year', 'speaker_full_name'])['label_as_nb'].median().reset_index()

In [44]:
median_speaker_value[median_speaker_value['statement_year'] == 2016]

Unnamed: 0,statement_year,speaker_full_name,label_as_nb
4637,2016.0,", Foodmentum",8.0
4638,2016.0,"My City Bikes,",14.0
4639,2016.0,"18% of the American public,",6.0
4640,2016.0,"57 campaign, Stop",12.0
4641,2016.0,"ACLU of North Carolina,",6.0
4642,2016.0,"AFSCME People,",8.0
4643,2016.0,"AFSCME,",10.0
4644,2016.0,"Abbott, Greg",14.0
4645,2016.0,"Abele, Chris",15.0
4646,2016.0,"Action Fund, ClearPath",12.0


### Non-People Speakers Handling

Removing non-people (_tweets, facebook posts, etc._) from the dataset:

In [45]:
from nltk import download
download('punkt')
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import sent_tokenize
from collections import Counter

model = 'nlp/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz'
jar = 'nlp/stanford-ner-2018-10-16/stanford-ner-3.9.2.jar'
st = StanfordNERTagger(model, jar, encoding='utf-8')

def get_tag(speaker):
    ner_tag = 0
    if type(speaker) == str:
        full_speaker_name = speaker.replace("-", " ").title()

        for sent in sent_tokenize(full_speaker_name):
            tokens = word_tokenize(sent)
            tags = st.tag(tokens)
            
        ner_tag= Counter(dict(tags).values()).most_common(1)[0][0]
        print(tags, " --> ", ner_tag)
    return ner_tag
    

# just to see if/how it works
word = "Twitter-Post-Anna"
get_tag(word)

full_speaker_name = "Barack-Obama"
get_tag(full_speaker_name)

full_speaker_name = 0
get_tag(full_speaker_name)


[nltk_data] Downloading package punkt to C:\Users\Jelena
[nltk_data]     Banjac\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[('Twitter', 'ORGANIZATION'), ('Post', 'ORGANIZATION'), ('Anna', 'PERSON')]  -->  ORGANIZATION
[('Barack', 'PERSON'), ('Obama', 'PERSON')]  -->  PERSON


0

In [46]:
_t["speaker"] = _t.apply(lambda row: "-".join([str(row["speaker_first_name"]).strip().replace(" ","-").lower(), str(row["speaker_last_name"]).strip().replace(" ","-").lower()]), axis=1)
_t["speaker"] = _t.apply(lambda row: row["speaker"][1:] if row["speaker"].startswith("-") else row["speaker"][:-1] if row["speaker"].endswith("-") else row["speaker"], axis=1)

In [57]:
import os.path

file_path = 'nlp/speaker_tags_lg.json'

if not os.path.exists(file_path):
    print(f"Total number of values to classify: {len(_t['speaker'].value_counts().index)}")

    words_with_tags = {}
    for word in _t['speaker'].value_counts().index:
        words_with_tags[word] = get_tag(word)
    
    # save tags, since it took ~3h to tag all 3214 unique speakers
    with open(file_path, 'w') as fp:
        json.dump(words_with_tags, fp, indent=4)
else:
    with open(file_path, 'r') as f:
        words_with_tags = json.load(f)
    print(f"Total number of classified values (from file): {len(words_with_tags)}")

Total number of classified values (from file): 3962


In [48]:
_t["speaker_tag"] = _t.apply(lambda row: words_with_tags.get(row['speaker'], "UNKNOWN") if not pd.isnull(row['speaker']) else row['speaker'], axis=1)
_t[['speaker','speaker_tag']].drop_duplicates()

Unnamed: 0,speaker,speaker_tag
0,barbara-ann-radnofsky,PERSON
1,mike-huckabee,PERSON
42,michele-bachmann,PERSON
103,leticia-van-de-putte,PERSON
115,wisconsin-professional-police-association,ORGANIZATION
117,allen-west,PERSON
143,georgians-together,O
145,thom-tillis,PERSON
151,reza-aslan,PERSON
152,republican-party-of-florida,ORGANIZATION


Seems good, so now let's remove non-people from the dataset:

In [49]:
_t.shape

(21026, 31)

In [50]:
_t[_t['speaker_tag'] == "PERSON"].shape

(18194, 31)

In [51]:
_t[_t['speaker_tag'] == "UNKNOWN"].shape

(0, 31)

We see that there are ~5000 statements which are made by (speaker) _Twitter, Facebook, Blog post, Republican Party Texas, etc._

In [52]:
# removing non-people statements
# _t = _t[_t['speaker_tag'] == "PERSON"]

### Clean-up Context

In [53]:
_t["context"].value_counts().index.values

array(['a tweet', 'an interview', 'a news release', ...,
       'a press release from the Communication Workers of America',
       ' Conference call with reporters', 'a CBS4 interview'],
      dtype=object)

It would be good to try using the tool that would extract keywords from these phases. Let's use NLTK Rake:

In [67]:
from rake_nltk import Rake, Metric
from collections import Counter

def do_keyword_extraction(words, debug = False):
    if debug: print("---\n", words)
        
    rake_all = Rake()
    rake_all.extract_keywords_from_sentences(_t["context"].value_counts().index.values)

    word_degrees = dict(rake_all.get_word_degrees())
    
    r = Rake()
    r.extract_keywords_from_text(words)

    keywords = dict(r.get_word_degrees())
    
    if debug: print(keywords)
        
    for k, v in keywords.items():
        keywords[k] = word_degrees[k]
    
    if debug: print(keywords)
#     print(Counter(keywords).most_common(1))
#     print(Counter(keywords).most_common(1)[0])
    return Counter(keywords).most_common(1)[0] if len(Counter(keywords).most_common(1)) > 0 else "OTHER"

In [55]:
# try to see how it works
text_to_process = "a television interview"
do_keyword_extraction("an interview")
do_keyword_extraction("a television interview")
do_keyword_extraction("a TV interview")

('interview', 219)

In [68]:
_t["context_tag"] = _t.apply(lambda row: do_keyword_extraction(row['context']) if not pd.isnull(row['context']) else row['context'], axis=1)

In [69]:
context_tags = _t[['context','context_tag']]['context_tag'].value_counts()
print(f"Number of different context tags is {len(context_tags)}")
context_tags

Number of different context tags is 345


(interview, 219)       1753
(news, 478)            1282
(campaign, 533)        1177
(speech, 226)          1107
(press, 304)           1017
(debate, 476)          1010
(ad, 240)               973
(post, 159)             631
(tweet, 41)             573
(radio, 236)            315
(state, 247)            314
(house, 309)            293
(show, 265)             255
(senate, 311)           203
(email, 82)             201
(headline, 9)           178
OTHER                   166
(website, 153)          165
(article, 108)          158
(mail, 73)              158
(column, 63)            155
(meeting, 184)          154
(statement, 86)         147
(web, 163)              143
(video, 144)            142
(new, 251)              128
(letter, 40)            103
(internet, 45)           99
(op, 50)                 96
(comments, 29)           87
                       ... 
(.,, 18)                  1
(executive, 24)           1
(retweet, 1)              1
(reilly, 6)               1
(church, 9)         

We see that the number of context tags is 271, which is a preatty big number. Let's consider decreasing this number and make smaller groups.

## Dataset we will use for our visualizations

In [70]:
_t.head()

Unnamed: 0,author_name_slug,context,label,ruling_date,speaker_current_job,speaker_first_name,speaker_home_state,speaker_last_name,statement,statement_date,statement_id,statement_type,statement_type_description,label_as_nb,clean_context,statement_year,speaker_full_name,speakers_job_title_cleaned,CANDIDATE NAME,primary_votes_senate_2014,primary_votes_senate_2014_pct,primary_votes_house_2014,primary_votes_house_2014_pct,primary_votes_senate_2016,primary_votes_senate_2016_pct,primary_votes_house_2016,primary_votes_house_2016_pct,primary_votes_all_2012,primary_votes_all_2012_pct,speaker,speaker_tag,context_tag
0,meghan-ashford-grooms,in a Web site video,pants-fire,2010-01-12T15:52:21,,Barbara Ann,,Radnofsky,The attorney general requires that rape victims pay for the rape kit.,2009-10-22,1.0,Claim,blog post,16.0,others,2009.0,"Radnofsky, Barbara Ann",,,,,,,,,,,,,barbara-ann-radnofsky,PERSON,"(web, 163)"
1,jody-kyle,an interview on MSNBC,true,2007-10-03T00:00:00,author,Mike,Arkansas,Huckabee,Hes sued gun manufacturers. He was supportive of Brady. He was supportive of things like assault weapon bans.,2007-09-21,100.0,Attack,<p>\r\n\tA criticism of a candidate.</p>\r\n,6.0,others,2007.0,"Huckabee, Mike",author,,,,,,,,,,,,mike-huckabee,PERSON,"(interview, 219)"
2,michael-van-sickler,,barely-true,2007-10-05T00:00:00,author,Mike,Arkansas,Huckabee,"The reality is, with a $2 trillion-a-year health care budget, were spending more on health care, nearly 17 percent of our gross domestic product, versus 3.8 percent of GDP on the entire military budget.",2007-09-02,105.0,Claim,blog post,12.0,others,2007.0,"Huckabee, Mike",author,,,,,,,,,,,,mike-huckabee,PERSON,OTHER
3,louis-jacobson,a video,barely-true,2015-05-04T17:10:31,author,Mike,Arkansas,Huckabee,Says he raised average family income by 50 percent during his tenure as Arkansas governor.,2015-05-01,10796.0,Claim,blog post,12.0,others,2015.0,"Huckabee, Mike",author,,,,,,,,,,,,mike-huckabee,PERSON,"(video, 144)"
4,lauren-carroll,"an interview on ""Fox News Sunday""",barely-true,2015-05-29T10:00:50,author,Mike,Arkansas,Huckabee,The Supreme Court can&rsquo;t overrule the other two branches&nbsp;of government.&nbsp;,2015-05-24,10879.0,Claim,blog post,12.0,fox,2015.0,"Huckabee, Mike",author,,,,,,,,,,,,mike-huckabee,PERSON,"(news, 478)"


In [71]:
_t.shape

(21026, 32)

In [73]:
_t.to_json('data/liar_dataset.json')

In [None]:
# TODO: make smaller context groups, ideally around 10

In [None]:
# TODO: implement tagging on the cleaned jobs as well

In [None]:
# TODO: plot the answers from research questions we have

## Some initial insights

In [None]:
_t['sum_not_so_true'] = _t['pants_on_fire_counts']/(_t['barely_true_counts'] + _t['false_counts'] + _t['half_true_counts'] + _t['mostly_true_counts'] + _t['pants_on_fire_counts'])
number_of_party_affiliation = _t.groupby('party_affiliation')['sum_not_so_true'].sum().sort_values(ascending=False)
number_of_party_affiliation

Here are the `party_affiliations` who most lie ordered by their proportion of lies. But we already know that the 2 dominant parties in USA are republican and democrat. We see that there are lots of unknown party affiliations from which we can make identify 2 possibilities

In [None]:
number_of_party_affiliation = _t.groupby(['speaker'])['sum_not_so_true'].sum().sort_values(ascending=False)
number_of_party_affiliation.head(10)

Looking at the dataset content, people above are sorted quantity of lies.

In [None]:
all_contexts = _t['context_tag'].unique()
nb_elements_context = _t.groupby(['context_tag'])['context_tag'].count().sort_values(ascending=False)
nb_elements_context.head(50)

Looking at the context, it seems that people lie the most during the interviews, then speech, after debates, and so on...