# Loading the "missing" PSC statements

Some companies make a statement that the PSC has not yet been identified. Here we build those statement nodes and match the appropriate companies to them.

In [1]:
import pandas as pd
import json
from pandas.io.json import json_normalize;

import blaze as bz

You can access NaTType as type(pandas.NaT)
  @convert.register((pd.Timestamp, pd.Timedelta), (pd.tslib.NaTType, type(None)))


First load the data and do some tidying up so we don't waste RAM

In [3]:
original_psc_data = pd.read_json('../data/psc_snapshot-2017-09-08.json')
all_records_psc = pd.concat([original_psc_data['company_number'],json_normalize(original_psc_data['data'])],axis=1)
del original_psc_data

In [4]:
all_records_psc[~all_records_psc.statement.isnull()].statement.map(lambda x: ',' in x).value_counts()

False    404603
Name: statement, dtype: int64

So there are 404,603 statements of a PSC not being found for a company. Below we see the breakdown in the types of statement they can make

In [5]:
all_records_psc.statement.value_counts()

no-individual-or-entity-with-signficant-control                328286
steps-to-find-psc-not-yet-completed                             26806
psc-exists-but-not-identified                                   22995
psc-details-not-confirmed                                       20025
steps-to-find-psc-not-yet-completed-partnership                  3400
no-individual-or-entity-with-signficant-control-partnership       836
psc-exists-but-not-identified-partnership                         760
psc-details-not-confirmed-partnership                             726
psc-contacted-but-no-response                                     717
restrictions-notice-issued-to-psc                                  48
psc-has-failed-to-confirm-changed-details                           4
Name: statement, dtype: int64

## Inserting the statements into Neo4j

Assuming that we have all the nodes up front we can build the family of statements that a company can register. In theory we should only do this once when we are initially building the database. Afterwards, we will have clashes or produce duplicate nodes unless we MERGE rather than CREATE.

In [6]:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase.driver("bolt://10.0.0.1:7687", auth=("myusername", "mypassword"))

In [7]:
with driver.session() as session:
    for statement_type in all_records_psc.statement.dropna().unique():
        session.run("CREATE (s:Statement {type: {statement_type}})", statement_type=statement_type)

## Now to connect companies to statements

1. First let's define a function that takes a company record and creates the relationship back to the statement. 
2. Then we need to loop over all companies that have made statements and insert those into the neo4j database.
3. We will chunk over the data so as not to put excess strain on the database.

In [8]:
def write_no_psc_statement(input_data):
    """Function writes company records where there is no PSC statement to Neo4j database"""
    with driver.session() as session:
        session.run(("UNWIND {list} AS d "
                     "MERGE (c:Company {uid: d.company_id}) "
                     "MERGE (s:Statement {type: d.statement}) "
                     "MERGE (c)-[:DECLARED]->(s);"), 
                    {"list": input_data})

We shall loop over the data in chunks of 50k records.

In [12]:
no_statement_df = all_records_psc[all_records_psc.statement.notnull()]
no_statement_df['company_id'] = no_statement_df['links.self'].map(lambda x: x.split('/')[2])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [13]:
for chunk in bz.odo(no_statement_df[['company_id', 'statement']], target=bz.chunks(pd.DataFrame), chunksize=50000):
    print(chunk.shape)
    input_data = [v for k,v in chunk.T.to_dict().items()]
    write_no_psc_statement(input_data)

(50000, 2)
(50000, 2)
(50000, 2)
(50000, 2)
(50000, 2)
(50000, 2)
(50000, 2)
(50000, 2)
(4603, 2)
