# Querying "cluster JSON" with Pandas

In [1]:
import codecs
import json
import pprint

import pandas as pd

We will:

* Open a single json file for a cluster and look at it.
* Bring in the data as a dictionary and then it is quite easy to read in as a Pandas to play around with

In [2]:
data = []
with codecs.open('../../data/SKNews.json', encoding='utf-8') as f:
    for line in f:
        entry = json.loads(line)
        data.append(entry)

In [3]:
pprint.pprint(data[:2])
print(data[1]['body_text'])

[{u'body_text': u"KT apologizes after 8.7M subscribers' data hacked Published: July 29, 2012 9:23 a.m. ET SEOUL--South Korea's KT Corp. (KT, 030200.SE) apologized Sunday after acknowledging that the personal data for 8.7 million of the company's subscribers had been hacked and said it will devote efforts to prevent any recurrences. Earlier in the day, media reports said the police have arrested hackers who are suspected of having stolen personal data of the customers of the nation's second-largest mobile operator. KT had requested a police investigation on July 13 after detecting signs of possible hacking through internal monitoring, the company said in a statement. The company has has since retrieved all the personal data that had been allegedly collected by the hackers, it added.",
  u'cluster_id': 198,
  u'corpus': u'SKNews',
  u'novelty': False,
  u'order': 1,
  u'post_id': u'1981'},
 {u'body_text': u'South Korean crackers arrested Yet another data leak at KT 29 Jul 2012 at 23:30, 

In [4]:
pd_data = pd.DataFrame(data)

In [5]:
print(pd_data)

Unnamed: 0,body_text,cluster_id,corpus,novelty,order,post_id
0,KT apologizes after 8.7M subscribers' data hac...,198,SKNews,False,1,1981
1,South Korean crackers arrested Yet another dat...,198,SKNews,True,0,1980
2,Two arrested for hacking personal data of 8.7 ...,198,SKNews,True,2,1982
3,"Hackers steal, sell data on 8.7 million Korea ...",198,SKNews,False,3,1983
4,Data of 8.7 Million KT Subscribers Hacked in S...,198,SKNews,False,4,1984
5,South Korea arrests phone firm KT Corp hacking...,198,SKNews,False,5,1985


You can use Pandas functionality like indexing and sorting:

In [6]:
pd_data=pd_data.set_index('order')
print(pd_data.ix[0].body_text)

South Korean crackers arrested Yet another data leak at KT 29 Jul 2012 at 23:30, Richard Chirgwin South Korean police say they have arrested two malicious hackers that obtained personal details of 8.7 million KT mobile customers and on-selling the data to telemarketing firms. The police accuse the pair of earning one billion won, which sounds a lot more than the roughly $US800,000 it converts to, in the scam. The data theft took place between February this year and early this month, when KT detected signs of intrusion on their networks. Seven individuals have been charged over buying the data, according to AFP. In the kind of apology you never get from Western companies suffering data breaches (or, for that matter, repeatedly and egregiously breaching their customers’ privacy – we know who we’re talking about), KT issued a statement to customers saying that “We deeply bow our head in apology” for the leaks. This may, however, reflect KT’s humiliation at being still vulnerable to data l

## Loading all the json files from a folder

We can also send a directory to the algorithm and have it load all the files therein into a single dataframe:

In [7]:
import os

# You need the pythia repo root on sys.path (or provided in the $PYTHONPATH env var)
from src.utils.load_json_to_pandas import load_json_as_pandas, load_json_file

In [12]:
json_path = '../../data/'
!ls -al $json_path

total 20
drwxr-xr-x 2 pcallier pcallier 4096 Jun 15 18:54 .
drwxr-xr-x 7 pcallier pcallier 4096 Jun 15 21:50 ..
-rw-r--r-- 1 pcallier pcallier 9530 Jun 15 18:54 SKNews.json


In [13]:
data_frame = load_json_as_pandas(json_path)

Put this all together to demo the code that is in utils/load_json_to_pandas.py

In [15]:
print(data_frame)

                                           body_text  cluster_id  corpus  \
0  KT apologizes after 8.7M subscribers' data hac...         198  SKNews   
1  South Korean crackers arrested Yet another dat...         198  SKNews   
2  Two arrested for hacking personal data of 8.7 ...         198  SKNews   
3  Hackers steal, sell data on 8.7 million Korea ...         198  SKNews   
4  Data of 8.7 Million KT Subscribers Hacked in S...         198  SKNews   
5  South Korea arrests phone firm KT Corp hacking...         198  SKNews   

  novelty  order post_id  
0   False      1    1981  
1    True      0    1980  
2    True      2    1982  
3   False      3    1983  
4   False      4    1984  
5   False      5    1985  


And you can sort by arbitary columns:

In [25]:
print(data_frame.sort_values('post_id'))

                                           body_text  cluster_id  corpus  \
1  South Korean crackers arrested Yet another dat...         198  SKNews   
0  KT apologizes after 8.7M subscribers' data hac...         198  SKNews   
2  Two arrested for hacking personal data of 8.7 ...         198  SKNews   
3  Hackers steal, sell data on 8.7 million Korea ...         198  SKNews   
4  Data of 8.7 Million KT Subscribers Hacked in S...         198  SKNews   
5  South Korea arrests phone firm KT Corp hacking...         198  SKNews   

  novelty  order post_id  
1    True      0    1980  
0   False      1    1981  
2    True      2    1982  
3   False      3    1983  
4   False      4    1984  
5   False      5    1985  


A handy webpage to learn about querying data in Pandas is at: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

In [26]:
# Look to see how many documents are in each cluster
pd_data.groupby("cluster_id").size()

cluster_id
198    6
dtype: int64

In [27]:
# how many clusters are there (those with non-zero which this current dataset has)?
len(pd_data.groupby("cluster_id").size())

1

## Extra: PySpark
We are going to try to bring in the data into Spark.
If you aren't running Spark or don't have PySpark this part will not work... Any of the operations above can be performed in Spark.

In [None]:
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext 

try:
    sc = SparkContext()
except:
    sc = SparkContext._active_spark_context

sqlCtx = SQLContext(sc)

In [None]:
data1 = sqlCtx.read.json("../../data/SKNews.json")

In [None]:
data1.collect()

In [None]:
data1.registerTempTable("stack")
sqlCtx.sql("select * from stack where order=0").take(1)

In [None]:
data1.printSchema()