# Convert to Dataframe & Parse: JSON object using Python

**<font color=red>Mr Fugu Data Science</font>** 

# (◕‿◕✿)

[Github](https://github.com/MrFuguDataScience) | [Youtube](https://www.youtube.com/channel/UCbni-TDI-Ub8VlGaP8HLTNw/)

# Purpose & Outcome:

+ Convert Nested JSON into Dataframe ( `Four` *different ways* )

+ Adjust Nested Function to Take more parameters ( *such as new data* )
    + *1<sup>st</sup>* : Create **`New Nesting`** Entries Inside Our Function
    + *2<sup>nd</sup>* : **`NONE Nested`** New Entries
    
+ Last: Discuss, timing these opeartions and show an example
    

`____________________________________________`


In [11]:
import json
import pandas as pd
import time
from faker import Factory,Faker # Create fake data
import numpy as np
# from pandas.io.json import pandas.json_normalize 

In [12]:
# Our Nested Data: What do we do with this?

# ¯\_(ツ)_/¯

pd.read_json("employee_data.json").head()

Unnamed: 0,features
0,"{'candidate': {'first_name': 'Margaret', 'last..."
1,"{'candidate': {'first_name': 'Michael', 'last_..."
2,"{'candidate': {'first_name': 'Brenda', 'last_n..."
3,"{'candidate': {'first_name': 'Joseph', 'last_n..."
4,"{'candidate': {'first_name': 'Laura', 'last_na..."


In [13]:
with open('employee_data.json') as datafile:
    for line in datafile:
        data = json.loads(line)
    ww=pd.DataFrame(data)
ww.head()

Unnamed: 0,features
0,"{'candidate': {'first_name': 'Margaret', 'last..."
1,"{'candidate': {'first_name': 'Michael', 'last_..."
2,"{'candidate': {'first_name': 'Brenda', 'last_n..."
3,"{'candidate': {'first_name': 'Joseph', 'last_n..."
4,"{'candidate': {'first_name': 'Laura', 'last_na..."


# Four Ways to Convert Nested JSON to Data Frame:

+ First Use: `pandas.read_json` then you have two options:

    + *1<sup> st</sup> :* convert data type, then create a new data frame

    + *2<sup> nd</sup> :* use `pandas.json_normalize` directly.
    
    + *3<sup> rd</sup> :* iterate and enter nested outer to inner
    
    + *4<sup> th</sup> :* list comprehension

# 1 ) read_json 

In [15]:
df = pd.read_json("employee_data.json")

bn=pd.DataFrame(df.features.values.tolist())['candidate']

pd.DataFrame.from_records(bn).head()



Unnamed: 0,first_name,last_name,skills,state,specialty,experience,relocation
0,Margaret,Mcdonald,"[skLearn, Java, R, SQL, Spark, C++]",AL,Database,Mid,no
1,Michael,Carter,"[TensorFlow, R, Spark, MongoDB, C++, SQL]",AR,Statistics,Senior,yes
2,Brenda,Tyler,[Spark],UT,Database,Mid,no
3,Joseph,King,"[skLearn, SQL, R, Spark, Java, C++, Python, Te...",FL,Machine Learning,Senior,maybe
4,Laura,Webb,"[TensorFlow, C++, SQL, Java, R, MongoDB]",WY,Machine Learning,Junior,maybe


# 2 ) json_normalize:

+ Takes somewhat strucured (json) data as input and returns a `flat table`

In [16]:
df = pd.read_json("employee_data.json")

bn=pd.DataFrame(df.features.values.tolist())['candidate']

pd.json_normalize(bn).head()

Unnamed: 0,first_name,last_name,skills,state,specialty,experience,relocation
0,Margaret,Mcdonald,"[skLearn, Java, R, SQL, Spark, C++]",AL,Database,Mid,no
1,Michael,Carter,"[TensorFlow, R, Spark, MongoDB, C++, SQL]",AR,Statistics,Senior,yes
2,Brenda,Tyler,[Spark],UT,Database,Mid,no
3,Joseph,King,"[skLearn, SQL, R, Spark, Java, C++, Python, Te...",FL,Machine Learning,Senior,maybe
4,Laura,Webb,"[TensorFlow, C++, SQL, Java, R, MongoDB]",WY,Machine Learning,Junior,maybe


# 3 ) Iterate:

In [39]:
f=[]
for i in data['features']:
    f.append(i['candidate'])

pd.DataFrame(f).head()

Unnamed: 0,first_name,last_name,skills,state,specialty,experience,relocation
0,Margaret,Mcdonald,"[skLearn, Java, R, SQL, Spark, C++]",AL,Database,Mid,no
1,Michael,Carter,"[TensorFlow, R, Spark, MongoDB, C++, SQL]",AR,Statistics,Senior,yes
2,Brenda,Tyler,[Spark],UT,Database,Mid,no
3,Joseph,King,"[skLearn, SQL, R, Spark, Java, C++, Python, Te...",FL,Machine Learning,Senior,maybe
4,Laura,Webb,"[TensorFlow, C++, SQL, Java, R, MongoDB]",WY,Machine Learning,Junior,maybe


# 4 ) List Comprehension:

In [40]:
# Nested List Comprehension to flatten a given 2-D matrix 

tt=[val for sublist in data['features'] for val in sublist.values() ]
pd.DataFrame(tt).head()


Unnamed: 0,first_name,last_name,skills,state,specialty,experience,relocation
0,Margaret,Mcdonald,"[skLearn, Java, R, SQL, Spark, C++]",AL,Database,Mid,no
1,Michael,Carter,"[TensorFlow, R, Spark, MongoDB, C++, SQL]",AR,Statistics,Senior,yes
2,Brenda,Tyler,[Spark],UT,Database,Mid,no
3,Joseph,King,"[skLearn, SQL, R, Spark, Java, C++, Python, Te...",FL,Machine Learning,Senior,maybe
4,Laura,Webb,"[TensorFlow, C++, SQL, Java, R, MongoDB]",WY,Machine Learning,Junior,maybe


# `Create New Data frame that will be used to update our nested JSON object`


In [20]:
# Create List of fake hire dates:
fake_data=Faker()

fake_data.seed(10)

hire_dates=[]

for _ in range(len(tt)):
    hire_dates.append(fake_data.date_between(start_date='-7y', end_date='today'))

hire_dates[:2]

[datetime.date(2018, 4, 10), datetime.date(2013, 9, 9)]

In [21]:
# Work Status: FT/PT/Contract
# import numpy as np
work_status=['FT','PT','Contract']


worker_emp_status=np.random.choice(work_status,size=len(tt),replace=True)
# len(worker_emp_status)

In [22]:
# Salary:

Salary=np.random.randint(75000,230000,size=len(tt))
Salary[:4]

array([129784, 176911,  88253, 116101])

In [23]:
# Does person have healthcare coverage?:
healthcare=['yes','no']

healthcare_elig=np.random.choice(healthcare,size=len(tt),replace=True)
healthcare_elig[:3]

array(['yes', 'no', 'yes'], dtype='<U3')

In [49]:
HR_df=pd.DataFrame(np.stack([hire_dates,Salary,healthcare_elig],axis=1),
                   columns=['hire_date','Salary','healthcare'])

HR_df.head()

HR_df.to_csv('HR_CSV', index = False)

# Combine data frames and insert new data:


In [43]:
df_update=pd.concat([pd.DataFrame(tt),HR_df],axis=1)
df_update.head()

Unnamed: 0,first_name,last_name,skills,state,specialty,experience,relocation,hire_date,Salary,healthcare
0,Margaret,Mcdonald,"[skLearn, Java, R, SQL, Spark, C++]",AL,Database,Mid,no,2018-04-10,129784,yes
1,Michael,Carter,"[TensorFlow, R, Spark, MongoDB, C++, SQL]",AR,Statistics,Senior,yes,2013-09-09,176911,no
2,Brenda,Tyler,[Spark],UT,Database,Mid,no,2017-01-22,88253,yes
3,Joseph,King,"[skLearn, SQL, R, Spark, Java, C++, Python, Te...",FL,Machine Learning,Senior,maybe,2017-07-08,116101,yes
4,Laura,Webb,"[TensorFlow, C++, SQL, Java, R, MongoDB]",WY,Machine Learning,Junior,maybe,2018-05-01,96059,no


# `Now UPDATE OUR NESTED JSON Object, with Human Resources Data Frame:`

**ThIs Will Create a `New Nested` Object inside**

In [44]:
def df_to_nested_json(df, candidate, hire='hire_date',
                      salary='Salary',healthcare='healthcare'):
    _json = {'features':[]}
    
    for _, row in df.iterrows():
        feature = {'candidate':{},
                   'HR_related':{}} # New Nesting Creation here!
        
        #Nested Entries:
        feature['HR_related']['hire_date'] = [row[hire]]
        feature['HR_related']['Salary'] = [row[salary]]
        feature['HR_related']['healthcare'] = [row[healthcare]]

        for prop in candidate:
            feature['candidate'][prop] = row[prop]
        _json['features'].append(feature)
    return _json


cols=pd.DataFrame(tt).columns

people_json=df_to_nested_json(df_update, cols)


In [27]:
people_json['features'][:2]

[{'candidate': {'first_name': 'Margaret',
   'last_name': 'Mcdonald',
   'skills': ['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++'],
   'state': 'AL',
   'specialty': 'Database',
   'experience': 'Mid',
   'relocation': 'no'},
  'HR_related': {'hire_date': [datetime.date(2018, 4, 10)],
   'Salary': [129784],
   'healthcare': ['yes']}},
 {'candidate': {'first_name': 'Michael',
   'last_name': 'Carter',
   'skills': ['TensorFlow', 'R', 'Spark', 'MongoDB', 'C++', 'SQL'],
   'state': 'AR',
   'specialty': 'Statistics',
   'experience': 'Senior',
   'relocation': 'yes'},
  'HR_related': {'hire_date': [datetime.date(2013, 9, 9)],
   'Salary': [176911],
   'healthcare': ['no']}}]

# Other form of Updating our data:  with `new entries` 

+ `NON Nested` entry update

In [28]:
def df_to_nested_json(df, candidate, lat='hire_date',lon='Salary',ll='healthcare'):
    _json = {'features':[]}
    
    for _, row in df.iterrows():
        feature = {'candidate':{}}

        # NONE Nested Entries:
        feature['candidate']['hire_date'] = [row[lat]]
        feature['candidate']['Salary'] = [row[lon]]
        feature['candidate']['healthcare'] = [row[ll]]

        for prop in candidate:
            feature['candidate'][prop] = row[prop]
        _json['features'].append(feature)
    return _json


cols=pd.DataFrame(tt).columns

people_json=df_to_nested_json(df_update, cols)


In [29]:
people_json['features'][:2]

[{'candidate': {'hire_date': [datetime.date(2018, 4, 10)],
   'Salary': [129784],
   'healthcare': ['yes'],
   'first_name': 'Margaret',
   'last_name': 'Mcdonald',
   'skills': ['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++'],
   'state': 'AL',
   'specialty': 'Database',
   'experience': 'Mid',
   'relocation': 'no'}},
 {'candidate': {'hire_date': [datetime.date(2013, 9, 9)],
   'Salary': [176911],
   'healthcare': ['no'],
   'first_name': 'Michael',
   'last_name': 'Carter',
   'skills': ['TensorFlow', 'R', 'Spark', 'MongoDB', 'C++', 'SQL'],
   'state': 'AR',
   'specialty': 'Statistics',
   'experience': 'Senior',
   'relocation': 'yes'}}]

# Considerations: 
 When you are looking into timing operations, you also need to consider memory management. There is a tradeoff between these and it is important when dealing with large data.
 

In [45]:
# Ex.) Checking the Time it takes to call the file and read from pandas vs with open:

print('Notice the Differences:')
start = time.time()

pd.read_json("employee_data.json").head()
# get time taken to run list comprehension
elapsed_time_lc=(time.time()-start)
print('Using only pd.read_json() :',elapsed_time_lc)
print(20*'-')
start = time.time()
with open('employee_data.json') as datafile:
    for line in datafile:
        data = json.loads(line)
    ww=pd.DataFrame(data)
ww.head()
# get time taken to run list comprehension
elapsed_time_=(time.time()-start)
print('with open :',elapsed_time_)


Notice the Differences:
Using only pd.read_json() : 0.007957935333251953
--------------------
with open : 0.004159212112426758


# Citations:

# ◔̯◔


**JSON Related**:

https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/

https://stackoverflow.com/questions/40588852/pandas-read-nested-json

https://www.geeksforgeeks.org/nested-list-comprehensions-in-python/

**Memory Usuage**: Future Tasks To Consider

https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba

https://towardsdatascience.com/memory-management-in-python-6bea0c8aecc9