### <center>Hello! Today I am going to demonstrate how `flatten_json` can be used to extract data from complex json structures</center>

#### <center><b>#1:</b> First, we need to load in the json and learn about its structure.</center>

### Import the required libraries
<hr>

In [33]:
import json, pandas as pd, pprint, graphviz, json2txttree, ipytree
from json2txttree import json2txttree
from flatten_json import flatten
from anytree import Node, RenderTree

### Load the data
<hr>

In [None]:
with open('/Users/brtelfer/Documents/Python_Data_Projects/Regex_Training/DataSets/large_dummy_data.json', 'r') as f:
  data = json.load(f)

### And using json2txttree, observe the structure of the json file in tree format
<hr>

Print json hierarchy as a tree. In notebooks, you can click on the <mark>view as a scrollable element</mark> to see the whole structure

This helps with visualizing complex json files

In [68]:
print(json2txttree(data))

└─  (array)
   └─  (object)
      ├─ "_id" (string)
      ├─ "statement" (object)
      │  ├─ "actor" (object)
      │  │  ├─ "account" (object)
      │  │  │  ├─ "homePage" (string)
      │  │  │  └─ "name" (string)
      │  │  ├─ "name" (string)
      │  │  ├─ "id" (string)
      │  │  ├─ "objectType" (number)
      │  │  └─ "mbox" (number)
      │  ├─ "verb" (object)
      │  │  ├─ "id" (string)
      │  │  └─ "display" (object)
      │  │     ├─ "en-US" (string)
      │  │     ├─ "de-DE" (string)
      │  │     ├─ "fr-FR" (string)
      │  │     └─ "es-ES" (string)
      │  ├─ "object" (object)
      │  │  ├─ "id" (string)
      │  │  ├─ "definition" (object)
      │  │  │  ├─ "name" (object)
      │  │  │  │  └─ "en-US" (string)
      │  │  │  ├─ "description" (object)
      │  │  │  │  └─ "en-US" (string)
      │  │  │  ├─ "type" (string)
      │  │  │  └─ "extensions" (number)
      │  │  └─ "objectType" (string)
      │  ├─ "context" (object)
      │  │  └─ "contextActivities" 

### <center><b>#2:</b> Next, we will use the `flatten_json` package in conjunction with `pandas` to load the data into python</center>
<hr></hr>

`pd.DataFrame` can be used with `flatten`. Every path will be denoted by `.`

In [69]:
df = pd.DataFrame((flatten(record, '.') for record in data))

### Assess the resulting dataframe
<hr></hr>

In [53]:
df.head()

Unnamed: 0,_id,statement.actor.account.homePage,statement.actor.account.name,statement.actor.name,statement.actor.id,statement.actor.objectType,statement.actor.mbox,statement.verb.id,statement.verb.display.en-US,statement.verb.display.de-DE,...,statement.object.definition.extensions.bXEY1E5WLG,statement.context.contextActivities.grouping.0.definition.extensions.yTbtUTVPie,statement.object.definition.extensions.fsQa37o4cK,statement.object.definition.extensions.rzcLgSYOzp,statement.context.contextActivities.grouping.0.definition.extensions.6Wj3DVYpPu,statement.object.definition.extensions.9hBFN0lGH7,statement.context.contextActivities.grouping.0.definition.extensions.4PuVNMKsQe,statement.result.extensions.jlhkcSzQM6,statement.object.definition.extensions.EOgJ4f9Rm0,statement.context.contextActivities.grouping.0.definition.extensions.rjxve9a8sm
0,p6GaHcnPHB,https://9qZdL2Qu89.com,2qAquWLlSw,L6igBbjHa2,OLUcQAg1LN,,,lG2RychGdw,Z3oisqGCEN,bKOJi25Ro8,...,,,,,,,,,,
1,dvAFvr4vo7,https://nzTBUvw9wh.com,LP86vkaT1W,RNJNFndINX,UR5Sz07K14,Group,,7iR5m4YtuN,0snI3o3RqB,2YU0Ah0Xth,...,,,,,,,,,,
2,4NsB21aZlA,https://ZmiZFwPWQN.com,Vydbp7RVow,XqWGJzNdNR,ASBXxy54sk,Group,,8sLR091mKK,s6debveCaX,MzW30R0BW5,...,,,,,,,,,,
3,8txsdHVgKf,https://dFtOa7gU03.com,E29FHlrP94,TDDJGfVLha,RKHILVWsdm,,,0gxnw6hd1q,7WBncdjbuc,pb2FGUpzEw,...,,,,,,,,,,
4,y2WhJMaUlu,https://pQF3t3iGoI.com,HY6xRHXo2k,pIAo4mRpo0,mcDVLRs8pX,,,HwO32gpkUy,ekeKrNTmno,,...,,,,,,,,,,


`flatten` transforms the json file into a single level of key-value pairs. Hence, all hierarchies are "flattened".

#### Let's look at the number of columns in our dataframe

In [54]:
len(df.columns)

1549

### Wow, that's a lot of columns!
- We need a surefire strategy for identifying which of these 1,549 are relevant to our query

First, let's identify which values we actually want to extract by looking at our raw json file.<br>
Opening the file with Notepad ++ or VS Code will let you read the structure more easily.<br>

### <b>#3: </b>We will extract the following values from the database by first getting the column headers.
<hr>

-   `timestamp`<br>
-   `session`
-   `ip` 
-   `raw`
-   `max` 
-   `timestamp`
-   `homePage`
-   `useragent`
-   `mbox`

A great way to start, is by identifying which columns contain these values within their pathways.

Filter by <b>columns</b>(`df.columns`) where the string(`str`) of the `columns` contains the target value.

`r''` will denote the use of <mark>Regex</mark>. Within regex the `|` specifies OR. `case=False` tells the `contains()` function to NOT be case sensitive.
<hr>

In [51]:
df.columns[df.columns.str.contains(r'timestamp|session|\bip\b|raw|max|homepage|useragent|mbox',case=False)]

Index(['statement.actor.account.homePage', 'statement.actor.mbox',
       'statement.result.score.raw', 'statement.result.score.max',
       'statement.timestamp', 'statement.authority.account.homePage',
       'meta.session', 'meta.useragent', 'meta.ip'],
      dtype='object')

<b>ip</b> is surrounded by `\b \b` to avoid getting things like .<mark>ip</mark>K0gYeMmu in the pathway list. `\b` is regex that excludes characters surrounded by alphanumeric characters.

Now that we have the whole list, let's try selecting a single column corresponding to the homePage value. Because there are two homepage values:
-   `statement.actor.account.homePage`<br>
-   `statement.authority.account.homePage`<br>

Make sure to select the one you actually want! For this example, let's go with the second one.

### <center><b>#4:</b> Now we will use the columns index to create a copy dataframe with the requested information</center>
<hr></hr>

<b>1.</b> First, let's split the pathway headers by `'.'` and select the last element to get a clean list of the desired column headers.
-   This will allow us to create a dataframe with clear and consice column headers as opposed to long directory pathways

<b>2.</b> Our previous query of the current column headers will become our target pathways

<b>3.</b> And remember, because we only want the authority homepages, we will specify that path in the regex as '.authority.account.homePage'

In [62]:
Target_Pathways = df.columns[df.columns.str.contains(r'timestamp|session|\bip\b|raw|max|timestamp|.authority.account.homePage|useragent|mbox',case=False)]
Target_Pathways

Index(['statement.actor.mbox', 'statement.result.score.raw',
       'statement.result.score.max', 'statement.timestamp',
       'statement.authority.account.homePage', 'meta.session',
       'meta.useragent', 'meta.ip'],
      dtype='object')

### Seperate by `.`
<hr>

In [59]:
Target_Pathways.str.split('.')

Index([                   ['statement', 'actor', 'mbox'],
                 ['statement', 'result', 'score', 'raw'],
                 ['statement', 'result', 'score', 'max'],
                              ['statement', 'timestamp'],
       ['statement', 'authority', 'account', 'homePage'],
                                     ['meta', 'session'],
                                   ['meta', 'useragent'],
                                          ['meta', 'ip']],
      dtype='object')

### Select the last value of every list
<hr>

In [41]:
# Add .str[-1] to select the last element
Target_Pathways.str.split('.').str[-1]

Index(['mbox', 'raw', 'max', 'timestamp', 'homePage', 'session', 'useragent',
       'ip'],
      dtype='object')

### Remove dtype index and dtype information
<hr>

In [60]:
# Add .to_list() to remove index and dtype information
Target_Columns = Target_Pathways.str.split('.').str[-1].to_list()
Target_Columns

['mbox', 'raw', 'max', 'timestamp', 'homePage', 'session', 'useragent', 'ip']

### Create a for loop to assign our data to a new dataframe
<hr>

Because Target_Pathways is a list of the full pathway from our `flatten_json` to `pd.DataFrame`, we can call the number to get each header.

In this case 1 represents 'statement.result.score.raw'

In [61]:
df[Target_Pathways[1]]

0      45.407172
1      16.452983
2      92.515854
3      11.469594
4       0.775795
         ...    
995    37.372111
996    52.315517
997    58.033325
998          NaN
999    51.368381
Name: statement.result.score.raw, Length: 1000, dtype: float64

### Create an empty list
<hr>

In [None]:
df_copy = []

### Initialize a value for numeric incrementals
<hr>

In [63]:
i = 0

### Convert the list to a pandas DataFrame
<hr>

In [64]:
df_copy = pd.DataFrame(df_copy)

### Create a for loop to perform two tasks: 
- 1: Create a new column header and: 
- 2: Send a copy of the data in df to copy_df 
<hr>

In [65]:
for x in Target_Columns:
    df_copy[x] = df[Target_Pathways[i]]
    i += 1
df_copy

Unnamed: 0,mbox,raw,max,timestamp,homePage,session,useragent,ip
0,,45.407172,,2021-10-04T01:10:02,https://exYiscKArt.com,bW9byBC3NV,,192.168.91.211
1,,16.452983,87.611519,2022-10-23T11:48:55,https://05VJkFROri.com,,,192.168.82.234
2,,92.515854,80.095236,2022-06-26T11:51:23,https://fB2xRd8NYH.com,,Yc6q81bi0f,
3,,11.469594,,2022-11-07T15:01:52,https://grJ5Jwwo1N.com,d4I8SubqOR,,
4,,0.775795,25.639499,2021-01-30T02:15:21,https://qqXWVhZUeT.com,NWypH3szzl,,
...,...,...,...,...,...,...,...,...
995,,37.372111,,2022-07-25T11:29:52,https://BDzWWm6WPl.com,1yHjETSXM4,wWt6D3bf8W,192.168.174.124
996,,52.315517,,2021-08-03T12:57:21,https://6sZwnD2jAE.com,Ia1hGFXt0K,UhzGMCOSWl,
997,,58.033325,,2022-11-22T03:33:03,https://Ssj3Hznc9u.com,,,192.168.157.67
998,,,69.737578,2021-05-03T13:52:28,https://NOoQ1DXawN.com,,ywpnLvmtk6,192.168.232.104


### Finally, let us consider an alternative to <b><i>flattening</i></b> json, which is called <b><i>normalizing</i></b> json
<hr>

In Python, this can be performed with `normalize_json()`. Let us consider the following json dataset:

In [78]:
data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {
            "governor": "Rick Scott",
            "capital": "Tallahassee",
            "population": 21538187,
            "economy": {
                "GDP": 1196600000000,
                "major_industries": ["Tourism", "Agriculture", "Aerospace"]
            }
        },
        "major_cities": [
            {"name": "Miami", "population": 467912},
            {"name": "Orlando", "population": 285713}
        ],
        "counties": [
            {
                "name": "Dade",
                "population": 12345,
                "major_city": "Miami",
                "area": 2497,
                "economy": {
                    "major_industries": ["Tourism", "Finance"],
                    "unemployment_rate": 4.5
                }
            },
            {
                "name": "Broward",
                "population": 40000,
                "major_city": "Fort Lauderdale",
                "area": 1323,
                "economy": {
                    "major_industries": ["Healthcare", "Tourism"],
                    "unemployment_rate": 3.9
                }
            },
            {
                "name": "Palm Beach",
                "population": 60000,
                "major_city": "West Palm Beach",
                "area": 2386,
                "economy": {
                    "major_industries": ["Agriculture", "Tourism"],
                    "unemployment_rate": 4.2
                }
            }
        ]
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {
            "governor": "John Kasich",
            "capital": "Columbus",
            "population": 11799495,
            "economy": {
                "GDP": 727300000000,
                "major_industries": ["Manufacturing", "Healthcare", "Aerospace"]
            }
        },
        "major_cities": [
            {"name": "Columbus", "population": 905748},
            {"name": "Cleveland", "population": 372624}
        ],
        "counties": [
            {
                "name": "Summit",
                "population": 1234,
                "major_city": "Akron",
                "area": 419,
                "economy": {
                    "major_industries": ["Manufacturing", "Healthcare"],
                    "unemployment_rate": 5.1
                }
            },
            {
                "name": "Cuyahoga",
                "population": 1337,
                "major_city": "Cleveland",
                "area": 457,
                "economy": {
                    "major_industries": ["Healthcare", "Manufacturing"],
                    "unemployment_rate": 5.3
                }
            }
        ]
    }
]
print(f'Here is the structure of our json file: \n {json2txttree(data)}')

Here is the structure of our json file: 
 └─  (array)
   └─  (object)
      ├─ "state" (string)
      ├─ "shortname" (string)
      ├─ "info" (object)
      │  ├─ "governor" (string)
      │  ├─ "capital" (string)
      │  ├─ "population" (number)
      │  └─ "economy" (object)
      │     ├─ "GDP" (number)
      │     └─ "major_industries" (array)
      │        └─  (string)
      ├─ "major_cities" (array)
      │  └─  (object)
      │     ├─ "name" (string)
      │     └─ "population" (number)
      └─ "counties" (array)
         └─  (object)
            ├─ "name" (string)
            ├─ "population" (number)
            ├─ "major_city" (string)
            ├─ "area" (number)
            └─ "economy" (object)
               ├─ "major_industries" (array)
               │  └─  (string)
               └─ "unemployment_rate" (number)



`pd.json_normalize()` extracts the certain information by querying the exact pathways of the json file. 

It takes 3 parameters which are `data`, which specifies the target json file, `record_path`: which specifies the path in each object to the list of records and `meta` which specifies which fields should be used as metadata for each record.
<hr>

Let's extract the following informaiton:
-   `state`
-   `shortname`
-   `counties`
-   `governor`
-   `area`
-   `unemployment_rate`

In [None]:
pd.json_normalize(
    data,
    record_path=['counties'],
    meta=['state', 'shortname', ['info', 'governor']],
    meta_prefix='meta_'
)

Unnamed: 0,name,population,major_city,area,economy.major_industries,economy.unemployment_rate,meta_state,meta_shortname,meta_info.governor
0,Dade,12345,Miami,2497,"[Tourism, Finance]",4.5,Florida,FL,Rick Scott
1,Broward,40000,Fort Lauderdale,1323,"[Healthcare, Tourism]",3.9,Florida,FL,Rick Scott
2,Palm Beach,60000,West Palm Beach,2386,"[Agriculture, Tourism]",4.2,Florida,FL,Rick Scott
3,Summit,1234,Akron,419,"[Manufacturing, Healthcare]",5.1,Ohio,OH,John Kasich
4,Cuyahoga,1337,Cleveland,457,"[Healthcare, Manufacturing]",5.3,Ohio,OH,John Kasich


`record_path` is basically performing a `flatten` on everything within `countries` to get all of the paths:
-  `countries.name`
-  `countries.populaiton`
-  `countries.major_city`
-  `countries.area`
-  `countries.economy`
-  `countries.economy.major_industries`
-  `countries.economy.unemployment_rate`

`meta` selects everything at the same level or before the level of countries the first `[]` represents the first level. The `[[]]` can represent the second level.

`['Level_1_Keys'['Level_1_Key', 'To_Level_2_Value(Which is the key to another value)']]`

So `meta = [['info', 'governor']]` basically means the path `info.governor`

This method is useful when you do not want to flatten everything at once (This gets messy with large databases with similar key and value names and long json pathways).