# Notebook containing scripts to explore and preprocess .har files

## Output format :

```json
{
  "categories": [
    {
      "cat_name": "Search",
      "sites": [
        <site1>,
        <site2>,
        ...
        <average site>
      ]
    },
    {
      "cat_name": "Video",
      "sites": [
        <site1>,
        <site2>,
        ...
        <average site>
      ]
    },
  ]
}
```
This is the final format of our data, `<site>` represents the information of a website under this form :  


```json
{
  "total": 3085729,
  "session_length": 0.196933333333277,
  "data": {
    "text": {
      "html": {
        "val": 756083,
        "prop": 245
      },
      "javascript": {
        "val": 2183083,
        "prop": 707
      }
    },
    "application": {
      "json": {
        "val": 50103,
        "prop": 16
      }
    },
    "image": {
      "png": {
        "val": 66459,
        "prop": 21
      }
    },
    "others": {
      "val": 30001,
      "prop": 11,
      "types": "text/plain, application/javascript, application/binary, image/webp, image/x-icon, font/woff2, x-unknown"
    }
  },
  "total_proportion": 1000,
  "website": "Google"
}
```

The average website of a category as the same form but no sub-categories : 

```json
{
  "website": "Average Search website",
  "data": {
    "text": {
      "val": 2939166,
      "prop": 952
    },
    "application": {
      "val": 50103,
      "prop": 16
    },
    "image": {
      "val": 66459,
      "prop": 21
    },
    "others": {
      "val": 30001,
      "prop": 11
    }
  },
  "total": 3085729,
  "total_proportion": 1000
}
```

# Exploration of .har files

### Loading

In [1]:
import json

In [34]:
with open("raw_data/Video/YouTube.har", 'r') as f:
        data = json.load(f)

### Displaying keys and values on the resulting python dict recursively 

In [2]:
def print_keys(d, indent='\n'):
    if type(d) == dict:
        for k in d.keys():
            print(indent, k, end='')
            print_keys(d[k], (indent + ' |  '))
    elif type(d) in (str,list):
        print(' <', str(type(d)).split("'")[1], '> (len: ', len(d), end=')', sep='')
    else:
        print(' <', str(type(d)).split("'")[1], '>', end='', sep='')
        
def print_values(d, indent='\n'):
    if type(d) == dict:
        for k in d.keys():
            print(indent, k, end='')
            print_values(d[k], (indent + ' |  '))
    else:
        print(':', d, end='')

### - Log (root)

In [36]:
print_keys(data)


 log
 |   version <str> (len: 3)
 |   creator
 |   |   name <str> (len: 12)
 |   |   version <str> (len: 6)
 |   pages <list> (len: 1)
 |   entries <list> (len: 343)

In [37]:
print(data['log']['creator'])

{'name': 'WebInspector', 'version': '537.36'}


In [38]:
print(data['log']['pages'])

[{'startedDateTime': '2019-12-14T14:51:43.280Z', 'id': 'page_1', 'title': 'https://www.youtube.com/', 'pageTimings': {'onContentLoad': 2485.2490000002945, 'onLoad': 3324.8200000002726}}]


### - Element of log.entries

In [39]:
print_keys(data['log']['entries'][0])


 startedDateTime <str> (len: 24)
 time <float>
 request
 |   method <str> (len: 3)
 |   url <str> (len: 24)
 |   httpVersion <str> (len: 8)
 |   headers <list> (len: 13)
 |   queryString <list> (len: 0)
 |   cookies <list> (len: 1)
 |   headersSize <int>
 |   bodySize <int>
 response
 |   status <int>
 |   statusText <str> (len: 0)
 |   httpVersion <str> (len: 8)
 |   headers <list> (len: 18)
 |   cookies <list> (len: 5)
 |   content
 |   |   size <int>
 |   |   mimeType <str> (len: 9)
 |   |   text <str> (len: 528804)
 |   |   encoding <str> (len: 6)
 |   redirectURL <str> (len: 0)
 |   headersSize <int>
 |   bodySize <int>
 |   _transferSize <int>
 cache
 timings
 |   blocked <float>
 |   dns <int>
 |   ssl <int>
 |   connect <int>
 |   send <float>
 |   wait <float>
 |   receive <float>
 |   _blocked_queueing <float>
 serverIPAddress <str> (len: 14)
 _initiator
 |   type <str> (len: 5)
 _priority <str> (len: 8)
 _resourceType <str> (len: 8)
 connection <str> (len: 2)
 pageref <str>

### - Entries types

In [30]:
entries_types = []
min_size = 1e7
max_size = 0
for e in data['log']['entries']:
    if not e['response']['content']['mimeType'] in entries_types:
        entries_types.append(e['response']['content']['mimeType'])
        
    min_size = min(min_size, e['response']['content']['size'])
    max_size = max(max_size, e['response']['content']['size'])
    
print(entries_types)
print('min size:',min_size)
print('max size:',max_size)

['text/html', 'text/plain', 'text/javascript', 'text/css', 'image/jpeg', 'font/woff2', 'application/json', 'image/webp', 'image/gif', 'application/javascript', 'application/x-www-form-urlencoded', 'video/webm', 'audio/webm', 'image/x-icon', 'image/png', 'video/x-flv', 'text/xml', 'x-unknown']
min size: 0
max size: 2909592


#### Some tests

In [31]:
for i in range(100,104):
    print(i, ':')
    print('content type:', data['log']['entries'][i]['response']['content']['mimeType'])
    print('content size:', data['log']['entries'][i]['response']['content']['size'])
    print('head size:', data['log']['entries'][i]['response']['headersSize'])
    print('body size:', data['log']['entries'][i]['response']['bodySize'])
    print('transfer size:', data['log']['entries'][i]['response']['_transferSize'])

100 :
content type: text/html
content size: 0
head size: -1
body size: -1
transfer size: 61
101 :
content type: video/x-flv
content size: 0
head size: -1
body size: -1
transfer size: 75
102 :
content type: text/html
content size: 0
head size: -1
body size: -1
transfer size: 66
103 :
content type: audio/webm
content size: 65536
head size: 1010
body size: 65536
transfer size: 66546


=> It may be best to use the transfer size

## Retrieving useful data

In [3]:
def parse_time(s):
    t = s.split(':')
    return float(t[0][-2:])*60 + float(t[1]) + float(t[2][:-1])/60

def retrieve_data(data):
    out = {
        'total': 0,
        'session_length': (parse_time(data['log']['entries'][-1]['startedDateTime'])
                           - parse_time(data['log']['entries'][0]['startedDateTime']))
    }

    d = {}
    
    for e in data['log']['entries']:
        t = e['response']['content']['mimeType'].split('/') # data type (list)
        size = e['response']['_transferSize']

        if not t[0] in d.keys():
            if len(t) > 1:
                d[t[0]] = {}
                d[t[0]][t[1]] = size
            else:
                d[t[0]] = size

        elif len(t) > 1:
            if not t[1] in d[t[0]].keys():
                d[t[0]][t[1]] = size
            else:
                d[t[0]][t[1]] += size            

        else:
            d[t[0]] += size

        out['total'] += size
        
        #print(e['startedDateTime'])
    out['data'] = d
    return out

with open("raw_data/Search/Google.har", 'r') as f:
        data = json.load(f)
print_values(retrieve_data(data))


 total: 3832655
 session_length: 0.1433833333334178
 data
 |   text
 |   |   html: 714480
 |   |   javascript: 2133736
 |   |   css: 137636
 |   |   plain: 11632
 |   image
 |   |   png: 376889
 |   |   webp: 37735
 |   |   x-icon: 4125
 |   |   svg+xml: 4407
 |   |   jpeg: 319958
 |   |   gif: 22691
 |   |   vnd.microsoft.icon: 0
 |   x-unknown: 0
 |   font
 |   |   woff2: 39531
 |   application
 |   |   javascript: 1175
 |   |   json: 28660

## Preprocessing 

### _Calculating per minutes ratio_

In [4]:
def dict_to_list(dic):
    liste = []
    for k in dic.keys():
        ob = dic[k]
        ob['type'] = k
        liste.append(ob)
    return liste

def preprocess(js, total_prop = 1000, remove_threshold=10):
    factor = js['session_length']
    total = js['total']
    js['total'] = int(js['total']/js['session_length'])
    js['total_proportion'] = total_prop
    
    to_remove = []
    
    s_val = 0
    s_prop = 0
    
    d = js['data']
    
    for k in d.keys():
        
        if type(d[k]) == dict:
            dict_val = 0
            dict_prop = 0
            for k2 in d[k].keys():
                val = int(d[k][k2]/factor)
                prop = int(d[k][k2]/total*total_prop)
                d[k][k2] = {'val': val, 'prop': prop}
                s_val += val
                s_prop += prop
                dict_val += val
                dict_prop += prop
            d[k]['val'] = dict_val
            d[k]['prop'] = dict_prop
        
        else:
            val = int(d[k]/factor)
            prop = int(d[k]/total*total_prop)
            d[k] = {'val': val, 'prop': prop}
            s_val += val
            s_prop += prop
            
        if remove_threshold > 0:
            if 'prop' in d[k].keys():
                if d[k]['prop'] < remove_threshold:
                    to_remove.append(k)
            else:
                for k2 in d[k].keys():
                    if 'prop' in d[k][k2].keys():
                        if d[k][k2]['prop'] < remove_threshold:
                            to_remove.append((k + '/' + k2))
    
    string = ''
    r_prop = 0
    r_val = 0
    for cat in to_remove:
        string+=cat+', '
        keys = cat.split('/')
        
        if len(keys) > 1:
            r_prop += d[keys[0]][keys[1]]['prop']  #super moche mais plus facile 
            r_val += d[keys[0]][keys[1]]['val']
            del d[keys[0]][keys[1]]
            if len(d[keys[0]].keys()) == 0:
                del d[keys[0]]
        else:
            r_prop += d[keys[0]]['prop']  #super moche mais plus facile 
            r_val += d[keys[0]]['val']
            del d[keys[0]]
    string = string[:-2]
    
    js['data']['others'] = {
        'val': js['total'] - s_val + r_val, 
        'prop': total_prop - s_prop + r_prop,
        'types': string.split(', ')
    }
    
    js['array_data'] = dict_to_list(js['data'])
    
#     print('sum_val:',s_val)
#     print('tot_val:', js['total'])
#     print('sum_prop:',s_prop)
#     print('tot_prop:', total_prop)
    
    return js

In [5]:
with open("raw_data/Search/Google.har", 'r') as f:
        data = json.load(f)
print_values(preprocess(retrieve_data(data)))


 total: 26730129
 session_length: 0.1433833333334178
 data
 |   text
 |   |   html
 |   |   |   val: 4983005
 |   |   |   prop: 186
 |   |   javascript
 |   |   |   val: 14881339
 |   |   |   prop: 556
 |   |   css
 |   |   |   val: 959916
 |   |   |   prop: 35
 |   |   plain
 |   |   |   val: 81125
 |   |   |   prop: 3
 |   |   val: 20905385
 |   |   prop: 780
 |   |   type: text
 |   image
 |   |   png
 |   |   |   val: 2628541
 |   |   |   prop: 98
 |   |   webp
 |   |   |   val: 263175
 |   |   |   prop: 9
 |   |   x-icon
 |   |   |   val: 28769
 |   |   |   prop: 1
 |   |   svg+xml
 |   |   |   val: 30735
 |   |   |   prop: 1
 |   |   jpeg
 |   |   |   val: 2231486
 |   |   |   prop: 83
 |   |   gif
 |   |   |   val: 158254
 |   |   |   prop: 5
 |   |   vnd.microsoft.icon
 |   |   |   val: 0
 |   |   |   prop: 0
 |   |   val: 5340960
 |   |   prop: 197
 |   |   type: image
 |   font
 |   |   woff2
 |   |   |   val: 275701
 |   |   |   prop: 10
 |   |   val: 275701
 |   |   prop: 

## Sites Moyens

In [6]:
def compute_average_site(sites, cat_name):
    avg = {
        'website': ('Average ' + c + ' website'),
        'data': {}
    }
    
    tot_val = 0
    
    # get main datatypes
    datatypes = []
    for site in sites:
        for k in site["data"].keys():
            if k not in datatypes:
                datatypes.append(k)
            
    # computing and saving average
    for t in datatypes:
        s_val = 0
        for site in sites:
            data = site['data'] 
            
            if t in data.keys():
                # test sub categories
                if 'val' not in data[t].keys():
                    for k in data[t].keys():
                        s_val += data[t][k]['val']
                else:
                    s_val += data[t]['val']
                    
        val = int(s_val/len(sites))
        
        tot_val += val
        
        avg['data'][t] = {'val': val }
        
    # computing prop:
    tot_prop = 0
    for t in avg['data'].keys():
        prop = int(avg['data'][t]['val'] * 1000 / tot_val)
        tot_prop += prop
        avg['data'][t]['prop'] = prop
    # adding the leftovers in 'others'
    if 'others' in avg['data'].keys():
        avg['data']['others']['prop'] += 1000 - tot_prop
    else:
        avg['data']['others']['prop'] = 1000 - tot_prop
        
    avg['total'] = tot_val
    avg['total_proportion'] = 1000 #XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX TODO : variable globale (en dessus *2 aussi)
    
    
    return avg 

## Script final

In [7]:
from os import listdir
from os.path import isfile, join, isdir
#onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
path = 'raw_data'

In [8]:
categories = [d for d in listdir(path) if isdir(join(path, d))]

out = {
    'categories': []
}

for c in categories:
    cat_dir = join(path,c)
    sites = []
    files = [f for f in listdir(cat_dir) if isfile(join(cat_dir, f))]
    
    for filename in files:
        with open(join(cat_dir,filename), 'r') as f:
            try:
                data = json.load(f)
            except:
                print('/ ! \ Error in file', filename)
            else:
                data = preprocess(retrieve_data(data))
                data['website'] = filename.split('/')[-1][:-4]
                sites.append(data)
    
    d = {
        'cat_name': c,
        'sites': sites,
        'average': compute_average_site(sites, c)
    }
    
    out['categories'].append(d)


## Saving

In [9]:
with open('data2.json', 'w') as f:
    json.dump(out, f)

TODO:
- eventuellement changer la façon de faire (ajouter attibut name, booléen si il y a des sous cat)
- meilleure mise en forme de other ?
- Site moyen général
- Cas où dans les sites moyens une prop est en dessous de threshold

In [32]:
for c in out['categories']:
    print('---', c['cat_name'])
    for s in c['sites']:
        print(' - ')
        print_values(s)
    print(' - ')
    print_values(c['average'])

--- Search
 - 

 total: 26730129
 session_length: 0.1433833333334178
 data
 |   text
 |   |   html
 |   |   |   val: 4983005
 |   |   |   prop: 186
 |   |   javascript
 |   |   |   val: 14881339
 |   |   |   prop: 556
 |   |   css
 |   |   |   val: 959916
 |   |   |   prop: 35
 |   |   plain
 |   |   |   val: 81125
 |   |   |   prop: 3
 |   |   val: 20905385
 |   |   prop: 780
 |   image
 |   |   png
 |   |   |   val: 2628541
 |   |   |   prop: 98
 |   |   webp
 |   |   |   val: 263175
 |   |   |   prop: 9
 |   |   x-icon
 |   |   |   val: 28769
 |   |   |   prop: 1
 |   |   svg+xml
 |   |   |   val: 30735
 |   |   |   prop: 1
 |   |   jpeg
 |   |   |   val: 2231486
 |   |   |   prop: 83
 |   |   gif
 |   |   |   val: 158254
 |   |   |   prop: 5
 |   |   vnd.microsoft.icon
 |   |   |   val: 0
 |   |   |   prop: 0
 |   |   val: 5340960
 |   |   prop: 197
 |   font
 |   |   woff2
 |   |   |   val: 275701
 |   |   |   prop: 10
 |   |   val: 275701
 |   |   prop: 10
 |   others
 |   |   va