### Problem Descripition 

In 2012, URL shortening service Bitly partnered with the US government website USA.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.

The text file comes in JSON format and here are some keys and their description. They are only the most important ones for this task.

|key| description |
|---|-----------|
| a|Denotes information about the web browser and operating system|
| tz | time zone |
| r | URL the user come from |
| u | URL where the user headed to |
| t | Timestamp when the user start using the website in UNIX format |
| hc | Timestamp when user exit the website in UNIX format |
| cy | City from which the request intiated |
| ll | Longitude and Latitude |

### Script Details

All CSV files must have the following columns
- web_browser
        The web browser that has requested the service
- operating_sys
        operating system that intiated this request
- from_url

        The main URL the user came from

    **note**:

    If the retrived URL was in a long format `http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf`

     make it appear in the file in a short format like this `www.facebook.com`
     
    
- to_url

       The same applied like `to_url`
   
- city

        The city from which the the request was sent
    
- longitude

        The longitude where the request was sent
- latitude

        The latitude where the request was sent

- time_zone
        
        The time zone that the city follow
        
- time_in

        Time when the request started
- time_out
        
        Time when the request is ended
        
        
**NOTE** :

Because that some instances of the file are incomplete, you may encouter some NaN values in your transforamtion. Make sure that the final dataframes have no NaNs at all.

The Script itself must do the following before and after trasforamtion: 
    
- One positional argument which is the directory path with that have the files.


- One optional argument **-u**. If this argument is passed will maintain the UNIX format of timpe stamp and if not                passed the time stamps will be converted.


- Check if the files have any dublicates in between **checksum** and print a messeage that indicate that.


- Print a message after converting each file with the number of rows transformed and the path of this file


- At the end of this script print the total excution time.
    

In the cell, I tried to provide some helper code for better understanding and clearer vision

-**HINT**- Those lines of code may be not helping at all with your task.

## Required

Write a script can transform the JSON files to a DataFrame and commit each file to a sparete CSV file in the target directory and consider the following:

        

In [1]:
import os
import pandas as pd 
from pandas.io.json import json_normalize 
import json
from subprocess import run, PIPE, Popen
import subprocess
import argparse
import fnmatch
import numpy as np
import time
import re
from os import listdir
from os.path import isfile, join

In [2]:
path = os.getcwd()
print("path to search for json files ==> " , path)
files=[]
for file in listdir(path):
    if '.json' in file:
        files.append(file)
print("files detected :\n", files)

path to search for json files ==>  /mnt/d/ITI/python_for_data_management/Task_2
files detected :
 ['usa.gov_click_data_1.json', 'usa.gov_click_data_2.json', 'usa.gov_click_data_3.json']


In [3]:
checksums = {}
duplicates = []

for filename in files:
    # Use Popen to call the md5sum utility
    with Popen(["md5sum", filename], stdout=PIPE) as proc:
        checksum, _ = proc.stdout.read().split()
        
        # Append duplicate to a list if the checksum is found
        if checksum in checksums:
            duplicates.append(filename)
        checksums[checksum] = filename

print(f"Found Duplicates: {duplicates}")

Found Duplicates: ['usa.gov_click_data_3.json']


In [4]:

#parser = argparse.ArgumentParser()

#parser.add_argument("dir", help = "Enter path of Directory")

#parser.add_argument("-u", "--unix", action="store_true", dest="unix", default=False, help="This to manage time")

#args = parser.parse_args()

In [5]:
for file in files:
    To_Url_lst=[]
    lst = []
    Operating_System_lst = []
    From_Url_lst=[]  
    lst.clear()
    Operating_System_lst.clear()
    From_Url_lst.clear()
    To_Url_lst.clear()
    
    records = [json.loads(line) for line in open(file)]
    df = pd.DataFrame(records)
    df = df[['a','tz','r','u','t','ll','hc','cy']]
    df['web_Browser']= df.a.str.extract(r'^([a-zA-Z]*/)')   #extract any ch end with /   
    df['web_Browser']= df.a.str.extract(r'^([a-zA-Z]*)')    #then remove   / extract chs only
    df['operating_system']= df.a.str.extract(r'(\([^(]+\))', expand=True)
    
    for i in df['operating_system']:
        i = str(i)
        lst = i.split(" ")
        x = re.sub("\(","",lst[0])
        x = re.sub(";","",x)
        Operating_System_lst.append(x)
    df['operating_system']= Operating_System_lst
                                                     # OUTPut       //t.co/N
    df['from_url'] = df.r.str.extract(r'(//[a-zA-Z]*.[a-zA-Z]*.[a-zA-Z]*)', expand=True)   # output //www.facebook.com
    
    for i in df['from_url']:
        i = str(i)
        x = re.sub("//","",str(i))
        if "/" in x:
            x = x[0:x.index("/")]
            From_Url_lst.append(x)
        else:
            From_Url_lst.append(x)
    From_Url_lst
    df['from_url'] = From_Url_lst
    
    df['to_url'] = df.u.str.extract(r'(.*[.gov])', expand=True)
    #df['to_url'] = df.u.str.extract(r'(//[a-zA-Z]*.[a-zA-Z]*.[a-zA-Z])', expand=True) # remove the beggining http // ..
    for i in df['to_url']:
        i = str(i)
        x = re.sub("http://","",str(i))
        if "/" in x:
            x = x[0:x.index("/")]
            To_Url_lst.append(x)
        else:
            To_Url_lst.append(x)
    From_Url_lst
    df['to_url'] = To_Url_lst
    
    df.rename(columns = {'tz' : 'time_zone', 'cy': 'city','t': 'time_in','hc':'time_out'}, inplace = True)
    df['ll'] = df['ll'].fillna('')
    df[['longitude','latitude']] = pd.DataFrame(df['ll'].values.tolist(), index = df.index)
    df = df[['web_Browser','operating_system','from_url','to_url','city','longitude','latitude','time_zone','time_in','time_out']]
    #df.replace(r'\s+', np.nan)
    df = df.replace(to_replace = "nan" , value = np.nan)
    df = df.replace(to_replace = r'\s+' , value = np.nan)
    df = df.dropna(axis=0)
    
                  
    df.to_csv(path+"/target/"+file+".csv" , header=True , index=False )
    
    

In [6]:
df

Unnamed: 0,web_Browser,operating_system,from_url,to_url,city,longitude,latitude,time_zone,time_in,time_out
0,Mozilla,Windows,www.facebook.com,www.ncbi.nlm.nih.gov,Danvers,42.576698,-70.954903,America/New_York,2012-03-16 22:40:47+00:00,2012-03-15 18:48:38+00:00
2,Mozilla,Windows,t.co,boxer.senate.gov,Washington,38.900700,-77.043098,America/New_York,2012-03-16 22:40:50+00:00,2012-03-16 21:45:41+00:00
4,Mozilla,Windows,www.shrewsbury-ma,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-03-16 22:40:51+00:00,2010-05-12 17:53:31+00:00
5,Mozilla,Windows,www.shrewsbury-ma,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-03-16 22:40:52+00:00,2010-05-12 17:55:06+00:00
6,Mozilla,Windows,plus.url.google,www.nasa.gov,Luban,51.116699,15.283300,Europe/Warsaw,2012-03-16 17:40:55+00:00,2012-03-16 17:34:14+00:00
...,...,...,...,...,...,...,...,...,...,...
3550,Mozilla,Windows,www.linkedin.com,www.nlm.nih.gov,Conshohocken,40.079800,-75.285500,America/New_York,2012-03-16 23:40:37+00:00,2012-03-16 21:07:12+00:00
3553,Mozilla,Windows,www.shrewsbury-ma,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-03-16 23:40:40+00:00,2010-05-12 17:53:31+00:00
3554,Mozilla,Windows,www.shrewsbury-ma,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-03-16 23:40:40+00:00,2010-05-12 17:55:06+00:00
3556,Mozilla,Windows,www.facebook.com,www.okc.gov,Oklahoma City,35.471500,-97.518997,America/Chicago,2012-03-17 00:40:44+00:00,2011-06-08 15:50:47+00:00
