### Problem Descripition 

In 2012, URL shortening service Bitly partnered with the US government website USA.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.

The text file comes in JSON format and here are some keys and their description. They are only the most important ones for this task.

|key| description |
|---|-----------|
| a|Denotes information about the web browser and operating system|
| tz | time zone |
| r | URL the user come from |
| u | URL where the user headed to |
| t | Timestamp when the user start using the website in UNIX format |
| hc | Timestamp when user exit the website in UNIX format |
| cy | City from which the request intiated |
| ll | Longitude and Latitude |

In the cell, I tried to provide some helper code for better understanding and clearer vision

-**HINT**- Those lines of code may be not helping at all with your task.

In [1]:
# I will try to retrieve one instance of the file in a list of dictionaries
import json
records = [json.loads(line) for line in open('usa.gov_click_data_1.json')]
# Print the first occurance
records[0]

{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'c': 'US',
 'nk': 1,
 'tz': 'America/New_York',
 'gr': 'MA',
 'g': 'A6qOVH',
 'h': 'wfLQtf',
 'l': 'orofrog',
 'al': 'en-US,en;q=0.8',
 'hh': '1.usa.gov',
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
 't': 1333307030,
 'hc': 1333307037,
 'cy': 'Danvers',
 'll': [42.576698, -70.954903]}

## Required

In [2]:
import json 
import pandas as pd 
from pandas.io.json import json_normalize 

#From Json file to DataFrame
df = json_normalize(records)

#drop null values
df = df.dropna()

#The web browser that has requested the service
#'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11'
# split return array
# split by space  
arr1 = df['a'].str.split(" ", n = 2, expand = True) 
# split by by / 
arr2 = arr1[0].str.split("/", n = 1, expand = True) 
df['web_browser'] = arr2[0] 

#operating system that intiated this request
#remove first character which is "("
df['operating_sys'] = arr1[1].str[1:]

#The main URL the user came from
#'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf'
arr3 = df['r'].str.split("/", n = 5, expand = True) 
df['from_url'] = arr3[2]

#The same applied like `to_url`
#'http://www.ncbi.nlm.nih.gov/pubmed/22415991'
arr4 = df['u'].str.split("/", n = 5, expand = True) 
df['to_url'] = arr4[2] 

#The city from which the the request was sent
df['city'] = df['cy']

#[42.576698, -70.954903]
#The latitude where the request was sent
df['longitude'] = df['ll'].str[0]

#The latitude where the request was sent
df['latitude'] = df['ll'].str[1]


#The time zone that the city follow
df['time_zone'] = df['tz']


#drop null values
df = df.dropna()


#Time when the request started
time_in_timestamp = []
for i, row in df.iterrows():
    time_in = pd.to_datetime(row['t'], unit = 's').tz_localize(row['time_zone']).tz_convert('UTC')
    time_in_timestamp.append(time_in)
    
df['time_in'] = time_in_timestamp
 
    
#Time when the request is ended
time_out_timestamp = []
for i, row in df.iterrows():
    time_out = pd.to_datetime(row['hc'], unit = 's').tz_localize(row['time_zone']).tz_convert('UTC')
    time_out_timestamp.append(time_out)
    
df['time_out'] = time_out_timestamp


#drop old columns 
df.drop(columns=['a', 'c','nk','tz','gr','g','h','l','al','hh','r','u','t','hc','cy','ll'], inplace=True)

df.head(12)

#df.to_csv('f1.csv', index = False)

  


Unnamed: 0,web_browser,operating_sys,from_url,to_url,city,longitude,latitude,time_zone,time_in,time_out
0,Mozilla,Windows,www.facebook.com,www.ncbi.nlm.nih.gov,Danvers,42.576698,-70.954903,America/New_York,2012-04-01 23:03:50+00:00,2012-04-01 23:03:57+00:00
2,Mozilla,Windows,t.co,boxer.senate.gov,Washington,38.9007,-77.043098,America/New_York,2012-04-03 02:50:30+00:00,2012-04-03 02:50:35+00:00
4,Mozilla,Windows,www.shrewsbury-ma.gov,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-04-05 10:23:50+00:00,2012-04-05 10:23:59+00:00
5,Mozilla,Windows,www.shrewsbury-ma.gov,www.shrewsbury-ma.gov,Shrewsbury,42.286499,-71.714699,America/New_York,2012-04-06 14:10:30+00:00,2012-04-06 14:10:48+00:00
6,Mozilla,Windows,plus.url.google.com,www.nasa.gov,Luban,51.116699,15.2833,Europe/Warsaw,2012-04-07 11:57:10+00:00,2012-04-07 11:57:20+00:00
11,Mozilla,Macintosh;,t.co,oversight.house.gov,Washington,38.937599,-77.092796,America/New_York,2012-04-12 09:04:00+00:00,2012-04-12 09:04:03+00:00


Write a script can transform the JSON files to a DataFrame and commit each file to a sparete CSV file in the target directory and consider the following:

        

All CSV files must have the following columns
- web_browser
        The web browser that has requested the service
- operating_sys
        operating system that intiated this request
- from_url

        The main URL the user came from

    **note**:

    If the retrived URL was in a long format `http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf`

     make it appear in the file in a short format like this `www.facebook.com`
     
    
- to_url

       The same applied like `to_url`
   
- city

        The city from which the the request was sent
    
- longitude

        The longitude where the request was sent
- latitude

        The latitude where the request was sent

- time_zone
        
        The time zone that the city follow
        
- time_in

        Time when the request started
- time_out
        
        Time when the request is ended
        
        
**NOTE** :

Because that some instances of the file are incomplete, you may encouter some NaN values in your transforamtion. Make sure that the final dataframes have no NaNs at all.

### Script Details

The Script itself must do the following before and after trasforamtion: 
    
- One positional argument which is the directory path with that have the files.


- One optional argument **-u**. If this argument is passed will maintain the UNIX format of timpe stamp and if not                passed the time stamps will be converted.


- Check if the files have any dublicates in between **checksum** and print a messeage that indicate that.


- Print a message after converting each file with the number of rows transformed and the path of this file


- At the end of this script print the total excution time.
    