# Extracting NL Domain URLs from CommonCrawl Archive
#### *1. Using CommonCrawl [Index Client](http://index.commoncrawl.org) (CDX Index API Client)*
  * For retrieving the URL’s of the Dutch Domain in the Common Crawl achieve of September, the [cdx-index-client](https://github.com/ikreymer/cdx-index-client) can be used in [cygwin64](https://www.cygwin.com) terminal (a linux like environment for windows) it can also be used in the Jupyter environment as shown below.<br>
  * The following code is used download gz files from the September 2016 archive: `./cdx-index-client.py -c CC-MAIN-2016-40 *.nl --fl url -z` (Changing 2016-40 to 2017-13 gives the March 2017 archive).(`!cdx-index-client.py -c CC-MAIN-2016-40 *.nl --fl url -z`   in Jupyter.)  <br>
  * The cdx-index-client ran with python3 does gives a couple of errors. These could be easily by changing some bits of code (mainly typos and capitals).<br>
  * The cdx-index-client downloaded 552 gz files which all contain several URLS. Many of the Urls not only contain the hosting address but also an additional path, e.g., the http://www.doof.nl has more than a hundred URLS in the the 100th gz file.<br>
  * Total runningtime for downloading the urls from the September 2016 Common Crawl archive was: 52 minutes.<br>
  * Total runningtime for downloading the urls from the March 2017 Common Crawl archive was: 3 hours and 16 minutes.



In [None]:
!cdx-index-client.py -c CC-MAIN-2016-40 *.nl --fl url -z

#### *2. Funtions for extracting unique URLS from gz files*

Two functions are created to extract urls from the gz files. The extracturls functions and the create functions. The create function will be used in the extracturls function and thus only the extracturls function needs to be ran.  

#### The Extracturls Function

The purpose of the extracturls function is to extract the urls from the gz files which are downloaded with the cdx-index-client and return a txt file with unique urls. Beside this it will give the total running time the total urls extracted, average urls per gz file, total size of processed files and average size per gz file. 

The function six optional arguments. With the **'path'** argument a directory in which the gz files are located can be choosen. The argument **'name'** makes it possible for the user to choose a name for the final file(s) (e.g. september_2016_urls.csv) when no name is given by user it will automatically choose **'output'** as name for the file. Another optional argument is **'prnt'**, by setting it to *True* (prnt = *True*) for each gz files processed it will print running time per file, the amount of urls extractied from it and the filesize of the url. 

Finally the three arguments txt, json and csv enables the user to choose a format for the outputfile **csv** is set to *True* by default, while **json** and **txt** are set to *False* by default.

In [78]:
def extracturls(path = "", name = 'output', prnt = False, csv = True, txt = False, json = False):
    import glob
    import time
    import json as js
    import builtins # is used because conflict with gzip open.
    import os
    import csv as cs
    from gzip import open
    if len(path) == 0:
        path = os.getcwd()
        print("Using files in working directory")
    urls =[] # creates empty list for list complete list of urls.
    start_time0 = time.time()
    N = len(glob.glob1(path,"*.gz")) # counts the how many gz files are in the directory.
    name += "_"+str(N) #adds number of gz files to name.
    total_size = 0
    for i in range(N):
        start_time = time.time() #starts counting time for each loop (.gz file)
        clean_urls = []  #create empty lists for cleaned urls for each loop.
        j = ((len(str(N))-len(str(i)))*'0')+str(i) #This line determines the how many 0's if any should be added to file name.
        f = open(path+'\domain-nl-'+j+'.gz', mode='rt') # Any streaming file object that supports `read`
        size = os.stat(path+'\domain-nl-'+j+'.gz').st_size
        url = f.read().split() # spilts into seperate urls.  
        for k in url: # iterate through seperated urls.
            clean_url = k[k.find("http"):k.find(".nl")+3] #only selects first two segments of urls.
            clean_urls.append(clean_url) # append to clean_urls.
        total_size += size
        urls.extend(clean_urls) #extends complete list with clean_urls.
        #creating an empty list for each loop and copy results into a second complete list
        #dramatically decreases running time since the urls to check do not increase after each loop
        #otherwise running time increases exponentially. 
        runningtime = (time.time() - start_time) #calculates running time.
        if prnt == True:
            print("runningtime {}:".format(i),  round(runningtime,2),"seconds,", len(clean_urls), 'URL\'s', 'filesize', round(size/1024,2), 'KB')
        #m, s = divmod(runningtime, 60) # separates seconds into seconds and minutes.
        #h, m = divmod(m, 60) # separates minutes into minutes and hours.
        #print (" Running time {} = %d:%02d:%02d".format(i) % (h, m, s)) #gives running total running time.
    uniqueurls = list(set(urls))#removes duplicates, but distorts the order. 
    uniqueurls.sort()
    create(name, uniqueurls, txt, json, csv)
    runningtime0 = (time.time() - start_time0) #gets total running time.
    m, s = divmod(runningtime0,60) # separates seconds into seconds and minutes.
    h, m = divmod(m, 60) # separates minutes into minutes and hours.
    #uniqueurls2.sort() #puts in alphabethical order.
    print ("Total Runningtime: %d:%02d:%02d" % (h, m, s)) #gives running total running time.
    print ("Total # Urls {}, Avarage Urls/file {}".format(len(urls), (len(urls)/N)))
    print ("Total Size {} MB, Average filesize {} KB".format((round((total_size/(1024*1024)),2)) , (round((total_size/ N)/(1024),2))))

#### The Create Function

The create function will use the arguments from the extracturls function in order to output the in format which the user has chosen.

In [79]:
def create(name,X,txt, json, csv): 
    import time
    import builtins
    import json as js
    import os
    import csv as cs
    while True:
        if txt == True:
            print("creating txt file")
            output = builtins.open(name+'.txt','w') #creates a txt file.
            js.dump(X,output) #outputs the created urls list to created json file.
            output.close()
            print ("textfile {} created".format(name+'.txt'))
            size = os.stat(r'C:\Users\Eltebook 01\Documents\CBS Text mining\gzipstream-master\{}.txt'.format(name)).st_size
            print (round((size/1024)/1024,2), 'MB')
            txt = False
            print(txt)
        if json == True:
            print("creating json file")
            output = builtins.open(name+'.json','w') #creates a json file.
            js.dump(X,output) #outputs the created urls list to created csv file.
            output.close()
            print ("jsonfile {} created".format(name+'.json'))
            size = os.stat(r'C:\Users\Eltebook 01\Documents\CBS Text mining\gzipstream-master\{}.json'.format(name)).st_size
            print (round((size/1024)/1024,2), 'MB')
            js = False
        if csv == True:  
            print("creating csv file")
            with builtins.open(name+'.csv', 'w') as output:
                writer = cs.writer(output)
                writer.writerows(X)
            output.close()
            print ("csvfile {} created".format(name+'.csv'))
            size = os.stat(r'C:\Users\Eltebook 01\Documents\CBS Text mining\gzipstream-master\{}.csv'.format(name)).st_size
            print (round((size/1024)/1024,2), 'MB')
            csv = False
        else:
            break
    print("files created")

Total urls in the September 2016 archive: **8261874** <br>
Total unique urls in the September 2016 archive: **635740** <br><br>
Total urls in March 2017 archive: **25127201**<br>
Total unique urls in the March archive: **1117424**

#### *3. Running the Program*

Running the codeblock below will create by defealt creates a csv file with unique urls extracted from the gz files in the working directory. By setting txt to True (extracturls(`txt = True`) will create a csv file and a txt file. If only a json file is to be created use (extracturls(csv = False, json = True)

If extracturls(X) is ran it will produce a csv file from path X.

In [80]:
X = r'C:\Users\Eltebook 01\Documents\CBS Thesis\.nl domain' 
extracturls()

Using files in working directory
creating csv file
csvfile output_552.csv created
33.98 MB
files created
Total Runningtime: 0:01:50
Total # Urls 8261874, Avarage Urls/file 14967.16304347826
Total Size 132.25 MB, Average filesize 245.33 KB
