# Carbon Date 

-------------------------------------------------

&nbsp;

## Contents

*  [Prepare the Data](#cleanup)
*  [Graph the Data](#graphs)
*  [Scatterplot Graph (non-interactive)](#scatter)
*  [Scatterplot Graph (external)](scatter.html)

&nbsp;

The raw code for this Jupyter notebook is on by default. The focus of this notebook walking through the process and code that leads to the final data and graph. If you only came for the graphing [click here](#graphs). To toggle on/off the raw code, click below:

&nbsp;

In [1]:
# Setup Code toggle button
from IPython.core.display import HTML  

HTML(''' 
<center><h3>
<a href="javascript:code_toggle()">Code is cheap, show me the data.</a>
</center></h3>
<script>
    var code_show=false; //false -> show code at first

    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt

        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
</script>
''')

In [2]:
# Setup notebook theme
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
set_nb_theme(get_themes()[2])

In [3]:
# Setup offline mode
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

&nbsp;

<a id='cleanup'></a>
## Prepare the Data

&nbsp;

In [4]:
link_path = '../data/final_link_list.dat'
with open(link_path) as link_file:
    init_links = [link.strip().lower() for link in link_file]

print("Grabed the {} links".format(len(init_links)))

Grabed the 1000 links


&nbsp;

Using some bash magic the docker container prepaired for the [CarbonDate](https://github.com/oduwsdl/CarbonDate) tool can be used to get the json data. Using the `%%bash` magic we can execute the command to run the carbon date docker image with the command to download the needed data from the notebook.

&nbsp;

In [5]:
%%bash

docker run --rm -i -p 4444:4444 \
            registry.gitlab.com/datenstrom/cs532-s17:assignment-2 \
            ./main.py -l search 'www.nostarch.com'

cdGetBitly.py::GetBitlyJson(), please set bitly access token in config
runtime in seconds:  23
{
  "URI": "http://www.nostarch.com",
  "Estimated Creation Date": "2009-10-26T00:00:00",
  "Google.com": "2009-10-26T00:00:00",
  "Twitter.com": "2017-02-06T21:30:21",
  "Backlinks": "",
  "Archives": "",
  "Bing.com": "",
  "Pubdate tag": "",
  "Bitly.com": "",
  "Last Modified": "2017-02-14T01:18:37"
}



&nbsp;

Now that we know it works we can use an inline bash magic like `!<bash command>` and wrap it in a python loop with our URI list.

&nbsp;

In [6]:
import json
from datetime import datetime

creation_dates = []
for link in init_links:
    
    JSON = !docker run --rm -i -p 4444:4444 \
                registry.gitlab.com/datenstrom/cs532-s17:assignment-2 \
                ./main.py -l search {link}
    
    # Join `SList` object elements with `\n`
    j = JSON.n
    try:
        # Split into list at newlines
        # Slice off unwanted bits
        # Joint into a single line string
        # Convert to json object
        j = json.loads(' '.join(j.split('\n')[2:]))
        # Date format is 'YYYY-MM-DDTHH:MM:SS'
        # Convert to datetime object
        time = j['Estimated Creation Date']
    except ValueError as e:
        print('ValueError: {}\nlikely a malformed date string'.format(e))
        time = ''
    
    if time != '':
        time = datetime.strptime(time, '%Y-%m-%dT%H:%M:%S')
        creation_dates.append(time)
        print("Link: {}\nTime: {}\n\n\n".format(link, time))
    else:
        creation_dates.append(None)
        print("Link: {}\nTime: {}\n\n\n".format(link, None))

Link: www.werkenbijdeloitte.nl
Time: 2012-03-05 00:00:00



Link: www.linkedin.com/today/author/jamesljohnson
Time: None



Link: www.scmagazine.com/david-beckhams-emails-hacked-and-released-after-ransom-refusal/article/636560/?dcmp=emc-scus_newswire&spmailingid=16524299&spuserid=mzqzmze2nju5ota3s0&spjobid=960720405&spreportid=otywnziwnda1s0
Time: 2017-02-07 10:30:00



Link: www.systron.hu
Time: 2014-02-09 11:33:39



Link: www.tripwire.com/state-of-security/featured/french-man-sues-uber-after-privacy-bug-led-wife-to-suspect-adultery/
Time: 2017-02-09 05:09:38



Link: www.vanguardngr.com
Time: 2016-02-05 00:00:00



Link: www.bitfeed.co/
Time: 2015-07-04 00:00:00



Link: www.cpgconnect.ca
Time: 2016-05-30 00:00:00



Link: www.cartoonnetwork.com
Time: 2005-02-06 00:00:00



Link: www.instagram.com/hackedbystacy/
Time: 2017-02-09 16:56:34



Link: www.avst.com/products/cx-e/
Time: None



Link: www.stickleyonsecurity.com
Time: None



Link: www.redhawksecurity.com
Time: None



Link:

In [7]:
import plotly.plotly as py
from plotly import figure_factory as FF

import pandas as pd

link_data = pd.read_csv('../data/link_mementos.dat')  

link_data

Unnamed: 0,link,memento
0,www.werkenbijdeloitte.nl,203
1,www.linkedin.com/today/author/jamesljohnson,0
2,www.scmagazine.com/david-beckhams-emails-hacke...,0
3,www.systron.hu,21
4,www.tripwire.com/state-of-security/featured/fr...,1
5,www.vanguardngr.com,12866
6,www.bitfeed.co/,6
7,www.cpgconnect.ca,198
8,www.cartoonnetwork.com,15546
9,www.instagram.com/hackedbystacy/,0


In [8]:
dates = [None] * 1000
for link, date in zip(init_links, creation_dates):
    target_index = link_data['link'].tolist().index(link)
    dates[target_index] = date

link_data['creation_date'] = dates
link_data

Unnamed: 0,link,memento,creation_date
0,www.werkenbijdeloitte.nl,203,2012-03-05 00:00:00
1,www.linkedin.com/today/author/jamesljohnson,0,NaT
2,www.scmagazine.com/david-beckhams-emails-hacke...,0,2017-02-07 10:30:00
3,www.systron.hu,21,2014-02-09 11:33:39
4,www.tripwire.com/state-of-security/featured/fr...,1,2017-02-09 05:09:38
5,www.vanguardngr.com,12866,2016-02-05 00:00:00
6,www.bitfeed.co/,6,2015-07-04 00:00:00
7,www.cpgconnect.ca,198,2016-05-30 00:00:00
8,www.cartoonnetwork.com,15546,2005-02-06 00:00:00
9,www.instagram.com/hackedbystacy/,0,2017-02-09 16:56:34


&nbsp;

R expects the date to be represented by an integer which is the number of days since the 1970-01-01 epoch, and uses negative values for any earlier dates. This can easily be done using the stored datetime objects, but there are also a lot of `NaT` in there and it would be good to convert them to `NaN`. First calculate the `datetime.timedelta` value, then apply a lambda over the deltas using `.notnull()` to skip the `NaT` values to get extract the days.

&nbsp;

In [9]:
link_data['creation_date'] = (pd.to_datetime(link_data['creation_date'])
                              - pd.datetime(1970, 1, 1))


link_data['creation_date'] = link_data.loc[link_data['creation_date'].notnull(), \
                             'creation_date'].apply(lambda x: pd.Timedelta(x).days)

link_data

Unnamed: 0,link,memento,creation_date
0,www.werkenbijdeloitte.nl,203,15404.0
1,www.linkedin.com/today/author/jamesljohnson,0,
2,www.scmagazine.com/david-beckhams-emails-hacke...,0,17204.0
3,www.systron.hu,21,16110.0
4,www.tripwire.com/state-of-security/featured/fr...,1,17206.0
5,www.vanguardngr.com,12866,16836.0
6,www.bitfeed.co/,6,16620.0
7,www.cpgconnect.ca,198,16951.0
8,www.cartoonnetwork.com,15546,12820.0
9,www.instagram.com/hackedbystacy/,0,17206.0


&nbsp;

All set to wrap this up and pass it to R.

&nbsp;

In [14]:
# Read back from saved read only data,
# Let us never run the carbon date loop again.
# Comment out to run with new data.
link_data = pd.read_csv('../data/_static/link_mementos_date.dat')

# Convert Dataframe to be importable by R
from rpy2.robjects import pandas2ri
df = pandas2ri.py2ri(link_data)

# Load R magic
%load_ext rpy2.ipython

&nbsp;

<a id='graphs'></a>
## Graph the data

Graph all URIs that have more than zero mementos and an estimated creation data, graph these with days on the $x$-axis and number of mementos on the $y$-axis. Since we do not care about data without a creation date all rows with `NaN` values can be thrown out. Unfortunatly R dataframes are not printed as nicely as Pythons though.

&nbsp;

In [11]:
%%R -i df -o df

# Remove all rows without date
df <- df[complete.cases(df),]
no_date <- 1000 - nrow(df)
remaining <- nrow(df)

# Remove all rows without mementos
df <- df[!(df$X.memento==0),]
no_mementos <- remaining - nrow(df)
remaining <- nrow(df)

# Add a string date column
#library(anytime)
#seconds_day = 86400
#cbind(df, sapply(df$creation_date, anytime()))

cat("Total: 1000\n")
cat("No Mementos: "); cat(no_mementos); cat('\n')
cat("No Date Estimate: "); cat(no_date);  cat('\n')
cat("Remaining Data Points: "); cat(remaining); cat('\n\n\n\n')
head(df)

Total: 1000
No Mementos: 377
No Date Estimate: 0
Remaining Data Points: 623



  Unnamed..0                                            link X.memento
1          0                              www.simplycast.com       565
2          1                                   www.cobalt.io        38
5          4 www.spotlight.com/interactive/cv/0358-9086-7967        12
6          5                               www.sailpoint.com       405
7          6                              www.socialflow.com      2332
9          8                           www.socialjukebox.com        39
  creation_date
1       14229.0
2       16423.0
5       15297.0
6       11522.0
7       14717.0
9       14047.0


&nbsp;

Graph for web.

&nbsp;

In [15]:
%%R -i df

library(plotly)
library(anytime)
library(RColorBrewer)
secs = 86400

p <- plot_ly(df, x = ~creation_date, y = ~X.memento,
             type = 'scatter',
             mode = 'markers',
             color = ~X.memento,
             marker = list(
                 size = 8
             ),
             # Hover text
             text = ~paste("Mementos: ", X.memento,
                           '<br>Date: ', (function(x) anydate(x*secs))(creation_date),
                           '<br>URI: ', link
                          )
            ) %>%
            layout(
                 title = "Memento Frequency in Time",
                 xaxis = list(
                     title = "Creation Time (In days since Unix epoch)"
                 ),
                 yaxis = list(
                     title = "Number of Mementos"
                 )
            ) %>%
            colorbar(
                title = "Number of<br />Mementos"
            )

# Create HTML graph
htmlwidgets::saveWidget(p, "scatter.html")

&nbsp;

Graph for print.

&nbsp;

In [17]:
%%R

title_font <- list(
    family = "Courier New, monospace",
    size = 70,
    color = "666666"
)

axis_font <- list(
    family = "Courier New, monospace",
    size = 50,
    color = "666666"
)

tick_font <- list(
    family = "Courier New, monospace",
    size = 30
)

margin <- list(
    l = 200,
    r = 200,
    b = 200,
    t = 200,
    pad = 8
)

# Create graph for print
q <- plot_ly(df, x = ~creation_date, y = ~X.memento,
             type = 'scatter',
             mode = 'markers',
             color = ~X.memento,
             marker = list(
                 size = 25
             ),
             # Hover text
             text = ~paste("Mementos: ", X.memento,
                           '<br>Date: ', (function(x) anydate(x*secs))(creation_date),
                           '<br>URI: ', link
                          )
            ) %>%  layout(
                 title = "Memento Frequency in Time",
                 font = title_font,
                 xaxis = list(
                     title = "Creation Time (In days since Unix epoch)",
                     font = axis_font,
                     tickfont = tick_font
                 ),
                 yaxis = list(
                     title = "Number of Mementos",
                     font = axis_font,
                     tickfont = tick_font
                 ),
                 autosize = F,
                 width=2556,
                 height=2556,
                 margin = margin
            ) %>%
            colorbar(
                title = "Number of<br />Mementos"
            )

plotly_IMAGE(
    q, 
    format = "png",
    out_file = "scatter.png",
    width=2556,
    height=2556,
    fileopt = "overwrite"
)

&nbsp;

Click the graph for the interactive version.

&nbsp;

<a id='scatter'></a>

[![graph](scatter.png)](scatter.html)