# *Sci-Hub server log data analysis*

### *John Bohannon, Science magazine*

This Notebook will help you process the 6 months of raw server log data provided by Sci-Hub to Science magazine in March 2016.

**Science article:**

http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone

**Data set:**

http://dx.doi.org/10.5061/dryad.q447c

In [None]:
import pandas as pd
import numpy as np

In [None]:
months = ("sep2015", "oct2015", "nov2015", "dec2015", "jan2016", "feb2016")

In [None]:
%mkdir "scihub_data_temp"

**We will need CrossRef's database of DOI prefixes to identify publishers from the DOIs.**

http://www.crossref.org/06members/50go-live.html

**Don't worry, I scraped all that for you:**

In [None]:
journal_DOIs = pd.read_csv("publisher_DOI_prefixes.csv", index_col = 0)
journal_DOIs.head()

**Holding all months of raw data in memory is a bit much for most laptop computers, so let's process each month separately to generate aggregate data.**

In [None]:
def process_data(month):
    
    # load the file as a dataframe
    path = "scihub_data/"
    filename = month + ".tab"
    with open(path + filename, "r") as f:
        data = pd.read_table(f)

    # this is the format of the columns
    data.columns = ["date","doi","IP_code","country","city","coords"]

    # create a few more useful columns
    data[["latitude", "longitude"]] = data.coords.str.split(",", expand = True)
    data["prefix"] = data.dropna(subset = ["doi"]).doi.apply(lambda x: x.split("/")[0])

    # group by DOI prefix and count total downloads for each
    publishers = data.dropna(subset = ["prefix"]).groupby("prefix").count()
    publishers = publishers.sort_values(by = "date", ascending = False).date
    publishers = publishers.reset_index()
    publishers.columns = ["prefix","downloads"]

    # translate those DOI prefixes into publisher names using the CrossRef data
    data_publishers = pd.merge(publishers, journal_DOIs[["Prefix","Name"]],
                               left_on = "prefix", right_on = "Prefix", how = "left")
    data_publishers[["prefix","downloads","Name"]].to_csv("scihub_data_temp/%s_publishers.csv" %(month))

    # calculate the 100 most downloaded DOIs of the month
    top100_doi = data.groupby("doi").count().sort_values(by = "date", ascending = False)[:100].date
    top100_doi.name = "downloads"
    top100_doi.to_csv("scihub_data_temp/%s_top100_doi.csv" %month, header = "downloads")

In [None]:
for month in months:
    print(month)
    process_data(month)

**Now let's see the big picture, starting with the publishers...**

In [None]:
publishers_by_month = [pd.read_csv("scihub_data_temp/" + i + "_publishers.csv") for i in months]
for i in publishers_by_month:
    print(len(i))

In [None]:
all_publishers = pd.concat(publishers_by_month)

In [None]:
all_publishers = all_publishers.groupby("Name").sum().downloads.sort_values(ascending = False)
all_publishers.head()

In [None]:
sum([int(i[i.Name == "Elsevier"].downloads) for i in publishers_by_month])

**Yep, that checks out. Nearly 10 million Elsevier downloads in 6 months.**

In [None]:
all_publishers.to_csv("downloads_by_publishers.csv", header = ["downloads"])

**Next let's get a list of most downloaded papers across the 6-month period.**

In [None]:
all_papers = pd.concat([pd.read_csv("scihub_data_temp/" + i + "_top100_doi.csv") for i in months])

In [None]:
all_papers.groupby("doi").count().sort_values(by = "downloads",ascending = False).head()

**Plenty of papers are in the top100 across all 6 months. Let's see what the most downloaded paper is across the entire time period...**

In [None]:
top25_doi = all_papers.groupby("doi").sum().downloads.sort_values(ascending = False)[:25]
top25_doi

In [None]:
data_list = list()
for month in months:
    print(month)
    datafile = "scihub_data/" + month + ".tab"
    with open(datafile, "r") as f:
        this_month = pd.read_table(f)
        this_month.columns = ["date", "doi", "IP_code", "country", "city", "coords"]
        data_list.append(this_month.groupby("doi").count().date)   

In [None]:
data = data_list[0]
for i in data_list[1:]:
    data = data.add(i, fill_value = 0)
data.head()

In [None]:
data.name = "downloads"
data = data.sort_values(ascending = False)
data.head()

**Note that some DOIs are invalid, due to typos from the Sci-Hub users or, in the case of 10.1182/asheducation-2015.1.8, because a website listed the wrong DOI.**

In [None]:
data[:100].astype(int).to_csv("top100_downloads_by_DOI.csv", header = "downloads")

In [None]:
import requests, json
from pandas.io.json import json_normalize

def get_top25_metadata(doi):
    fields = ["title", "type", "publisher","container-title", "subject", "published-print.date-parts"]
    metadata = dict([(i, None) for i in fields])
    metadata["doi"] = doi
    try:
        url = "http://dx.doi.org/" + doi
        headers = {"accept": "application/citeproc+json"}
        r = requests.get(url, headers = headers)
        full_metadata = json_normalize(json.loads(r.text))
        for i in fields:
            if i in full_metadata.columns:
                metadata[i] = full_metadata.iloc[0][i]
    except:
        pass
    return metadata

In [None]:
get_top25_metadata("10.1007/978-1-4419-9716-6_11")

In [None]:
top25 = pd.DataFrame([get_top25_metadata(i) for i in data[:25].index.values])
top25

In [None]:
top25.to_csv("top25_papers.csv")

**Now let's look at the geography of Sci-Hub downloads.**

In [None]:
def get_country_data(month):
    path = "scihub_data/"
    filename = month + ".tab"
    with open(path + filename, "r") as f:
        data = pd.read_table(f)
        data.columns = ["date", "doi", "IP_code", "country", "city", "coords"]
        data = data[["country", "date"]]
        data = data.groupby("country").count()
    return data

In [None]:
data_list = [get_country_data(month) for month in months]

In [None]:
by_country = data_list[0]
for i in data_list[1:]:
    by_country = by_country.add(i)
data_list = None
by_country = by_country.sort_values(by = "date", ascending = False)
by_country.head()

In [None]:
by_country.columns = ["downloads"]

In [None]:
by_country.to_csv("downloads_by_country.csv")

**Now to get downloads by lat/lon coordinates...**

In [None]:
def get_cities(month):
    filepath = "scihub_data/" + month + ".tab"
    with open(filepath, "r") as f:
        data = pd.read_table(f)
        data.columns = ["date","doi","IP_code","country","city","coords"]
        data = data.groupby(["coords","city","country"]).count()
        data = data.rename(columns = {"date":"downloads"})
        return data["downloads"]

In [None]:
downloads_by_coords = get_cities(months[0])
for month in months[1:]:
    print(month)
    downloads_by_coords = downloads_by_coords.add(get_cities(month), fill_value = 0)

In [None]:
len(downloads_by_coords)

**So there are about 22,000 locations with unique triples of (lat/lon, city name, country name).**

In [None]:
downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["coords","city","country","downloads"]

In [None]:
downloads_by_coords.head()

**Some coords cluster to the same city/country. Let's find them...**

In [None]:
by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["city","country"])]
coord_dupes

**Yeah look at that. Many Sci-Hub user IP addresses clustered to different lat/lon coordinates within the same city, probably because the Google Maps API treats big cities as supersets of several smaller cities. We'll need to pull out the lower-level "administrative_area_level" names using Google Maps...**

**From here to the end of this Notebook is data-wrangling I needed to build the map that features in the Science article. You probably don't need any of this for your own analyses. The code above gets you to the starting point.**

In [None]:
import requests, json
from pandas.io.json import json_normalize

# You will need to register with Google Maps and get your own free API key
# https://developers.google.com/maps/documentation/javascript/get-api-key
API_KEY = "AjZaSyCaecKeKEr9NEv4zXaPzVVSds1FLTrtM4x"

def get_admin(coords, country):
    url = "https://maps.googleapis.com/maps/api/geocode/json?key=%s&latlng=%s" %(API_KEY, coords)
    r = requests.get(url)
    data = json_normalize(json.loads(r.text)["results"], "address_components")
    if country == "United States":
        level = "administrative_area_level_1"
    else:
        level = "administrative_area_level_4"
    return data[data.types.map(lambda x: level in x)]["short_name"].iloc[0]

In [None]:
get_admin("-15.813415,-48.1044183", "Brazil")

In [None]:
with open("downloads_by_coords.csv") as f:
    by_coords = pd.read_csv(f)
by_coords.head()

In [None]:
ambiguous = by_coords[by_coords.duplicated(["city","country"], keep = False)].copy()
ambiguous

**So we have 2106 city names duplicated within countries...**

In [None]:
ambiguous[ambiguous.city == "Sterling"]

**For the US, we want the state, which is administrative_area_level_1 in the Google Maps JSON.**

In [None]:
get_admin("39.0026518,-77.3956004", "United States")

In [None]:
US_cities = ambiguous[ambiguous.country == "United States"].copy()
US_coords = US_cities["coords"].tolist()
get_admin(US_coords[0], "United States")

In [None]:
US_states = dict()

In [None]:
import sys

total = len(US_coords)
fails = list()
for n,i in enumerate(US_coords):
    sys.stdout.write("%s\t%s" %(total - n, len(fails)))
    if i not in US_states:
        try:
            US_states[i] = get_admin(i, "United States")
        except:
            fails.append(i)
    sys.stdout.flush()
    sys.stdout.write('\r')
print("%s\t%s" %(total - n, len(fails)))

In [None]:
len(set(US_states.values()))

**That takes care of the US states. Now to deal with non-US cities.**

In [None]:
ambiguous["admin"] = pd.Series()

In [None]:
ambiguous.tail()

In [None]:
non_US_cities = ambiguous[ambiguous.country != "United States"].copy()
non_US_coords = non_US_cities["coords"].tolist()
len(non_US_coords)

**Not too bad. Just 363 non-US locations to deal with.**

In [None]:
import requests, json
from pandas.io.json import json_normalize

# You will need to register with Google Maps and get your own free API key
# https://developers.google.com/maps/documentation/javascript/get-api-key
API_KEY = "AjZaSyCaecKeKEr9NEv4zXaPzVVSds1FLTrtM4x"

def get_non_US_admin(coords):
    url = "https://maps.googleapis.com/maps/api/geocode/json?key=%s&latlng=%s" %(API_KEY, coords)
    r = requests.get(url)
    data = json_normalize(json.loads(r.text)["results"], "address_components")
    try:
        level = "administrative_area_level_4"
        return data[data.types.map(lambda x: level in x)]["short_name"].iloc[0]
    except:
        try:
            level = "administrative_area_level_3"
            return data[data.types.map(lambda x: level in x)]["short_name"].iloc[0]
        except:
            try:
                level = "administrative_area_level_2"
                return data[data.types.map(lambda x: level in x)]["short_name"].iloc[0]
            except:
                level = "locality"
                return data[data.types.map(lambda x: level in x)]["short_name"].iloc[0]

get_non_US_admin("65.0120888,25.4650773")

In [None]:
non_US_cities = dict()

**I started running into the 2500 queries/day limit, requiring multiple fresh API keys. So I'm keeping track of that with error messaging below. Loading the results into a dict as we go allows this to fail gracefully and always pick up where it left off.**

In [None]:
import time

total = len(non_US_coords)
fails = list()
for n,i in enumerate(non_US_coords):
    time.sleep(0.15)
    sys.stdout.write("%s\t%s" %(total - n, len(fails)))
    if i not in non_US_cities:
        try:
            result = get_non_US_admin(i)
            if result == "OVER_QUERY_LIMIT":
                print("OVER_QUERY_LIMIT")
                break
            else:
                non_US_cities[i] = result
        except:
            fails.append(i)
    sys.stdout.flush()
    sys.stdout.write('\r')
print("%s\t%s" %(total - n, len(fails)))

In [None]:
coords_and_areas = US_states.copy()
for k,v in non_US_cities.items():
    coords_and_areas[k] = v
len(coords_and_areas)

In [None]:
coords = "45.83316,5.096755"
url = "https://maps.googleapis.com/maps/api/geocode/json?key=%s&latlng=%s" %(API_KEY, coords)
r = requests.get(url)
data = json_normalize(json.loads(r.text)["results"], "address_components")
data[data.types.map(lambda x: "locality" in x)]["short_name"].iloc[0]

In [None]:
dupes = pd.Series(non_US_cities)[pd.Series(non_US_cities).duplicated(keep=False)].sort_values()
dupes

In [None]:
def get_locality(coords):
    url = "https://maps.googleapis.com/maps/api/geocode/json?key=%s&latlng=%s" %(API_KEY, coords)
    r = requests.get(url)
    data = json_normalize(json.loads(r.text)["results"], "address_components")
    return data[data.types.map(lambda x: "locality" in x)]["short_name"].iloc[0]

In [None]:
get_locality(dupes.index[2])

In [None]:
localities = dict()

In [None]:
fails = list()
for coords in dupes.index.tolist():
    time.sleep(0.15)
    localities[coords] = get_locality(coords)

In [None]:
coords_and_areas.update(localities)
len(coords_and_areas)

In [None]:
missing = [i for i in ambiguous.coords if i not in coords_and_areas]
len(missing)

In [None]:
missing_areas = dict()

In [None]:
for i in missing:
    time.sleep(0.15)
    missing_areas[i] = get_non_US_admin(i)

In [None]:
coords_and_areas.update(missing_areas)
len(coords_and_areas)

In [None]:
coords_and_areas = pd.Series(coords_and_areas, name = "coords")

In [None]:
disambiguated_cities = ambiguous.join(coords_and_areas, on="coords", how="left", rsuffix="_new").sort_values(by="city")
disambiguated_cities = disambiguated_cities.rename(columns = {"coords_new":"admin_area"})
disambiguated_cities.head()

In [None]:
disambiguated_cities.groupby(["coords","country","city","admin_area"]).sum()

**OK! All 2106 ambiguous cities are now accounted for.**

In [None]:
disambiguated_cities.to_csv("disambiguated_cities.csv")

**Now let's put it all together for mapping...**

In [None]:
disambiguated_cities = pd.read_csv("disambiguated_cities.csv", index_col = 0)
disambiguated_cities.head()

In [None]:
by_coords = by_coords.join(disambiguated_cities.admin_area, how="left")
by_coords[by_coords.admin_area.notnull()]

In [None]:
stragglers = by_coords[by_coords.duplicated(subset = ["city", "country", "admin_area"], 
                                            keep = False)].sort_values(by = "city")
stragglers

**There are still 629 coords that have identical (country, city, admin_area).**

In [None]:
import requests, json
from pandas.io.json import json_normalize

A# You will need to register with Google Maps and get your own free API key
# https://developers.google.com/maps/documentation/javascript/get-api-key
API_KEY = "AjZaSyCaecKeKEr9NEv4zXaPzVVSds1FLTrtM4x"

def get_google_maps_data(coords):
    url = "https://maps.googleapis.com/maps/api/geocode/json?key=%s&latlng=%s" %(API_KEY, coords)
    r = requests.get(url)
    data = json_normalize(json.loads(r.text)["results"], "address_components")
    level_dict = dict([(i, None) for i in ("level_1", "level_2", "level_3", "level_4")])
    for level in level_dict:
        try:
            level_dict[level] = data[data.types.map(lambda x: "administrative_area_" + level in x)]["short_name"].iloc[0]
        except:
            pass
    return pd.Series(level_dict)

In [None]:
get_google_maps_data("39.9911093,-76.6699169")

In [None]:
get_google_maps_data("39.9625984,-76.727745")

**And this makes it clear that it is not even possible to disambiguate some of these. Google Maps does not have specific administrative level information that we could use to distinguish them by name. Well, how geographically far are these ambiguous cities from their name duplicate coords?...**

In [None]:
from math import sin, cos, sqrt, atan2, radians

# radius of earth (km)
R = 6373.0

def geo_distance(coords1, coords2):
    lat1, lon1 = [radians(float(i)) for i in coords1.split(",")]
    lat2, lon2 = [radians(float(i)) for i in coords2.split(",")]
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance = R * c
    return distance

In [None]:
geo_distance("39.9911093,-76.6699169","39.9617415,-76.7471062")

**Even 7 km is far enough to make the map wonky. So, executive decision: I'm just going to give the duplicates generic names such as "area 1 near York, PA, United States"... Expediency!**

In [None]:
def get_generic_area_name(map_name, n):
    return "area %s near %s" %(n, map_name)

get_generic_area_name("York, PA, United States", 2)

In [None]:
name_dict = dict()

def fix_stragglers(df):
    total = len(stragglers)
    x, y, z = df.iloc[0][["city","admin_area", "country"]]
    for n, c in enumerate(df.coords.tolist()):
        if z == "United States":
            name_dict[c] = get_generic_area_name("%s, %s, %s" %(x, y, z), n + 1)
        else:
            name_dict[c] = get_generic_area_name("%s, %s" %(x, z), n + 1)

In [None]:
x, y, z = ("United States", "Akron", "OH")
akron = stragglers[(stragglers.country == x) & (stragglers.city == y) & (stragglers.admin_area == z)]
akron

In [None]:
fix_stragglers(akron)
name_dict

In [None]:
for i in stragglers[["coords","city","admin_area", "country"]].values:
    x = i
print(x)

In [None]:
for i in stragglers[["city","admin_area", "country"]].values:
    x, y, z = i
    df = stragglers[(stragglers.city == x) & (stragglers.admin_area == y) & (stragglers.country == z)]
    fix_stragglers(df)

In [None]:
def make_name(coords):
    return name_dict[coords]

stragglers["map_name"] = stragglers.coords.apply(make_name)
stragglers

In [None]:
by_coords.head()

In [None]:
def get_map_name(admin_area, city, country):
    if type(admin_area) != str or admin_area == city:
        return "%s, %s" %(city, country)
    elif country == "United States":
        return "%s, %s, %s" %(city, admin_area, country)
    else:
        return "%s, %s, %s" %(admin_area, city, country)

by_coords["map_name"] = by_coords.apply(lambda x: get_map_name(x["admin_area"], x["city"], x["country"]), axis=1)
by_coords.head()

In [None]:
len(by_coords)

In [None]:
by_coords[by_coords.admin_area.notnull()]

In [None]:
still_straggling = by_coords[by_coords.duplicated(subset=["country","city","admin_area"], keep=False)].sort_values(by="city")
still_straggling

In [None]:
name_dict = dict()

def fix_final_stragglers(df):
    total = len(stragglers)
    x, y, z = df.iloc[0][["city","admin_area", "country"]]
    for n, c in enumerate(df.coords.tolist()):
        if z == "United States":
            name_dict[c] = get_generic_area_name("%s, %s, %s" %(x, y, z), n + 1)
        else:
            name_dict[c] = get_generic_area_name("%s, %s" %(x, z), n + 1)

In [None]:
for i in still_straggling[["city","admin_area", "country"]].values:
    x, y, z = i
    df = still_straggling[(still_straggling.city == x) & (still_straggling.admin_area == y) & (still_straggling.country == z)]
    fix_stragglers(df)

In [None]:
name_dict

In [None]:
def make_name(coords):
    return name_dict[coords]

still_straggling["map_name"] = still_straggling.coords.apply(make_name)
still_straggling

**Finally, I noticed that there are still duplicated records for 6 pairs of coords. Must fix these:**

In [None]:
dupes = by_coords[by_coords.duplicated(subset = ["coords"], keep = False)].sort_values(by = "coords")
dupes

In [None]:
fixed_dupes = dupes[dupes.duplicated(subset = ["coords"], keep = "last")]
fixed_dupes = fixed_dupes.set_index("coords")
other = dupes[dupes.duplicated(subset = ["coords"], keep = "first")].set_index("coords")
fixed_dupes.downloads = fixed_dupes.downloads.add(other.downloads, fill_value = 0)
fixed_dupes

In [None]:
len(by_coords)

In [None]:
by_coords = by_coords.drop(dupes.index)
len(by_coords)

In [None]:
by_coords = by_coords.set_index("coords")
by_coords = by_coords.append(fixed_dupes)
len(by_coords)

**Ready for mapping!**

**Finally, let's do some frequency analysis...**

In [None]:
def get_days(month):
    filepath = "scihub_data/" + month + ".tab"
    with open(filepath, "r") as f:
        data = pd.read_table(f)
        data.columns = ["date","doi","IP_code","country","city","coords"]
        data = data[["date","doi"]]
        data.date = pd.to_datetime(data.date).apply(lambda x: "%s-%s-%s" %(x.year, x.month, x.day))
        data = data.groupby("date").count()
        data = data.reset_index()
        data = data.rename(columns = {"doi":"downloads"})
        return data

In [None]:
by_day = pd.concat([get_days(month) for month in months])
by_day = by_day.groupby("date").sum()
by_day.to_csv("downloads_by_day.csv")

In [None]:
by_day = by_day[:164]
by_day.tail()

In [None]:
by_day["day"] = pd.to_datetime(by_day.date).apply(lambda x: x.isoweekday())
by_day.head()

In [None]:
%matplotlib inline

by_day.groupby("day").sum().downloads.plot(kind="bar")

**TUESDAY is the busiest day for Sci-Hub. Well how about that?**