![dsl_logo](dsl_logo.png)


# Sci Hub Usage in Niagara
## A Data Science case study

This tutorial is meant to give you an introduction to the main ideas behind data science by analyzing usage logs of the Sci-Hub website in the Niagara region using the Python programming language. This tutorial is presented in a Jupyter notebook that blends code into web pages. Please feel free to run through this on your own.

Jupyter Notebooks are pretty easy to use. They have code 'cells' that allow you to enter and run code. Let's demonstrate. Click in the box below and hit the _Run_ button in the menu above, or the play button on the left side of the cell (if you're using Google Colab, it looks like a circle with a triangle in it) 

In [None]:
#Let's just print a basic message
print("Welcome to our Data Science Tutorial")

## Background info


![scihub_log](https://upload.wikimedia.org/wikipedia/fr/c/c4/Sci-Hub_logo.png)

SciHub is a resource that a person can use to download Academic PDFs. There is some controversy with it however. Periodically the owner of the site releases usage logs that curious people like us can review. The most recent [log file](https://zenodo.org/record/1158301) is from 2017. We are going to explore this data to see if we can spot anything interesting. At the same time we're going to learn some [Python](https://www.python.org/), in particular the [Panadas](https://pandas.pydata.org/) library and a visualization library called [Matplotlib](https://matplotlib.org/)


## Loading the Libraries and the data

Our first step is to get Python ready and to load the datafile. The next cell will take care of that. We'll also look at the first 10 lines of our data file

In [None]:
#Our Libraries
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

#Loading our Data
data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/SciHub_Workshop/master/niagara_scihub_2017_use.tab",sep="\t")

#Tell pandas what is in the data
data.columns = ["date","doi","pub_code","user_code","country","city","lat","long"]
data = data.sort_values(by = "date", ascending = False)

#Let's look at the first ten lines of the data
data.head(10)

## Looking at the data

Let's look  at what different columns are in our data.


- _date_ - The date the article was downloaded
- _doi_ - Is something like the serial number of the article [more info](https://apastyle.apa.org/learn/faqs/what-is-doi)
- _pub_code_ - a randomized serial number that represents the publisher behind the article
- _user_code_ - a randomized serial number that represents the user who downloaed the article
- _county_ - The country the usage is from. (The original datafile is global)
- _city_ - Which city in Niagara the user lives in
- _lat_ - the latitude of the center of the city found in _city_
- _long_ - the longtitude of the center of the city found in _city_


Don't work the data is totally randomized!


## Some general questions about averages

Let's ask some basic questions about what is in our data.

### How many entries are in our datafile?

We just apply the `len` function to our dataframe.

In [None]:
total_papers = len(data)
print(total_papers)

### How many unique users are in the data?

We'll select the _user_code_ column and see how many unique values there are with `nunique()`

In [None]:
unique_users = data["user_code"].nunique()
print(unique_users)

### How many unique paper are in the data?

We'll select the _doi_ column and see how many unique values there are with `nunique()`

In [None]:
unique_papers = data["doi"].nunique()
print(unique_papers)

## How many unique publishers are in the data?

Same as before but witht the _pub_code_ column


In [None]:
unique_publishers = data["pub_code"].nunique()
print(unique_publishers)

---
Now run the next cell to figure our some averages

In [None]:
print("Average papers downloaded per user: ", total_papers / unique_users)

--- 
Can you come up with other interesting averages?

---

##  Lost Revenue?

If each paper on average cost *30* how much revenue was 'lost'. * is the multiplication operator. What if each paper was on average *50*?

In [None]:
cost_per_article = 30
lost_revenue = cost_per_article * total_papers

print("Approximately",lost_revenue,"dollars would be lost.")

---

## Most popular 


### Which paper has been downloaded the most

This is a bit more complex. We first need to `groupby()` then `count()` and finally sort our result to go from most to least. Our result may look a bit odd but it is applying the `count()` function against all of the columns and showing us the result. In our case that won't make a difference and we can use any value in columns 2 on as the answer to our question.


In [None]:
top_article = data.groupby("doi").count().sort_values(by = "date", ascending = False)
top_article

Holy cats! One article was downloaded *a lot*. Have a look at the [article](https://dx.doi.org/10.1071/CH06322) Take a look at what the stated price is to access tha article.

### Which user has downloaded the most papers

Let's use the same rationale and figure out who our busiest users were. Notice the same behaviour with `count()` shows up there too.

In [None]:
top_users = data.groupby("user_code").count().sort_values(by = "date", ascending = False)
top_users

Wow, there has been some busy users in the data!

---

## Location questions

Which cities in Niagara used SciHub the most?

We'll apply `groupby()` to the _city_ column and get a `count()`

In [None]:
data.groupby("city").count()

We see that St. Catharines did the most downloading by a good margin

## Visualize some results

Let's draw some charts with our data to see if we can spot any other interesting details.

### How many papers do users download?

Well plot out a [histogram](https://en.wikipedia.org/wiki/Histogram) of user downloads amounts. Explore different values for `bins` to see if you can get a better graph

In [None]:
#Let's get the data we need in a new `dataframe`
user_downloads = data.groupby("user_code").count().sort_values( by = "doi",ascending = False).doi

#how many different values on the x-axis we'll use for the data.
bins = 200

#Now we plot it all out
plt.hist(user_downloads, bins)
plt.ylabel("Users")
plt.xlabel("Downloads")
plt.title("Downloads per user")
plt.show()

As we saw above the average papers per user is at 16 and change. We can see with this graph however that the data does not follow a [standard distribution](https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/2-mean-and-standard-deviation).

### What's popular in Thorold

Let's graph out how many papers per user but just in Thorold. For fun we'll make it look like an [XKCD](https://xkcd.com/) cartoon.

In [None]:
#Let's get the data we need in a new `dataframe`
thorold_downloads = data[data["city"] == "Thorold"]
thorold_downloads = thorold_downloads.groupby("user_code").count().sort_values( by = "doi",ascending = False).doi

#how many different values on the x-axis we'll use for the data.
bins = 50

with plt.xkcd():
    #Now we plot it all out
    plt.hist(thorold_downloads[0:100], bins)
    plt.ylabel("Users")
    plt.xlabel("Downloads")
    plt.title("Thorold Downloads per user")
    plt.show()

### Everyone likes pie

In our last example we'll draw a pie-graph of the top 5 cities in the data. 'Cause everybody loves pie. We'll use a `value_counts()` to count how many times each city shows up in the data. and we'll grab the `unique()` values in the _city_ column to be our labels. We apply the slice operator `[0:5]` which grabs only the first five values. Let's also make it like XKCD, 'cause why not.

In [None]:
cities = data["city"].value_counts()[0:5]
city_labels = data["city"].unique()[0:5]

with plt.xkcd():
    plt.pie(cities,labels=city_labels)
    plt.title("Top 5 Cities that 'download'")
    plt.show()

# The End

Thanks for taking a look at our tutorial. Now you have the basics all taken care of. Here are some links you might find useful:

- [Introduction to Python](https://brockdsl.github.io/Intro_to_Python_Workshop/) - Just like the name says, it's our first intro to Python workshop
- [Pyhon Part 2: Introducion to Data Science](https://brockdsl.github.io/Python_2.0_Workshop/) - Dig into a bit more Python and find out how to use it to do some data science stuff
- [Machine Learning with Python](https://brockdsl.github.io/Machine_Learning_with_Python/) - Once you get the basics this workshop will run through how to make predictions with your data.
- [Workshop listings](https://experiencebu.brocku.ca/organization/dsl) - All of the workshops we host can be found on ExperienceBU or if you're not a student at Brock, we list everything on [Eventbrite](https://brockdsl.eventbrite.com) too
- [Python the Hard Way](https://learntocodetogether.com/learn-python-the-hard-way-free-ebook-download/) - Don't let the name fool you. This great resource will teach you all of the basic of Python.

Check out the [DSL website](https://brocku.ca/library/dsl) too. We're also on [Twitter](https://twitter.com/brock_dsl) and [Insta](https://www.instagram.com/brock_dsl)