# Wikimedia data

- [Wikimedia Downloads: Analytics Datasets](https://dumps.wikimedia.org/other/analytics/)
Info about Pageviews, mediacounts and unique devices:
- [Pageviews since may 2015](https://dumps.wikimedia.org/other/pageviews/):
```
https://dumps.wikimedia.org/other/pageviews/[YEAR]/[YEAR]-[2-DIGIT-MONTH]/pageviews-YYYYMMDD-HHMMSS.GZ
```

- [Siteviews interactive analysis](https://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=all-agents&start=2015-07&end=2017-09&sites=all-projects)

## Running this notebook:

Dependencies:
- Bokeh
- Pandas

Enable widgetsnbextension: 
```
$ jupyter nbextension enable --py --sys-prefix widgetsnbextension
```


In [None]:
# use New Wikipedia scrapper and store in a dataframe
import wikimedia_scraper as ws
from datetime import datetime
import pandas as pd

start_date = datetime(2016,10,31)
end_date  = datetime(2016,11,30)

ws.output_notebook()

traffic_generator = ws.get_traffic_generator(start_date, end_date)
df = pd.DataFrame(list(traffic_generator))

In [None]:
#df = df.set_index(['date'])
#df.index = pd.DatetimeIndex(df.date)

df = df.set_index(pd.DatetimeIndex(df['date']))
df = df.drop(['date'], axis=1)
df = df.loc[df['project']=='es']

df.describe()

In [None]:
# need to convert types to avoid a INF value while computing mean value (too big number?)
df['hits']=df['hits'].astype(float)

#z-score
df["col_zscore"] = (df['hits'] - df['hits'].mean())/df['hits'].std(ddof=0)

#rolling mean
df["rolling"] = df['col_zscore'].rolling(window=24*7, min_periods=3).mean()

df.head()

In [None]:
# Filtering between dates
mask = (df.index >= '2017-05-22 15:00:00') & (df.index <= '2017-05-23 5:00:00')
df.loc[mask]

# Ahora las gráficas


In [None]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure

from bokeh.models import DatetimeTickFormatter

In [None]:
p = figure(title="wikipedia visits per hour")

p.xaxis.formatter=DatetimeTickFormatter(
        hours=["%h %d %B %Y"],
        days=["%d %B %Y"],
        months=["%d %B %Y"],
        years=["%d %B %Y"],
    )

r = p.line(df.index, df['col_zscore'], color="#2222aa", line_width=1)
r = p.line(df.index, df['rolling'], color="red", line_width=1)

output_notebook()
show(p, notebook_handle=True)
push_notebook()
