# Wikipedia clickstream
The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. [Documentation here](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream).

This exploration takes the .tsv for English Wikipedia from March 2020 that is available [here](https://dumps.wikimedia.org/other/clickstream/). There are other languages available.

[Other Analytics Resources for Wikimedia](https://dumps.wikimedia.org/other/analytics/)

In [1]:
import pandas as pd
import re

In [2]:
df = pd.read_csv('../clickstream-enwiki-2020-03.tsv', delimiter='\t', header=None, names=['prev', 'curr', 'type', 'n'], usecols=[0, 1, 2, 3])
df.head()

Unnamed: 0,prev,curr,type,n
0,other-empty,Kamasḥalta,external,11
1,other-search,Melanie_Windridge,external,64
2,other-empty,Melanie_Windridge,external,13
3,other-empty,Malaysia–Namibia_relations,external,15
4,other-search,Ding_Chao,external,19


## Top Articles

In [3]:
df.groupby('curr').sum().sort_values('n', ascending=False)[:5]

Unnamed: 0_level_0,n
curr,Unnamed: 1_level_1
Main_Page,300944959
United_States_Senate,132753424
Hyphen-minus,55283299
2019–20_coronavirus_pandemic,30709138
Coronavirus,11074539


## Top Referers

[This post](https://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/) by Ellery Wulczyn (Data Scientist @ WMF) states that `other-empty` (refererless traffic) usually comes from clients using HTTPS.

In [4]:
df.groupby('prev').sum().sort_values('n', ascending=False)[:5]

Unnamed: 0_level_0,n
prev,Unnamed: 1_level_1
other-search,3204375198
other-empty,1980695162
other-internal,145878947
other-external,90525286
Main_Page,36256389


## Outgoing requests from main page

In [5]:
outgoingWPMain = df.loc[(df['prev'] == 'Main_Page')]
outgoingWPMain.sort_values('n', ascending=False)[:5]

Unnamed: 0,prev,curr,type,n
18436309,Main_Page,2019–20_coronavirus_pandemic,link,3132537
18987983,Main_Page,Hyphen-minus,other,3034711
25130927,Main_Page,Deaths_in_2020,link,1407219
16102638,Main_Page,2019–20_coronavirus_pandemic_by_country_and_te...,link,210114
5421934,Main_Page,Coronavirus_disease_2019,link,177632


## Coronavirus data exploration

In [6]:
coronaDf = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic') | (df['curr'] == '2019–20_coronavirus_pandemic')]
coronaDf.sort_values('n', ascending=False)

Unnamed: 0,prev,curr,type,n
18437380,other-search,2019–20_coronavirus_pandemic,external,10653762
18434022,other-empty,2019–20_coronavirus_pandemic,external,9985625
18436309,Main_Page,2019–20_coronavirus_pandemic,link,3132537
27228795,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_the_United_States,link,2006708
18435884,other-internal,2019–20_coronavirus_pandemic,external,1704299
...,...,...,...,...
18438802,We_Are_Number_One,2019–20_coronavirus_pandemic,other,10
18435317,Benito_Mussolini,2019–20_coronavirus_pandemic,other,10
3514911,2019–20_coronavirus_pandemic,Nick_Foles,other,10
18437362,Les_Prophéties,2019–20_coronavirus_pandemic,other,10


In [7]:
exportCov = coronaDf
exportCov.columns = ["source", "target", "type", "value"]
exportCov = exportCov.sort_values('value', ascending=False)
targetsAlsoSources = []
for i, row in enumerate(exportCov.itertuples(), 1):
    #row[2] is target, [1] is source, [0] is index
    if exportCov.loc[exportCov['source'] == row[2]].count()['source'] > 0 and row[2] != '2019–20_coronavirus_pandemic':
        targetsAlsoSources.append(row[2])
exportCov.loc[exportCov['target'].isin(targetsAlsoSources), 'target'] = exportCov['target'] + " *"
exportCov.to_csv("InOut_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov[:100].to_csv("InOutTop100_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov

Unnamed: 0,source,target,type,value
18437380,other-search,2019–20_coronavirus_pandemic,external,10653762
18434022,other-empty,2019–20_coronavirus_pandemic,external,9985625
18436309,Main_Page,2019–20_coronavirus_pandemic,link,3132537
27228795,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_the_United_States *,link,2006708
18435884,other-internal,2019–20_coronavirus_pandemic,external,1704299
...,...,...,...,...
18438802,We_Are_Number_One,2019–20_coronavirus_pandemic,other,10
18435317,Benito_Mussolini,2019–20_coronavirus_pandemic,other,10
3514911,2019–20_coronavirus_pandemic,Nick_Foles,other,10
18437362,Les_Prophéties,2019–20_coronavirus_pandemic,other,10


### Incoming requests to main pandemic article


In [8]:
coronaDf.columns = ["prev", "curr", "type", "n"]
incomingMain = coronaDf.loc[(coronaDf['curr'] == '2019–20_coronavirus_pandemic')]
incomingMain.groupby('prev').sum().sort_values('n', ascending=False)[:10]

Unnamed: 0_level_0,n
prev,Unnamed: 1_level_1
other-search,10653762
other-empty,9985625
Main_Page,3132537
other-internal,1704299
Coronavirus_disease_2019,1203042
Coronavirus,759147
Severe_acute_respiratory_syndrome_coronavirus_2,263889
other-external,195789
Pandemic,132594
2020_coronavirus_pandemic_in_the_United_States,125656


### Outgoing requests from main pandemic article

In [9]:
outgoingMain = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic')]
outgoingMain.groupby('curr').sum().sort_values('n', ascending=False)[:10]

Unnamed: 0_level_0,n
curr,Unnamed: 1_level_1
2020_coronavirus_pandemic_in_the_United_States,2006708
2020_coronavirus_pandemic_in_Italy,1297838
2019–20_coronavirus_pandemic_by_country_and_territory,679592
2020_coronavirus_pandemic_in_Spain,624386
2020_coronavirus_pandemic_in_Germany,550413
2020_coronavirus_pandemic_in_the_United_Kingdom,469164
2019–20_coronavirus_pandemic_in_mainland_China,453397
2020_coronavirus_pandemic_in_India,376995
2020_coronavirus_pandemic_in_France,341264
Coronavirus_disease_2019,308027


In [10]:
incomingMain.sum()

prev    Anglophone_CrisisHamburg2020_coronavirus_pande...
curr    2019–20_coronavirus_pandemic2019–20_coronaviru...
type    linkotherlinkotherlinklinklinkexternallinkothe...
n                                                30709138
dtype: object

In [11]:
outgoingMain.sort_values('n', ascending=False)

Unnamed: 0,prev,curr,type,n
27228795,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_the_United_States,link,2006708
29488393,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Italy,link,1297838
16102610,2019–20_coronavirus_pandemic,2019–20_coronavirus_pandemic_by_country_and_te...,link,679592
10947455,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Spain,link,624386
12622707,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Germany,link,550413
...,...,...,...,...
29289909,2019–20_coronavirus_pandemic,German_Empire,other,10
29240726,2019–20_coronavirus_pandemic,Roman_Reigns,other,10
21142317,2019–20_coronavirus_pandemic,Ahsoka_Tano,other,10
29201601,2019–20_coronavirus_pandemic,United_Airlines,other,10


### Outgoing requests from main pandemic article, and is a link from that article

In [12]:
outgoingMainLinks = outgoingMain.loc[(outgoingMain['type'] == 'link')]
outgoingMainLinks.sort_values('n', ascending=False)

Unnamed: 0,prev,curr,type,n
27228795,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_the_United_States,link,2006708
29488393,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Italy,link,1297838
16102610,2019–20_coronavirus_pandemic,2019–20_coronavirus_pandemic_by_country_and_te...,link,679592
10947455,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Spain,link,624386
12622707,2019–20_coronavirus_pandemic,2020_coronavirus_pandemic_in_Germany,link,550413
...,...,...,...,...
16268096,2019–20_coronavirus_pandemic,Handle_System,link,10
1817503,2019–20_coronavirus_pandemic,MS_Zaandam,link,10
28775653,2019–20_coronavirus_pandemic,Indian_local_government_response_to_the_2020_c...,link,10
1205451,2019–20_coronavirus_pandemic,Our_World_in_Data,link,10


### Searches while on main pandemic article
type `link` means main pandemic article links to request (in the article); type `other` could mean a search, but also could be an incorrect referer

In [13]:
outgoingMain.groupby('type').sum()

Unnamed: 0_level_0,n
type,Unnamed: 1_level_1
link,13879231
other,454349


In [14]:
coronaMainSearch = coronaDf.loc[coronaDf['type'] == 'other']
coronaMainSearch.sort_values('n', ascending=False)

Unnamed: 0,prev,curr,type,n
20893419,2019–20_coronavirus_pandemic,Main_Page,other,97652
18998157,2019–20_coronavirus_pandemic,Hyphen-minus,other,33262
1903860,2019–20_coronavirus_pandemic,Horseshoe_bat,other,32520
29735986,2019–20_coronavirus_pandemic,Hemoptysis,other,18386
11672604,2019–20_coronavirus_pandemic,Worldometer,other,16241
...,...,...,...,...
18439302,Coyote,2019–20_coronavirus_pandemic,other,10
18439298,Roberto_Benigni,2019–20_coronavirus_pandemic,other,10
18436379,State_of_Palestine,2019–20_coronavirus_pandemic,other,10
18439295,Emilio_Salgari,2019–20_coronavirus_pandemic,other,10
