# Wikipedia Recommender System - statistics

## Dataset analysis

For the purpose of performing the dataset analysis, the most useful representation would be probably the bag-of-words representation. Since the recommender system uses TF-IDF representation, and it is not possible to get bag-of-words representation from it, a separated dataset is needed.

Download file `bow1000.csv` from [this link](https://drive.google.com/file/d/1X-RmXG_21r1XKo3ElARkNVejqPd9WxB3/view?usp=sharing) and place it in notebook's directory. It contains bag-of-words representation of all Wikipedia articles that are scraped when generating recommender system for 1000 articles. It is probably all that you need for calculating all sorts of dataset statistics, unless you would like to get similarities between articles - then, refer to the next section. If for any reason you would need vector representations of the articles in TF-IDF, then use `recommender.dataset` property.

In [35]:
import pandas as pd

bow = pd.read_csv("bow1000.csv", index_col=0)
bow.head()

Unnamed: 0_level_0,aa,aam,aardvark,aba,aback,abandon,abas,abattoir,abb,abbasi,...,zoo,zoologist,zoom,zoomorph,zooplankton,zorro,zoster,zucchini,zwitter,zygon
wikipedia_url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://en.wikipedia.org/wiki/Wikipedia:Popular_pages,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
https://en.wikipedia.org/wiki/Leviathan_(Hobbes_book),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
https://en.wikipedia.org/wiki/Adele,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
https://en.wikipedia.org/wiki/Jay-Z,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
https://en.wikipedia.org/wiki/The_New_York_Times,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Generating recommendations

Download the file `recommender1000.csv` from [this link](https://drive.google.com/file/d/1j5H7VVNuhuU2ZWgk4b-TyWRIfBsJ9MmD/view?usp=sharing) and place it in the notebook's directory. Once this is done, you can load the recommender for 1000 Wikipedia articles with the following code:

In [1]:
from wikirecommender import WikipediaRecommender

recommender = WikipediaRecommender.load_from_file("recommender1000.csv")

Recommender can generate recommendations based on either a single article or multiple articles. Here is an example for one article:

In [3]:
recommendations = recommender.recommend("https://en.wikipedia.org/wiki/Titanic")
recommendations.head()

Unnamed: 0,URL,Similarity
1,https://en.wikipedia.org/wiki/Titanic,1.0
2,https://en.wikipedia.org/wiki/Titanic_(1997_film),0.508459
3,https://en.wikipedia.org/wiki/Bermuda_Triangle,0.42408
4,https://en.wikipedia.org/wiki/Dunkirk_evacuation,0.390148
5,https://en.wikipedia.org/wiki/Venice,0.341528


And here is for more than one:

In [36]:
recommendations = recommender.recommend([
    "https://en.wikipedia.org/wiki/Chess",
    "https://en.wikipedia.org/wiki/Checkers"
])
recommendations.head()

Unnamed: 0,URL,Similarity
1,https://en.wikipedia.org/wiki/Tennis,0.164198
2,https://en.wikipedia.org/wiki/Fallout_4,0.16045
3,https://en.wikipedia.org/wiki/Association_foot...,0.156599
4,https://en.wikipedia.org/wiki/Minecraft,0.156032
5,https://en.wikipedia.org/wiki/American_football,0.139123
