## Matteo Renzi mentions across Italian and Portuguese Wikipedia

#### Table of contents
1. [Find articles](#parse)
2. [Rank articles according to November pageviews](#rank)

In [1]:
from pageviews import *
from wiki_parser import *
from helpers_parser import *

### 1. Find articles <a name="parse"></a>
The fist goal to achive is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so, we use the [WikiHandler](wiki_parser.py) class which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex Prime Minister](http://gph.is/29zqwgy).

In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the [`README`](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) file you can find the information related to the collection of data.

In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
# Xml file
xml_files = ['itwiki-20161120-pages-articles-multistream.xml', 
             'ptwiki-20161201-pages-articles-multistream.xml']

After having a quick peek at a snippet of the `xml`. The elements we are interested in are under the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`.

Due to the big size of the `xml` we opted for a parser which registers callbacks for events of interes and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.

Hence, we proceed parsing the Italian corpus using the `parse_articles` function stored in [`wiki_parser`](wiki_parser.py) library - it basically activates the parser.

In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0], 'Matteo Renzi')

Then move towards the Portuguese one.

In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1], 'Matteo Renzi')

The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json`](Corpus/wiki_ita_matteo_renzi.json) file whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are authomatically stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).

                                  {"title": "title_1", "text": "text_1"}
                                                        ...
                                                        ...
                                  {"title": "title_n", "text": "text_n"}

### 2. Rank articles according to November pageviews <a name="parse"></a>

Once the data has been filtered, we proceed with a *simple* analysis of the pageviews. In particular, using the [`article_df_from_json`](pageviews.py) function, all the article titles are extracted from the corpus and then stored in a `DataFrame`.

In [3]:
# Get the df for the Italian articles
df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')

# Get the df for the Portuguese articles
df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')

Take a look at the obtained `DataFrame`.

In [5]:
df_it_titles.sample(5)

Unnamed: 0,Title
255,The Huffington Post
308,Malala Yousafzai
205,Anna Cappellini
25,Sergio Chiamparino
278,Italialand


Thus, we extract the number of monthly pageviews for each article related to the languages of interest (i.e. `it` and `pt`) from the *pageviews* file - [Additional data](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) in the `README`. To filter the file we use the [`filter_pageviews_file`](pageviews.py) function and get a dictionary of dictionaries with the following structure (according to our example):

                                {'it':{'Title_1':'No pageviews',
                                               ...
                                       'Title_n':'No pageviews'},
                                 'pt':{'Title_1':'No pageviews',
                                               ...
                                       'Title_k':'No pageviews'}}

In [6]:
# Pageviews file
pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'

# Filter the pageview file
articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])

Thus, a right-join between the `DataFrames`, namely the one obtained from the pageviews and the other obtained from from the corpus, is performed. It results that both for the Italian and Portuguese articles there aare articles that mention Matteo Renzi that have not been visualized in November. The `define_ranked_df` function is stored in the [`pageviews`](pageviews.py) library.

In [7]:
ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)
ranked_df_ita.head(10)

Over the whole number of articles in the corpus  39  have not been visited during the considered perion.


Unnamed: 0,Title,Pageviews
235,Marco Travaglio,19795.0
196,Pif (conduttore televisivo),19557.0
457,Partito Democratico (Italia),11653.0
398,Vittorio Sgarbi,11324.0
425,Malala Yousafzai,9452.0
436,Jobs Act,8908.0
403,Enrico Letta,7791.0
473,Startup (economia),7698.0
148,Marianna Madia,7608.0
76,Nuovo Centro Congressi,6894.0


In [9]:
ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)
ranked_df_port.head(10)

Over the whole number of articles in the corpus  4  have not been visited during the considered perion.


Unnamed: 0,Title,Pageviews
24,Partido Democrático (Itália),567.0
36,Lista de chefes de Estado e de governo atuais,410.0
30,G7,259.0
6,Federica Mogherini,215.0
26,G20,185.0
4,Centro-esquerda,141.0
35,Privatização,93.0
11,Lista de chefes de Estado e de governo por dat...,83.0
10,9.ª reunião de cúpula do G20,81.0
16,10.ª reunião de cúpula do G20,70.0
