## Matteo Renzi mentions across Italian and Portuguese Wikipedia

__*Remark:*__ Since interactive plots are present open [this](http://nbviewer.jupyter.org/github/CriMenghini/Wikipedia/blob/master/Mention/Mention_draft.ipynb) link to read the `Notebook` correctly.

#### Table of contents
1. [Find articles](#parse)
2. [Rank articles according to November pageviews](#rank)
3. [Make comparisons](#comp)

In [1]:
import plotly
from pageviews import *
from wiki_parser import *
import plotly.tools as tls
from helpers_parser import *
from across_languages import *
plotly.tools.set_credentials_file(username='crimenghini', api_key='t5q05yuxzu')

In [14]:
import plotly.plotly as py
import plotly.graph_objs as go
import requests
from bs4 import BeautifulSoup

### 1. Find articles <a name="parse"></a>
The fist goal to achieve is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so, we use the [WikiHandler](wiki_parser.py) class which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex-Prime Minister](http://gph.is/29zqwgy).

In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the [`README`](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) file, you can find the information related to the collection of data.

In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
# Xml file
xml_files = ['itwiki-20161120-pages-articles-multistream.xml', 
             'ptwiki-20161201-pages-articles-multistream.xml']

After having a quick peek at a snippet of the `XML`. The elements we are interested in are on the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`.

Due to the big size of the `XML` we opted for a parser which registers callbacks for events of interest and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.

Hence, we proceed to parse the Italian corpus using the `parse_articles` function stored in the [`wiki_parser`](wiki_parser.py) library - it basically activates the parser.

In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0], 'Matteo Renzi')

Then move towards the Portuguese one.

In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1], 'Matteo Renzi')

The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json`](Corpus/wiki_ita_matteo_renzi.json) file whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are automatically stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).

                                  {"title": "title_1", "text": "text_1"}
                                                        ...
                                                        ...
                                  {"title": "title_n", "text": "text_n"}

### 2. Rank articles according to November pageviews <a name="rank"></a>

Once the data has been filtered, we proceed with a *simple* analysis of the pageviews. In particular, using the [`article_df_from_json`](pageviews.py) function, all the article titles are extracted from the corpus and then stored in a `DataFrame`.

In [3]:
# Get the df for the Italian articles
df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')

# Get the df for the Portuguese articles
df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')

Take a look at the obtained `DataFrame`.

In [4]:
df_it_titles.sample(5)

Unnamed: 0,Title
207,Auto blu
373,Lorenza Bonaccorsi
302,Chiara Braga
322,Renata Bueno
286,Strage di Firenze


Thus, we extract the number of monthly page views for each article related to the languages of interest (i.e. `it` and `pt`) from the *page views* file - [Additional data](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) in the `README`. To filter the file we use the [`filter_pageviews_file`](pageviews.py) function and get a dictionary of dictionaries with the following structure (according to our example):

                                {'it':{'Title_1':'No pageviews',
                                               ...
                                       'Title_n':'No pageviews'},
                                 'pt':{'Title_1':'No pageviews',
                                               ...
                                       'Title_k':'No pageviews'}}

In [5]:
# Page views file
pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'

# Filter the page view file
articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])

Thus, a right join between the `DataFrames`, namely the one obtained from the pageviews and the other obtained from the corpus, is performed. It results that both for the Italian and Portuguese articles there are articles that mention Matteo Renzi that have not been visualized in November. The `define_ranked_df` function is stored in the [`pageviews`](pageviews.py) library.

In [6]:
# Define the italian ranked article df according to the number of page views
ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)
# Show the df head
ranked_df_ita.head(10)

Over the whole number of articles in the corpus  39  have not been visited during the considered period.


Unnamed: 0,Title,Pageviews
360,Marco Travaglio,19795.0
300,Pif (conduttore televisivo),19557.0
43,Partito Democratico (Italia),11653.0
472,Vittorio Sgarbi,11324.0
342,Malala Yousafzai,9452.0
137,Jobs Act,8908.0
277,Enrico Letta,7791.0
22,Startup (economia),7698.0
411,Marianna Madia,7608.0
338,Nuovo Centro Congressi,6894.0


In [7]:
# Define the italian ranked article df according to the number of page views
ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)
# Show the df
ranked_df_port.head(10)

Over the whole number of articles in the corpus  4  have not been visited during the considered period.


Unnamed: 0,Title,Pageviews
14,Partido Democrático (Itália),567.0
33,Lista de chefes de Estado e de governo atuais,410.0
35,G7,259.0
29,Federica Mogherini,215.0
37,G20,185.0
7,Centro-esquerda,141.0
36,Privatização,93.0
30,Lista de chefes de Estado e de governo por dat...,83.0
20,9.ª reunião de cúpula do G20,81.0
3,10.ª reunião de cúpula do G20,70.0


Having a quick glance at the two top 10, we notice:
* The number of page views for the Italian articles which mention Matteo Renzi is considerably higher than for those that are written in Portuguese.
* The only article that is present in both the top ranking is `Partito Democratico (Italia)`. 
* It seems that the pages differ in the content: the Portuguese ones are more related to topics that regard the international politics rather the Italians that refer to politics, journalists and public figures.

### 3. Make comparisons <a name = 'comp'></a>

We now move ahead exploring the data that we preprocessed and trying to figure out something interesting.
* We take a look at the number of mentions received in each article. In this contest, it may be possible that Matteo Renzi received more than one mention just because of the presence of references. For instance on [this](https://it.wikipedia.org/wiki/Francesco_Guccini) page, if you look up for Matteo Renzi, you will find 2 mentions but one of those just refers to the first. For the moment we do not address this issue.

The `DataFrame` below- obtained using `article_mentions` function in [this](across_languages.py) library- shows the number of mentions that Matteo Renzi has received in each article according to both for the Italian and Portuguese corpora. The `DataFrames` are sorted by the number of mentions so that we get the pages where Matteo Renzi is more "popular".

In [8]:
# Italian df of mentions per page
df_it_mentions = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')

# Sort the df by the number of mentions and see the top 5
df_it_mentions = df_it_mentions.sort_values('Number of mentions', ascending = False)

# Show results
df_it_mentions.head(5)

Unnamed: 0,Title,Number of mentions
37,Matteo Renzi,62
424,Governo Renzi,30
195,Partito Democratico (Italia),17
492,Riforma costituzionale Renzi-Boschi,12
214,Storia del Partito Democratico (Italia),12


In [9]:
# Portuguese df of mentions per page
df_pt_mentions = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')

# Sort the df by the number of mentions and see the top 5
df_pt_mentions = df_pt_mentions.sort_values('Number of mentions', ascending = False)

# Show results
df_pt_mentions.head(5)

Unnamed: 0,Title,Number of mentions
20,Matteo Renzi,11
7,Partido Democrático (Itália),6
9,Itália,3
0,Lista de primeiros-ministros da Itália,2
16,Lista de viagens presidenciais de Dilma Rousseff,2


Comparing the two `DataFrames` we immediately notice that even if the maximum number of mentions that Matteo Renzi received for Italian and Portuguese articles are very different. In the Portuguese corpus there are only two articles that have more than 5 mentions. Thus, can be interesting to visualize the distribution of the mentions both for the IT and PT corpora.

The distributions are represented using the boxplots. They show that for both the languages the 75% of the articles contain no more than 3 mentions of the Italian premier. For the Portuguese corpus stand out two outliers that correspond to `Matteo Renzi 11 mentions` and `Partido Democrático (Itália) 6 mentions`, rather for the Italians the number of outliers is bigger and the maximum number of mentions are contained in `Matteo Renzi 62 mentions`. Moreover, zooming in the boxes, we observe that the two distributions are skewed toward left (number of mentions equal to 1). 

In [10]:
#boxplot_mentions(df_pt_mentions, df_it_mentions, 'PT', 'IT', 'Number of mentions')
tls.embed("https://plot.ly/~crimenghini/20")

In this direction, one aspect that can be considered is the following: 
> Define how important is Matteo Renzi in the articles that mention him. It requires defining the concept of *importance*. Intuitively, we would say that higher is the number of mentions more is the importance of our object in the article. Moreover, it may be useful to weight the number of mentions according to the number of words in the article. 
$$I_{string} = \frac{M}{|D|}$$
Where *I* is the importance, *M* is the number of mentions and *D* the number of words in the document. In this way, whether an article cited Renzi once but it is made up just by a few lines, the string of interest will result more significant.

Another thing that can be visualized is the realtionship between the `Number of mentions` and the `Pageviews`. In order to do that we first merge the two pageviews and mentions `DataFrames`.

In [11]:
# Merge pageviews and mentions DataFrames for IT
df_it_mension_pageview = pd.merge(df_it_mentions, ranked_df_ita, on=['Title'])

# Show it
df_it_mension_pageview.sample(5)

Unnamed: 0,Title,Number of mentions,Pageviews
472,Anonymous,1,6.0
291,Seconda guerra civile in Libia,1,1669.0
165,Pietro Langella,1,64.0
26,Guglielmo Epifani,4,849.0
130,Enrico Costa (politico),2,744.0


In [12]:
# Merge pageviews and mentions DataFrames for PT
df_pt_mension_pageview = pd.merge(df_pt_mentions, ranked_df_port, on=['Title'])

# Show it
df_pt_mension_pageview.sample(5)

Unnamed: 0,Title,Number of mentions,Pageviews
23,Lista de chefes de Estado e de governo atuais,1,410.0
11,Giuliano Poletti,2,7.0
12,Maria Elena Boschi,2,5.0
17,Terremotos da Itália de outubro de 2016,1,26.0
28,Canonização de João XXIII e João Paulo II,1,26.0


A scatterplot is used to get how an article is positioned according to these two variables. We

In [19]:
# Write function - Code in plots.py so far
tls.embed('https://plot.ly/~crimenghini/36')

About these two features, we can think that another way to explore should be the following:
> Consider how the number of pageviews of an article changes when the number of Matteo Renzi citations increases from a revision to another. In particular, the *importance*(I) is re-defined as: 
$$I = \sum_{t = 1}^{T} \frac{(p_t-p_{t-1}) \times m_t}{|D_t|}$$
Where *t* is the time of sequential revision of the article, *p* is the number of page views at time and m is the number of mentions.

Thus we proceed to look for the presence of same articles (in different languages) that mention Matteo Renzi. To do so we make a request for each Portugues *Wikipedia* page (that cites Renzi) than we parse the `HTML` source to extract - where available- the title of the IT article related to that the request has been sent. Precisely, the requests are sent for each title of the language that has less article that match Matteo Renzi. The function `get_matches` is stored in [this](across_languages) library. 

In [None]:
# Built the common articles matches
dict_italian = get_matches(df_pt_titles)

Proceed to create a `DataFrame` that contains the information related to those articles.

In [32]:
italian_titles = list(dict_italian.values())
portugues_titles = list(dict_italian.keys())

In [57]:
ita = df_it_mension_pageview.query('Title in @italian_titles') # df_it_titles
# Add a column with the tuple

In [51]:
# The length is different because there are some italian articles that have not been visited in November
print (len(italian_titles),len(ita))

31 19


In [52]:
ita.head(5)

Unnamed: 0,Title,Number of mentions,Pageviews
0,Matteo Renzi,62,31.0
2,Partito Democratico (Italia),17,11653.0
12,Maria Elena Boschi,6,11.0
30,Federica Mogherini,4,8.0
39,Presidenti del Consiglio dei ministri della Re...,3,8.0


In [54]:
port = df_pt_mension_pageview.query('Title in @portugues_titles')
port.head(5)

Unnamed: 0,Title,Number of mentions,Pageviews
0,Matteo Renzi,11,26.0
1,Partido Democrático (Itália),6,567.0
2,Itália,3,5.0
3,Lista de primeiros-ministros da Itália,2,12.0
7,Presidente do Conselho de Ministros da Itália,2,14.0


Join the two `DataFrames`