## Matteo Renzi mentions across Italian and Portuguese Wikipedia

#### Table of contents
1. [Find articles](#parse)
2. [Rank articles according to November pageviews](#rank)

In [1]:
from pageviews import *
from wiki_parser import *
from helpers_parser import *

### 1. Find articles <a name="parse"></a>
The fist goal to achive is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so, we use the [WikiHandler](wiki_parser.py) class which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex Prime Minister](http://gph.is/29zqwgy).

In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the [`README`](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) file you can find the information related to the collection of data.

In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
# Xml file
xml_files = ['itwiki-20161120-pages-articles-multistream.xml', 
             'ptwiki-20161201-pages-articles-multistream.xml']

After having a quick peek at a snippet of the `xml`. The elements we are interested in are under the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`.

Due to the big size of the `xml` we opted for a parser which registers callbacks for events of interes and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.

Hence, we proceed parsing the Italian corpus using the `parse_articles` function stored in [`wiki_parser`](wiki_parser.py) library - it basically activates the parser.

In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0], 'Matteo Renzi')

Then move towards the Portuguese one.

In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1], 'Matteo Renzi')

The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json`](Corpus/wiki_ita_matteo_renzi.json) file whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are authomatically stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).

                                  {"title": "title_1", "text": "text_1"}
                                                        ...
                                                        ...
                                  {"title": "title_n", "text": "text_n"}

### 2. Rank articles according to November pageviews <a name="parse"></a>

Once the data has been filtered, we proceed with a *simple* analysis of the pageviews. In particular, using the [`article_df_from_json`](pageviews.py) function, all the article titles are extracted from the corpus and then stored in a `DataFrame`.

In [3]:
# Get the df for the Italian articles
df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')

# Get the df for the Portuguese articles
df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')

Take a look at the obtained `DataFrame`.

In [5]:
df_it_titles.sample(5)

Unnamed: 0,Title
255,The Huffington Post
308,Malala Yousafzai
205,Anna Cappellini
25,Sergio Chiamparino
278,Italialand


Thus, we extract the number of monthly pageviews for each article related to the languages of interest (i.e. `it` and `pt`) from the *pageviews* file - [Additional data](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) in the `README`. To filter the file we use the [`filter_pageviews_file`](pageviews.py) function and get a dictionary of dictionaries with the following structure (according to our example):

                                {'it':{'Title_1':'No pageviews',
                                               ...
                                       'Title_n':'No pageviews'},
                                 'pt':{'Title_1':'No pageviews',
                                               ...
                                       'Title_k':'No pageviews'}}

In [4]:
# Pageviews file
pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'

# Filter the pageview file
articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])

Thus, a right-join between the `DataFrames`, namely the one obtained from the pageviews and the other obtained from from the corpus, is performed. It results that both for the Italian and Portuguese articles there aare articles that mention Matteo Renzi that have not been visualized in November. The `define_ranked_df` function is stored in the [`pageviews`](pageviews.py) library.

In [7]:
ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)
ranked_df_ita.head(10)

Over the whole number of articles in the corpus  39  have not been visited during the considered period.


Unnamed: 0,Title,Pageviews
415,Marco Travaglio,19795.0
223,Pif (conduttore televisivo),19557.0
76,Partito Democratico (Italia),11653.0
248,Vittorio Sgarbi,11324.0
118,Malala Yousafzai,9452.0
322,Jobs Act,8908.0
104,Enrico Letta,7791.0
295,Startup (economia),7698.0
365,Marianna Madia,7608.0
286,Nuovo Centro Congressi,6894.0


In [8]:
ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)
ranked_df_port.head(10)

Over the whole number of articles in the corpus  4  have not been visited during the considered period.


Unnamed: 0,Title,Pageviews
11,Partido Democrático (Itália),567.0
24,Lista de chefes de Estado e de governo atuais,410.0
5,G7,259.0
4,Federica Mogherini,215.0
30,G20,185.0
13,Centro-esquerda,141.0
9,Privatização,93.0
10,Lista de chefes de Estado e de governo por dat...,83.0
25,9.ª reunião de cúpula do G20,81.0
15,10.ª reunião de cúpula do G20,70.0


In [9]:
import re
import json
import codecs
import pandas as pd
import numpy as np

In [10]:
def article_mentions(json_, string):
    """ This funtion returns the data frame of article titles that mention the 
    string of interest.
    It takes as input:
    
    @json_: is the path of the .json to import"""
    
    # Initialise the list to store titles
    list_titles = []
    list_matches = []
    
    # Read line by line 
    with codecs.open(json_,'rU','utf-8') as f:
        for line in f:
            load_line = json.loads(line)
            text = load_line['text']
            match = len(list(re.findall(string, text, flags=re.I)))
            list_titles.append(load_line['title'])
            list_matches.append(match)
    
    # Convert list titles in a df
    #print (len(list_titles), len(list_matches))
    list_titles = pd.DataFrame(np.array([list_titles, list_matches]).T, columns=['Title', 'Number of mentions'])
    
    return list_titles

In [11]:
a = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')

In [12]:
a_bis = pd.merge(a, ranked_df_ita, on=['Title'])

In [13]:
a_bis.head()

Unnamed: 0,Title,Number of mentions,Pageviews
0,Lecco,1,7.0
1,Anni 2010,1,10.0
2,Francesco Guccini,2,8.0
3,Antonio Meucci,1,3904.0
4,Roberto Benigni,1,8.0


In [14]:
a['Number of mentions'].value_counts()/len(a)*100

1     68.932039
2     18.446602
3      5.436893
4      3.689320
7      0.582524
6      0.388350
5      0.388350
12     0.388350
8      0.388350
11     0.388350
10     0.194175
30     0.194175
17     0.194175
62     0.194175
9      0.194175
Name: Number of mentions, dtype: float64

In [60]:
b = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')

42 42


In [15]:
b_bis = pd.merge(a, ranked_df_port, on=['Title'])

In [16]:
b_bis

Unnamed: 0,Title,Number of mentions,Pageviews
0,Matteo Renzi,62,26.0
1,Sergio Mattarella,2,68.0
2,Marianna Madia,2,6.0
3,Enrico Letta,3,9.0
4,Giorgio Napolitano,2,16.0
5,Angelino Alfano,1,12.0
6,Federica Mogherini,4,215.0
7,Maria Elena Boschi,6,5.0
8,Giuliano Poletti,1,7.0


In [95]:
b['Number of mentions'].value_counts()/len(b)*100

1     61.904762
2     30.952381
6      2.380952
11     2.380952
3      2.380952
Name: Number of mentions, dtype: float64

In [17]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.tools.set_credentials_file(username='crimenghini', api_key='t5q05yuxzu')

In [93]:
sum(a_bis['Pageviews'])

505191.0

In [94]:
trace0 = go.Scatter(
    x = a_bis['Number of mentions'],
    y = a_bis['Pageviews']/sum(a_bis['Pageviews'])*100,
    name = 'Italian',
    mode = 'markers',
    text= a_bis['Title'],
    marker = dict(
        size = 10,
        color = 'rgba(152, 0, 0, .8)',
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

trace1 = go.Scatter(
    x = b_bis['Number of mentions'],
    y = b_bis['Pageviews']/sum(b_bis['Pageviews'])*100,
    name = 'Portuguese',
    mode = 'markers',
    text= b_bis['Title'],
    marker = dict(
        size = 10,
        color = 'rgba(255, 182, 193, .9)',
        line = dict(
            width = 2,
        )
    )
)

data = [trace0, trace1]

layout = dict(title = 'Styled Scatter',
              yaxis = dict(title = 'Pageviews'),
              xaxis = dict(title = 'Number of mentions')
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-scatter')

In [None]:
trace1 = go.Bar(
    x=['giraffes', 'orangutans', 'monkeys'],
    y=[20, 14, 23],
    name='SF Zoo'
)
trace2 = go.Bar(
    x=['giraffes', 'orangutans', 'monkeys'],
    y=[12, 18, 29],
    name='LA Zoo'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

In [96]:
import requests
from bs4 import BeautifulSoup

In [117]:
dict_italian = {}
for i in df_pt_titles['Title']:
    print (i)
    title = i.replace(' ','_')
    req = requests.get('https://pt.wikipedia.org/wiki/' + title)
    html = req.content
    soup = BeautifulSoup(html, 'html.parser')
    title_dirty = soup.findAll("a", { "lang" : "it" })
    if len(title_dirty) != 0:
        ita_title = title_dirty[0]['title'].split('—')[0].strip()
        dict_italian[i] = ita_title

Lista de primeiros-ministros da Itália
Giorgio Napolitano
Centro-esquerda
La Repubblica
G20
G8+5
Lista de chefes de Estado e de governo atuais
Partido Democrático (Itália)
Predefinição:Membros do Conselho Europeu
Itália
Conselho Europeu
Privatização
G7
Lista de chefes de Estado e de governo por data da tomada de posse
Predefinição:Atuais líderes G20
Predefinição:Política da Itália
Lista de viagens presidenciais de Dilma Rousseff
Enrico Letta
Milena Canonero
Cerimônia de abertura dos Jogos Olímpicos de Verão de 2016
Matteo Renzi
Marianna Madia
Angelino Alfano
Maria Elena Boschi
Giuliano Poletti
Canonização de João XXIII e João Paulo II
Federica Mogherini
Massacre do Charlie Hebdo
Pietro Grasso
Sergio Mattarella
Ataque ao Museu Nacional do Bardo
Eleição presidencial na Argentina em 2015
9.ª reunião de cúpula do G20
10.ª reunião de cúpula do G20
Lista de líderes do G8
Predefinição:Guerra contra o Estado Islâmico
Lista de líderes do G20
Presidente do Conselho de Ministros da Itália
Lista d

In [120]:
print (len(dict_italian), len(df_pt_titles))

31 42


In [125]:
common_articles_ita = pd.DataFrame(list(dict_italian.values()), columns = ['Title'])

In [126]:
ita = pd.merge(common_articles_ita, a, on='Title')

In [127]:
common_articles_port = pd.DataFrame(list(dict_italian.keys()), columns = ['Title'])

In [128]:
port = pd.merge(common_articles_port, b, on='Title')

In [129]:
pd.concat([ita,port], axis = 1)

Unnamed: 0,Title,Number of mentions,Title.1,Number of mentions.1
0,Presidenti del Consiglio dei ministri della Re...,3.0,Lista de primeiros-ministros da Itália,2
1,Marianna Madia,2.0,Cerimônia de abertura dos Jogos Olímpicos de V...,1
2,Template:Leader G20,1.0,La Repubblica,1
3,Italia,2.0,Conselho Europeu,1
4,Attentato al museo nazionale del Bardo,2.0,Marianna Madia,1
5,Centro-sinistra,2.0,Predefinição:Atuais líderes G20,1
6,Maria Elena Boschi,6.0,Itália,3
7,Enrico Letta,3.0,Ataque ao Museu Nacional do Bardo,1
8,Angelino Alfano,1.0,Centro-esquerda,2
9,Capi di Stato e di governo in carica,1.0,Maria Elena Boschi,2


In [97]:
req = requests.get('https://pt.wikipedia.org/wiki/Privatização')
html = req.content

In [99]:
soup = BeautifulSoup(html, 'html.parser')

In [104]:
prova = soup.findAll("a", { "lang" : "it" })

In [114]:
# Get the name of the article
prova[0]['title'].split('—')[0].strip()

'Privatizzazione'

In [11]:
from itertools import islice
with open(path + xml_files[0]) as myfile:
    head = list(islice(myfile, 350))

In [12]:
head

['<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="it">\n',
 '  <siteinfo>\n',
 '    <sitename>Wikipedia</sitename>\n',
 '    <dbname>itwiki</dbname>\n',
 '    <base>https://it.wikipedia.org/wiki/Pagina_principale</base>\n',
 '    <generator>MediaWiki 1.29.0-wmf.3</generator>\n',
 '    <case>first-letter</case>\n',
 '    <namespaces>\n',
 '      <namespace key="-2" case="first-letter">Media</namespace>\n',
 '      <namespace key="-1" case="first-letter">Speciale</namespace>\n',
 '      <namespace key="0" case="first-letter" />\n',
 '      <namespace key="1" case="first-letter">Discussione</namespace>\n',
 '      <namespace key="2" case="first-letter">Utente</namespace>\n',
 '      <namespace key="3" case="first-letter">Discussioni utente</namespace>\n',
 '      <namespace key="4" case="firs

In [13]:
from itertools import islice
with open(path + xml_files[1]) as myfile:
    head = list(islice(myfile, 350))

In [14]:
head

['<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="pt">\n',
 '  <siteinfo>\n',
 '    <sitename>Wikipédia</sitename>\n',
 '    <dbname>ptwiki</dbname>\n',
 '    <base>https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal</base>\n',
 '    <generator>MediaWiki 1.29.0-wmf.4</generator>\n',
 '    <case>first-letter</case>\n',
 '    <namespaces>\n',
 '      <namespace key="-2" case="first-letter">Multimédia</namespace>\n',
 '      <namespace key="-1" case="first-letter">Especial</namespace>\n',
 '      <namespace key="0" case="first-letter" />\n',
 '      <namespace key="1" case="first-letter">Discussão</namespace>\n',
 '      <namespace key="2" case="first-letter">Usuário(a)</namespace>\n',
 '      <namespace key="3" case="first-letter">Usuário(a) Discussão</namespace>\n',
 '      <