## Get in touch with Wikipedia

*Remark:* for the tidiness of the `Notebook` the classes and function used are gathered in external libraries.

#### Table of contents
1. [Find articles](#parse)
2. [Rank articles according to November pageviews](#rank)

In [1]:
from wiki_parser import *
from helpers_parser import *

### 1. Find articles <a name="parse"></a>
The fist goal to achive is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so,I implemented a [parser](wiki_parser.py) which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex Prime Minister](http://gph.is/29zqwgy).

In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
xml_files = ['itwiki-20161120-pages-articles-multistream.xml', 'ptwiki-20161201-pages-articles-multistream.xml']

Have a quick peek at a snippet of the `xml`. The elements we are interested in are under the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`.

In [3]:
with open(path + xml_files[0]) as myfile:
    head = [next(myfile) for x in range(80)]

# Snippet
head[-45:55]

['      <namespace key="2303" case="case-sensitive">Discussioni definizione accessorio</namespace>\n',
 '      <namespace key="2600" case="first-letter">Argomento</namespace>\n',
 '    </namespaces>\n',
 '  </siteinfo>\n',
 '  <page>\n',
 '    <title>Armonium</title>\n',
 '    <ns>0</ns>\n',
 '    <id>2</id>\n',
 '    <revision>\n',
 '      <id>83431090</id>\n',
 '      <parentid>82477294</parentid>\n',
 '      <timestamp>2016-09-24T14:25:49Z</timestamp>\n',
 '      <contributor>\n',
 '        <username>Pufui PcPifpef</username>\n',
 '        <id>461628</id>\n',
 '      </contributor>\n',
 '      <comment>/* Altri progetti */</comment>\n',
 '      <model>wikitext</model>\n',
 '      <format>text/x-wiki</format>\n',
 '      <text xml:space="preserve">{{nota disambigua|il complesso italiano|Armonium (gruppo musicale)}}&lt;!--\n']

Due to the big size of the `xml` I opted for a parser which registers callbacks for events of interes and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.

Hence, I proceed parsing the Italian corpus.

In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0])

Then move towards the Portuguese one.

In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1])

The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json` file](Corpus/wiki_ita_matteo_renzi.json) whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).

                                      {"title": "title_1", "text": "text_1"}
                                                        ...
                                                        ...
                                      {"title": "title_n", "text": "text_n"}

### Rank articles according to November pageviews <a name="parse"></a>

In [3]:
import pandas as pd
import json
import codecs

In [18]:
def read_json(json_,):
    """ This funtion returns the data frame of article titles that mention Matteo Renzi.
    It takes as input:
    
    @json_: is the path of the .json to import"""
    
    # Initialise the list to store titles
    list_titles = []
    
    # Read line by line 
    with codecs.open(json_,'rU','utf-8') as f:
        for line in f:
            list_titles.append(json.loads(line)['title'])
    
    # Convert list titles in a df
    list_titles = pd.DataFrame(list_titles, columns=['Title'])
    
    return list_titles

In [21]:
df_it_titles = read_json('Corpus/wiki_ita_matteo_renzi.json')

In [22]:
df_pt_titles = read_json('Corpus/wiki_port_matteo_renzi.json')

In [26]:
import time

In [63]:
s = time.time()
it_art = {}
pt_art = {}
with open('/Users/cristinamenghini/Downloads/pagecounts-2016-11-views-ge-5-totals') as fp:
    for line in fp:
        split_line = line.split(' ')
        
        # Get title
        title_split = split_line[1].split(':')
        
        # Filter by the language namespace
        if split_line[0].startswith('it'):
            it_art = get_pageviews(it_art)
        
        elif split_line[0].startswith('pt'):
            pt_art = get_pageviews(pt_art)

print (time.time()-s)

115.73478293418884


In [104]:
def define_ranked_df(dict_art, mention_title):
    df_total_article = pd.DataFrame(list(dict_art.items()), columns=['Title', 'Pageviews'])

    interested_art = pd.merge(df_total_article, mention_title, how = 'right', on=['Title'])
    

    visited_article = interested_art[interested_art['Pageviews'].notnull()]
    

    ranked_articles = visited_article.sort_values('Pageviews', ascending = False)
    
    return ranked_articles

In [105]:
define_ranked_df(it_art, df_it_titles).head()

Unnamed: 0,Title,Pageviews
319,Marco Travaglio,19795.0
88,Pif (conduttore televisivo),19557.0
91,Partito Democratico (Italia),11653.0
234,Vittorio Sgarbi,11324.0
60,Malala Yousafzai,9452.0


In [109]:
define_ranked_df(pt_art, df_pt_titles)

Unnamed: 0,Title,Pageviews
23,Partido Democrático (Itália),567.0
29,Lista de chefes de Estado e de governo atuais,410.0
34,G7,259.0
5,Federica Mogherini,215.0
18,G20,185.0
31,Centro-esquerda,141.0
2,Privatização,93.0
7,Lista de chefes de Estado e de governo por dat...,83.0
37,9.ª reunião de cúpula do G20,81.0
25,10.ª reunião de cúpula do G20,70.0


In [59]:
a = list(it_art.keys())

In [64]:
it_art['Villa di Faragola']

61

In [65]:
def get_pageviews(lang_dict, title_split, split_line):
    """ This function adjourn add elements to the dictionary (key, value): (title, pageviews) 
    and returns the dictionary. 
    It takes as inputs:
    
    @lang_dict: the dictionary to adjourn
    @title_split: list of elements in the second column (splitting by ':')
    @split_line: three column entries of a line of the pageviews file"""
    
    # Check whether the split by ':' produces more than one element
    if len(title_split) == 2:
        # Add the article and the respective pageviews to the dictionary
        lang_dict[title_split[1].replace('_', ' ')] = int(split_line[2])
    
    elif len(title_split) == 1:
        # Add the article and the respective pageviews to the dictionary
        lang_dict[title_split[0].replace('_', ' ')] = int(split_line[2])
        
    return lang_dict

In [70]:
df_total_article_it = pd.DataFrame(list(it_art.items()), columns=['Title', 'Pageviews'])

In [73]:
df_total_article_it.head()

Unnamed: 0,Title,Pageviews
0,,49
1,Villa di Faragola,61
2,Giulio Sciorilli Borrelli,32
3,Consiglio ONU per i Diritti Umani,44
4,Voltammetria ad onda quadra,55


In [78]:
interested_it_art = pd.merge(df_total_article_it, df_it_titles, how = 'right', on=['Title'])

In [79]:
interested_it_art.shape

(515, 2)

In [96]:
visited_article = interested_it_art[interested_it_art['Pageviews'].notnull()]

In [97]:
ranked_articles = visited_article.sort_values('Pageviews', ascending = False)