# Indexation Web - TP1
BERNARD Renan

## Introduction

Tout d'abord les différentes importations nécessaires pour ce Notebook.
Les fonctions utilisées sont dans le fichier __utils.py__.

In [4]:
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt

from functools import reduce
import multiprocessing

from utils import *

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Récupérons ensuite l'ensemble des données. Nous garderons un DataFrame contenons le texte, l'auteur et l'identifiant du document.

In [2]:
%%time
texts = generate_texts_dataframe()
texts.head()

CPU times: user 121 ms, sys: 71.9 ms, total: 193 ms
Wall time: 464 ms


Unnamed: 0,Text,Author,DocumentId
0,"Russia's Fuel and Energy Ministry, sitting on ...",Lynnley Browning,116673
1,Russia's Western oil joint ventures are findin...,Lynnley Browning,248885
2,Russian oil company officials said on Friday t...,Lynnley Browning,314644
3,Azerbaijan is proving more successful in attra...,Lynnley Browning,219830
4,A multinational group trying to build a $1.5 b...,Lynnley Browning,239689


Lisons quelques articles.

In [4]:
view_article(10, texts)
view_article(102, texts)


-------------------------------------------- 
Author :  Lynnley Browning 
Id :  114005 
--------------------------------------------
Aluminium industry sources expressed doubt on Monday over whether producers association Kontsern Alyuminiy would be able to contain output to support sagging world prices.
"I don't believe they'll do it," said a Western trade source in Moscow.
A Moscow representative of the Krasnoyarsk smelter and two sources at the giant Bratsk plant said they had no information on plans to contain output.
"This is a very strategic question on which I have no information," said Bratsk-based Andrei Toropovsky of the plant's foreign economic relations department.
Kontsern Alyuminiy chief executive Igor Prokopov said Russian smelters had agreed to keep aluminium output at current levels and not to exploit at least 80,000-100,000 tonnes of idle annual capacity.
He and Vladimir Kalchenko, the group's first deputy chief executive, referred to weak world prices and unprofitabl

## Création de l'Index

L'index est crée selon les specifications de la fonction __create_index_from_text(...)__ dont les explications sont disponibles dans __utils.py__. Pour accélerer le calcul, nous utilisons le modèle _MapReduce_. Le __mapper__ ici correspond à la création de l'index pour un seul document, le __reducer__ correspond à l'_addition_ des index obtenus.

Tout d'abord, regardons le temps d'exécution sans __MapReduce__ :

In [3]:
%%time
index = create_index_from_text(texts.Text[0], 0)
for i in range(1, len(texts)):
    index_i = create_index_from_text(texts.Text[i], i)
    index = sum_two_indexes(index, index_i)

CPU times: user 31.5 s, sys: 128 ms, total: 31.7 s
Wall time: 31.7 s


In [7]:
index['abandon']

{'total_occurences': 33,
 78: {'locations': [213], 'occurences': 1},
 80: {'locations': [350], 'occurences': 1},
 93: {'locations': [213], 'occurences': 1},
 257: {'locations': [205], 'occurences': 1},
 266: {'locations': [96], 'occurences': 1},
 278: {'locations': [163], 'occurences': 1},
 293: {'locations': [84], 'occurences': 1},
 462: {'locations': [138], 'occurences': 1},
 472: {'locations': [214], 'occurences': 1},
 478: {'locations': [141], 'occurences': 1},
 959: {'locations': [178], 'occurences': 1},
 968: {'locations': [274], 'occurences': 1},
 979: {'locations': [212], 'occurences': 1},
 1045: {'locations': [162], 'occurences': 1},
 1155: {'locations': [190], 'occurences': 1},
 1213: {'locations': [17], 'occurences': 1},
 1219: {'locations': [61], 'occurences': 1},
 1241: {'locations': [114], 'occurences': 1},
 1249: {'locations': [244], 'occurences': 1},
 1295: {'locations': [137], 'occurences': 1},
 1398: {'locations': [179], 'occurences': 1},
 1439: {'locations': [183, 29

Maintenant essayons avec les fonctions __map__ et __reduce__ :

In [15]:
%%time 
def mapper(x):
    return create_index_from_text(texts.Text[x], x)

with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(mapper, range(len(texts)))
        
index = reduce(sum_two_indexes, list(results))

CPU times: user 4.25 s, sys: 908 ms, total: 5.16 s
Wall time: 15.6 s


In [8]:
index['abandon']

{'total_occurences': 33,
 78: {'locations': [213], 'occurences': 1},
 80: {'locations': [350], 'occurences': 1},
 93: {'locations': [213], 'occurences': 1},
 257: {'locations': [205], 'occurences': 1},
 266: {'locations': [96], 'occurences': 1},
 278: {'locations': [163], 'occurences': 1},
 293: {'locations': [84], 'occurences': 1},
 462: {'locations': [138], 'occurences': 1},
 472: {'locations': [214], 'occurences': 1},
 478: {'locations': [141], 'occurences': 1},
 959: {'locations': [178], 'occurences': 1},
 968: {'locations': [274], 'occurences': 1},
 979: {'locations': [212], 'occurences': 1},
 1045: {'locations': [162], 'occurences': 1},
 1155: {'locations': [190], 'occurences': 1},
 1213: {'locations': [17], 'occurences': 1},
 1219: {'locations': [61], 'occurences': 1},
 1241: {'locations': [114], 'occurences': 1},
 1249: {'locations': [244], 'occurences': 1},
 1295: {'locations': [137], 'occurences': 1},
 1398: {'locations': [179], 'occurences': 1},
 1439: {'locations': [183, 29

Les résultats sont heureusement les mêmes, mais l'utilisation de __map__ et __reduce__ divise par 2 le temps de calcul.

## Création de la matrice TF-IDF

La matrice TF_IDF (la représation du corpus dans l'espace TF-IDF) est également créer suivant un modèle _MapReduce_, le __mapper__ étant le calcul du TF-IDF pour chaque documents d'un _token_, le __reducer__ étant une simple concaténation.

In [16]:
%%time
tokens_count = generate_tokens_count(index)

def mapper(token):
    return calculate_tf_idf_for_token(index[token], tokens_count)

with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(mapper, list(index.keys()))

matrix_tfidf = np.concatenate(list(results), axis=1)

CPU times: user 960 ms, sys: 850 ms, total: 1.81 s
Wall time: 3.07 s


## Mise en place d'une requête du corpus

Pour requêter (_query_) le corpus, nous calculons simplement le produit scalaire de la représentation du texte de la requête dans l'espace TF-IDF par la matrice.

In [28]:
%%time
query_corpus("china", index, matrix_tfidf, texts, nb_to_show=20)

CPU times: user 81.1 ms, sys: 14.9 ms, total: 96 ms
Wall time: 58.5 ms


Unnamed: 0,DotProduct,Text,Author,DocumentId
1312,0.138641,"Zhuhai, China Nov 5. (Reuter) China's flag car...",Jim Gilchrist,166432
1198,0.135079,China is unlikely to concede on its demand to ...,Jane Macartney,242319
1150,0.134272,China is unlikely to concede on its demand to ...,Jane Macartney,242519
830,0.131013,China must make real changes to its economy if...,Mure Dickie,126632
1825,0.124948,"China, in a bid to boost the its aerospace ind...",Tan Ee Lyn,169589
1320,0.124416,DHL Worldwide Express plans to strengthen its ...,Jim Gilchrist,138497
1560,0.118853,Wheat is unlikely to become a casualty of the ...,Lynne Donnell,15736
1380,0.118372,Russia will push to expand economic ties with ...,William Kazer,209813
1579,0.111065,Wheat is unlikely to become a casualty of the ...,Lynne Donnell,17858
2287,0.101874,Hong Kong business groups hit back on Wednesda...,Sarah Davison,310654


In [91]:
%%time
search_results = search_words("france air", index)
print(search_results.get("exact_matches"))

"franc" appears in 199 documents.
"air" appears in 168 documents.
All words in 39 documents.
Exact matches in  9 documents.
{1258, 1291, 1295, 1297, 948, 919, 1274, 1276, 1279}
CPU times: user 3.99 ms, sys: 0 ns, total: 3.99 ms
Wall time: 3.38 ms


In [73]:
len(set(np.asarray(index['china'][1584]['locations'])).intersection(set(np.asarray(index['decis'][1584]['locations']) - 1)))

1

In [89]:
view_article(948, texts)


--------------------------------------------
Author : Pierre Tran
Id : 204524
--------------------------------------------
State-owned Air France on Wednesday reported dramatically improved earnings of 802 million francs ($158 million) in the first half and placed a bumper order for 20 Boeing and Airbus aircraft, plus options.
The profit estimate for the six months to September 30, which emerged in an Air France board statement, compared with a loss of 335 million francs a year ago.
The improved results and Air France's aim to break even in 1996/97 was underlined with orders for 10 Boeing 777 twinjets, and options for 10 more. The airline also ordered five Airbus A340s, confirmed five orders made in June and options for a further five.
Air France chairman Christian Blanc won vital support from Prime Minister Alain Juppe to buy from Seattle-based Boeing instead of ordering solely from its archrival Airbus Industrie , the European consortium based in southwest France.
At catalogue price