# Moteur de recherche

Le but de ce TD est de développer un moteur de recherche  dans une base de textes. Nous utilisons une base constitués d'un peu plus de 11.000 messages postés sur des forums de discussion (anglophones), fréquemment utilisée en analyse des données.

Il s'agit de la base http://scikit-learn.org/stable/datasets/twenty_newsgroups.html

Le TD est constitué de plusieurs exercices ainsi que d'un problème. Les principales opérations sont réalisées à l'aide de la librairie ``scikit-learn``.
  * indexation des fichiers et des termes
  * génération une matrice creuse comptant les fréquences d'occurrence des termes
  * vectorisation des messages et des requêtes
  * fichier inverse
  * calcul de similarité et tri de la liste des réponses 


Voir : http://scikit-learn.org/stable/modules/feature_extraction.html

# Importation des librairies mathématiques

In [None]:
import numpy as np
import scipy.sparse as sp
from pprint import pprint

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


# Téléchargement des données

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
type(newsgroups_train)

sklearn.utils.Bunch

### Nombre de fichiers

In [None]:
newsgroups_train.filenames

array(['/root/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51879',
       ...,
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.ibm.pc.hardware/60695',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38319',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/rec.motorcycles/104440'],
      dtype='<U86')

In [None]:
newsgroups_train.filenames.shape

(11314,)

### Noms des fichiers

In [None]:
print(newsgroups_train.filenames[0])

/root/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994


### Affichage du premier message

In [None]:
print(newsgroups_train.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


### Catégorie du premier message

In [None]:
print(newsgroups_train.target[0])

7


In [None]:
print(newsgroups_train.target_names[7])

rec.autos


### Liste des catégories

In [None]:
print(newsgroups_train.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


# 2. Vectorisation (comptage de termes)

### Corpus

In [None]:
corpus = newsgroups_train.data

In [None]:
len(corpus)

11314

### Vectoriseur

In [None]:
vectoriseur = CountVectorizer()

### Analyse

In [None]:
vectoriseur.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

### Dictionnaire des indices

In [None]:
indice = vectoriseur.vocabulary_

In [None]:
len(indice)

101631

In [None]:
indice['car']

25775

### Liste des termes

In [None]:
terme = vectoriseur.get_feature_names()

In [None]:
terme[25775]

'car'

## <font color='purple'>Exercice 1

<font color='purple'>
Afficher les 10 termes qui suivent le terme 'home' ainsi que ceux qui suivent le terme 'car' dans la liste de vocabulaire.
</font>

In [None]:
print(terme[indice['home']:(indice['home']+11)])
print(terme[indice['car']:(indice['car']+11)])

['home', 'homeboy', 'homeboys', 'homebrew', 'homecoming', 'homeland', 'homelands', 'homeless', 'homelessness', 'homemade', 'homeo']
['car', 'car377', 'caraballo', 'caramate', 'caramel', 'caramelizing', 'caramete', 'caratzas', 'caravan', 'caravans', 'caray']


### Transformation texte vers vecteur (matrice creuse)

In [None]:
texte = [corpus[0]]

In [None]:
print(texte)

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.']


In [None]:
vect = vectoriseur.transform(texte)

In [None]:
print(vect)

  (0, 9843)	1
  (0, 11174)	1
  (0, 16809)	1
  (0, 17936)	1
  (0, 18915)	2
  (0, 21987)	1
  (0, 23480)	1
  (0, 24160)	1
  (0, 24635)	1
  (0, 25492)	1
  (0, 25590)	1
  (0, 25775)	4
  (0, 30074)	1
  (0, 31990)	1
  (0, 34809)	1
  (0, 34810)	1
  (0, 35974)	1
  (0, 37287)	1
  (0, 37335)	1
  (0, 41715)	2
  (0, 41724)	1
  (0, 41979)	1
  (0, 45885)	1
  (0, 46814)	1
  (0, 48754)	2
  :	:
  (0, 68080)	2
  (0, 68409)	1
  (0, 68781)	1
  (0, 68847)	1
  (0, 71850)	1
  (0, 73373)	1
  (0, 76471)	1
  (0, 77878)	1
  (0, 80623)	1
  (0, 81658)	1
  (0, 83426)	1
  (0, 84276)	1
  (0, 84538)	1
  (0, 88143)	1
  (0, 88532)	6
  (0, 88638)	1
  (0, 88767)	4
  (0, 89360)	1
  (0, 95844)	4
  (0, 96247)	1
  (0, 96395)	1
  (0, 96433)	1
  (0, 97181)	1
  (0, 99911)	1
  (0, 100208)	1


In [None]:
vue = sp.find(vect)

In [None]:
vue[2].tolist().index(max(vue[2]))

53

In [None]:
print(vue)

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32), array([  9843,  11174,  16809,  17936,  18915,  21987,  23480,  24160,
        24635,  25492,  25590,  25775,  30074,  31990,  34809,  34810,
        35974,  37287,  37335,  41715,  41724,  41979,  45885,  46814,
        48754,  49447,  49932,  51136,  51326,  54632,  55746,  57390,
        57393,  59079,  59216,  60560,  62746,  64931,  67670,  68080,
        68409,  68781,  68847,  71850,  73373,  76471,  77878,  80623,
        81658,  83426,  84276,  84538,  88143,  88532,  88638,  88767,
        89360,  95844,  96247,  96395,  96433,  97181,  99911, 100208],
      dtype=int32), array([1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1,

## <font color='purple'>Exercice 2

Trouvez à partir du vecteur vue :
  * le terme le plus fréquent dans le message
  * le nombre d'apparitions du terme 'car' 



</font>

In [None]:
#indice  du terme le plus fréquent
vue[1].tolist()[vue[2].tolist().index(max(vue[2]))]

88532

In [None]:
#"recherche du terme correspondant à partir de l'indice trouvé
terme[88532]
terme[vue[1].tolist()[vue[2].tolist().index(max(vue[2]))]]

'the'

In [None]:
#le nombre d'apparition de 'car'
vue[2].tolist()[vue[1].tolist().index(indice['car'])]

4

### Vectorisation du corpus complet

In [None]:
X_comptage = vectoriseur.transform(corpus)

In [None]:
print(X_comptage)

  (0, 9843)	1
  (0, 11174)	1
  (0, 16809)	1
  (0, 17936)	1
  (0, 18915)	2
  (0, 21987)	1
  (0, 23480)	1
  (0, 24160)	1
  (0, 24635)	1
  (0, 25492)	1
  (0, 25590)	1
  (0, 25775)	4
  (0, 30074)	1
  (0, 31990)	1
  (0, 34809)	1
  (0, 34810)	1
  (0, 35974)	1
  (0, 37287)	1
  (0, 37335)	1
  (0, 41715)	2
  (0, 41724)	1
  (0, 41979)	1
  (0, 45885)	1
  (0, 46814)	1
  (0, 48754)	2
  :	:
  (11313, 57131)	1
  (11313, 60560)	1
  (11313, 61975)	1
  (11313, 62086)	1
  (11313, 64435)	1
  (11313, 66242)	1
  (11313, 66857)	2
  (11313, 68080)	1
  (11313, 68409)	1
  (11313, 68997)	1
  (11313, 70066)	1
  (11313, 71786)	1
  (11313, 71992)	1
  (11313, 78365)	1
  (11313, 81742)	1
  (11313, 81792)	1
  (11313, 82660)	1
  (11313, 84605)	1
  (11313, 85524)	1
  (11313, 87730)	1
  (11313, 89465)	1
  (11313, 89804)	1
  (11313, 90644)	1
  (11313, 96497)	1
  (11313, 96707)	1


X est une matrice creuse contenant 11314 lignes, chaque ligne correspondant à un vecteur texte

## <font color='purple'>Exercice 3

<font color='purple'>
Combien de textes contiennent le terme 'home'? en deduire le score 'term frequency'
</font>

In [None]:
i=indice['home']
Col=X_comptage.getcol(i)
L=Col.nonzero()
d=len(L[0])
d

352

In [None]:
print("le score term frequency du mot home est {}".format(d/11314))

le score term requency du mot home est 0.031111896765069823


## <font color='purple'>Exercice 4

<font color='purple'>
Affichez un message du corpus contenant le terme 'platypus'
</font>

In [None]:
corpus[1]

"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks."

In [None]:
i=indice['platypus']
Col=X_comptage.getcol(i)
L=Col.nonzero()
corpus[L[0].tolist()[0]]

'\n\nSee, there you go again, saying that a moral act is only significant\nif it is "voluntary."  Why do you think this?\n\nAnd anyway, humans have the ability to disregard some of their instincts.\n\n\nYou are attaching too many things to the term "moral," I think.\nLet\'s try this:  is it "good" that animals of the same species\ndon\'t kill each other.  Or, do you think this is right? \n\nOr do you think that animals are machines, and that nothing they do\nis either right nor wrong?\n\n\n\nThose weren\'t arbitrary killings.  They were slayings related to some sort\nof mating ritual or whatnot.\n\n\nYes it was, but I still don\'t understand your distinctions.  What\ndo you mean by "consider?"  Can a small child be moral?  How about\na gorilla?  A dolphin?  A platypus?  Where is the line drawn?  Does\nthe being need to be self aware?\n\nWhat *do* you call the mechanism which seems to prevent animals of\nthe same species from (arbitrarily) killing each other?  Don\'t\nyou find the fact 

## <font color='purple'>Exercice 5

<font color='purple'>
Vectorisez maintenant un texte anglophone de votre choix et affichez le vecteur résultant</font>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
Doc = open('/content/drive/MyDrive/Text mining/pg2701.txt', 'r')
texte1 = Doc.read()

In [None]:
L=[]
L.append(texte1)

In [None]:
X_comptage1 = vectoriseur.transform(L)
print(X_comptage1)

  (0, 1)	21
  (0, 1469)	4
  (0, 1470)	1
  (0, 1622)	1
  (0, 1708)	1
  (0, 1748)	1
  (0, 1769)	1
  (0, 1786)	1
  (0, 1807)	1
  (0, 1821)	1
  (0, 1833)	1
  (0, 1846)	1
  (0, 1912)	1
  (0, 1913)	1
  (0, 1996)	1
  (0, 2073)	1
  (0, 2086)	1
  (0, 2107)	1
  (0, 2124)	1
  (0, 2142)	1
  (0, 2157)	1
  (0, 2178)	1
  (0, 2194)	1
  (0, 2222)	1
  (0, 2223)	1
  :	:
  (0, 100049)	3
  (0, 100185)	3
  (0, 100186)	4
  (0, 100187)	1
  (0, 100195)	2
  (0, 100197)	6
  (0, 100208)	963
  (0, 100212)	80
  (0, 100214)	2
  (0, 100215)	1
  (0, 100216)	1
  (0, 100221)	258
  (0, 100231)	9
  (0, 100233)	26
  (0, 100235)	7
  (0, 100239)	9
  (0, 100893)	2
  (0, 100894)	7
  (0, 100928)	1
  (0, 101025)	1
  (0, 101058)	1
  (0, 101277)	5
  (0, 101278)	3
  (0, 101285)	2
  (0, 101297)	1


# 3. Transformation TF-IDF

In [None]:
transformeur = TfidfTransformer(norm=None, smooth_idf=False)

### Calcul IDF

In [None]:
transformeur.fit(X_comptage)

TfidfTransformer(norm=None, smooth_idf=False, sublinear_tf=False, use_idf=True)

In [None]:
idf = transformeur.idf_

In [None]:
print(idf)

[ 4.84073473  4.76545167  8.38788603 ... 10.33379618 10.33379618
 10.33379618]


## <font color='purple'>Exercice 6

<font color='purple'>
Affichez le score IDF (fréquence documentaire) de 'the', 'car', 'spherical', 'platypus' </font>

In [None]:
len(idf)

101631

In [None]:
lis=['the', 'car', 'spherical', 'platypus']
i_the=indice['the']
for i in lis:
  print("le score IDF de {} est {}".format(i,idf[indice[i]]))


le score IDF de the est 1.1773841459524739
le score IDF de car est 4.35238196464862
le score IDF de spherical est 8.031211082909056
le score IDF de platypus est 10.333796175903101


### Vectorisation TF-IDF du premier message du corpus

In [None]:
vecteur_comptage = X_comptage[0,:]

In [None]:
vecteur_tfidf = transformeur.transform(vecteur_comptage)

In [None]:
print(vecteur_tfidf)

  (0, 100208)	1.790155814472015
  (0, 99911)	3.5192532786431427
  (0, 97181)	5.12431002306168
  (0, 96433)	3.1933431328019424
  (0, 96395)	4.620063370393732
  (0, 96247)	2.896001054231169
  (0, 95844)	8.969899609490763
  (0, 89360)	1.2901008813358654
  (0, 88767)	7.053765745490525
  (0, 88638)	2.232724672783557
  (0, 88532)	7.064304875714843
  (0, 88143)	10.333796175903101
  (0, 84538)	6.129103556512135
  (0, 84276)	6.114288470726994
  (0, 83426)	4.237971613470876
  (0, 81658)	5.466261725447518
  (0, 80623)	4.750299867121402
  (0, 77878)	4.559244630358693
  (0, 76471)	3.354650900834291
  (0, 73373)	6.04333673475471
  (0, 71850)	3.2758982384912447
  (0, 68847)	2.4962418150220165
  (0, 68781)	2.6722690945445837
  (0, 68409)	1.969288072152512
  (0, 68080)	3.5671102608813268
  :	:
  (0, 48754)	3.855516067706052
  (0, 46814)	4.67431396014348
  (0, 45885)	1.7702920816236132
  (0, 41979)	8.542036706675045
  (0, 41724)	4.86573603476797
  (0, 41715)	4.371858091958309
  (0, 37335)	7.768846818441

## <font color='purple'>Exercice 7

<font color='purple'>
Affichez les termes dont le score TF-IDF est superieur à 8 </font>

In [None]:
terme[1293]

'0tq33'

In [None]:
Num=[]
for i in range (len(idf)):
  if idf[i]>=8:
    Num.append(terme[i])
Num

['0000',
 '00000',
 '000000',
 '00000000',
 '0000000004',
 '00000000b',
 '00000001',
 '00000001b',
 '00000010',
 '00000010b',
 '00000011',
 '00000011b',
 '00000074',
 '00000093',
 '000000e5',
 '00000100',
 '00000100b',
 '00000101',
 '00000101b',
 '00000110',
 '00000110b',
 '00000111',
 '00000111b',
 '000005102000',
 '00000510200001',
 '00000ee5',
 '00001000',
 '00001000b',
 '00001001',
 '00001001b',
 '00001010',
 '00001010b',
 '00001011',
 '00001011b',
 '000010af',
 '00001100',
 '00001100b',
 '00001101',
 '00001101b',
 '00001110',
 '00001110b',
 '00001111',
 '00001111b',
 '000042',
 '000062david42',
 '000094',
 '0001',
 '00010000',
 '00010000b',
 '00010001',
 '00010001b',
 '00010010',
 '00010010b',
 '00010011',
 '00010011b',
 '000100255pixel',
 '00010100',
 '00010100b',
 '00010101',
 '00010101b',
 '00010110',
 '00010110b',
 '00010111',
 '00010111b',
 '00011000',
 '00011000b',
 '00011001',
 '00011001b',
 '00011010',
 '00011010b',
 '00011011',
 '00011011b',
 '00011100',
 '00011100b',
 '0

### Norme du vecteur

In [None]:
np.linalg.norm(vecteur_tfidf.toarray())

43.35287803918421

### Vectorisation TF-IDF du corpus

In [None]:
X = transformeur.transform(X_comptage)


In [None]:
X

<11314x101631 sparse matrix of type '<class 'numpy.float64'>'
	with 1103627 stored elements in Compressed Sparse Row format>

### Produit scalaire entre deux vecteurs du corpus (X[0,:] et X[1,:])

In [None]:
print(X[0])

  (0, 100208)	1.790155814472015
  (0, 99911)	3.5192532786431427
  (0, 97181)	5.12431002306168
  (0, 96433)	3.1933431328019424
  (0, 96395)	4.620063370393732
  (0, 96247)	2.896001054231169
  (0, 95844)	8.969899609490763
  (0, 89360)	1.2901008813358654
  (0, 88767)	7.053765745490525
  (0, 88638)	2.232724672783557
  (0, 88532)	7.064304875714843
  (0, 88143)	10.333796175903101
  (0, 84538)	6.129103556512135
  (0, 84276)	6.114288470726994
  (0, 83426)	4.237971613470876
  (0, 81658)	5.466261725447518
  (0, 80623)	4.750299867121402
  (0, 77878)	4.559244630358693
  (0, 76471)	3.354650900834291
  (0, 73373)	6.04333673475471
  (0, 71850)	3.2758982384912447
  (0, 68847)	2.4962418150220165
  (0, 68781)	2.6722690945445837
  (0, 68409)	1.969288072152512
  (0, 68080)	3.5671102608813268
  :	:
  (0, 48754)	3.855516067706052
  (0, 46814)	4.67431396014348
  (0, 45885)	1.7702920816236132
  (0, 41979)	8.542036706675045
  (0, 41724)	4.86573603476797
  (0, 41715)	4.371858091958309
  (0, 37335)	7.768846818441

### Fonction produit scalaire

In [None]:
def prod(x,y):
    return x.dot(y.T).toarray()[0][0]

### Fonction Similarité

In [None]:
def sim(x,y):
    num = prod(x,y)
    den1 = np.sqrt(prod(x,x))
    den2 = np.sqrt(prod(y,y))
    return num / (den1 * den2)    

In [None]:
sim(X[0,:],X[1,:])

0.062894713024627

## <font color='purple'>Problème

<font color='purple'>
Ecrivez et testez une fonction qui prend une requête de l'utilisateur sous la forme d'une liste de mots-clés (en anglais) et qui retourne la liste des 10 textes de la base les plus similaires à la requête par ordre décroissant de similarité.
</font>

In [None]:
def response(req):
  X_comptagereq = vectoriseur.transform(req)
  X_req = transformeur.transform(X_comptagereq)
  A={}
  z=0
  for i in range(len(corpus)):
    for j in range(len(req)):
      z=z+sim(X[i,:],X_req[j,:]) #Pour chaque élement j de la requête il y'a un calcul de similarité qui est fait pour le document i. Par la suite il faut faire la somme des similarités
    A[i]=z
    z=0
  sorted_A = sorted(A.items(), key=lambda x: x[1],reverse=True)
  return sorted_A[:10]



la fonction retourne 10 tuples correspondant aux 10 primiers documents avec les scores de similarité (somme des similarités)calculés.

In [None]:
B=['the', 'max']

In [None]:
B=['summer']
response(B)

  """


[(200, 0.2729512639508488),
 (720, 0.21356983431342014),
 (620, 0.173523465367272),
 (9240, 0.10567164106946193),
 (388, 0.08805599507313351),
 (327, 0.06387490347308232),
 (2, 0.061413843384225834),
 (59, 0.012174940451679625),
 (0, 0.0),
 (3, 0.0)]

In [None]:
corpus[200]

'\nI first read and consulted rec.guns in the summer of 1991.  I\njust purchased my first firearm in early March of this year.'

In [None]:
corpus[720]

'Just out of curiosity, what happened to the weekly AL and NL Game\nScore Reports?  I used to enjoy reading them throughout the summer\nfor the last two years.\n\nInquisitively yours,\n\nJoel'