# **Поиск документа на основе обратного индекса и TF-IDF**

### **1.** Предобработка корпуса(датасета новостей): удаление знаков препинания и лишних символов, токенизация, приведение к нормальной форме, удаление стоп слов

### **2.** Реализация поиска на основе обратного индекса

### **3.** Подсчёт TF-IDF для своего корпуса, реализация поиска по документам, использующего расстояние между векторами

In [None]:
import os
import numpy as np
import lxml
from lxml import objectify, etree, html
from xml.etree import ElementTree as xml
import urllib3
from io import StringIO, BytesIO
import pandas as pd
import re
import codecs
import nltk
import string
nltk.download("stopwords")
from nltk.corpus import stopwords
from pymystem3 import Mystem
import copy
from collections import defaultdict
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
import json
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt

%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# XML

In [None]:
xmlstr = codecs.open('dataPatina.xml', encoding='utf-8', mode='r').read()
print(xmlstr)

<catalog>
 <article id="https://www.bbc.com/culture/story/20200116-are-authentic-accents-important-in-film-and-tv">
  <title>
   Are authentic accents important in film and TV?
  </title>
  <category>
   culture
  </category>
  <tags>
   Television,Books,TV,Photography
  </tags>
  <text>
   At the ripe old age of 100, Dr Dolittle has been reincarnated in the form of Robert Downey Jr. In the latest screen version of the children’s literature classic, Dolittle, released in the US today, he is also Welsh… or at least Wales-adjacent. Because whilst he’s still able to talk to the animals, it would appear that Dr Dolittle is not so proficient in his new native voice.

More like this:

- The brilliant women the Oscars ignored

- What makes a Hollywood heartthrob?

- Will the Star Wars universe survive?

There was tantalisingly (and tellingly) little of Downey Jr speaking in the film’s trailer but just enough to threaten a film already beset by production issues – a title change, reshoots, and

In [None]:
root = etree.fromstring(xmlstr)
print(root)

<Element catalog at 0x2e57c125848>


In [None]:
catalog = []

for element_lvl1 in root:
    article = {}
    for element_lvl2 in element_lvl1:
        txt = element_lvl2.text
        article[element_lvl2.tag] = '' if txt is None else txt
        print('{}: {}'.format(element_lvl2.tag, article[element_lvl2.tag]))
    catalog.append(article)
        
    print()

title: 
   Are authentic accents important in film and TV?
  
category: 
   culture
  
tags: 
   Television,Books,TV,Photography
  
text: 
   At the ripe old age of 100, Dr Dolittle has been reincarnated in the form of Robert Downey Jr. In the latest screen version of the children’s literature classic, Dolittle, released in the US today, he is also Welsh… or at least Wales-adjacent. Because whilst he’s still able to talk to the animals, it would appear that Dr Dolittle is not so proficient in his new native voice.

More like this:

- The brilliant women the Oscars ignored

- What makes a Hollywood heartthrob?

- Will the Star Wars universe survive?

There was tantalisingly (and tellingly) little of Downey Jr speaking in the film’s trailer but just enough to threaten a film already beset by production issues – a title change, reshoots, and a postponed release date – with the heavy burden of an infamously bad accent. And now reviews have confirmed that his pronunciation is wayward, to sa

In [None]:
catalog

[{'title': '\n   Are authentic accents important in film and TV?\n  ',
  'category': '\n   culture\n  ',
  'tags': '\n   Television,Books,TV,Photography\n  ',
  'text': '\n   At the ripe old age of 100, Dr Dolittle has been reincarnated in the form of Robert Downey Jr. In the latest screen version of the children’s literature classic, Dolittle, released in the US today, he is also Welsh… or at least Wales-adjacent. Because whilst he’s still able to talk to the animals, it would appear that Dr Dolittle is not so proficient in his new native voice.\n\nMore like this:\n\n- The brilliant women the Oscars ignored\n\n- What makes a Hollywood heartthrob?\n\n- Will the Star Wars universe survive?\n\nThere was tantalisingly (and tellingly) little of Downey Jr speaking in the film’s trailer but just enough to threaten a film already beset by production issues – a title change, reshoots, and a postponed release date – with the heavy burden of an infamously bad accent. And now reviews have confirm

# Preprocessing articles

In [None]:
def preprocess_article(text, mystem=Mystem(entire_input=False)):
    text = text.lower()
    en_stopwords = stopwords.words('english')
    dgts = [str(i) for i in range(10)]
    for s in string.punctuation:
        text = text.replace(s, '')
    text = re.sub( '\s+', ' ', text).strip()
    
    tokens = mystem.lemmatize(text)
    tokens = [token for token in tokens if token not in en_stopwords\
              and token not in dgts\
              and token != " "]
    text = " ".join(tokens)
    return text

In [None]:
def preprocess_docs(catalog):
    preproc_catalog = copy.deepcopy(catalog)
    for i in range(len(preproc_catalog)):
        preproc_catalog[i]['text'] = preprocess_article(catalog[i]['text'])
    return preproc_catalog

В качестве данных, я беру не все статьи, т.к. моя релизация подсчёта обратного индекса выполняется долго.

In [None]:
preproc_catalog = preprocess_docs(catalog[0:50])
preproc_catalog

[{'title': '\n   Are authentic accents important in film and TV?\n  ',
  'category': '\n   culture\n  ',
  'tags': '\n   Television,Books,TV,Photography\n  ',
  'text': 'ripe old age dr dolittle reincarnated form robert downey jr latest screen version children literature classic dolittle released us today also welsh least walesadjacent whilst still able talk animals would appear dr dolittle proficient new native voice like brilliant women oscars ignored makes hollywood heartthrob star wars universe survive tantalisingly tellingly little downey jr speaking film trailer enough threaten film already beset production issues title change reshoots postponed release date heavy burden infamously bad accent reviews confirmed pronunciation wayward say least vulture bilge ebiri describing illadvised halfhearted welsh accent occasionally assuming inadvertently slips irish indian jamaican intonations dodgy accent hall fame ruled course dick van dyke busy place echoing chambers russell crowe notvery

# Inverted index

In [None]:
def inv_index(catalog):
    vocab = set([])
    for i in range(len(catalog)):
        if len(catalog[i]['text']) != 0:
            tokens_article = set(nltk.word_tokenize(catalog[i]['text']))
            vocab.update(tokens_article)
        else:
            continue
                                 
    index = {i:[] for i in vocab} #defaultdict(list)
    for word in vocab:
        for i in range(len(catalog)):
            tokens = set(nltk.word_tokenize(catalog[i]['text']))
            if word in tokens:
                index[word].append(i)
    return index

In [None]:
ind = inv_index(preproc_catalog)
ind

{'assistant': [11, 38],
 'unravel': [30],
 'tokyo': [27],
 'tension': [6],
 'central': [6, 10, 26],
 'asian': [0, 17],
 'answer': [14, 22, 29],
 'chris': [0, 3, 7, 20],
 'ceiling': [5],
 'ohioset': [0],
 'middle': [44],
 'owner': [12, 19, 20, 21, 24],
 'police': [28, 38],
 'attacks': [43],
 'anchors': [30],
 'miscarriage': [44],
 'national': [0, 6, 11, 18, 28],
 'begin': [18, 19],
 'neesons': [29],
 'burrell': [40],
 'womb': [29, 40, 44],
 'roberts': [0],
 'kong': [8, 11],
 'understanding': [42],
 'put': [0, 6, 7, 8, 9, 10, 15, 19, 24, 29, 30, 39, 40],
 'releasing': [29],
 'everchanging': [34],
 'tract': [40],
 'brussels': [19],
 'see': [2, 5, 11, 13, 15, 17, 18, 30, 34, 40, 41, 43, 44],
 'losses': [6, 10, 15, 44],
 'reforms': [31],
 'february': [0, 19, 39, 49],
 'department': [6, 8, 9, 13, 36, 41, 43],
 'cognitive': [39, 42, 44],
 'emissions': [5, 13],
 'batten': [40],
 'minute': [5],
 'footing': [19],
 'lionsgate': [30],
 'commerce': [9],
 'adams': [7],
 'diesel': [5],
 'expanded': [

In [None]:
def searchInvInd(phrase):
    '''If there are several articles in the search, 
        the one with the largest number of phrases in the text is selected.'''
    phrase = phrase.lower()
    words = phrase.strip().split()
    res = list()
    for word in words:
        res_ = list(ind.get(word, '-'))
        if res_[0] == '-':
            return 'This phrase wasn\'t found in any document!'
        else:
            res.extend(res_)
    res = [i for i in res if res.count(i)==len(words)]
    result = np.unique(res)
    docs = [catalog[i] for i in result]
    
    index = 0
    num_words = dict()
    if len(docs)>1:
        for i in range(len(docs)):
            tokens = nltk.word_tokenize(preproc_catalog[i]['text'])
            count = 0
            for word in words:
                count+=tokens.count(word)
            num_words[i] = count       
        values = list(num_words.values())
        max_num = max(values)
        index = values.index(max_num)
    
    keys = list(catalog[0].keys())
    print('---------------------------Article----------------------------')
    for k in keys:
        print('{}:{}'.format(k,docs[index][k]))
    print('--------------------------------------------------------------\n')


In [None]:
searchInvInd('theory')

---------------------------Article----------------------------
title:
   US-China trade deal: Five things that aren't in it
  
category:
   business
  
tags:
   Trump tariffs,China-US relations,China economy,China,Trade war,Donald Trump,United States,Huawei
  
text:
   Image copyright Getty Images

The US and China have finally - after almost two years of hostilities - signed a "phase one" deal. But it only covers the easier aspects of their difficult relationship, and only removes some of the tariffs.

The biggest hurdles are still to come, and could stand in the way of a second phase agreement - one that would in theory remove all of the tariffs, bringing some much needed relief for the global economy, which is in the interests of all of us.

What's not in the phase one deal tells us where the flashpoints are in the US-China relationship - and what could derail the second round of negotiations.

So what didn't make it into the agreement?

1. Industrial subsidies and 'Made in China 20

In [None]:
searchInvInd('earth')

---------------------------Article----------------------------
title:
   What the earliest life on Earth looked like
  
category:
   future
  
tags:
   Biology
  
text:
   At the south-eastern tip of Newfoundland, rugged cliffs rise imposingly above the sea. The craggy rocks are known as Mistaken Point, an homage to the many ships that met their untimely end there after sailors ‘mistook’ them for a different place. Now the wild and jagged cliffs are famous for another reason. They are at the centre of a debate about one of Earth’s greatest mysteries – just how and when did complex life first evolve?

“If you walk around the rocks you will find surfaces covered in literally thousands of fossils,” says Frankie Dunn, a paleobiologist at Oxford University.

The fossils were preserved about 570 million years ago during the Ediacaran period, when a series of volcanic eruptions covered the seafloor in ash, providing a ‘snapshot’ of life at the time.

“The way I would describe it is like walki

# TF-IDF

In [None]:
data_train = list()
for i in range(len(preproc_catalog)):
    data_train.append(preproc_catalog[i]['text'])
len(data_train)

50

In [None]:
data_train

['ripe old age dr dolittle reincarnated form robert downey jr latest screen version children literature classic dolittle released us today also welsh least walesadjacent whilst still able talk animals would appear dr dolittle proficient new native voice like brilliant women oscars ignored makes hollywood heartthrob star wars universe survive tantalisingly tellingly little downey jr speaking film trailer enough threaten film already beset production issues title change reshoots postponed release date heavy burden infamously bad accent reviews confirmed pronunciation wayward say least vulture bilge ebiri describing illadvised halfhearted welsh accent occasionally assuming inadvertently slips irish indian jamaican intonations dodgy accent hall fame ruled course dick van dyke busy place echoing chambers russell crowe notverynottinghamshire robin hood anne hathaway clunky yorkshire twang one day cheadle allgonepetetong cockney ocean eleven franchise rather many nonnative bostonians departed

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(data_train)
bow = {}
for i in range(0,len(data_train)):
    bow[i] = [vectorizer.transform([data_train[i]]).indices]

In [None]:
bow

{0: [array([   2,    9,   12,   17,   25,   26,   27,   28,   36,   37,   44,
           50,   52,   58,   59,   60,   64,   68,   73,  108,  113,  123,
          149,  158,  163,  164,  165,  172,  174,  176,  181,  185,  186,
          191,  192,  203,  215,  222,  228,  229,  231,  232,  237,  246,
          248,  257,  267,  270,  274,  277,  284,  297,  299,  304,  310,
          317,  320,  321,  324,  330,  336,  349,  351,  362,  363,  364,
          366,  368,  370,  373,  382,  385,  387,  388,  391,  394,  395,
          403,  407,  417,  447,  451,  458,  459,  464,  472,  473,  488,
          489,  490,  503,  507,  508,  523,  534,  536,  540,  542,  550,
          560,  561,  578,  582,  589,  596,  598,  601,  608,  619,  635,
          636,  638,  645,  646,  647,  657,  663,  665,  672,  678,  679,
          694,  705,  706,  714,  715,  726,  738,  743,  764,  766,  788,
          795,  799,  824,  832,  842,  844,  845,  848,  849,  851,  855,
          856,  864,  

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', strip_accents='ascii')
tfidf_train = vectorizer.fit_transform(data_train)
print(tfidf_train.shape)

(50, 5717)


In [None]:
tfidf_train

<50x5717 sparse matrix of type '<class 'numpy.float64'>'
	with 11282 stored elements in Compressed Sparse Row format>

In [None]:
vectorizer.vocabulary_

{'ripe': 4332,
 'old': 3520,
 'age': 111,
 'dr': 1531,
 'dolittle': 1505,
 'reincarnated': 4175,
 'form': 2061,
 'robert': 4350,
 'downey': 1527,
 'jr': 2787,
 'latest': 2883,
 'screen': 4488,
 'version': 5450,
 'children': 858,
 'literature': 2991,
 'classic': 901,
 'released': 4186,
 'today': 5177,
 'welsh': 5568,
 'walesadjacent': 5509,
 'whilst': 5583,
 'able': 9,
 'talk': 5035,
 'animals': 220,
 'appear': 251,
 'proficient': 3951,
 'new': 3419,
 'native': 3380,
 'voice': 5492,
 'like': 2961,
 'brilliant': 609,
 'women': 5640,
 'oscars': 3574,
 'ignored': 2503,
 'makes': 3097,
 'hollywood': 2431,
 'heartthrob': 2366,
 'star': 4832,
 'wars': 5530,
 'universe': 5365,
 'survive': 4999,
 'tantalisingly': 5041,
 'tellingly': 5080,
 'little': 2994,
 'speaking': 4768,
 'film': 1980,
 'trailer': 5221,
 'threaten': 5141,
 'beset': 480,
 'production': 3943,
 'issues': 2717,
 'title': 5173,
 'change': 819,
 'reshoots': 4258,
 'postponed': 3848,
 'release': 4185,
 'date': 1273,
 'heavy': 2370,

In [None]:
valid_inds = list()
for i_el, el in enumerate(tfidf_train):
    if el.getnnz() > 0:
        valid_inds.append(i_el)
        
valid_inds = np.asarray(valid_inds)
print(len(valid_inds))        
tfidf_train_filt = tfidf_train[valid_inds]
print(tfidf_train_filt.shape)

50
(50, 5717)


In [None]:
predictor = NearestNeighbors(n_neighbors=1, algorithm='brute', metric='cosine').fit(tfidf_train_filt)

In [None]:
def searchTF_IDF(phrase):
    '''Among all documents, the document with the smallest distance is selected,
    and this phrase is more common in it.'''
    phrase = phrase.lower()
    test_text = vectorizer.transform([vectorizer.decode(phrase)])
    distances, indices = predictor.kneighbors(test_text, n_neighbors=5)
    distances = np.squeeze(distances)
    indices = np.squeeze(indices)

    keys = list(catalog[0].keys())
    i = valid_inds[indices[0]]
    if distances[0]!=1:
        print(indices, ':', distances, '\n')
        print('---------------------------Article----------------------------')
        for k in keys:
            print('{}:{}'.format(k,catalog[i][k]))
        print('--------------------------------------------------------------')
    else:
        print('This phrase wasn\'t found in any document!')

In [None]:
searchTF_IDF('abra cadabra')

This phrase wasn't found in any document!


In [None]:
searchTF_IDF('Australia')

[ 3 15 40 34 36] : [0.88340095 0.90354362 0.98146994 1.         1.        ] 

---------------------------Article----------------------------
title:
   Australia fires: How do we know how many animals have died?
  
category:
  
tags:
   Reality Check,Reality Check,Australia fires,Australia
  
text:
   Image copyright Paul Sudmals / Reuters

There is a widely-reported estimate that almost half a billion (480 million) animals have been killed by the bush fires in Australia.

It's a figure that came from Prof Chris Dickman, an expert on Australian biodiversity at the University of Sydney.

He released a statement explaining how he had reached the figure - a statement which refers to the number of animals affected rather than those necessarily dying as a direct result of the fire (although the title of the release talks about 480 million being killed).

The numbers are based on a report he co-wrote in 2007 for the World Wide Fund for Nature (WWF) on the impact of land-clearing on Australian