## Лабораторная работа №2. Простой поиск и TF-IDF представление текста
### Выполнила: Залесская Галина, 16ПМИ


In [1]:
import spacy
import spacy.lang.en
from lxml import etree
from collections import Counter
from numpy import intersect1d, unique
from functools import reduce
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors

In [2]:
news_df = open('news.xml').read()

In [3]:
root = etree.fromstring(news_df)
catalog = []

for element_lvl1 in root:
    news = {}
    for element_lvl2 in element_lvl1:
        txt = element_lvl2.text
        news[element_lvl2.tag] = '' if txt is None else txt
    
    text = news['text']
    important_text = news['title'] + '\n' + news['tags'] + '\n' + news['category']
    catalog.append([important_text, text])
    print('Text: {}\nImpotant text: {}\n'.format(text[:100], important_text))   

Text: Image copyright PA Media Image caption Nicola Sturgeon said her government would accept the result o
Impotant text: Sturgeon agrees to wider review of Scottish education
Scotland Education, Scottish Parliament, Scottish government, Nicola Sturgeon
uk

Text: Media playback is unsupported on your device Media caption Sir Keir Starmer: We've lost four electio
Impotant text: Labour leadership: Don't just blame 2019 campaign, Starmer warns
Labour Party leadership election, Keir Starmer, Labour Party leadership election
uk

Text: Image copyright Getty Images

Entertainment streaming giants including Amazon, Apple, Google, Netfli
Impotant text: Amazon, Apple and Google face data complaints
Netflix, Privacy, Data protection, Google, Spotify, Amazon, Apple, GDPR, Streaming
technology

Text: Video

TV cameras are to be allowed to film in Crown Courts in England and Wales for the first time.
Impotant text: Crown Court filming ban to be lifted in England and Wales

uk

Text: Image copyright 

In [4]:
def data_preparation(paragraph):
    #tokenizing, removing stop-words and digits (not years), lemmatization, to counter-dict
    doc = nlp(paragraph.lower())
    tokens = [token.lemma_ for token in doc if not token.is_stop and 
              not token.is_punct and ((token.is_digit and float(token.text)>1500) or (not token.is_digit))]
    return Counter(tokens)

In [5]:
def collect_dict(catalog):
    dictionary = {}
    for i, news in enumerate(catalog):
        for n in range(2):
            counter = data_preparation(news[n])
            if n==0:
                isTitle = True  
            else:
                isTitle = False
            for word in counter:
                value = [i, counter[word], isTitle] 
                if dictionary.get(word, None) is None:
                    dictionary[word] = [value]
                else:
                    dictionary[word].append(value)
    return dictionary

In [12]:
def nice_visualization(article):
    print('\nTitle:\n{}\nText:\n{}\n'.format(article[0],article[1]))

In [8]:
nlp = spacy.load('en_core_web_sm')

In [9]:
dictionary = collect_dict(catalog)

In [56]:
def simple_search(request, dictionary):
    #returns scope of article - it is better search
    #got some extra information from dataset (like the count of words in the article) but don't use it
    request = data_preparation(request)
    results = []
    for word in request:
        if dictionary.get(word, None):
            results.append(dictionary[word])
    if len(results)==0:
        return []
    #intersection rule
    docs_num = reduce(intersect1d, [[doc_num for [doc_num, count, isTitle] in line] for line in results])
    #any word in article
    if len(docs_num)==0:
        docs_num = unique([doc_num for line in results for [doc_num, count, isTitle] in line])
        print("Best match is not found")

    print("Found {} articles with folowing indexes: {}".format(len(docs_num), docs_num))
    for i in docs_num:
        print("----------------------------------------")
        print('Number of article: ' + str(i+1))
        nice_visualization(catalog[i])
        print("----------------------------------------")

In [49]:
simple_search('saturday evening what to do', dictionary)
#not a very good advice, to be honest

Found 1 articles with folowing indexes: [89]
----------------------------------------
Number of article: 90

Title:
A reckoning in Charlottesville
Alt-right, United States, Virginia
world
Text:
Image copyright Joel Gunter Image caption Nationalists descended on Charlottesville to defend a Confederate statue

In the middle of Emancipation Park in Charlottesville on Saturday, two young women, one white and one black, took each other's hands and held them tightly, and with their other hands they gripped the steel barrier in front of them.

A few feet away, a young white man with a buzzed haircut and sunglasses leaned towards them over a facing barrier. "You'll be on the first f*****g boat home," he screamed at the black woman, before turning to the white woman. "And as for you," he said coolly, "you're going straight to hell." Then he gave a Nazi salute.

For the third time in a few months, white nationalists had descended on the small, liberal city of Charlottesville in the southern stat

In [58]:
simple_search('Europe town', dictionary)

Found 5 articles with folowing indexes: [ 34  38 172 181 196]
----------------------------------------
Number of article: 35

Title:
Air pollution: How three global cities tackle the problem
Air pollution, Beijing, Mexico City, Delhi
world
Text:
Image copyright AFP/Getty

India's capital Delhi is blanketed under a hazardous shroud of air pollution.

City authorities have imposed a car rationing scheme in a bid to bring levels down, but experts believe the real blame lies with crop burning by farmers in neighbouring states.

Delhi is the latest city to try to come up with ways to tackle increasingly dangerous pollutants in the air.

This is what other cities have done in a bid to beat air pollution.

London

Media playback is unsupported on your device Media caption The Great Smog of London remembered 60 years on

When was pollution at its worst?

Thick smog used to frequently blanket the UK capital in the 19th and 20th centuries, when people burned coal to warm homes and heavy industry

## TFIDF search

In [63]:
from numpy import arange

In [64]:
def get_data_train():
    dumped_catalog = [imp + text for [imp, text] in catalog]
    data_train = []
    for article in dumped_catalog:
        doc = nlp(article.lower())
        tokens = [token.text for token in doc if not token.is_digit]
        data_train.append(' '.join(tokens))
    return data_train

def get_tfidf_representation(data_train):
    vectorizer = TfidfVectorizer(stop_words='english', strip_accents='ascii')
    tfidf_train = vectorizer.fit_transform(data_train)
    tfidf_train = tfidf_train[arange(len(data_train))] 
    print(tfidf_train.shape)
    return vectorizer, tfidf_train

In [66]:
data_train = get_data_train()
vectorizer, tfidf_train = get_tfidf_representation(data_train)
#length of the embedding for the article is 13592 (Wow!)

(249, 13592)


In [69]:
def tfidf_search(request, vectorizer, tfidf_train):
    predictor = NearestNeighbors(n_neighbors=1, algorithm='brute', metric='cosine').fit(tfidf_train)
    tfidf_request = vectorizer.transform([vectorizer.decode(request)])
#     print(tfidf_request.shape)
    dist, [[predicted_index]] = predictor.kneighbors(tfidf_request)
    return nice_visualization(catalog[predicted_index])

In [70]:
tfidf_search("group of people", vectorizer, tfidf_train)


Title:
Rugby League World Cup 2021 draw: England drawn with Samoa, France and Greece

sport
Text:
England, captained by the now-retired Sam Burgess, lost 6-0 to Australia in the 2017 World Cup final

Rugby League World Cup 2021 Dates: 23 October to 27 November 2021 Coverage: All 31 matches of the men's World Cup will be broadcast live by the BBC, with at least 16 games on BBC One or BBC Two.

Hosts England have been drawn alongside Samoa, France and Greece in the group stage of the 2021 Rugby League World Cup.

England will play Samoa in the opening match of the tournament at St James' Park, Newcastle on 23 October.

Australia, who beat England in the 2017 final, are in Group B and have been drawn with Fiji, Scotland and Italy.

Meanwhile, 1995 and 2000 semi-finalists Wales have been drawn alongside Tonga, Papua New Guinea and the Cook Islands.

New Zealand, who were knocked out by Fiji in the 2017 quarter-finals, will face Lebanon, Jamaica and Ireland.

The Duke of Sussex, Jason Robi

In [71]:
tfidf_search("saturday evening what to do", vectorizer, tfidf_train)


Title:
Ings in 'a better place mentally' with Saints

sport
Text:
Southampton in-form striker Danny Ings tells Football Focus how he feels "in a better place mentally" this season after previous injuries had hampered his start to life with the Saints.

READ MORE: How Danny Ings got his groove back

Watch the full interview on Football Focus, Saturday 18 January at 12:00 on BBC One.

Available to UK users only,



In [72]:
tfidf_search("Funny jokes", vectorizer, tfidf_train)


Title:
Impeachment trial: Why did Pelosi use so many pens?
Impeachment of Donald Trump, Nancy Pelosi, Impeachment of Donald Trump
world
Text:
Image copyright Reuters Image caption Nancy Pelosi used pens engraved with her name to sign the articles of impeachment

The US Senate formally received the articles on impeachment from the House on Thursday.

The whole process has been more colourful, and dramatic, than your standard Congress fare.

Speaker Nancy Pelosi used several pens to put her signature on the impeachment bill - annoying Republicans - before House Democrats and the House clerk formally marched over to the Senate Chamber to present the articles.

And the sergeant at arms kicked off proceedings on Thursday with a medieval-sounding threat: "Hear ye, hear ye, hear ye, all persons are commanded to keep silent, on pain of imprisonment."

It's rare to see so much pomp and circumstance in the US Congress, which typically has far fewer ceremonial traditions than the UK's Parliament