## Описание

Выберите корпус отзывов на товары одной из категорий Amazon:
http://jmcauley.ucsd.edu/data/amazon/


Допустим, что вам нужно подготовить аналитический отчет по этим отзывам — например, для производителя нового продукта этой категории. Для этого будем искать упоминания товаров в отзывах (будем считать их NE). Учтите, что упоминание может выглядеть не только как "Iphone 10", но и как "модель", "телефон" и т.п.

**Важное замечание**: в задании приводятся примеры решений, вы можете их использовать!


### Варианты решения:

1. **Rule-based** – пишем правила (синтаксические шаблоны) с помощью yargy.
- Достоинства: скорость
- Недостатки: достаточно сложно составлять правила (из много)
   
2. **Classification-based** - делаем бинарную классификацию (NER/неNER). 
   - Достоинства: скорость
   - Недостатки: достаточно сложно составлять правила (из много)
   
3. **SpaCy** – готовая модель
   - Достоинства: скорость, простота
   - Недостатков нет

In [1]:
import gzip
import json
import string
import pickle
from collections import Counter

import nltk
import pandas as pd
import spacy; nlp = spacy.load("en_core_web_sm")
from nltk.tokenize import word_tokenize, MWETokenizer
from nltk.collocations import BigramCollocationFinder
from spacy.matcher import Matcher
from spacy.util import filter_spans
from tqdm.auto import tqdm

In [2]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

df = getDF('Software_5.json.gz')

In [3]:
df.dropna(subset=['reviewText'], inplace=True)
df.tail(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
12800,4.0,False,"07 16, 2016",A1E50L7PCVXLN4,B01FFVDY9M,{'Platform:': ' Key Card'},Colinda,When I ordered this it was listed as Photo Edi...,File Management Software with Basic Editing Ca...,1468627200,,
12801,3.0,False,"06 17, 2017",AVU1ILDDYW301,B01HAP3NUG,,G. Hearn,This software has SO much going on. Theres a ...,"Might not be for the ""novice""",1497657600,,
12802,4.0,False,"01 24, 2017",A2LW5AL0KQ9P1M,B01HAP3NUG,,Dr. E,I have used both more complex and less complex...,"Great, Inexpensive Software for Those Who Have...",1485216000,,
12803,3.0,False,"06 14, 2018",AZ515FFZ7I2P7,B01HAP47PQ,{'Platform:': ' PC Disc'},Jerry Jackson Jr.,Pinnacle Studio 20 Ultimate is a perfectly ser...,Gets the job done ... but not as easy as it sh...,1528934400,,
12804,4.0,False,"04 16, 2018",A2WPL6Y08K6ZQH,B01HAP47PQ,{'Platform:': ' PC Disc'},Narut Ujnat,A program that is fairly easy to use and provi...,Good overall program.,1523836800,,


In [4]:
reviews = df['reviewText'].tolist()
summaries = df['summary'].tolist()

Считаем частотные глаголы и существительные.

In [5]:
verbs = Counter()
reviews_lemmas = []
for review in tqdm(reviews):
    doc = nlp(review)
    review_lemmas = []
    for token in doc:
        review_lemmas.append(token.lemma_)
        if token.pos_ == 'VERB':
            verbs[token.lemma_] += 1
    reviews_lemmas.append(' '.join(review_lemmas))
    
verbs.most_common(10)

  0%|          | 0/12804 [00:00<?, ?it/s]

[('have', 19660),
 ('use', 16735),
 ('do', 9871),
 ('get', 7832),
 ('work', 6988),
 ('make', 5866),
 ('go', 5189),
 ('be', 5139),
 ('find', 5071),
 ('need', 5007)]

In [6]:
nouns = Counter()
summaries_lemmas = []
for summary in tqdm(summaries):
    if pd.isna(summary):
        summaries_lemmas.append([])
        continue
    
    doc = nlp(summary)
    summary_lemmas = []
    for token in doc:
        summary_lemmas.append(token.lemma_) 
        if token.pos_ == 'NOUN':
            nouns[token.lemma_] += 1
    summaries_lemmas.append(' '.join(summary_lemmas))
    
nouns.most_common(10)

  0%|          | 0/12804 [00:00<?, ?it/s]

[('star', 1335),
 ('product', 565),
 ('software', 537),
 ('program', 309),
 ('version', 274),
 ('year', 239),
 ('price', 214),
 ('feature', 176),
 ('time', 172),
 ('computer', 167)]

Пишем правила с помощью Spacy matcher и ищем строки с упоминаниями

In [7]:
matcher = Matcher(nlp.vocab)
matcher.add(
    "verb_pattern", 
    [
        [
            {
                "LEMMA": {
                    "IN": [
                        "use", 
                        "like", 
                        "instal"
                    ]
                }
            }, 
            {
                "lower": "this", 
                "OP": "*"
            }, 
            {
                "POS": "PROPN", 
                "OP": "+"
            }
        ]
    ]
)
matcher.add(
    "this_pattern", 
    [
        [
            {
                "lower": "this"
            }, 
            {
                "POS": {
                    "IN": [
                        "PROPN", 
                        "NOUN"
                    ]
                }, 
                "OP": 
                "+"
            }, 
            {
                "LEMMA": {
                    "IN": [
                        "be", 
                        "have"
                    ]
                }
            }, 
            {
                "POS": "ADJ", 
                "OP": "*"
            }
        ]
    ]
)
matcher.add(
    "descriptor_pattern", 
    [
        [
            {
                "POS": "PROPN"
            }, 
            {
                "POS": "PROPN", 
                "OP": "*"
            }, 
            {
                "lower": {
                    "IN": [
                        "program", 
                        "software", 
                        "player", 
                        "package", 
                        "tool", 
                        "game"
                    ]
                }
            }
        ]
    ]
)

In [8]:
def get_spans(text):
    doc = nlp(text)
    return filter_spans([doc[start:stop] for _, start, stop in matcher(doc)])

In [9]:
def extract_products(match):
    tokens = [token.text.lower() for token in match]
    if tokens[0] in ["use", "like", "instal"]:
        product = ' '.join(tokens[1:])
    elif 'this' in tokens:
        this_ind = tokens.index('this')
        if 'be' in tokens:
            verb_ind = tokens.index('be')
        elif 'have' in tokens:
            verb_ind = tokens.index('have')         
        product = ' '.join(tokens[this_ind+1:verb_ind])
    elif tokens[-1] in ["program", "software", "player", "package", "tool", "game"]:
        product = ' '.join(tokens)    
    return product

In [10]:
def get_products_mentions(text):
    all_prodnames = []
    for span in get_spans(text):
        try:
            mention = extract_products(span)
        except Exception as e:
            continue
        all_prodnames.append(mention)
    return all_prodnames

In [11]:
products_mentions = [
    get_products_mentions(text) for text in tqdm(reviews_lemmas)
]
products_mentions = sum(products_mentions, [])
products_mentions[:10]

  0%|          | 0/12804 [00:00<?, ?it/s]

['dreamweaver',
 'course',
 'courseware',
 'course',
 'flash files',
 'flash video',
 'div',
 'ap',
 'spry',
 'dw']

In [13]:
mwe_tokenizer = MWETokenizer(separator=" ")
for products_mention in products_mentions:
    mwe_tokenizer.add_mwe(tuple(products_mention.split()))
    
bigrams = Counter()
reviews_lemmas_mwe = []
for review in tqdm(reviews):
    tokens = mwe_tokenizer.tokenize(word_tokenize(review.lower()))
    reviews_lemmas_mwe.append(tokens)
    
    review_bigrams = list(nltk.bigrams(tokens))
    review_bigrams_filtered = []
    for review_bigram in review_bigrams:
        if review_bigram[0] in string.punctuation or review_bigram[1] in string.punctuation:
            continue
        if review_bigram[0] in products_mentions or review_bigram[1] in products_mentions:
            review_bigrams_filtered.append(review_bigram)
    bigrams.update(review_bigrams_filtered)
    
bigrams.most_common(10)

  0%|          | 0/12804 [00:00<?, ?it/s]

[(('i', 'have'), 6616),
 (('it', "'s"), 5173),
 (('it', 'is'), 4617),
 (('i', "'ve"), 3276),
 (('i', "'m"), 3035),
 (('and', 'i'), 2997),
 (('that', 'i'), 2922),
 (('the', 'software'), 2917),
 (('i', 'was'), 2904),
 (('the', 'program'), 2673)]

Подсчет PMI

In [14]:
collocation_measures = nltk.collocations.BigramAssocMeasures()
collocation_finder = BigramCollocationFinder.from_documents(reviews_lemmas_mwe)

pmi = []
likelihood_ratio = []
student_t = []
for bigram in tqdm(bigrams):
    pmi.append(
        (
            bigram, 
             collocation_finder.score_ngram(
                 collocation_measures.pmi, 
                 bigram[0], 
                 bigram[1]
             )
        )
    )
    likelihood_ratio.append(
        (
            bigram, 
            collocation_finder.score_ngram(
                collocation_measures.likelihood_ratio, 
                bigram[0], 
                bigram[1]
            )
        )
    )
    student_t.append(
        (
            bigram, 
            collocation_finder.score_ngram(
                collocation_measures.student_t, 
                bigram[0], 
                bigram[1]
            )
        )
    )

  0%|          | 0/98831 [00:00<?, ?it/s]

Сущеностей получилось много, поэтому сохраняю все в отдельный файл csv.

In [15]:
def get_item_group(bigram):
    if bigram[0] in products_mentions:
        return bigram[0]
    elif bigram[1] in products_mentions:
        return bigram[1]

scores = pd.DataFrame()
scores['bigram'] = [b[0] for b in pmi]
scores['pmi'] = [b[1] for b in pmi]
scores['likelihood_ratio'] = [b[1] for b in likelihood_ratio]
scores['student_t'] = [b[1] for b in student_t]
scores['item_group'] = scores['bigram'].apply(get_item_group)

In [16]:
pmi_scores = (
    scores[
        [
            'item_group', 
            'bigram', 
            'pmi'
        ]
    ]
    .groupby('item_group')
    .apply(
        lambda x: x.sort_values('pmi', ascending=False)
    )
    .reset_index(drop=True)
)
pmi_scores.head()

Unnamed: 0,item_group,bigram,pmi
0,* program,"(like, * program)",7.49871
1,* program,"(* program, is)",5.179029
2,* software,"(* software, rely)",12.876201
3,* software,"(* software, cds)",12.415228
4,* software,"(backup, * software)",9.561079


In [17]:
likelihood_ratio_scores = (
    scores[
        [
            'item_group', 
            'bigram', 
            'likelihood_ratio'
        ]
    ]
    .groupby('item_group')
    .apply(
        lambda x: x.sort_values('likelihood_ratio', ascending=False)
    )
    .reset_index(drop=True)
)
likelihood_ratio_scores.head()

Unnamed: 0,item_group,bigram,likelihood_ratio
0,* program,"(like, * program)",9.014803
1,* program,"(* program, is)",5.821187
2,* software,"(* software, will)",17.901573
3,* software,"(* software, rely)",16.136121
4,* software,"(* software, cds)",15.493911


In [18]:
student_t_scores = (
    scores[
        [
            'item_group', 
            'bigram', 
            'student_t'
        ]
    ]
    .groupby('item_group')
    .apply(
        lambda x: x.sort_values('student_t', ascending=False)
    )
    .reset_index(drop=True)
)
student_t_scores.head()

Unnamed: 0,item_group,bigram,student_t
0,* program,"(like, * program)",0.994471
1,* program,"(* program, is)",0.972397
2,* software,"(* software, will)",1.406138
3,* software,"(* software, rely)",0.999867
4,* software,"(* software, cds)",0.999817


In [19]:
pmi_scores.to_csv('pmi.csv')
likelihood_ratio_scores.to_csv('likelihood_ratio.csv')
student_t_scores.to_csv('student_t.csv')