# Imports

In [1]:
import pandas as pd
import numpy as np
import gzip
import json
import ast
import textwrap
from transformers import pipeline

  torch.utils._pytree._register_pytree_node(


# The Data

In [2]:
# explore the data and select features
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

In [3]:
fashion_reviews = getDF('../../data/text_data/Amazon Fashion Review Data.json.gz')

In [4]:
fashion_reviews.shape

(3176, 12)

In [5]:
fashion_reviews.iloc[0]

overall                                                        5.0
verified                                                      True
reviewTime                                              09 4, 2015
reviewerID                                           ALJ66O1Y6SLHA
asin                                                    B000K2PJ4K
style             {'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}
reviewerName                                              Tonya B.
reviewText                                Great product and price!
summary                                                 Five Stars
unixReviewTime                                          1441324800
vote                                                           NaN
image                                                          NaN
Name: 0, dtype: object

# The Models & Tasks
All of the following tasks will be using the same text feature: reviewText. The model will differ based on the architecture for the task. 

### Named Entity Recognition
The ner pipeline from huggingface accepts a single text input as a feature, and then returns a json objects having the estimated entities and their corresponding probabilities. Each text input will have three output features after input to the ner pipeline (entity, entity group, and entity score). For each reviewText input there will be many predicted entities. For the purposes of this assignment I will only be selecting the max entity.

In [6]:
fashion_reviews['reviewText'] = fashion_reviews['reviewText'].fillna('no review')

In [7]:
ner_pipeline = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
def apply_ner(column, pipeline=ner_pipeline):
    output = {'reviewText':[],'word':[], 'entity_group':[], 'score':[]}
    def map_to_input(row, input_map=output):
        result = ner_pipeline(row,  aggregation_strategy="simple")
        input_map['reviewText'].append(row)
        max_score = float('-inf')  # Initialize max score to negative infinity
        max_score_entity = None # we will only select the entity with the max score
        for entity in result:
            if entity['score'] > max_score:
                max_score = entity['score']
                max_score_entity = entity
        if max_score_entity is not None:
            input_map['word'].append(max_score_entity['word'])
            input_map['entity_group'].append(max_score_entity['entity_group'])
            input_map['score'].append(max_score_entity['score'])
        else:
            input_map['word'].append(np.nan)
            input_map['entity_group'].append(np.nan)
            input_map['score'].append(np.nan)
    column.apply(map_to_input)
    return output


In [9]:
entity_features = apply_ner(fashion_reviews['reviewText'][:100]) # would take ~13 minutes for the full dataset

In [10]:
entity_feature_lengths = [len(entity[1]) for entity in entity_features.items()]

In [11]:
entity_feature_lengths

[100, 100, 100, 100]

In [12]:
fashion_entities = pd.DataFrame(entity_features).dropna(how='any')
fashion_entities.head()

Unnamed: 0,reviewText,word,entity_group,score
10,Relieved my Plantar Fascitis for 3 Days. Then ...,##tis,MISC,0.557792
11,This is my 6th pair and they are the best thin...,SmartDestination,ORG,0.979718
13,Pinnacle seems to have more cushioning so my h...,Powers,MISC,0.651797
15,A little more cushion than the Powerstep Prote...,Powerstep Protech,MISC,0.929084
17,Relieved my Plantar Fascitis for 3 Days. Then ...,##tis,MISC,0.557792


In [13]:
fashion_entities.shape

(17, 4)

#### Conclusion
The output of the named entity recognition model on the reviewText for 100 rows showed that the model could only detect an entity in 17 of these reviews. The entities that I selected for the 17 reviews that included entities has the max score, but there are some discrepencies. For example, '##tis' does not seem like an entity, while SmartDestination/Powerstep Protech do seem like they would be products.

### Summarization

In [14]:
summarization_pipeline = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [15]:
# since we want to summarize, let us only select the reviews having long length
long_reviews = fashion_reviews.where(fashion_reviews['reviewText'].apply(lambda x: len(x.split(' ')) > 56))
long_reviews.dropna(how='all', inplace=True)

In [16]:
long_review_sample = long_reviews.sample(10)

In [17]:
display(long_reviews['reviewText'].sample().iloc[0])

"Love these shoes, I have worn them all day and found them really comfortable with no aching feet at the end of the day. Fantastic colour and true to size. I also bought  Nike Women's Flex Supreme Tr 3 Pnk Pw/Mtllc Slvr/Cl Gry/White Training Shoe and these shoes are slightly more comfortable than those even though they look very similar."

In [24]:
# converts strings to a readable format
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

In [26]:
def apply_summarization(column, pipeline=summarization_pipeline, wrapper=wrapper):
    output = {'reviewText':[], 'summary_reviewText':[]}
    def map_to_input(row, input_map = output):
        input_map['reviewText'].append(row)
        # find the summary of the reviewText
        results = pipeline(row, max_length=56, clean_up_tokenization_spaces=True)
        input_map['summary_reviewText'].append(wrapper.fill(results[0]['summary_text']))
    column.apply(map_to_input)
    return output

In [19]:
summarization_features = apply_summarization(long_review_sample['reviewText'])

In [20]:
summarizations = pd.DataFrame(summarization_features)
summarizations.head()

Unnamed: 0,reviewText,summary_reviewText
0,A-MA-ZING! I needed a good jogging shoe and t...,A-MA-ZING! I needed a good jogging shoe and ...
1,"Love these shoes, I have worn them all day and...",Nike Women's Flex Supreme Tr 3 Pnk Pw/Mtllc S...
2,I was wearing running shoes for cardio dance c...,The lacing is done in a way that allows you t...
3,Tried them on in a store before buying online ...,Overall I was looking for a durable cross tra...
4,I have pretty high arches and my calves always...,I have been on the hunt for a good cross trai...


In [27]:
example_summary = summarizations.sample()
print(wrapper.fill(example_summary['reviewText'].iloc[0]))

I was wearing running shoes for cardio dance classes at the gym because I
thought they would be sufficient. After getting 2 blood blisters I did some
research and found out cross trainers would be more appropriate which is how I
ended up buying this pair. They are slightly longer in look then I  would like
but I think that is typical of cross trainers. Also, the lacing is done in a way
that allows you to tie too tightly if you're not careful (my toes were tingling
during one particular work out when I had done this). Other then that I am very
happy with the shoes.


In [22]:
print(example_summary['summary_reviewText'].iloc[0])

 Super comfortable and fit my small feet perfectly. I have flat feet so a lot of
shoes are not comfortable for long periods of time. I can wear the shoe all day
long and they are super comfortable. They are light colored so any dirt will be
seen right away


#### Conclusion
The summary of the sample review was able to capture the color of the shoe being reviewed as well as the general sentiment. The summarization model was a little slow since summarization and text generation depends on the inference per word being chained. This takes a long time especially in summarizations that are longer. A better implementation would utilize GPU to summarize columns with more records and thus more tokens.

### Translation (EN->ES)

In [32]:
translation_pipeline = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")

Since we are doing translation, this will also be a text generation technique so it may be wiser to apply it on the summarized reviews we have just prepared to speed up processing time!

In [37]:
def apply_translation(column, pipeline=translation_pipeline, wrapper=wrapper):
    output = {'summary_reviewText':[], 'translated_summary':[]}
    def map_to_input(row, input_map = output):
        input_map['summary_reviewText'].append(row)
        # find the summary of the reviewText
        results = pipeline(row, max_length=84, clean_up_tokenization_spaces=True)
        input_map['translated_summary'].append(wrapper.fill(results[0]['translation_text']))
    column.apply(map_to_input)
    return output

In [38]:
spanish_translation_summary_reviews = apply_translation(summarizations['summary_reviewText'])

In [40]:
translations = pd.DataFrame(spanish_translation_summary_reviews)
translations.head()

Unnamed: 0,summary_reviewText,translated_summary
0,A-MA-ZING! I needed a good jogging shoe and ...,A-MA-ZING! Necesitaba un buen zapato de joggin...
1,Nike Women's Flex Supreme Tr 3 Pnk Pw/Mtllc S...,Zapato de entrenamiento Nike Women's Flex Supr...
2,The lacing is done in a way that allows you t...,El encaje se hace de una manera que te permite...
3,Overall I was looking for a durable cross tra...,"En general, estaba buscando un zapato de entre..."
4,I have been on the hunt for a good cross trai...,He estado en la caza de un buen entrenador de ...


In [41]:
example_translation = translations.sample()

In [43]:
print(example_translation['summary_reviewText'].iloc[0])

 Super comfortable and fit my small feet perfectly. I have flat feet so a lot of
shoes are not comfortable for long periods of time. I can wear the shoe all day
long and they are super comfortable. They are light colored so any dirt will be
seen right away


In [44]:
print(example_translation['translated_summary'].iloc[0])

Tengo pies planos por lo que muchos zapatos no son cómodos durante largos
períodos de tiempo. Puedo usar el zapato todo el día y son súper cómodos. Son de
color claro por lo que cualquier suciedad se verá de inmediato


#### Conclusion

It was a good thing that the summarization is able to be done before translation. It is able to preserve the meaning and sentiment behind the review, and then reduce the complexity of the review before it has to be translated. The translation seems to be accurate - given my native speaker understanding of spanish. A couple reviews could not translate the english that was broken up ex: "A-MA-ZING' -> 'A-SOM-BROSO'. The model also doesn't seem to translate the name of the product/entity. This may or may not be intended.