# NLP Challenge

## 1. Classifier

>You just trained a model to classify soft drinks according to their flavor-profile. The results are below.
>- What has the model learned about the correspondence between the item name and the classification?
>- Can you find any problems with what the model has learned?
>- How would you address this problem?

In [1]:
import pandas as pd

In [8]:
results_df = pd.DataFrame(data={
    "item_name": [
        "7up can 355ml", "pepsi bottle 14oz", "7up bottle 500ml", "7up can 500ml",
        "pepsi bottle 20oz", "coke bottle 14oz", "pepsi can 20oz", "coke bottle 20oz",
        "sprite bottle 500ml", "sprite can 500ml", "sprite can 355ml", "coke can 20oz",
        "sprite bottle 355ml", "7up bottle 355ml", "pepsi can 14oz", "coke can 14oz"
        ],
    "flavor_profile": [
        "lemon-line", "cola", "lemon-line", "lemon-line", "cola", "cola", "cola",
        "cola", "lemon-line", "lemon-line", "lemon-line", "cola", "lemon-line",
        "lemon-line", "cola", "cola"
        ]})

results_df

Unnamed: 0,item_name,flavor_profile
0,7up can 355ml,lemon-line
1,pepsi bottle 14oz,cola
2,7up bottle 500ml,lemon-line
3,7up can 500ml,lemon-line
4,pepsi bottle 20oz,cola
5,coke bottle 14oz,cola
6,pepsi can 20oz,cola
7,coke bottle 20oz,cola
8,sprite bottle 500ml,lemon-line
9,sprite can 500ml,lemon-line


In [10]:
results_df[['item_name']].value_counts()

item_name          
7up bottle 355ml       1
7up bottle 500ml       1
7up can 355ml          1
7up can 500ml          1
coke bottle 14oz       1
coke bottle 20oz       1
coke can 14oz          1
coke can 20oz          1
pepsi bottle 14oz      1
pepsi bottle 20oz      1
pepsi can 14oz         1
pepsi can 20oz         1
sprite bottle 355ml    1
sprite bottle 500ml    1
sprite can 355ml       1
sprite can 500ml       1
dtype: int64

### What has the model learned about the correspondence between the item name and the classification?

According to each brand's website both [7Up](https://www.7up.com/en/products/7up) and [Sprite](https://www.sprite.com) are lemon-lime flavoured soft drinks and both [Pepsi](https://www.pepsi.ca/products/pepsi-591ml) and [Coke](https://www.coca-cola.ca/brands/coca-cola#facts) are cola flavoured soft drinks. Therefore, looking at the classification results, the model is 100% accurate, assuming we ignore the specialty flavour variations available from each brand (e.g. cherry).

Whether the data was classified using supervised or unsupervised methods, the model appears to have learned a direct relationship between brand and flavour. If trained on data as clean and structured as the test data above, this could be satisfactory.

### Can you find any problems with what the model has learned?

On closer examination, however, it is possible the model is severely over-fit (low bias) having only learned to classify based on the volumetric unit of each item—millilitres or fluid ounces—irrespective of volume and container type.

From an over-fit model, one would expect low testing accuracy instead of the apparent 100% obtained above. This result can occur if duplicate records exist in both training and testing data sets.

### How would you address this problem?

- regularization
- shuffle order of words (if using nlp techniques)
- character level modelling. These are all short strings and therefore easy enough to be length bound


In [50]:
# Expand item_name strings
results_df[['brand', 'container', 'volume', 'unit']] = results_df['item_name'].str.extract(pat="(\w+) (\w+) (\d+)(\w+)", expand=True)

# normalize units
results_df['volume'] = pd.to_numeric(results_df['volume'])
# convert fluid ounces to millilitres
results_df.loc[results_df['unit'] == 'oz', 'volume'] = results_df['volume'] * 29.5735
results_df.loc[results_df['unit'] == 'oz', 'unit'] = 'ml'
# rename volume column and drop unit column
results_df.rename(columns={'volume': 'ml'}, inplace=True)
results_df.drop(columns=['unit'], inplace=True)



In [51]:
results_df

Unnamed: 0,item_name,flavor_profile,brand,container,ml
0,7up can 355ml,lemon-line,7up,can,355.0
1,pepsi bottle 14oz,cola,pepsi,bottle,414.029
2,7up bottle 500ml,lemon-line,7up,bottle,500.0
3,7up can 500ml,lemon-line,7up,can,500.0
4,pepsi bottle 20oz,cola,pepsi,bottle,591.47
5,coke bottle 14oz,cola,coke,bottle,414.029
6,pepsi can 20oz,cola,pepsi,can,591.47
7,coke bottle 20oz,cola,coke,bottle,591.47
8,sprite bottle 500ml,lemon-line,sprite,bottle,500.0
9,sprite can 500ml,lemon-line,sprite,can,500.0


Some initial thoughts:

- character level modelling. These are all short strings and therefore easy enough to be length bound
- Over fit/under fit?
- Bias!!

> What has the model learned about the correspondence between the item name and the classification?

It appears as though the model has learned to classify on brand name and volume unit. All 7up and Sprite are measured in ml and return lemon-line while all pepsi and coke are measured in oz and return cola regardless of the container type and size.

- Is this a consequence of over fitting or under fitting?
- Add regularization to bump out of the local minima

Sentiment analysis is a supervised classification problem. Multinomial Naive Bayes works fine.

My approach above isn't NLP.

According to [width.ai](https://www.width.ai/post/extracting-information-from-unstructured-text-using-algorithms) SVM can also work. Both from sklearn.

## 2. Brand Extraction

> You’ve been given a product list of ~16,000 carbonated soft drinks (see carbonated_soft_drinks.csv) and you’re asked to build a model that can extract the product’s brand (e.g., Coke, Pepsi Sunkist), when available.
>- What type of model would you use?
>- If you choose a supervised-learning model, where will you get your training data?

In [58]:
# load data
soft_drinks_df = pd.read_csv('../data/carbonated_soft_drinks.csv')
soft_drinks_df

Unnamed: 0,item_name
0,Bottle Coke Classic 20oz
1,Bottle Coke Diet 20oz
2,20oz Fountain Beverage
3,Bottle Pepsi 20oz
4,32oz Fountain Beverage
...,...
16979,Large Dr. Pepper BOOST
16980,Large Coca-Cola BOOST
16981,Diet Ginger Ale
16982,UpSZ Soda


### What type of model would you use?

I would choose a supervised learning model.

I must attribute much of my thinking on this problem to Ajinkya More's 2016 article [*Attribute Extraction from Product Titles in eCommerce*](https://www.arxiv-vanity.com/papers/1608.04670/).

At a high level, this is a Named Entity Recognition (NER) and Information Extraction (IE) problem.

A deterministic solution may be possible if the number of brands is small. One could simply look up terms in a dictionary of brand names based on a client list. Fuzzy matching would account for spelling variations. However, this approach does not work well if the objective grows to include extracting additional features like container type, drink name, or volume. The deterministic solution also does not generalize to products besides soft-drinks. It would not be possible to write parsing rules for every possible variation and in short-order the problem becomes intractable.

Therefore, an inferential/probabilistic model using NLP techniques for Named Entity Recognition is required.

Sklearn and SpaCy both provide NER implementations that are pre-trained.

#### NER with NLTK

In [57]:
import nltk

To test this approach, the code below looks at the first 10 product names. Tokenizes each product name (document), applies part-of-speech tagging, and then chunking to identify named entities.

In [90]:
for i, doc in list(soft_drinks_df.iterrows())[:10]:  # first 10 products
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(doc[0]))):
        print(chunk)
    print()

(PERSON Bottle/NNP)
(PERSON Coke/NNP Classic/NNP)
('20oz', 'CD')

(PERSON Bottle/NNP)
(PERSON Coke/NNP Diet/NNP)
('20oz', 'CD')

('20oz', 'CD')
(GPE Fountain/NNP)
('Beverage', 'NN')

(PERSON Bottle/NNP)
('Pepsi', 'NNP')
('20oz', 'CD')

('32oz', 'CD')
(GPE Fountain/NNP)
('Beverage', 'NN')

(PERSON Bottle/NNP)
(ORGANIZATION Mountain/NNP)
('Dew', 'NNP')
('20oz', 'CD')

(PERSON Bottle/NNP)
(PERSON Coke/NNP Zero/NNP)
('20oz', 'CD')

(PERSON Bottle/NNP)
('Dr.', 'NNP')
('Pepper', 'NNP')
('20oz', 'CD')

(PERSON Bottle/NNP)
('Dr.', 'NNP')
(PERSON Pepper/NNP Diet/NNP)
('20oz', 'CD')

(PERSON Diet/NNP)
('Pepsi', 'NNP')
('20oz', 'CD')



These results are very poor. Nouns are mislabelled as proper nouns and most products are thought to be persons. However, this is expected since product names do not follow fixed syntactical or grammatical rules like much of the data the pre-trained models in NLTK and SpaCy rely on. SpaCy fares no better when tested on our dataset.

One of the largest problems with this method is not being able to update the pre-trained models NLTK and SpaCy are relying on.

### If you choose a supervised-learning model, where will you get your training data?

#### Data Creation

Many of the product names are clean when it comes to the brand name. Therefore, a small list of brand names known to exist in the data can be used to automatically label the clean product names. This will give us an initial dataset to begin training on. Patterns learned from the clean product names can then be applied to unlabelled data to identify new brand names.

This curated dataset will also need to include examples of product names with no brand name.

In [142]:
# Use the brands listed in the problem description and the previous problem
lookup = ['coke', 'pepsi', 'sunkist', '7up', 'sprite']

mask = soft_drinks_df['item_name'].str.contains('|'.join(lookup), case=False)

labelled_data = soft_drinks_df[mask].reset_index(drop=True)
labelled_data['brand'] = labelled_data['item_name'].str.lower().str.extract(pat = '(' + '|'.join(lookup) + ')')

unlabelled_data = soft_drinks_df[~mask].reset_index(drop=True)

labelled_data

Unnamed: 0,item_name,brand
0,Bottle Coke Classic 20oz,coke
1,Bottle Coke Diet 20oz,coke
2,Bottle Pepsi 20oz,pepsi
3,Bottle Coke Zero 20oz,coke
4,Diet Pepsi 20oz,pepsi
...,...,...
5469,sprite 7.5 oz,sprite
5470,Sunkist Orange 12 oz,sunkist
5471,BVC - Can Sunkist Lemonade,sunkist
5472,SPRITE 7.5OZ CAN,sprite


In [97]:
# percent of data automatically labelled
labelled_data.shape[0] / soft_drinks_df.shape[0] * 100

32.23033443240697

From a short list of known labels, nearly one-third of our data is now labelled.

In [141]:
labelled_data['label'].value_counts()

coke       2479
pepsi      1766
sprite      792
sunkist     335
7up         102
Name: label, dtype: int64