### Project Objective:
The objective of this project is to categorify products based on their attibutes. 

The application of this project in business can be the automation of shelf arrangement services for retailer stores.  When a new product arrives, if we can correctly classify it based on the description, we can automatically know which shelf to put the item which increases the efficiency.

Another application is for e-commerce. Categorifying products means grouping products that are potential substitutes. When customers search for one item, we can put the products in the same category in the 'Recommendation' section, which has the potential of increasing sales.

### Project Details:
The project has three steps

1. Convert the UPCs to the product name and get different attributes associated with the product

2. Build a machine learning model with Natual Language Processing techniques to classify the products based on their attributes

3. Find the score of the brands of each product

### We are using the following packages for this project. 

In [757]:
import json
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import requests
import re
import matplotlib.image as img
from collections import defaultdict
import random
from retrying import retry
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/csun1992/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading Data
First load the UPCs to the notebook and get rid of all the duplicates.

In [551]:
fileName = 'data scientist case study sample UPCs.csv'
upcs = []
with open(fileName) as file:
    for i in file:
        upc = json.loads(i.strip())
        if upc != '':
            upcs.append(str(int(upc)))
upcs = list(set(upcs)) # get unique upcs

### Get Product Name for Each UPC
Now we need to obtain the product names for each UPCs. 

 The following issues are what I need to solve,
1. Even though some online databases allow retrieving product name with UPC, they have restrictions on how many times queries can occur each day. Usually, the limit is 100 times.
2. I used Walmart.com to find the attributes of products. However, the search engine of walmart.com is not ideal. Getting to the exact website of the product can be difficult.
3. A large number of UPCs in the file do not correspond to any record in the databases. 

I cannot do too much for the third issue. The following methods are what I did not handle the first two issues.
#### 1. I used the databases as a backup. The procedure of getting product names are as follows.
- Use google search with the word 'UPC' and the actual UPC as the keywords.  
- Scrape all the search results in the first search page. 
- Use MapReduce to find the first few words that occurred the most frequent in the search.  
- Combine those words to the potential name of the product name.
- Use databases if no google results or the name is not up to certain standard

#### 2. To get to the exact Walmart product page, I used the following procedure since the Google search engine is more precise than Walmart search engine.
- Google search with keyword consists of the word 'Walmart' and the product name retrieved.
- Get to the first result that has a link to walmart.com. 

In [520]:
# get product name from upc
def getProductName(upc): 
    wordRanks = defaultdict(int) # counts each word appears in search results
    lines = 0
    googleKeyWord = 'upc+'+upc 
    url = 'https://www.google.com/search?q=%s'%googleKeyWord # google link with keyword "upc" and upc number
    driver = webdriver.Chrome(executable_path='./chromedriver')
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for i in soup.find_all('h3', class_="LC20lb"): # go through words in each result in the 1st page
        bagOfWords = re.split('\s*[,$%\'\-\s|:;<>?\(\)\[\]\/&]\s*', i.text.lower())
        for i in bagOfWords:
            if i not in ['upc', 'gtin', 'ean', upc, '0'+upc, \
                        '00'+upc, '000'+upc, '0000'+upc, 'amazon', '...',\
                        'target', 'walmart', 'heb']: # not consider the words in the list
                wordRanks[i] += 1
        lines += 1
    driver.quit()
    if lines <= 1: # if no results or 1 result, use database to find the product name
        try: 
            url = 'https://www.buycott.com/upc/%s'%upc # database 1
            page = requests.get(url)
            soup = BeautifulSoup(page.content, 'html.parser')
            name = soup.find('div', {'id':'container_header'}).find('h2').text
        except:
            try:
                url = 'https://www.upcitemdb.com/upc/%s'%upc # database 2
                page = requests.get(url)
                soup = BeautifulSoup(page.content, 'html.parser')
                name = soup.find('p', class_='detailtitle').b.text.strip()
            except:
                raise Exception('no such item')
    elif lines <= 5: # if less than 5 results, used words at least appear twice
        name = [key for key, val in wordRanks.items() if val >= 2]
        name = ' '.join(name)
    else: 
        name = [key for key, val in wordRanks.items() if val >= lines * 0.4]
        name = ' '.join(name)
    if len(name.split(' ')) < 3: # if name is 3 words or less, use other databases
        try: 
            url = 'https://www.buycott.com/upc/%s'%upc
            page = requests.get(url)
            soup = BeautifulSoup(page.content, 'html.parser')
            name = soup.find('div', {'id':'container_header'}).find('h2').text
        except:
            try:
                url = 'https://www.upcitemdb.com/upc/%s'%upc
                page = requests.get(url)
                soup = BeautifulSoup(page.content, 'html.parser')
                name = soup.find('p', class_='detailtitle').b.text.strip()
            except:
                raise Exception('no such item')
    return name

In [502]:
def walmartUrlFromGoogle(productName):
    url = 'https://www.google.com/search?q=%s'%productName # google search with "Walmart" and product name
    driver = webdriver.Chrome(executable_path='./chromedriver')
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    found = False
    for i in soup.find_all('div', class_='r'): # find the first link that contains "walmart"
        refTag = i.find('a')
        if refTag:
            link = refTag.get('href')
            if link and link.split('.')[1].lower() == 'walmart':
                found = True
                break
    driver.quit()
    if found:
        return link
    else:
        return

In [503]:
# get product Walmart link from product name
def getWalmartUrl(productName):
    walmartName = '+'.join(re.split('\s*[,\-\s]+\s*', 'Walmart '+productName))
    link = walmartUrlFromGoogle(walmartName)
    if link:  # first try to get product name with google search
        return link
    url = 'https://www.walmart.com/search/?query=%s'%walmartName # if google search fails use walmart search
    driver = webdriver.Chrome(executable_path='./chromedriver') 
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    try:
        link = soup.find('a', class_="product-title-link line-clamp line-clamp-2").get('href')
    except:
        diver.quit()
        raise Exception('no link')
    driver.quit()
    return 'https://www.walmart.com/%s'%link

In [504]:
def downloadImages(imgUrl,title):
    Picture_request = requests.get(imgUrl)
    name = "./images/%s.jpg"%title
    if Picture_request.status_code == 200:
        with open(name, 'wb') as f:
            f.write(Picture_request.content)
    return name

In [550]:
# get attributes of each result from the product Walmart link
def getAttribute(productUrl, upc):
    data=defaultdict()
    page = requests.get(productUrl)
    soup = BeautifulSoup(page.content, 'html.parser')
    try:
        category = soup.find('ol', class_="breadcrumb-list")
        allSubCat = category.find_all('li')
        cat = ['-'.join(re.split('\s*[,&\-\s]+\s*', i.find('a').get('itemname').lower())) for i in allSubCat]
        data['category'] = '/'.join(cat)
    except:
        pass
        
    try:
        aboutProduct = soup.find('div', {'id':'about-product-section'})
        try:
            nutrition = aboutProduct.find('div', {'id':'nutritionFacts'})
            infos = ['nutrition-facts-all-facts-servingSize', "nutrition-facts-all-facts-calorie-info",\
                    "nutrition-facts-all-facts-nutrient-info", \
                     "nutrition-facts-all-facts-vitamins-minerals-info"]
            try:
                for info in infos:
                    fact = nutrition.find('div', class_=info).find_all('div')
                    for i in fact:
                        keyVal = [j.text for j in i.find_all('span')]
                        if len(keyVal) >= 2:
                            data[keyVal[0].lower()] = keyVal[1].lower()
            except:
                pass
        except:
            pass
        try:
            specifications = aboutProduct.find('div', {'id':'specifications'}).find_all('tr')
            for i in specifications:
                keyVal = [j.text for j in i.find_all('td')]
                if len(keyVal) >= 2:
                        data[keyVal[0].lower()] = keyVal[1].lower()
        except:
            pass
    except:
        pass
    try:
        ingredients = soup.find('span', class_="aboutModuleText").text
        if ingredients:
            data['ingredients'] = ingredients.lower()
        else:
            aboutItem = soup.find('div', {'id':'product-about'})
    except:
        try:
            aboutItem = soup.find('div', {'id':'product-about'}).find('ul').find_all('li')
            for i in aboutItem:
                line = i.text.split()
                if line[-1] == 'serving' and line[-2] == 'per':
                    data[line[-3].lower()] = line[0].lower()
        except:
            pass
    try:
        reviews = soup.find('div', class_='ReviewHistogram ReviewsHeader-filter')\
                    .find_all('div', {'role':'button'})
        reviews = [i.get('aria-label') for i in reviews]
        reviewByStars = {}
        for review in reviews:
            splitedReview = re.split('[\s\-]', review)
            value, key = int(splitedReview[0]), int(splitedReview[1]) 
            reviewByStars[key] = value
        data['reviews'] = reviewByStars
    except:
        pass
    try:
        imageLink = soup.find('div', class_='prod-hero-image').find('img').get('src')
        imgName = downloadImages(imageLink, upc)
        data['images'] = img.imread(imgName)
    except:
        pass
    try:
        singlePrice = soup.find('div', class_='variant-options-container')\
                        .find('div', class_="variant-ppu-price variant-option-text big bold").text
        if singlePrice != 'See price':
            data['price'] = float(singlePrice[1:])
        else:
            allPrices = soup.find_all('div', class_='ppu-transactional-variant')[-1]
            quant = int(allPrices.find('div', class_='LinesEllipsis').text.split()[0])
            totalPrice = float(allPrices.find('div', class_='variant-ppu-price variant-option-text big bold')\
                             .text[1:])
            data['price'] = totalPrice / quant
    except:
        try:
            data['price'] = float(soup.find('span', class_='price-group').get('aria-label')[1:])
        except:
            pass
    return data    

In [546]:
# get all the needed info about a product directly UPC
def upcToProductInfo(upc):
    try:
        title = getProductName(upc)
        try:
            walmartUrl = getWalmartUrl(title)
            try:
                productAttr = getAttribute(walmartUrl, upc)
                productAttr['name'] = title.lower()
                return upc, productAttr
            except:
                raise Exception('something is wrong with attribute'+str(upc))
        except:
            raise Exception('google fault')
    except:
        raise Exception('no product name')

In [508]:
@retry(stop_max_attempt_number=5)
def productInfoList(upcList):
    productList = []
    badQuery = []
    for upc in upcList:
        try:
            productList.append(upcToProductInfo(upc))
        except:
            badQuery.append(upc)
    return productList, badQuery

In [552]:
data, badUpc = productInfoList(upcs) # get the data and bad UPCs

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [553]:
import dill as pickle
with open('data.pkl', 'wb') as file:
    pickle.dump(productInfo, file)

### Machine Learning Model for Classification

In [697]:
# Use the data with categories as training set
# Data without categories is used as test set
allData.columns = ['_'.join(re.split('\s*[\-,&\s\|\/]+\s*', i)) for i in allData.columns]
testData = allData[allData.category.isnull()].reset_index(drop=True)
trainData = allData[allData.category.notnull()].reset_index(drop=True)

Due to the unbalanced categories, I used SMOTE algorithm to upsample the minority classes.

Due to extremely large number of food items, I first classify the item into categories 'food' and 'other'. For the 
items classified as 'food', I used hierarchical classification to further classify them into subcategories suggested by Walmart.

Because there are features associated with only 'food' and non-food items. When doing the prediction, I first check whether the item has attributes associated with food or not. If yes, automatically perform second stage classification. Otherwise, first perform the first stage classification.

In [867]:
class Classifier(object):
    
    def __init__(self):
        clf1 = RandomForestClassifier(n_estimators=50)
        clf2 = RandomForestClassifier(n_estimators=50)
        parameters = {'max_depth':[2**i for i in range(2, 10)], \
                      'min_samples_split':[2**i for i in range(1, 8)], \
                      'max_features':['auto', 'sqrt', 'log2']}
        self.clf1 = GridSearchCV(clf1, parameters, cv=5, scoring='f1')
        self.clf2 = GridSearchCV(clf2, parameters, cv=5)

        stopWords = set(stopwords.words('english'))
        self.vectorizer = TfidfVectorizer(stop_words=stopWords, ngram_range=(1, 2))

        self.secondStageLabel = LabelEncoder()
        
    def transformTrainData(self, data):
        cat1 = pd.Series([int(i.split('/')[0] == 'food') for i in data.category], name='cat1')
        cat2Total = [i.split('/')[1] for i in data[cat1 == 1].category]
        cat2Count = Counter(cat2Total)
        cat2 = [i if cat2Count[i] >= 10 else 'other food' for i in cat2Total]
        cat2 = self.secondStageLabel.fit_transform(cat2)
        
        trainData = data['name'].fillna('') + ', ' + data['ingredients'].fillna('')
        try:
            trainData = trainData + ', ' + data['form'].fillna('')
        except:
            pass
        try:
            trainData = trainData + ', ' + data['food_form'].fillna('')
        except:
            pass
          
        train2 = trainData[cat1 == 1]
        return trainData, cat1, train2, cat2
    
    def transformTestData(self, data):
        data = data.fillna('')
        testData = data['name'] + ', ' + data['ingredients']
        try:
            testData = testData + ', ' + data['form']
        except:
            pass
        try:
            testData = testData + ', ' + data['food_form']
        except:
            pass
        return self.vectorizer.transform([testData])
    
    def fit(self, data):
        train1, label1, train2, label2 = self.transformTrainData(data)
        train1 = self.vectorizer.fit_transform(train1)
        train2 = self.vectorizer.transform(train2)
  
        smote = SMOTE('minority')
        X1, y1 = smote.fit_sample(train1, label1)
        X2, y2 = smote.fit_sample(train2, label2)
        self.clf1.fit(X1, y1)
        self.clf2.fit(X2, y2)
        
    def predict(self, data):
        label = []
        foodFeatures = [
       'calcium', 'calories', 'calories_from_fat', 'cholesterol',
       'flavor', 'food_allergen_statements','food_form', 'ingredients', 
       'protein', 'saturated_fat', 'sugars', 'total_carbohydrate', 
        'total_fat', 'trans_fat', 'serving_per_container','serving_size'
        ]
        
        nonFoodFeatures = ['animal_type', 'assembled_product_dimensions_(l_x_w_x_h)',
       'assembled_product_weight', 'author', 'book_format', 'capacity', 'country_of_origin_assembly', 
       'cpu_socket_type', 'dietary_fiber', 'duration', 'fabric_content','isbn_10', 'isbn_13', 
        'makeup_form', 'manganese','manufacturer_part_number', 'manufacturer_part_number', 'material',
       'maximum_ram_supported', 'model', 'number_of_pages', 'occasion', 'original_languages',
       'publication_date', 'publisher', 'release_date', 'series_title', 'studio_production_company', 
       'tire_diameter', 'tire_load_index', 'tire_season','tire_size', 'tire_width',
        'vehicle_make', 'vehicle_model']
        
        foodFeatIncluded = [i for i in foodFeatures if i in data.columns]
        nonFoodFeatIncluded = [i for i in nonFoodFeatures if i in data.columns]
        for _, row in data.iterrows():
            # preliminary check whether the product is food or not based on whehter it has
            # attibutes associated with food or non-food
            if row[foodFeatIncluded].notnull().any():  
                row = self.transformTestData(row)
                result = self.clf2.predict(row)
                label.extend(self.secondStageLabel.inverse_transform(result))
            elif row[nonFoodFeatIncluded].notnull().any():
                result.append('other')
            else:
                row = self.transformTestData(row)
                if self.clf1.predict(row)[0]:
                    result = self.clf2.predict(row)
                    label.extend(self.secondStageLabel.inverse_transform(result))
                else:
                    label.append('other')
        return label


In [868]:
clf = Classifier()
clf.fit(trainData)



In [870]:
clf.predict(testData) # result for test set

['fresh-food',
 'snacks-cookies-chips',
 'meal-solutions-grains-pasta',
 'snacks-cookies-chips',
 'fresh-food',
 'condiments-sauces-spices',
 'condiments-sauces-spices',
 'snacks-cookies-chips',
 'fresh-food',
 'condiments-sauces-spices',
 'fresh-food',
 'condiments-sauces-spices',
 'snacks-cookies-chips',
 'meal-solutions-grains-pasta',
 'meal-solutions-grains-pasta',
 'fresh-food',
 'meal-solutions-grains-pasta',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'condiments-sauces-spices',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'beverages',
 'fresh-food',
 'condiments-sauces-spices',
 'fresh-food',
 'fresh-food',
 'fresh-food',
 'condiments-sauces-spices',
 'fresh-food',
 'fresh-food',
 'condiments-sauces-spices',
 'other food',
 'fresh-food',
 'fresh-food',
 'meal-solutions-grains-pasta',
 'condiments-sauces-spices',
 'other',
 'fresh-food',
 'fresh-food',
 'condiments-sauces-spices',
 'condiments-sauces-spices',
 'condiments-sauces-sp

### Brand score is obtained based on the Walmart review
The score is average star of each product associated with same brand by dropping product with less than 5 reviews.

In [871]:
brands = list(set(df.brand.dropna()))
for i in df[df.brand.isnull()].index:
    for brand in brands:
        if re.match(brand.lower(), df.loc[i, 'name'].lower()) \
        or re.match(' '.join(re.split('\s*-\s*', brand.lower())), df.loc[i, 'name'].lower()):
            df.loc[i, 'brand'] = brand
            break

In [872]:
def getBrandScores(df):
    scores = defaultdict(int)
    for name, group in df.groupby('brand'):
        totalProduct = 0
        for review in group.reviews.dropna():
            if sum(review.values()) <= 5:
                pass
            else:
                scores[name] += sum([i * j for i, j in zip(review.keys(), review.values())])
                totalProduct += 1
        if totalProduct > 0:
            scores[name] = scores[name] / totalProduct
    return scores

Brand scores are given below.

In [873]:
df = pd.DataFrame(t[1] for t in data)
df.index = [i[0] for i in data]
df = df.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
getBrandScores(df)

defaultdict(int,
            {'5strands': 35.0,
             'a.1.': 238.7,
             'athenos': 69.75,
             'bagel bites': 92.0,
             "baker's": 90.0,
             'boca': 265.57142857142856,
             "breakstone's": 349.5,
             "bull's-eye": 151.0,
             'bumble bee': 380.0,
             'calumet': 33.0,
             'capri sun': 216.125,
             'certo': 130.0,
             'cheez whiz': 201.0,
             'classico': 190.0,
             'claussen': 192.75,
             'community coffee': 458.0,
             'cool whip': 30.0,
             'corn nuts': 52.0,
             'cornnuts': 40.4,
             'country time': 288.0,
             'cracker barrel cheese': 242.0,
             'crystal light': 252.6315789473684,
             'delimex': 67.0,
             'devour': 25.0,
             'dream whip': 140.0,
             'franks': 27.0,
             "french's": 198.0,
             'gevalia': 1883.888888888889,
             'good seasons': 