This analysis will use machine learning to categorize products based on their ingredient listing. 

There are over 16000 ingredients registered to INCI for cosmetics
This data was scraped from Ulta.com using my Scrapy SpiderCrawler project. 

In [1]:
#import all packages upfront for transparency
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [2]:
#load in the dataset, contains 1151 products
df = pd.read_csv("/Users/tylerjensen/Data-Science-Portfolio-Projects/scraping_project/ulta/shampoo_conditioner1.csv")
display(df.sample(5))
print(f"original shape {df.shape}")

Unnamed: 0,product_name,brand-name,rating,ingredients,product_url
1048,Core Strength Shampoo,Curlsmith,3.3,"Water (Aqua), Sodium Methyl Cocoyl Taurate, Co...",https://www.ulta.com/p/core-strength-shampoo-p...
1065,Flourish Conditioner for Thinning Hair,Virtue,4.7,"Aqua (Water, Eau), Cetearyl Alcohol, Dimethico...",https://www.ulta.com/p/flourish-conditioner-th...
325,ColorSolve Daily Moisture Conditioner,Madison Reed,3.0,"Aqua (Water), Cetearyl Alcohol, Behentrimonium...",https://www.ulta.com/p/colorsolve-daily-moistu...
559,Travel Size Intense Therapy Leave-In Treatment,Pravana,4.4,"Water (Aqua/Eau), Cyclomethicone, Phenyl Trime...",https://www.ulta.com/p/travel-size-intense-the...
222,Manuka Honey & Yogurt Hydrate + Repair Shampoo,SheaMoisture,4.4,"Water, Sodium Lauroyl Methyl lsethionate, Coco...",https://www.ulta.com/p/manuka-honey-yogurt-hyd...


original shape (1151, 5)


In [3]:
#drop products without a review yet 
df = df[df.rating != "Write A Review"]
df['rating'] = pd.to_numeric(df['rating'])

#Count the number of unique brands, shape of DataFrame and overall average product rating
print(f"unique_brands {df['brand-name'].nunique()} ")
print(f"rows/columns {df.shape}")
print(f"overall average rating {df['rating'].mean():.2f}")

unique_brands 111 
rows/columns (1143, 5)
overall average rating 4.33


In [4]:
#there are 8 rows with no ingredient list, no other values are empty
#lets drop those rows
df = df.dropna()
print(df.isna().sum())

print(f"rows,columns: {df.shape}")

product_name    0
brand-name      0
rating          0
ingredients     0
product_url     0
dtype: int64
rows,columns: (1135, 5)


In [5]:
# which brands have the most products on Ulta under the shampoo/conditioner category
print(df['brand-name'].value_counts())

Paul Mitchell        52
Redken               45
Bumble and bumble    33
Joico                32
Not Your Mother's    31
                     ..
Beast                 1
Thick Head            1
Every Man Jack        1
Pipette               1
Punky Colour          1
Name: brand-name, Length: 111, dtype: int64


This analysis will train a ML model to learn to classify cosmetic products "product type" based on their ingredient labels. First we need to clean up the ingredients column. 

In [6]:
df['product_name'] = df['product_name'].str.lower()

 #drop rows that contain sets of multiple products as a kit
df = df[df['product_name'].str.contains("set|kit") == False]

#categorize products by name
df['type'] = np.where(pd.Series(df.product_name).str.contains('shampoo|cleanser'),'shampoo', 'conditioner')

df.head() 

Unnamed: 0,product_name,brand-name,rating,ingredients,product_url,type
0,hydrate shampoo,Pureology,4.5,"Aqua/Water/Eau, Sodium Cocoyl Isethionate, Dis...",https://www.ulta.com/p/hydrate-shampoo-pimprod...,shampoo
1,miracle leave-in product,It's A 10,4.6,"Water (Aqua, Eau), Propylene Glycol, Cetearyl ...",https://www.ulta.com/p/miracle-leave-in-produc...,conditioner
2,no.4 bond maintenance shampoo,OLAPLEX,4.3,"Water (Aqua/Eau), Sodium Lauroyl Methyl Isethi...",https://www.ulta.com/p/no4-bond-maintenance-sh...,shampoo
3,"hg shampoo for thicker, stronger, fuller-looki...",Bondi Boost,4.7,"Aloe Barbadensis (Aloe Vera) Leaf Juice*, Sodi...",https://www.ulta.com/p/hg-shampoo-thicker-stro...,shampoo
4,no.5 bond maintenance conditioner,OLAPLEX,4.4,"Water (Aqua/Eau), Cetearyl Alcohol, PPG-3 Benz...",https://www.ulta.com/p/no5-bond-maintenance-co...,conditioner


In [7]:
#DataFrame info after cleaning 
print(f"number of product types: {df['type'].nunique()}")
print(f"df shape after cleaning {df.shape}")
print(f"count of each type of product:\n{df['type'].value_counts()}")


number of product types: 2
df shape after cleaning (1123, 6)
count of each type of product:
conditioner    563
shampoo        560
Name: type, dtype: int64


We see about an even number of shampoos and conditioners, which is expected since most product lineups contain both. 

Next we are going to clean up the ingredients list column. 

In [8]:
df['ingredients'] = df['ingredients'].str.lower()

df['ingredients'] = df['ingredients'].str.replace('*', '')
df['ingredients'] = df['ingredients'].str.replace('.', '')

df['ingredients'] = df['ingredients'].str.split(',')

#nested list comprehension for stripping white space in each ingredient 
df['ingredients'] = [[x.strip() for x in l] for l in df['ingredients']]

#example print out 
print(df['ingredients'][11])

['water (aqua/eau)', 'propylene glycol', 'cetearyl alcohol', 'cyclopentasiloxane', 'behentrimonium chloride', 'quaternium-80', 'phenoxyethanol', 'fragrance (parfum)', 'methylparaben', 'panthenol', 'propylparaben', 'ethylparaben', 'aloe barbadensis leaf juice', 'butylene glycol', 'sodium hydroxide', 'hydrolyzed keratin', 'keratin amino acids', 'helianthus annuus (sunflower) seed extract', 'silk amino acids', 'camellia sinensis leaf extract', 'benzyl benzoate', 'benzyl salicylate', 'hydroxycitronellal', 'linalool']


  df['ingredients'] = df['ingredients'].str.replace('*', '')
  df['ingredients'] = df['ingredients'].str.replace('.', '')


In [9]:
df.head()

Unnamed: 0,product_name,brand-name,rating,ingredients,product_url,type
0,hydrate shampoo,Pureology,4.5,"[aqua/water/eau, sodium cocoyl isethionate, di...",https://www.ulta.com/p/hydrate-shampoo-pimprod...,shampoo
1,miracle leave-in product,It's A 10,4.6,"[water (aqua, eau), propylene glycol, cetearyl...",https://www.ulta.com/p/miracle-leave-in-produc...,conditioner
2,no.4 bond maintenance shampoo,OLAPLEX,4.3,"[water (aqua/eau), sodium lauroyl methyl iseth...",https://www.ulta.com/p/no4-bond-maintenance-sh...,shampoo
3,"hg shampoo for thicker, stronger, fuller-looki...",Bondi Boost,4.7,"[aloe barbadensis (aloe vera) leaf juice, sodi...",https://www.ulta.com/p/hg-shampoo-thicker-stro...,shampoo
4,no.5 bond maintenance conditioner,OLAPLEX,4.4,"[water (aqua/eau), cetearyl alcohol, ppg-3 ben...",https://www.ulta.com/p/no5-bond-maintenance-co...,conditioner


In [10]:
#setup model by splitting data 

X = df.ingredients.astype(str)

y = df['type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)


In [11]:
#Convert a collection of text documents to a matrix of token counts.
vectorizer = CountVectorizer()

#Learn the vocabulary dictionary and return document-term matrix.
X_train_vect = vectorizer.fit_transform(X_train)

X_train_vect.shape


(898, 2502)

In [12]:
# Logistic Regression Model to classify this binary model 

model = LogisticRegression()

model.fit(X_train_vect, y_train)

X_test_vect = vectorizer.transform(X_test)

prediction = model.predict(X_test_vect)


In [13]:
#Print out metrics of the model 

confusion_matrix_1 = pd.DataFrame(metrics.confusion_matrix(y_test,prediction, labels = ['shampoo', 'conditioner']), index=['shampoo','conditioner'], columns=['shampoo','conditioner'])

print(f"confusion matrix: \n {confusion_matrix_1}\n")

print(metrics.classification_report(y_test, prediction))

confusion matrix: 
              shampoo  conditioner
shampoo          110            9
conditioner        2          104

              precision    recall  f1-score   support

 conditioner       0.92      0.98      0.95       106
     shampoo       0.98      0.92      0.95       119

    accuracy                           0.95       225
   macro avg       0.95      0.95      0.95       225
weighted avg       0.95      0.95      0.95       225



This model has pretty good accuracy for predicting the type of product based on its ingredient label. The cleaning of the data was essential, as there was quite a bit of noise prior. 

A next step in this project that would be interesting is to see if we can generate a fake product ingredient label for each type of product. 