## The goal

Hair cosmetics can be divided into groups of those that contain proteins, emollients and humectants (PEH). In order to have shiny and smooth hair one needs to maintain the balance of those three elements. The dataset used in this notebook contains the cosmetics' ingredients and the indication whether the cosmetic has mostly P, E, H component or maybe mixture of those.

I have used multi-layer perceptron algorithm in order to create a model which would identify the PEH group of cosmetics.

In [17]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate

In [13]:
peh_df = pd.read_csv("../input/peh-cosmetics/PEH cosmetics.csv")

In [3]:
peh_df

Unnamed: 0,ingredients,PEH group
0,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",PEH
1,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",PEH
2,"Aqua, Glycerin, Cetearyl Alcohol, Isopropyl Pa...",PEH
3,"Aqua, Decyl Glucoside, Glycerin, Babassu Oil P...",PEH
4,"Aqua, Decyl Glucoside, Glycerin, Cetearyl Alco...",EH
5,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",E
6,"Aqua, Glycerin, Cetearyl Alcohol, Propanediol,...",EH
7,"Aqua, Cetearyl Alcohol, Macadamia Ternifolia S...",PEH
8,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",EH
9,"Aqua, Myristyl alcohol, Behenamidopropyl dimet...",E


## Data transformation

Using Label Encoder the PEH group variable is transformed from categorical to numerical values. 

In [14]:
le = sklearn.preprocessing.LabelEncoder()
peh_df["PEH group"] = le.fit_transform(peh_df["PEH group"]) 

In [16]:
peh_df

Unnamed: 0,ingredients,PEH group
0,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",4
1,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",4
2,"Aqua, Glycerin, Cetearyl Alcohol, Isopropyl Pa...",4
3,"Aqua, Decyl Glucoside, Glycerin, Babassu Oil P...",4
4,"Aqua, Decyl Glucoside, Glycerin, Cetearyl Alco...",1
5,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",0
6,"Aqua, Glycerin, Cetearyl Alcohol, Propanediol,...",1
7,"Aqua, Cetearyl Alcohol, Macadamia Ternifolia S...",4
8,"Aqua, Cetearyl Alcohol, Behentrimonium Chlorid...",1
9,"Aqua, Myristyl alcohol, Behenamidopropyl dimet...",0


Next, we transform a column with ingredients into a matrix, because this representation is more approachable for a machine than a text. CountVectorizer function used below calculates the matrix with frequencies of each word in a document (row).

In [18]:
count_vectorizer = CountVectorizer()
peh_vectors = count_vectorizer.fit_transform(peh_df["ingredients"])

## Model

First of all, I splitted the data into training and test sets.

In [6]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(peh_vectors, peh_df["PEH group"], shuffle = True)

In order to make predictions I have used multi-layer perseptron classifier. A perceptron is a single neuron model. It is a feedforward neural network - the information moves only in one direction and do not create cycles or loops. MLP uses backpropagation (identifies which pathways are most influential in the final answer) and it has multiple layers of neurons.

In [23]:
mlp_peh = MLPClassifier(hidden_layer_sizes = (6,5), random_state = 3, verbose = True, learning_rate_init = 0.1)

mlp_peh.fit(X_train, y_train)

Iteration 1, loss = 1.74012494
Iteration 2, loss = 1.44373104
Iteration 3, loss = 1.25113079
Iteration 4, loss = 1.00319683
Iteration 5, loss = 0.83466674
Iteration 6, loss = 0.71751324
Iteration 7, loss = 0.62612494
Iteration 8, loss = 0.54251049
Iteration 9, loss = 0.46156270
Iteration 10, loss = 0.39442629
Iteration 11, loss = 0.29069830
Iteration 12, loss = 0.21198229
Iteration 13, loss = 0.12965060
Iteration 14, loss = 0.06779080
Iteration 15, loss = 0.04211196
Iteration 16, loss = 0.01595517
Iteration 17, loss = 0.00709989
Iteration 18, loss = 0.00506688
Iteration 19, loss = 0.00541537
Iteration 20, loss = 0.00445682
Iteration 21, loss = 0.00362178
Iteration 22, loss = 0.00352644
Iteration 23, loss = 0.00377916
Iteration 24, loss = 0.00361537
Iteration 25, loss = 0.00313227
Iteration 26, loss = 0.00288638
Iteration 27, loss = 0.00282576
Iteration 28, loss = 0.00283432
Iteration 29, loss = 0.00286652
Iteration 30, loss = 0.00290706
Iteration 31, loss = 0.00294976
Iteration 32, los

MLPClassifier(hidden_layer_sizes=(6, 5), learning_rate_init=0.1, random_state=3,
              verbose=True)

In [24]:
y_pred = mlp_peh.predict(X_test)

## Validation

In order to validate the model I have calculated accuracy score, which equals to 62%. Could be better - but it satisfies me for now.

One of the possibilities of making the predictions better is to add more data into training set. Initially it only has around 23 records. 

In [26]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.625

Following Kaggle comments I have also evaluate the model using cross validation with split into 3 subsets.

In [35]:
cv_results = cross_validate(mlp_peh, X_train, y_train, cv=3)

Iteration 1, loss = 1.75051435
Iteration 2, loss = 1.49440246
Iteration 3, loss = 1.28124596
Iteration 4, loss = 1.03568339
Iteration 5, loss = 0.91916022
Iteration 6, loss = 0.80772376
Iteration 7, loss = 0.71860902
Iteration 8, loss = 0.67267575
Iteration 9, loss = 0.59669998
Iteration 10, loss = 0.53487492
Iteration 11, loss = 0.47489657
Iteration 12, loss = 0.39731263
Iteration 13, loss = 0.33029565
Iteration 14, loss = 0.28213817
Iteration 15, loss = 0.24679809
Iteration 16, loss = 0.22019425
Iteration 17, loss = 0.19491933
Iteration 18, loss = 0.16542336
Iteration 19, loss = 0.14671330
Iteration 20, loss = 0.13554343
Iteration 21, loss = 0.12341962
Iteration 22, loss = 0.11923727
Iteration 23, loss = 0.10062887
Iteration 24, loss = 0.09149019
Iteration 25, loss = 0.07838900
Iteration 26, loss = 0.03452201
Iteration 27, loss = 0.02333014
Iteration 28, loss = 0.02592488
Iteration 29, loss = 0.02114430
Iteration 30, loss = 0.01357177
Iteration 31, loss = 0.01335078
Iteration 32, los

From the results of the 3-fold cross-validation one can see that the predicted data does not fit very well (although it is still better than guessing :)) Test_score in the first split equals to 50%, but in second and third it is only 37,5%. 

In [36]:
cv_results

{'fit_time': array([0.05460906, 0.03918886, 0.03629375]),
 'score_time': array([0.00066137, 0.00064349, 0.00062394]),
 'test_score': array([0.5  , 0.375, 0.375])}

## Upcoming changes

* There will be more data added to the dataset, which will make the model's performance better.