## Pair Programming Encoding


En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

Vuestro set de datos debería tener al menos una variable categórica, el objetivo del pair programming de hoy:


Hacer una códificación de la/las variables categóricas que tengáis en vuestro set de datos.

Recordad que lo primero que deberéis hacer es decidir su vuestras variables tienen o no orden, para que en función de esto uséis una aproximación u otra.


Guardad el dataframe, donde deberíais tener las variables estadandarizas, normalizadas y codificadas en un csv para usarlo en el próximo pairprogramming

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd
import random 

# Gráficos
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Estadísticos
# ------------------------------------------------------------------------------
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

plt.rcParams["figure.figsize"] = (10,8) 


pd.options.display.max_columns=None



In [2]:
df = pd.read_csv("../datos/sephora_website_dataset2.csv", index_col = 0)
df.sample(1)

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer,rating_raiz
1365,2309409,Charlotte Tilbury,Lip Liner,Lip Cheat Lip Liner - Pillow Talk Collection,.04 oz / 1.2 g,4.0,-0.049562,-0.196661,-0.595034,-0.603029,https://www.sephora.com/product/charlotte-tilb...,False,0,no options,What it is: A collection of lip liners in univ...,Suggested Usage:-Outline your lips starting wi...,Cyclopentasiloxane- Synthetic Wax- Polybutene-...,0,0,0,0,0.116885


Cuando empezamos el proceso de encoding lo primero que debemos preguntarnos es *si las variables categóricas tienen orden*

Las almacenamos en la variable categóricas usando un dtypes:

In [4]:
categoricas = df.select_dtypes(exclude = ['float64', 'int64'])
categoricas

Unnamed: 0,brand,category,name,size,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients
0,Acqua Di Parma,Fragrance,Blu Mediterraneo MINIATURE Set,5 x 0.16oz/5mL,https://www.sephora.com/product/blu-mediterran...,True,online only,no options,This enchanting set comes in a specially handc...,Suggested Usage:-Fragrance is intensified by t...,Arancia di Capri Eau de Toilette: Alcohol Dena...
1,Acqua Di Parma,Cologne,Colonia,0.7 oz/ 20 mL,https://www.sephora.com/product/colonia-P16360...,True,online only,- 0.7 oz/ 20 mL Spray - 1.7 oz/ 50 mL Eau d...,An elegant timeless scent filled with a fresh-...,no instructions,unknown
2,Acqua Di Parma,Perfume,Arancia di Capri,5 oz/ 148 mL,https://www.sephora.com/product/blu-mediterran...,True,online only,- 1oz/30mL Eau de Toilette - 2.5 oz/ 74 mL E...,Fragrance Family: Fresh Scent Type: Fresh Citr...,no instructions,Alcohol Denat.- Water- Fragrance- Limonene- Li...
3,Acqua Di Parma,Perfume,Mirto di Panarea,2.5 oz/ 74 mL,https://www.sephora.com/product/blu-mediterran...,True,online only,- 1 oz/ 30 mL Eau de Toilette Spray - 2.5 oz/...,Panarea near Sicily is an an island suspended ...,no instructions,unknown
4,Acqua Di Parma,Fragrance,Colonia Miniature Set,5 x 0.16oz/5mL,https://www.sephora.com/product/colonia-miniat...,True,online only,no options,The Colonia Miniature Set comes in an iconic A...,Suggested Usage:-Fragrance is intensified by t...,Colonia: Alcohol Denat.- Water- Fragrance- Lim...
...,...,...,...,...,...,...,...,...,...,...,...
9163,SEPHORA COLLECTION,Face Masks,The Rose Gold Mask,no size,https://www.sephora.com/product/the-rose-gold-...,True,limited edition · exclusive,no options,What it is: A limited-edition- nurturing and h...,Suggested Usage:-Unfold the mask.-Apply the ma...,-Rose Quartz Extract: Hydrates dry skin. Aqua...
9164,SEPHORA COLLECTION,Lip Sets,Give Me Some Sugar Colorful Gloss Balm Set,3 x 0.32 oz/ 9 g,https://www.sephora.com/product/sephora-collec...,True,exclusive,no options,What it is: A set of three bestselling Colorfu...,Suggested Usage:-Apply directly to lips using ...,Colorful Gloss Balm Wanderlust: Hydrogenated P...
9165,SEPHORA COLLECTION,Tinted Moisturizer,Weekend Warrior Tone Up Cream,0.946 oz/ 28 mL,https://www.sephora.com/product/sephora-collec...,True,exclusive,no options,What it is: A weightless complexion booster- i...,Suggested Usage:-Use this product as the last ...,Aqua (Water)- Dimethicone- Isohexadecane- Poly...
9166,SEPHORA COLLECTION,no category,Gift Card,no size,https://www.sephora.com/product/gift-card-P370...,False,0,no options,What it is:- Available in denominations of $10...,no instructions,unknown


Las exploramos usando boxplots para ver si su media nos muestre algún orden. Por desgracia nuestros datos son muy poco adecuados para este análisis. 

In [5]:


fig, axes = plt.subplots(3, 1, figsize=(30,25))

axes = axes.flat

for indice, columna in enumerate(categoricas):
    sns.boxplot(x = columna, y = 'rating' , data = df, ax=axes[indice], color = "aquamarine"); 

    
plt.tight_layout()
fig.delaxes(axes[-1])
plt.show()



KeyboardInterrupt: 

Exploramos la variable marketing flags 

In [None]:
df['marketingflags'].unique()

array([ True, False])

In [None]:
print(df['MarketingFlags'].dtype)

bool


In [None]:
df["MarketingFlags"].isnull().sum()

0

Decidimos quedarnos con la variable MarketingFlags como ordinal

In [None]:
flag_map = {False:0, True:1} 

In [None]:
df['MarketingFlags_map'] = df['MarketingFlags'].map(flag_map)

In [None]:
df.sample(5)

Unnamed: 0,id,category,rating,number_of_reviews,love,price,value_price,marketingflags,marketingflags_content,online_only,exclusive,limited_edition,limited_time_offer,rating_raiz
4004,1947019,Lotions & Oils,4.0,-0.311185,-0.332798,-0.425407,-0.44127,False,0,0,0,0,0,0.116885
3674,2067072,Hair,4.5,-0.273008,-0.252994,-0.743457,-0.744568,True,online only,1,0,0,0,0.393366
2966,2254720,Rollerballs & Travel Size,1.0,-0.314553,-0.375212,-0.467814,-0.48171,True,exclusive · online only,1,1,0,0,-2.162048
4392,1988716,Toners,4.0,-0.26178,-0.227174,-0.679847,-0.683908,False,0,0,0,0,0,0.116885
2054,1815984,Value & Gift Sets,4.0,-0.205637,-0.091038,-0.319391,-0.057093,False,0,0,0,0,0,0.116885


In [None]:
df['MarketingFlags_content'].value_counts()

0                                                   4786
exclusive                                           1692
online only                                         1528
exclusive · online only                              318
limited edition · exclusive                          297
limited edition                                      237
limited edition · online only                        188
limited edition · exclusive · online only            119
limited time offer                                     2
limited time offer · limited edition · exclusive       1
Name: MarketingFlags_content, dtype: int64

In [None]:
df["MarketingFlags_content"].unique()

array(['online only', 'exclusive · online only', '0',
       'limited edition · exclusive · online only',
       'limited edition · online only', 'exclusive',
       'limited edition · exclusive', 'limited edition',
       'limited time offer',
       'limited time offer · limited edition · exclusive'], dtype=object)

In [None]:
print(df['MarketingFlags_content'].dtype)

object


In [None]:
df["MarketingFlags_content"].isnull().sum()

0

In [None]:
flag_map2 = {"0":0,                                                 
            "exclusive" :1,                                     
            "online only":2,                                       
            "exclusive · online only"  :3,                           
            "limited edition · exclusive"  :4,                        
            "limited edition"  :5,                                   
            "limited edition · online only":6,                      
            "limited edition · exclusive · online only"  :7,       
            "limited time offer":8,                                
            "limited time offer · limited edition · exclusive" :9}       

In [None]:
df["MarketingFlags_content"].unique()

array(['online only', 'exclusive · online only', '0',
       'limited edition · exclusive · online only',
       'limited edition · online only', 'exclusive',
       'limited edition · exclusive', 'limited edition',
       'limited time offer',
       'limited time offer · limited edition · exclusive'], dtype=object)

In [None]:
df["MarketingFlags_content_map2"] = df['MarketingFlags_content'].map(flag_map2)

In [None]:
df.head(30)

Unnamed: 0_level_0,category,rating,number_of_reviews,love,price,value_price,MarketingFlags,MarketingFlags_content,online_only,exclusive,limited_edition,limited_time_offer,rating_norm,rating_log,rating_raiz,rating_minmax,MarketingFlags_map,MarketingFlags_content_map2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2218774,Fragrance,4.0,4,3002,66.0,75.0,True,online only,1,0,0,0,0.001996,1.386294,2.0,,1,2
2044816,Cologne,4.5,76,2700,66.0,66.0,True,online only,1,0,0,0,0.101996,1.504077,2.12132,,1,2
1417567,Perfume,4.5,26,2600,180.0,180.0,True,online only,1,0,0,0,0.101996,1.504077,2.12132,,1,2
1417617,Perfume,4.5,23,2900,120.0,120.0,True,online only,1,0,0,0,0.101996,1.504077,2.12132,,1,2
2218766,Fragrance,3.5,2,943,72.0,80.0,True,online only,1,0,0,0,-0.098004,1.252763,1.870829,,1,2
1417609,Perfume,4.5,79,2600,180.0,180.0,True,online only,1,0,0,0,0.101996,1.504077,2.12132,,1,2
1638832,Perfume,4.5,79,5000,210.0,210.0,True,online only,1,0,0,0,0.101996,1.504077,2.12132,,1,2
1284462,Cologne,5.0,13,719,120.0,120.0,True,online only,1,0,0,0,0.201996,1.609438,2.236068,,1,2
2221588,Body Mist & Hair Mist,4.0,5,800,58.0,58.0,True,online only,1,0,0,0,0.001996,1.386294,2.0,,1,2
2221596,Perfume,3.0,5,2100,58.0,58.0,True,exclusive · online only,1,1,0,0,-0.198004,1.098612,1.732051,,1,3


In [None]:
df.to_csv("../datos/sephora_website_dataset4.csv")