# ETL BBDD COSING
El objetivo de esta ETL es **extraer** los 4 anexos que tiene la Comisión Europea  https://ec.europa.eu/growth/tools-databases/cosing/reference/annexes **transformar** cada BBDD para luego unir los 4 anexos y posteriormente **cargar** usar este dataset transformado para otros fines que explicaremos más adelante. 

Para ello tenemos que tener en cuenta que en el reglamento se explican los 6 anexos relativos a este texto legislativo : https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32009R1223&qid=1611669486511

- ANEXO II: LISTA DE SUSTANCIAS PROHIBIDAS EN PRODUCTOS COSMÉTICOS
- ANEXO III: LISTA DE SUSTANCIAS QUE LOS PRODUCTOS COSMÉTICOS NO DEBEN CONTENER SALVO LAS RESTRICCIONES ESTABLECIDAS
- ANEXO IV: LISTA DE COLORANTES PERMITIDOS EN PRODUCTOS COSMÉTICOS
- ANEXO V: LISTA DE CONSERVANTES PERMITIDOS EN PRODUCTOS COSMÉTICOS
- ANEXO VI: LISTA DE FILTROS UV PERMITIDOS EN PRODUCTOS COSMÉTICOS


Con esta información, determinamos para nuestro análisis que: 

La variable que queremos extraer de estos anexos es : ***Name of Common Ingredients Glossary*** 

Según el Reglamento (CE) n.º 1223/2009 del Parlamento Europeo y del Consejo, de 30 de noviembre de 2009, sobre los productos cosméticos
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02009R1223-20240424
en el artículo 33. Glossary of common ingredient names:  establece que la Comisión actualizará un glosario de nombres comunes de ingredientes, en base a la Nomenclatura Internacional de Ingredientes Cosméticos INCI. El nombre común del ingrediente se aplicará a efectos del etiquetado de los productos cosméticos comercializados. 

**IMPORTANTE:** En adelante cuando nos refiramos al nombre INCI nos referimos al nombre de la etiqueta de los productos, por lo que acabamos de explicar

## 1. Importamos librerias 

In [1]:
#importación de librerias
import pandas as pd
import numpy as np

In [2]:
# Función para cargar un archivo y sobrescribir el header

# 1. Definimos los headers.
# El anexo II cuenta con varias columnas menos.
# El anexo IV cuenta con una columna más.

header_2 = ['Reference Number', 'Chemical name','CAS Number', 'EC Number', 'Regulation', 'Other Directives/Regulations', 
'SCCS opinions', 'Chemical/IUPAC Name', 'Identified INGREDIENTS or substances e.g.', 'CMR', 'Update date']

header = ['Reference Number', 'Chemical name', 'Name of Common Ingredients Glossary' ,
'CAS Number', 'EC Number', 'Product Type, body parts','Maximum concentration in ready for use preparation' , 'Other Restrictions','Wording of conditions of use and warnings',
'Regulation', 'Other Directives/Regulations', 'SCCS opinions', 'Chemical/IUPAC Name', 'Identified INGREDIENTS or substances e.g.', 'CMR', 'Update date']

header_4 = ['Reference Number', 'Chemical name', 'Name of Common Ingredients Glossary' ,
'CAS Number', 'EC Number','Color', 'Product Type, body parts','Maximum concentration in ready for use preparation' , 'Other Restrictions','Wording of conditions of use and warnings',
'Regulation', 'Other Directives/Regulations', 'SCCS opinions', 'Chemical/IUPAC Name', 'Identified INGREDIENTS or substances e.g.', 'CMR', 'Update date']

# 2. Definimos función
def cargar_anexo(ruta, columnas):
    df = pd.read_excel(ruta, skiprows=6, header=[0,1])
    df.columns = columnas  # renombra las columnas con las listas creadas
    return df


## 2. Lectura de datos y estructuras de dataframes

Hemos visto que los archivos tienen el encabezado en celdas combinadas.

In [3]:
# leemos datasets del Anexo 3 de CosIng
anexo_3 = pd.read_excel("../../data/raw/COSING_Annex_III_v2.xlsx", skiprows=6, header = [0,1])
anexo_3.head(1)

Unnamed: 0_level_0,Reference Number,Substance identification,Substance identification,Substance identification,Substance identification,Restrictions,Restrictions,Restrictions,Wording of conditions of use and warnings,Regulation,Other Directives/Regulations,SCCS opinions,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CMR,Update date
Unnamed: 0_level_1,Unnamed: 0_level_1,Chemical name / INN,Name of Common Ingredients Glossary,CAS Number,EC Number,"Product Type, body parts",Maximum concentration in ready for use preparation,Other,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2a,Thioglycolic acid and its salts,THIOGLYCOLIC ACID,68-11-1,200-677-4,(a) Hair products\n\n(b) Depilatories\n\n(c) H...,(a) (i) 8% (ii) 11% \n\n(b) 5% \n\n(c) 2%\n...,(a) (i) General use ready for use pH 7 to 9.5 ...,Conditions of use:\n\n(a) (b) (c) (d)\nAvoid c...,(EU) 2015/1190,,Thioglycolic acid and its salts (TGA),Thioglycolic acid and its salts,AMMONIUM THIOGLYCOLATE\nCALCIUM THIOGLYCOLATE\...,,27/07/2020


Con esta información decidimos hacer directamente la lectura con una función (definida previamente) que reemplace directamente los nombres de las variables.

In [4]:
# leemos los datasets de CosIng

anexo_2 = cargar_anexo("../../data/raw/COSING_Annex_II_v2.xlsx", header_2)
anexo_3 = cargar_anexo("../../data/raw/COSING_Annex_III_v2.xlsx", header)
anexo_4 = cargar_anexo("../../data/raw/COSING_Annex_IV_v2.xlsx", header_4)
anexo_5 = cargar_anexo("../../data/raw/COSING_Annex_V_v2.xlsx", header)
anexo_6 = cargar_anexo("../../data/raw/COSING_Annex_VI_v2.xlsx", header)

Consultamos datos y su estructura

In [5]:
anexo_3.head(1)

Unnamed: 0,Reference Number,Chemical name,Name of Common Ingredients Glossary,CAS Number,EC Number,"Product Type, body parts",Maximum concentration in ready for use preparation,Other Restrictions,Wording of conditions of use and warnings,Regulation,Other Directives/Regulations,SCCS opinions,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CMR,Update date
0,2a,Thioglycolic acid and its salts,THIOGLYCOLIC ACID,68-11-1,200-677-4,(a) Hair products\n\n(b) Depilatories\n\n(c) H...,(a) (i) 8% (ii) 11% \n\n(b) 5% \n\n(c) 2%\n...,(a) (i) General use ready for use pH 7 to 9.5 ...,Conditions of use:\n\n(a) (b) (c) (d)\nAvoid c...,(EU) 2015/1190,,Thioglycolic acid and its salts (TGA),Thioglycolic acid and its salts,AMMONIUM THIOGLYCOLATE\nCALCIUM THIOGLYCOLATE\...,,27/07/2020


In [6]:
anexo_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1744 entries, 0 to 1743
Data columns (total 11 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Reference Number                           1744 non-null   int64 
 1   Chemical name                              1744 non-null   object
 2   CAS Number                                 1744 non-null   object
 3   EC Number                                  1744 non-null   object
 4   Regulation                                 1744 non-null   object
 5   Other Directives/Regulations               47 non-null     object
 6   SCCS opinions                              946 non-null    object
 7   Chemical/IUPAC Name                        612 non-null    object
 8   Identified INGREDIENTS or substances e.g.  424 non-null    object
 9   CMR                                        1026 non-null   object
 10  Update date                         

In [7]:
anexo_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 16 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   Reference Number                                    373 non-null    object
 1   Chemical name                                       373 non-null    object
 2   Name of Common Ingredients Glossary                 313 non-null    object
 3   CAS Number                                          357 non-null    object
 4   EC Number                                           350 non-null    object
 5   Product Type, body parts                            233 non-null    object
 6   Maximum concentration in ready for use preparation  209 non-null    object
 7   Other Restrictions                                  277 non-null    object
 9   Regulation                                          373 non-null    object
 10  Other Dire

In [8]:
anexo_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 17 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   Reference Number                                    154 non-null    object 
 1   Chemical name                                       154 non-null    object 
 2   Name of Common Ingredients Glossary                 154 non-null    object 
 3   CAS Number                                          148 non-null    object 
 4   EC Number                                           152 non-null    object 
 5   Color                                               154 non-null    object 
 6   Product Type, body parts                            0 non-null      float64
 7   Maximum concentration in ready for use preparation  59 non-null     object 
 8   Other Restrictions                                  6 non-null      object 
 10 

In [9]:
anexo_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 16 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   Reference Number                                    57 non-null     object 
 1   Chemical name                                       57 non-null     object 
 2   Name of Common Ingredients Glossary                 53 non-null     object 
 3   CAS Number                                          54 non-null     object 
 4   EC Number                                           54 non-null     object 
 5   Product Type, body parts                            16 non-null     object 
 6   Maximum concentration in ready for use preparation  54 non-null     object 
 7   Other Restrictions                                  17 non-null     object 
 9   Regulation                                          57 non-null     object 
 10  O

In [10]:
anexo_6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 16 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   Reference Number                                    34 non-null     object 
 1   Chemical name                                       34 non-null     object 
 2   Name of Common Ingredients Glossary                 33 non-null     object 
 3   CAS Number                                          33 non-null     object 
 4   EC Number                                           31 non-null     object 
 5   Product Type, body parts                            4 non-null      object 
 6   Maximum concentration in ready for use preparation  33 non-null     object 
 7   Other Restrictions                                  8 non-null      object 
 9   Regulation                                          34 non-null     object 
 10  O

## 3. Tratamiento de variables antes de unión

Antes de realizar la unión de los anexos vamos añadir una columnas que nos diga a qué anexo pertenece.

In [11]:
# añadimos columna
anexo_2['Anexo_cosIng'] = 'Anexo_2'
anexo_3['Anexo_cosIng'] = 'Anexo_3'
anexo_4['Anexo_cosIng'] = 'Anexo_4'
anexo_5['Anexo_cosIng'] = 'Anexo_5'
anexo_6['Anexo_cosIng'] = 'Anexo_6'

In [12]:
# Concatenamos
cosing = pd.concat([anexo_2,anexo_3, anexo_4, anexo_5, anexo_6], ignore_index=True)

Consultamos estructura

In [13]:
cosing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2362 entries, 0 to 2361
Data columns (total 18 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   Reference Number                                    2362 non-null   object
 1   Chemical name                                       2362 non-null   object
 2   CAS Number                                          2336 non-null   object
 3   EC Number                                           2331 non-null   object
 4   Regulation                                          2362 non-null   object
 5   Other Directives/Regulations                        49 non-null     object
 6   SCCS opinions                                       1230 non-null   object
 7   Chemical/IUPAC Name                                 954 non-null    object
 8   Identified INGREDIENTS or substances e.g.           949 non-null    object
 9   CMR     

In [14]:
# Vemos que el índice se ha reseteado correctamente y hay continuidad del mismo del Anexo II al Anexo III
# Se aprecia tambien en la variable "Reference Number"
cosing.iloc[1743:1745]

Unnamed: 0,Reference Number,Chemical name,CAS Number,EC Number,Regulation,Other Directives/Regulations,SCCS opinions,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CMR,Update date,Anexo_cosIng,Name of Common Ingredients Glossary,"Product Type, body parts",Maximum concentration in ready for use preparation,Other Restrictions,Wording of conditions of use and warnings,Color
1743,1751,"foramsulfuron (ISO); 2-{[(4,6-dimethoxypyrimid...",173159-57-4,-,(EU) 2025/877,,,,,,18/06/2025,Anexo_2,,,,,,
1744,2a,Thioglycolic acid and its salts,68-11-1,200-677-4,(EU) 2015/1190,,Thioglycolic acid and its salts (TGA),Thioglycolic acid and its salts,AMMONIUM THIOGLYCOLATE\nCALCIUM THIOGLYCOLATE\...,,27/07/2020,Anexo_3,THIOGLYCOLIC ACID,(a) Hair products\n\n(b) Depilatories\n\n(c) H...,(a) (i) 8% (ii) 11% \n\n(b) 5% \n\n(c) 2%\n...,(a) (i) General use ready for use pH 7 to 9.5 ...,Conditions of use:\n\n(a) (b) (c) (d)\nAvoid c...,


## 4. Tratamiento de Duplicados

In [15]:
# Comprobamos nulos
cosing.isna().sum()

Reference Number                                         0
Chemical name                                            0
CAS Number                                              26
EC Number                                               31
Regulation                                               0
Other Directives/Regulations                          2313
SCCS opinions                                         1132
Chemical/IUPAC Name                                   1408
Identified INGREDIENTS or substances e.g.             1413
CMR                                                   1326
Update date                                              0
Anexo_cosIng                                             0
Name of Common Ingredients Glossary                   1809
Product Type, body parts                              2109
Maximum concentration in ready for use preparation    2007
Other Restrictions                                    2054
Color                                                 22

Antes de eliminar los duplicados, vamos a ordenar por variables relevantes que se pueden quedar sin información después de la eliminación.
Con ello, al eliminar duplicados tendremos en cuenta estas variable que son relevantes y de las que necesitaremos información más adelante y nuestro dataframe no se quedará con nulos innecesariamente.

In [16]:
# Para hacer la comparativa de duplicados vamos a elegir la variable "Chemical name" ya que es una variable importante y no tiene nulos.
duplicados = cosing[cosing.duplicated(subset=['Chemical name'], keep=False)].sort_values(by= ['Chemical name', 'Identified INGREDIENTS or substances e.g.'])
duplicados.head(8)

Unnamed: 0,Reference Number,Chemical name,CAS Number,EC Number,Regulation,Other Directives/Regulations,SCCS opinions,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CMR,Update date,Anexo_cosIng,Name of Common Ingredients Glossary,"Product Type, body parts",Maximum concentration in ready for use preparation,Other Restrictions,Wording of conditions of use and warnings,Color
2360,33,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",919803-06-8,485-100-6,(EU) 2022/2195,,OPINIONON2-(4-(2-(4-Diethylamino-2-hydroxy-ben...,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,27/11/2022,Anexo_6,BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,10 %\n\n(In case of combined use of Bis-(Dieth...,,,
2361,34,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",919803-06-8,485-100-6,(EU) 2022/2195,,OPINIONON2-(4-(2-(4-Diethylamino-2-hydroxy-ben...,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,27/11/2022,Anexo_6,BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,10 %\n(In case of combined use of Bis-(Diethyl...,Only nanomaterials having the following charac...,,
2301,32,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",38083-17-9,253-775-4,(EU) 2019/698,,Opinion concerning Restrictions on Materials l...,"(R,S)-1-(4-chlorophenoxy)-1-imidazol-1-yl-3,3-...",CLIMBAZOLE,,28/07/2020,Anexo_5,CLIMBAZOLE,(a) Hair lotions\n(b) Face creams\n(c) Foot ca...,"(a) 0,2 %\n(b) 0,2 %\n(c) 0,2 %\n(d) 0,5 %",,,
2049,310,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",38083-17-9;,253-775-4;,(EU) 2019/698,,Opinion concerning Restrictions on Materials l...,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",,,28/07/2020,Anexo_3,CLIMBAZOLE,Rinse-off anti-dandruff shampoo,"2,0 %",For purposes other than inhibiting the develop...,,
1803,54,1-Phenoxypropan-2-ol (8),770-35-4,212-222-7,(EC) 2009/1223,,,,PHENOXYISOPROPANOL,,09/11/2010,Anexo_3,,Rinse-off products \nNot to be used in oral pr...,2%,For purposes other than inhibiting the develop...,,
2310,43,1-Phenoxypropan-2-ol (8),770-35-4,212-222-7,(EC) 2009/1223,,,,PHENOXYISOPROPANOL,,09/11/2010,Anexo_5,PHENOXYISOPROPANOL,Only for rinse-off products,1.0%,,,
1164,1171,Moved or deleted,-,-,(EC) 2009/1223,,Opinion concerning Request for Confirmation of...,,DICHLOROETHYLENES (ACETYLENE CHLORIDES) E.G. V...,,03/10/2016,Anexo_2,,,,,,
672,673,Moved or deleted,-,-,(EC) 2009/1223,,Opinion concerning Chemical Ingredients in Cos...,,ETHOXYETHANOL ACETATE,Reprotoxic Cat. 1B(),03/10/2016,Anexo_2,,,,,,


Vemos que hay ingredientes que han sido eliminados y no tienen información en las variables CAS Number y EC Number

In [17]:
# Eliminamos registros 'Moved or deleted'
cosing = cosing[~cosing['Chemical name'].str.contains('Moved or deleted', na=False)]

In [18]:
duplicados = cosing[cosing.duplicated(subset=['Chemical name'], keep=False)].sort_values(by= ['Chemical name', 'Identified INGREDIENTS or substances e.g.'])
duplicados.head(20)

Unnamed: 0,Reference Number,Chemical name,CAS Number,EC Number,Regulation,Other Directives/Regulations,SCCS opinions,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CMR,Update date,Anexo_cosIng,Name of Common Ingredients Glossary,"Product Type, body parts",Maximum concentration in ready for use preparation,Other Restrictions,Wording of conditions of use and warnings,Color
2360,33,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",919803-06-8,485-100-6,(EU) 2022/2195,,OPINIONON2-(4-(2-(4-Diethylamino-2-hydroxy-ben...,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,27/11/2022,Anexo_6,BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,10 %\n\n(In case of combined use of Bis-(Dieth...,,,
2361,34,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",919803-06-8,485-100-6,(EU) 2022/2195,,OPINIONON2-(4-(2-(4-Diethylamino-2-hydroxy-ben...,"1,1'-(1,4-piperazinediyl)bis[1-[2-[4-(diethyla...",BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,27/11/2022,Anexo_6,BIS-(DIETHYLAMINOHYDROXYBENZOYL BENZOYL) PIPER...,,10 %\n(In case of combined use of Bis-(Diethyl...,Only nanomaterials having the following charac...,,
2301,32,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",38083-17-9,253-775-4,(EU) 2019/698,,Opinion concerning Restrictions on Materials l...,"(R,S)-1-(4-chlorophenoxy)-1-imidazol-1-yl-3,3-...",CLIMBAZOLE,,28/07/2020,Anexo_5,CLIMBAZOLE,(a) Hair lotions\n(b) Face creams\n(c) Foot ca...,"(a) 0,2 %\n(b) 0,2 %\n(c) 0,2 %\n(d) 0,5 %",,,
2049,310,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",38083-17-9;,253-775-4;,(EU) 2019/698,,Opinion concerning Restrictions on Materials l...,"1-(4-Chlorophenoxy)-1-(imidazol-1-yl)-3,3-dime...",,,28/07/2020,Anexo_3,CLIMBAZOLE,Rinse-off anti-dandruff shampoo,"2,0 %",For purposes other than inhibiting the develop...,,
1803,54,1-Phenoxypropan-2-ol (8),770-35-4,212-222-7,(EC) 2009/1223,,,,PHENOXYISOPROPANOL,,09/11/2010,Anexo_3,,Rinse-off products \nNot to be used in oral pr...,2%,For purposes other than inhibiting the develop...,,
2310,43,1-Phenoxypropan-2-ol (8),770-35-4,212-222-7,(EC) 2009/1223,,,,PHENOXYISOPROPANOL,,09/11/2010,Anexo_5,PHENOXYISOPROPANOL,Only for rinse-off products,1.0%,,,
2357,30,Zinc oxide,1314-13-2,215-222-5,(EU) 2016/621,,Opinion concerning Zinc oxide\nStatement on Zi...,,ZINC OXIDE,,13/09/2016,Anexo_6,ZINC OXIDE,,- 25%\n\n- In case of combined use of zinc oxi...,,Not to be used in applications that may lead t...,
2261,144,Zinc oxide,1314-13-2,215-222-5,(EU) 2017/1413,,,Zinc oxide,ZINC OXIDE\nCI 77947,,18/10/2021,Anexo_4,ZINC OXIDE,,,,Not to be used in applications that may lead ...,White


Vemos que para esos dupicados las variables CAS Number y EC Number están tambien duplicadas, así pues eliminamos solo por variable 'Chemical name'

In [19]:
# Eliminamos los duplicados por 'Chemical name' y manteniendo el primero
cosing_final = cosing.drop_duplicates(subset='Chemical name' , keep='first')

Eliminamos las filas que continen en la variable "Chemical name" la categoría : Moved or deleted

In [20]:
# Consulta
cosing_final.isnull().sum()

Reference Number                                         0
Chemical name                                            0
CAS Number                                              20
EC Number                                               25
Regulation                                               0
Other Directives/Regulations                          2285
SCCS opinions                                         1117
Chemical/IUPAC Name                                   1383
Identified INGREDIENTS or substances e.g.             1393
CMR                                                   1305
Update date                                              0
Anexo_cosIng                                             0
Name of Common Ingredients Glossary                   1785
Product Type, body parts                              2084
Maximum concentration in ready for use preparation    1983
Other Restrictions                                    2027
Color                                                 21

## 5. Variables "'Name of Common Ingredients Glossary'"

Vamos a comprobar la varible 'Name of Common Ingredients Glossary', porque es la variable objetivo, pues es de la que obtendremos el nombre INCI.

Vamos a ver si esta variable tiene más de un nombre en un registro separado por comas, barras o punto y coma.

Los guiones y las comas no podemos tenerlos en cuenta porque claramente forman parte del nombre, y aunque parece más un nombre químico se encuentra en esta variable, por ejemplo: 

2-METHOXYMETHYL-P-PHENYLENEDIAMINE

1-Hydroxy-4-methyl-6-(2,4,4-trimethylpentyl)-2 pyridon

In [21]:
cosing_final[cosing_final['Name of Common Ingredients Glossary'].str.contains(r'[/;,]', regex= True, na=False)].count()

Reference Number                                      95
Chemical name                                         95
CAS Number                                            93
EC Number                                             93
Regulation                                            95
Other Directives/Regulations                           0
SCCS opinions                                         40
Chemical/IUPAC Name                                   49
Identified INGREDIENTS or substances e.g.             78
CMR                                                    0
Update date                                           95
Anexo_cosIng                                          95
Name of Common Ingredients Glossary                   95
Product Type, body parts                              39
Maximum concentration in ready for use preparation    49
Other Restrictions                                    56
Color                                                 18
dtype: int64

Sabemos que hay hasta 95 registros con estas condiciones, pero no vamos a tratarlos hasta después de la unión de las tablas porque sería agrandar innecesariamente esta tabla, cuando a lo mejor la mayoría de los registros no tienen match con la otra tabla. 

Una vez se haga la unión se revisará esta variable.

## 6. Variables "CAS Number" y "EC Number"

Estas son variables comunes con los dataframes Edlists y ECHA y son las claves por la que vamos a realizar la unión de tablas.

En los otros dataframes se detectaron registros con varios "CAS Number" y "EC Number" vamos a investigar si aquí ocurre lo mismo.

A diferencia de lo que pasaba con lo otros dataframes en este hay CAS Number separados por los simbolos: **','** **';'**  **'/'**

In [22]:
# Filtramos filas con más de un CAS Number
cosing_final[cosing_final['CAS Number'].str.contains(r'[,;/]', regex= True, na=False)].count()

Reference Number                                      213
Chemical name                                         213
CAS Number                                            213
EC Number                                             213
Regulation                                            213
Other Directives/Regulations                            8
SCCS opinions                                          89
Chemical/IUPAC Name                                   100
Identified INGREDIENTS or substances e.g.             162
CMR                                                    32
Update date                                           213
Anexo_cosIng                                          213
Name of Common Ingredients Glossary                   112
Product Type, body parts                               30
Maximum concentration in ready for use preparation     47
Other Restrictions                                     67
Color                                                  27
dtype: int64

Hay 212 registros que tienen un varios CAS Number

Ahora veremos que con EC Number pasa lo mismo

In [23]:
# Filtramos filas con más de un EC Number
cosing_final[cosing_final['EC Number'].str.contains(r'[,;/]', regex= True, na=False)].count()

Reference Number                                      200
Chemical name                                         200
CAS Number                                            197
EC Number                                             200
Regulation                                            200
Other Directives/Regulations                            8
SCCS opinions                                          78
Chemical/IUPAC Name                                    96
Identified INGREDIENTS or substances e.g.             151
CMR                                                    32
Update date                                           200
Anexo_cosIng                                          200
Name of Common Ingredients Glossary                   110
Product Type, body parts                               28
Maximum concentration in ready for use preparation     44
Other Restrictions                                     66
Color                                                  27
dtype: int64

In [24]:
# Separamos CAS y EC múltiples por coma en una lista
cosing_final['CAS Number_2'] = cosing_final['CAS Number'].str.split(r'[,;/]')
cosing_final['EC Number_2'] = cosing_final['EC Number'].str.split(r'[,;/]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosing_final['CAS Number_2'] = cosing_final['CAS Number'].str.split(r'[,;/]')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosing_final['EC Number_2'] = cosing_final['EC Number'].str.split(r'[,;/]')


In [25]:
# comprobamos
cosing_final[cosing_final['CAS Number_2'].str.contains(r'[,;/]', regex= True, na=False)].count()

Reference Number                                      0
Chemical name                                         0
CAS Number                                            0
EC Number                                             0
Regulation                                            0
Other Directives/Regulations                          0
SCCS opinions                                         0
Chemical/IUPAC Name                                   0
Identified INGREDIENTS or substances e.g.             0
CMR                                                   0
Update date                                           0
Anexo_cosIng                                          0
Name of Common Ingredients Glossary                   0
Product Type, body parts                              0
Maximum concentration in ready for use preparation    0
Other Restrictions                                    0
Color                                                 0
CAS Number_2                                    

Descomponemos cada elemento de la lista de CAS Number_2 y de EC Number_2 Al hacerlo se van a crear combinaciones entre ambas porque se va a descomponer hasta que cada una tenga solo un elemento.

De esta manera si CAS Number_2 tiene 4 elementos se tendrían que crear 4 filas para cada CAS Number_2 pero si además EC Number_2 tiene otros 4 elementos, no serían 8, sino 4^4 =16 porque se combinan los índices de las listas.

In [26]:
# Descomponemos fila por cada CAS individual
cosing_final = cosing_final.explode('CAS Number_2')
# Descomponemos fila por cada EC individual
cosing_final = cosing_final.explode('EC Number_2')

In [27]:
#Comprobamos que 'CAS Number_2' y 'EC Number_2' tengan solo un número
cosing_final[['CAS Number','EC Number', 'CAS Number_2','EC Number_2' ]].iloc[15:30]

Unnamed: 0,CAS Number,EC Number,CAS Number_2,EC Number_2
12,84649-73-0,283-458-6,84649-73-0,283-458-6
13,51-43-4,200-098-7,51-43-4,200-098-7
14,90106-13-1,290-234-1,90106-13-1,290-234-1
15,-,-,-,-
16,7683-59-2,231-687-7,7683-59-2,231-687-7
17,57-06-7,200-309-2,57-06-7,200-309-2
18,5486-77-1,-,5486-77-1,-
19,62-67-9,200-546-1,62-67-9,200-546-1
20,300-62-9,206-096-2,300-62-9,206-096-2
21,62-53-3,200-539-3,62-53-3,200-539-3


Eliminamos las variables anteriores que contienen el más de un número en estas variables

In [28]:
# Eliminamos 'CAS Number' y 'EC Number' porque ya no nos sirven.
cosing_final.drop(columns='CAS Number', inplace=True)
cosing_final.drop(columns='EC Number', inplace=True)

In [29]:
#renombrar variables con el nombre antiguo
cosing_final = cosing_final.rename(columns={'CAS Number_2': 'CAS Number', 'EC Number_2':'EC Number'})

Comprobamos que no hay valores extraños en las variables "EC no." y "CAS no." antes de exportar

In [30]:
cosing_final[cosing_final['CAS Number'] == '-'].count()

Reference Number                                      70
Chemical name                                         70
Regulation                                            70
Other Directives/Regulations                           1
SCCS opinions                                         24
Chemical/IUPAC Name                                   16
Identified INGREDIENTS or substances e.g.             24
CMR                                                   25
Update date                                           70
Anexo_cosIng                                          70
Name of Common Ingredients Glossary                    1
Product Type, body parts                               2
Maximum concentration in ready for use preparation     3
Other Restrictions                                     1
Color                                                  0
CAS Number                                            70
EC Number                                             70
dtype: int64

In [31]:
cosing_final[cosing_final['EC Number'] == '-'].count()

Reference Number                                      314
Chemical name                                         314
Regulation                                            314
Other Directives/Regulations                            1
SCCS opinions                                          59
Chemical/IUPAC Name                                    90
Identified INGREDIENTS or substances e.g.             215
CMR                                                    17
Update date                                           314
Anexo_cosIng                                          314
Name of Common Ingredients Glossary                   141
Product Type, body parts                               16
Maximum concentration in ready for use preparation     31
Other Restrictions                                    113
Color                                                  24
CAS Number                                            314
EC Number                                             314
dtype: int64

In [32]:
cosing_final['CAS Number'] = cosing_final['CAS Number'].replace(['-', '', ' '], np.nan)
cosing_final['EC Number'] = cosing_final['EC Number'].replace(['-', '', ' '], np.nan)

In [33]:
cosing_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42780 entries, 0 to 2360
Data columns (total 18 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   Reference Number                                    42780 non-null  object
 1   Chemical name                                       42780 non-null  object
 2   Regulation                                          42780 non-null  object
 3   Other Directives/Regulations                        100 non-null    object
 4   SCCS opinions                                       2057 non-null   object
 5   Chemical/IUPAC Name                                 39901 non-null  object
 6   Identified INGREDIENTS or substances e.g.           2973 non-null   object
 7   CMR                                                 39277 non-null  object
 8   Update date                                         42780 non-null  object
 9   Anexo_cosIng

Como lo único que queremos de esta tabla es localizar el nombre INCI nos quedaremos con las variables mínimas que nos ayuden a identificar el ingrediente y por ende no tenga tantos valores nulos

In [34]:
cosing_final= cosing_final[['Chemical name', 'Chemical/IUPAC Name', 'Identified INGREDIENTS or substances e.g.', 'CAS Number', 'EC Number','Name of Common Ingredients Glossary', 'Product Type, body parts', 'Anexo_cosIng' ]]

In [35]:
# Comprobamos la condición de que "CAS Number" y "EC Number" sean nulos.

cosing_final[(cosing_final['CAS Number'].isna() & cosing_final['EC Number'].isna())].count()

Chemical name                                68
Chemical/IUPAC Name                          22
Identified INGREDIENTS or substances e.g.    35
CAS Number                                    0
EC Number                                     0
Name of Common Ingredients Glossary           8
Product Type, body parts                     16
Anexo_cosIng                                 68
dtype: int64

In [36]:
# Filtramos los registros que cumplen la condición que "CAS Number" y "EC Number" sean NO nulos. 
cosing_clean = cosing_final[~(cosing_final['CAS Number'].isna() & cosing_final['EC Number'].isna())]

In [37]:
cosing_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42712 entries, 0 to 2360
Data columns (total 8 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Chemical name                              42712 non-null  object
 1   Chemical/IUPAC Name                        39879 non-null  object
 2   Identified INGREDIENTS or substances e.g.  2938 non-null   object
 3   CAS Number                                 42375 non-null  object
 4   EC Number                                  42118 non-null  object
 5   Name of Common Ingredients Glossary        2291 non-null   object
 6   Product Type, body parts                   471 non-null    object
 7   Anexo_cosIng                               42712 non-null  object
dtypes: object(8)
memory usage: 2.9+ MB


In [38]:
cosing_clean['Chemical name'].nunique()

2277

In [39]:
cosing_clean['CAS Number'].nunique()

2908

In [40]:
cosing_clean['EC Number'].nunique()

2691

In [41]:
# consultamos duplicados
duplicados = cosing_clean[
    cosing_clean.duplicated(
        subset=['Chemical name','EC Number','CAS Number', 'Identified INGREDIENTS or substances e.g.'], keep='first')].sort_values(
            by= ['Chemical name', 'Identified INGREDIENTS or substances e.g.','CAS Number','EC Number','Chemical/IUPAC Name'])
duplicados.head(10)

Unnamed: 0,Chemical name,Chemical/IUPAC Name,Identified INGREDIENTS or substances e.g.,CAS Number,EC Number,Name of Common Ingredients Glossary,"Product Type, body parts",Anexo_cosIng
2089,Citrus aurantium amara and dulcis peel oil,,CITRUS AURANTIUM AMARA PEEL OIL\nCITRUS AURANT...,72968-50-4,,Citrus Aurantium Amara Peel Oil; Citrus Aurant...,,Anexo_3
2089,Citrus aurantium amara and dulcis peel oil,,CITRUS AURANTIUM AMARA PEEL OIL\nCITRUS AURANT...,8008-57-9,,Citrus Aurantium Amara Peel Oil; Citrus Aurant...,,Anexo_3
2089,Citrus aurantium amara and dulcis peel oil,,CITRUS AURANTIUM AMARA PEEL OIL\nCITRUS AURANT...,8028-48-6,,Citrus Aurantium Amara Peel Oil; Citrus Aurant...,,Anexo_3
2089,Citrus aurantium amara and dulcis peel oil,,CITRUS AURANTIUM AMARA PEEL OIL\nCITRUS AURANT...,97766-30-8,,Citrus Aurantium Amara Peel Oil; Citrus Aurant...,,Anexo_3
2089,Citrus aurantium amara and dulcis peel oil,,CITRUS AURANTIUM AMARA PEEL OIL\nCITRUS AURANT...,68916-04-1,,Citrus Aurantium Amara Peel Oil; Citrus Aurant...,,Anexo_3
2092,Cymbopogon citratus / schoenanthus/ flexuosus ...,,CYMBOPOGON CITRATUS LEAF OIL\nCYMBOPOGON FLEXU...,8007-02-1,295-161-9,Cymbopogon Schoenanthus Oil; Cymbopogon Flexuo...,,Anexo_3
2092,Cymbopogon citratus / schoenanthus/ flexuosus ...,,CYMBOPOGON CITRATUS LEAF OIL\nCYMBOPOGON FLEXU...,8007-02-1,295-161-9,Cymbopogon Schoenanthus Oil; Cymbopogon Flexuo...,,Anexo_3
2092,Cymbopogon citratus / schoenanthus/ flexuosus ...,,CYMBOPOGON CITRATUS LEAF OIL\nCYMBOPOGON FLEXU...,89998-16-3,295-161-9,Cymbopogon Schoenanthus Oil; Cymbopogon Flexuo...,,Anexo_3
2092,Cymbopogon citratus / schoenanthus/ flexuosus ...,,CYMBOPOGON CITRATUS LEAF OIL\nCYMBOPOGON FLEXU...,89998-16-3,295-161-9,Cymbopogon Schoenanthus Oil; Cymbopogon Flexuo...,,Anexo_3
2092,Cymbopogon citratus / schoenanthus/ flexuosus ...,,CYMBOPOGON CITRATUS LEAF OIL\nCYMBOPOGON FLEXU...,91844-92-7,289-754-1,Cymbopogon Schoenanthus Oil; Cymbopogon Flexuo...,,Anexo_3


In [42]:
duplicados.count()

Chemical name                                172
Chemical/IUPAC Name                            5
Identified INGREDIENTS or substances e.g.    170
CAS Number                                   172
EC Number                                    113
Name of Common Ingredients Glossary          170
Product Type, body parts                       0
Anexo_cosIng                                 172
dtype: int64

In [43]:
# Eliminamos los duplicados por las variables  de la consutla
cosing_clean = cosing_clean.drop_duplicates(subset=['Chemical name','EC Number','CAS Number', 'Identified INGREDIENTS or substances e.g.'] , keep='first').sort_values(
            by= ['Chemical name', 'Identified INGREDIENTS or substances e.g.','CAS Number','EC Number','Name of Common Ingredients Glossary','Chemical/IUPAC Name', 'Product Type, body parts'])
cosing_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42540 entries, 1935 to 999
Data columns (total 8 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Chemical name                              42540 non-null  object
 1   Chemical/IUPAC Name                        39874 non-null  object
 2   Identified INGREDIENTS or substances e.g.  2768 non-null   object
 3   CAS Number                                 42203 non-null  object
 4   EC Number                                  42005 non-null  object
 5   Name of Common Ingredients Glossary        2121 non-null   object
 6   Product Type, body parts                   471 non-null    object
 7   Anexo_cosIng                               42540 non-null  object
dtypes: object(8)
memory usage: 2.9+ MB


## 8. Exportamos base de datos limpia
Posteriormente vamos a unir esta tabla con la tabla ECHA y CosIng.

In [44]:
# Reiniciamos ínfice
cosing_clean.reset_index(drop=True, inplace=True)

Vemos que el indice no corresponde con los registros

In [45]:
# Parquet
cosing_clean.to_parquet("../../data/processed/notebooks/cosing_clean.parquet", index=False)

# Excel
# cosing_clean.to_excel("../../data/processed/notebooks/cosing_clean.xlsx", index=False)