A partir del archivo food_100, se pide:

1. Limpiar el archivo borrando las columnas Unnamed.
2. ¿Qué porcentaje de valores NaN hay en cada columna?
3. ¿Tendría algún sentido clasificar el nombre de los alimentos a partir del top5 de columnas numéricas con menos valores NaN?

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

pd.set_option('display.max_rows', 500)


In [2]:
df = pd.read_csv("../data/food_100.csv")
target = df['product_name']
df.head()

Unnamed: 0.1,Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,...,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,...,,,,,,,,,,
1,1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,...,,,,,,,14.0,14.0,,
2,2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,...,,,,,,,0.0,0.0,,
3,3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,...,,,,,,,12.0,12.0,,
4,4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,...,,,,,,,,,,


In [3]:
df = df.iloc[:, ~df.columns.str.match(r'^Unnamed')] # Eliminamos culaquier columna que empiece por 'Unnamed'
df.head()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,...,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,...,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,...,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,...,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,...,,,,,,,,,,


In [4]:
df.dtypes.unique()
# Vemos los tipos de datos numericos que tenemos para quedarnos únicamente con ellos

array([dtype('int64'), dtype('O'), dtype('float64')], dtype=object)

In [5]:
df = df.select_dtypes(include=['int64', 'float64'])
df['target'] = target
df.head()

Unnamed: 0,code,created_t,last_modified_t,generic_name,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,...,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,target
0,3087,1474103866,1474103893,,,,,,,,...,,,,,,,,,,Farine de blé noir
1,4530,1489069957,1489069957,,,,,,,,...,,,,,,14.0,14.0,,,Banana Chips Sweetened (Whole)
2,4559,1489069957,1489069957,,,,,,,,...,,,,,,0.0,0.0,,,Peanuts
3,16087,1489055731,1489055731,,,,,,,,...,,,,,,12.0,12.0,,,Organic Salted Nut Mix
4,16094,1489055653,1489055653,,,,,,,,...,,,,,,,,,,Organic Polenta


In [6]:
df.isnull().mean() * 100
# Porcentaje de valores NaN por columna númerica

code                                         0.0
created_t                                    0.0
last_modified_t                              0.0
generic_name                               100.0
origins                                    100.0
origins_tags                               100.0
manufacturing_places                       100.0
manufacturing_places_tags                  100.0
labels                                     100.0
labels_tags                                100.0
labels_en                                  100.0
emb_codes                                  100.0
emb_codes_tags                             100.0
first_packaging_code_geo                   100.0
cities                                     100.0
cities_tags                                100.0
purchase_places                            100.0
stores                                     100.0
allergens                                  100.0
allergens_en                               100.0
no_nutriments       

In [7]:
df.isnull().mean().sort_values().iloc[:6]
# Escogemos las 5 columnas con menos valores NaN y la columna target

code                                       0.00
created_t                                  0.00
last_modified_t                            0.00
target                                     0.01
ingredients_from_palm_oil_n                0.05
ingredients_that_may_be_from_palm_oil_n    0.05
dtype: float64

In [8]:
df = df[df.isnull().mean().sort_values().iloc[:6].index]
df = df.dropna(axis=0, how='any')
# Limpiamos el dataset de valores NaN y ya lo tenemos listo

In [9]:
df.head()

Unnamed: 0,code,created_t,last_modified_t,target,ingredients_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_n
1,4530,1489069957,1489069957,Banana Chips Sweetened (Whole),0.0,0.0
2,4559,1489069957,1489069957,Peanuts,0.0,0.0
3,16087,1489055731,1489055731,Organic Salted Nut Mix,0.0,0.0
4,16094,1489055653,1489055653,Organic Polenta,0.0,0.0
5,16100,1489055651,1489055651,Breadshop Honey Gone Nuts Granola,0.0,0.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 1 to 99
Data columns (total 6 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   code                                     95 non-null     int64  
 1   created_t                                95 non-null     int64  
 2   last_modified_t                          95 non-null     int64  
 3   target                                   95 non-null     object 
 4   ingredients_from_palm_oil_n              95 non-null     float64
 5   ingredients_that_may_be_from_palm_oil_n  95 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 5.2+ KB


Vamos a ver si tiene algún sentido clasificar el nombre de los alimentos a partirde este dataset

In [11]:
seed = 42
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=seed)

In [12]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(f'El score es {clf.score(X_test, y_test)}')

El score es 0.0


El modelo no ha acertado ningún resultado. Esto es comprensible ya que en el dataset hay columnas como el código que, en principio, no nos aporta ninguna información