DM lesson4


Dans ce dataset: https://raw.githubusercontent.com/fspot/INFMDI-721/master/lesson5/products.csv, chaque ligne correspond à un produit alimentaire mis en vente par un utilisateur.

Objectif: cleaner le dataset.

On aimerait avoir une colonne de prix unifiés en euros. Problème: la currency n'est pas indiquée pour tous les produits: il va falloir essayer de "deviner" les currency manquantes, en se basant sur l'adresse IP de l'utilisateur.
La colonne "infos" liste des ingrédients présents dans le produit. On préfèrerait avoir une colonne de type bool par ingrédient, indiquant si le produit contient ou non cet ingrédient.
Voic une liste d'APIs qui peut vous être utile : https://github.com/public-apis/public-apis (mais vous pouvez en utiliser d'autres si vous le voulez).

In [1]:
# imports
import pandas as pd
import requests

In [2]:
# load data file as pandas dataframe
products = pd.read_csv('https://raw.githubusercontent.com/fspot/INFMDI-721/master/lesson5/products.csv', sep=';')

# basic descriptive elements
products.head()

Unnamed: 0,username,ip_address,product,price,infos
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish"


In [3]:
# currency conversion: 
# I use the API https://ipgeolocationapi.com/ to obtain the currency code from IP
# I use the API https://api.exchangeratesapi.io/ to obtain the currency code from IP


# function to obtain currency from IP
# exemple: get_currency("219.118.201.119")
def get_currency(ip):
    currency = requests.get("https://api.ipgeolocationapi.com/geolocate/"+ip)
    try:
        currency = currency.json()
        return currency["currency_code"]
    except:
        return 'None'
    
    
# function to obtain conversion
# exemple: get_euro_rate("USD")
def get_euro_rate(currency_code):
    if currency_code == 'None':
        return float('NaN')
    if currency_code == "EUR":
        return 1
    rate = requests.get("https://api.exchangeratesapi.io/latest?base="+currency_code)
    try:
        rate = rate.json()
        return rate["rates"]["EUR"]
    except:
        return float('NaN')

In [4]:
# obtain all currency codes
currency_code = [get_currency(ip) for ip in products['ip_address']]
# obtain all conversion rates
conversion_rate = [get_euro_rate(code) for code in currency_code]                                 

In [5]:
# creation of column of Euro prices
products['currency_code'] = currency_code
products['conversion_rate'] = conversion_rate

# drop NaN
products = products.dropna()

# convert price to float
products['price'] = pd.to_numeric(products.price, errors='coerce')

# create a new column with price in euros
products['euro_price'] = products['price'] * products['conversion_rate']

# display results
products.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code,conversion_rate,euro_price
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,0.895015,313.389421
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,0.008245,7.830736
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,0.046729,5.213084
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,0.046729,4.114486
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,0.895015,770.831469


In [6]:
# extract column of ingredients
full_list = products['infos'].tolist()
# clean, separate the words
word_list = [element.lower().replace(',','').replace(':','').split() for element in full_list]
# explode to create a single list
word_list = [item for sublist in word_list for item in sublist]
# remove duplicates
word_list = set(word_list)
# remove irrelevant words
irr_words=['may', 'contain', 'contains', 'ingredient', 'ingredients', 'and']
ingredient_list = [word for word in word_list if word not in irr_words ]

# display results
print(ingredient_list)

['soja', 'gluten', 'sugar', 'peanut', 'mustard', 'fish', 'milk', 'egg']


In [7]:
# loop over ingredients
for ingredient in ingredient_list:
    # create list of booleans
    products[ingredient] = [ingredient in entry for entry in full_list]

# display results
products.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code,conversion_rate,euro_price,soja,gluten,sugar,peanut,mustard,fish,milk,egg
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,0.895015,313.389421,False,True,False,False,False,False,False,False
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,0.008245,7.830736,False,False,True,False,False,True,False,True
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,0.046729,5.213084,False,False,True,False,False,False,False,False
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,0.046729,4.114486,False,False,True,False,False,True,True,False
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,0.895015,770.831469,False,False,True,True,False,True,True,False
