# Are we consuming more local?

## Research questions

1. Where are the products we consume in our everyday life coming from?

    - Which countries produce the primary resources (ground ingredients) consumed in Switzerland?
    - Which countries manufacture most of the products consumed in Switzerland?


2. Is there a trend over time to consume more local products?

    - Are new products mostly using primary resources from Switzerland? Or from other countries inside Europe?
    - Are new products mostly manufactured in Switzerland? Or from other countries inside Europe?
    - Is there a trend over time to local products to promote their origin?

## Datasets

Open Food Facts (https://world.openfoodfacts.org/data)

Additional datasets “Evolution de la consommation de denrées alimentaires en Suisse” (https://opendata.swiss/fr/dataset/entwicklung-des-nahrungsmittelverbrauches-in-der-schweiz-je-kopf-und-jahr1) and “Dépenses fédérales pour l’agriculture et l’alimentation” (https://opendata.swiss/fr/dataset/bundesausgaben-fur-die-landwirtschaft-und-die-ernahrung1) from https://opendata.swiss/fr/group/agriculture

A last additional dataset for the second question of the project
https://www.gate.ezv.admin.ch/swissimpex/public/bereiche/waren/result.xhtml
Total of imports of agriculture, forestry and fishing goods


In [1]:
#imports
import re
import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib.pyplot as plt
from datetime import timedelta

In [2]:
import findspark
findspark.init()
import pyspark

from functools import reduce
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
from pyspark.sql.functions import min
from pyspark.sql.functions import to_date, last_day,date_add
from datetime import timedelta

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [36]:
DATA_FOLDER = 'data'
openfood_file = "/en.openfoodfacts.org.products.csv"
cities_file = "/worldcitiespop.csv"
countries_file = "/GEODATASOURCE-COUNTRY.csv"

# Loading data

In [4]:
dataset_main = spark.read.csv(DATA_FOLDER+ openfood_file, header=True, mode="DROPMALFORMED", sep = '\t')

dataset_main.createOrReplaceTempView("data_main")

# Filter required columns
p_id_col = " code, "
general_cols = " brands, brands_tags, categories, categories_tags, origins, origins_tags, manufacturing_places, manufacturing_places_tags,labels,labels_tags,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags "
geo_cols = " origins, manufacturing_places, countries "
geo_tags_cols = " origins_tags, manufacturing_places_tags, countries_tags "

off_df = spark.sql("SELECT" + p_id_col + geo_cols + "," + geo_tags_cols + " FROM data_main")
off_df.printSchema()

root
 |-- code: string (nullable = true)
 |-- origins: string (nullable = true)
 |-- manufacturing_places: string (nullable = true)
 |-- countries: string (nullable = true)
 |-- origins_tags: string (nullable = true)
 |-- manufacturing_places_tags: string (nullable = true)
 |-- countries_tags: string (nullable = true)



In [5]:
off_all_size = off_df.count()
off_cols_size = len(off_df.columns)
print("All data Size:\n" + str(off_cols_size) + "(columns) * " + str(off_all_size) + "(rows)")

All data Size:
7(columns) * 693829(rows)


### Data Cleaning and Preprocessing

In [6]:
# Find number of missing data

off_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in off_df.columns]).show()

+----+-------+--------------------+---------+------------+-------------------------+--------------+
|code|origins|manufacturing_places|countries|origins_tags|manufacturing_places_tags|countries_tags|
+----+-------+--------------------+---------+------------+-------------------------+--------------+
|   0| 651635|              626848|      561|      651689|                   626868|           561|
+----+-------+--------------------+---------+------------+-------------------------+--------------+



In [7]:
off_df.createOrReplaceTempView("off_df")

sql_filter = "SELECT * FROM off_df WHERE countries is not NULL \
              AND countries_tags is not NULL \
             AND origins is not NULL AND origins_tags is not NULL AND\
             manufacturing_places is not NULL AND manufacturing_places_tags is not NULL "

off_p_df = spark.sql(sql_filter)
off_p_all_size = off_p_df.count()
off_p_cols_size = len(off_p_df.columns)
print("Full GEO information data Size:\n" + str(off_p_cols_size) + "(columns) * " + str(off_p_all_size) + "(rows)")

Full GEO information data Size:
7(columns) * 26953(rows)


In [None]:
off_p_df.show(30)

Since columns with _tag label have more consistent data, we will use these columns from now.

In [8]:
off_p_df = off_p_df.drop('countries').drop('manufacturing_places').drop('origins')

In [None]:
off_p_df.show(10)

In [9]:
# Explode data

countries_df = off_p_df.withColumn('origins_tags', F.explode_outer(F.split('origins_tags', ',')))\
.withColumn('manufacturing_places_tags', F.explode_outer(F.split('manufacturing_places_tags', ',')))\
.withColumn('countries_tags', F.explode_outer(F.split('countries_tags', ',')))

In [10]:
# Remove "en:" occurances before name of each country in coutries_tags
countries_df = countries_df.withColumn('countries_tags', F.regexp_replace('countries_tags', "en:", ""))

In [11]:
countries_df.show(10)

+-------------+------------+-------------------------+--------------+
|         code|origins_tags|manufacturing_places_tags|countries_tags|
+-------------+------------+-------------------------+--------------+
|0000000274722|      france|                   france|        france|
|0000000290616|      quebec|          brossard-quebec|        canada|
|0000000394710|      quebec|          brossard-quebec|        canada|
|0000001071894|      france|           united-kingdom|united-kingdom|
|0000001938067|      quebec|          brossard-quebec|        canada|
|0000004302544|      quebec|                 brossard|        canada|
|0000004302544|      quebec|                   quebec|        canada|
|0000008237798|      quebec|                 brossard|        canada|
|0000008237798|      quebec|                   quebec|        canada|
|0000008240095|      quebec|          brossard-quebec|        canada|
+-------------+------------+-------------------------+--------------+
only showing top 10 

In [None]:

# Add each row of three columns as new entry in this database as index
# Remove repetative words
# Map each row to two columns: "cleaned country name" and "is european"




In [25]:
countries_mapping = countries_df.toPandas()
countries_mapping.head()

In [29]:
# create a new database mapping each country to some labels

countries_mapping['all_countries'] = countries_mapping.origins_tags + "," + countries_mapping.manufacturing_places_tags +"," + countries_mapping.manufacturing_places_tags

In [160]:
countries = pd.concat([pd.Series(row['all_countries'].split(','))              
                    for _, row in countries_mapping.iterrows()]).reset_index(drop=True)

In [161]:
countries = countries.drop_duplicates().reset_index(drop=True)
countries = countries.str.replace("-", " ") 

# Remove numbers from name of countries
countries = countries.str.replace('\d+', '')

print(len(countries))

8469


In [162]:
countries.head()

0             france
1             quebec
2    brossard quebec
3     united kingdom
4           brossard
dtype: object

## Using country name

An external database was used for country names. This database maps each country with country code
https://www.geodatasource.com

In [163]:
dataset_countries = pd.read_csv(DATA_FOLDER+ countries_file,sep='\t', error_bad_lines=False)
dataset_countries.head()

Unnamed: 0,CC_FIPS,CC_ISO,TLD,COUNTRY_NAME
0,AA,AW,.aw,Aruba
1,AC,AG,.ag,Antigua and Barbuda
2,AE,AE,.ae,United Arab Emirates
3,AF,AF,.af,Afghanistan
4,AG,DZ,.dz,Algeria


In [164]:
def map_country(data, country_code):
    global map_countries
    map_countries = map_countries.append({'input': data, 'country_code': country_code}, ignore_index=True)

def find_country(data):
    output = dataset_countries[dataset_countries.COUNTRY_NAME.str.contains(data, case=False)]
    if len(output):
        return output.iloc[0].CC_FIPS
    return 0

def assign_country_code(row):
    output = find_country(row)
    if output:
        map_country(row, output)
        return True
    return False

In [165]:
map_countries = pd.DataFrame(columns=['input', 'country_code'])

for i in range(len(countries)):
    if assign_country_code(countries[i]):
        countries = countries.drop([i])
        i -=1

195 name of countries detected in uncleaned dataset.

## Using City names

An external database was used for city names. This database maps each city with country code

https://www.maxmind.com/en/free-world-cities-database

In [168]:
# Find name of cities and replace with country code (Here some bias may happen, some cities have similar name)

dataset_cities = pd.read_csv(DATA_FOLDER+ cities_file,sep=',', error_bad_lines=False, encoding = "utf-8")
dataset_cities = dataset_cities[['Country', 'City']]
dataset_cities.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Country,City
0,ad,aixas
1,ad,aixirivali
2,ad,aixirivall
3,ad,aixirvall
4,ad,aixovall


In [169]:
def find_city(data):
    output = dataset_cities[dataset_cities.City.str.contains(data, case=False, na=False)]
    if len(output):
        return output.iloc[0].Country
    return 0

def assign_country_code_using_city(row):
    output = find_city(row)
    if output:
        map_country(row, output)
        return True
    return False

In [212]:
seen = 0

In [218]:
# Try to map more locations using city names
countries = countries.reset_index(drop=True)
for i in range(seen,2000):
    if assign_country_code_using_city(countries[i]):
        countries = countries.drop([i])
        i -=1
        seen = i
print(seen)   

1900


2319 locations were matched with city names

In [None]:
# evaluate the country detection algorithm by manual checking of a sample of 100 entries

## Using country name (contain)

In [228]:
for index, row in dataset_countries.iterrows():
    output = countries[countries.str.contains(row.COUNTRY_NAME, case=False, na=False)]
    for i in range(len(output)):
        map_country(output.iloc[i], row.CC_FIPS)
        countries = countries.drop(countries[countries == output.iloc[i]].index[0])

  


629 locations were matched with city names

## Using City name (contain)

In [263]:
for j in range(len(dataset_cities)):
    output = countries[countries.str.contains(str(dataset_cities.iloc[j].City) + " ", case=False, na=False)]
    for i in range(len(output)):
        map_country(output.iloc[i], dataset_cities.iloc[j].Country)
        countries = countries.drop(countries[countries == output.iloc[i]].index[0])
        
    output = countries[countries.str.contains(" " + str(dataset_cities.iloc[j].City), case=False, na=False)]
    for i in range(len(output)):
        map_country(output.iloc[i], dataset_cities.iloc[j].Country)
        countries = countries.drop(countries[countries == output.iloc[i]].index[0])

5,326 name of countries contain name of one city.

In [273]:
print("{0} strings remained\n{1} strings mapped to a country".format(len(countries), len(map_countries)))

1318 strings remained
7151 strings mapped to a country


In [None]:
# Export mapping of countries and remained countries
countries.to_csv('output/remained_countries_.csv')
map_countries.to_csv('output/mapping_countries_.csv')

In [None]:
cleaned_countries = pd.read_csv("cleaning_data/origins_cleaning.csv")

# Flatten origins column
origins_p = non_swiss_sold.withColumn('origins', F.explode_outer(F.split('origins', ',')))

# Clean origins countries
for index, row in cleaned_countries.iterrows():
    if(str(row['replace_with']) != "nan"):
        origins_p = origins_p.withColumn('origins', F.regexp_replace('origins', "^" + row['origins'] + "$", row['replace_with']))
    else:
        origins_p = origins_p.withColumn('origins', F.regexp_replace('origins', row['origins'], "Not Specified"))

In [None]:
# Find dominant importers of ingredients

origins_p.createOrReplaceTempView("origins_p")
target_origins = spark.sql("SELECT origins, COUNT(origins) FROM origins_p GROUP BY origins ORDER BY COUNT(origins) DESC")
target_origins = target_origins.withColumnRenamed('count(origins)' , 'Count')

print("Number of ingredient importers:\n" + str(target_origins.count()))

In [None]:
target_origins.show(5)

In [None]:
# Extract the 10 most important countries and number of occurances in ingredients.
origin_countries = target_origins.toPandas() 
origin_countries = origin_countries.head(10)

# Plot the number of products imported by 10 most important countries
fig, ax = plt.subplots()
ax.grid(zorder=-1)
plt.bar(origin_countries.origins, origin_countries.Count, zorder=3, color='skyblue')
plt.ylabel('Number of products')   
plt.xticks(origin_countries.origins, rotation='80')
plt.title('Fig. The number of products which ingredients are from the 10 most frequent countries')
plt.show()

In [None]:
# Export origins to CSV 
target_origins.select("origins").toPandas().to_csv('output/origins.csv')

Now that we found the most important countries in case of producing ingredients for products which are imported to Switzerland, We should take the same approach to find the most important manufacturers of this products.

In [None]:
# Flatten manufacturing_places column
manufacturers_p = non_swiss_sold.withColumn('manufacturing_places', F.explode_outer(F.split('manufacturing_places', ',')))

# Clean manufacturers countries
for index, row in cleaned_countries.iterrows():
    if(str(row['replace_with']) != "nan"):
        manufacturers_p = manufacturers_p.withColumn('manufacturing_places', F.regexp_replace('manufacturing_places', "^" + row['origins'] + "$", row['replace_with']))
    else:
        manufacturers_p = manufacturers_p.withColumn('manufacturing_places', F.regexp_replace('manufacturing_places', row['origins'], "Not Specified"))


In [None]:
# Find dominant importers (manufacturers)

manufacturers_p.createOrReplaceTempView("manufacturers_p")
target_manufacturers = spark.sql("SELECT manufacturing_places, COUNT(manufacturing_places) FROM manufacturers_p GROUP BY manufacturing_places ORDER BY COUNT(manufacturing_places) DESC")
target_manufacturers = target_manufacturers.withColumnRenamed('count(manufacturing_places)' , 'Count')
# target_manufacturers.show()

print("Number of Manufacturers:\n" + str(target_manufacturers.count()))

In [None]:
# Extract the 10 most important manufacturers countries and number of occurances.
manufacturer_countries = target_manufacturers.toPandas() 
manufacturer_countries = manufacturer_countries.head(10)

# Plot the number of products manufactured by 10 most important countries
fig, ax = plt.subplots()
ax.grid(zorder=-1)
plt.bar(manufacturer_countries.manufacturing_places, manufacturer_countries.Count, zorder=3, color='skyblue')
plt.ylabel('Number of products')   
plt.xticks(manufacturer_countries.manufacturing_places, rotation='80')
plt.title('Fig. The number of products manufactured by 10 most important countries')
plt.show()

In [None]:
# Export manufacturing_places to CSV 
target_manufacturers.select("manufacturing_places").toPandas().to_csv('output/manufacturing_places.csv')

### Working with products' categories

We would like to find which countries import what products?
In order to answer to this question, first we extract categories corresponding to the products sold in Switzerland but not manufactured in this country. Then we will combine the information provided for manufacturing places of these products with corresponding category. 

In the end we will extract the 5 most important importers of each category.

In [None]:
extra_info_df = ' categories '
categories_df = spark.sql("SELECT" + p_id_col + extra_info_df + " FROM data_main")
categories_df.show(10)

In [None]:
categories_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in traces_df.columns]).show()

In [None]:
# Join table of categories with table of target products (Sold in Switzerland but not manufactured in it)
non_swiss_sold.createOrReplaceTempView("target_products_df")
categories_df.createOrReplaceTempView("categories_df")
joined_df = spark.sql("SELECT p.code, c.categories, p.origins, p.manufacturing_places, p.countries  FROM target_products_df p INNER JOIN categories_df c ON p.code = c.code")

In [None]:
joined_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in joined_df.columns]).show()

According to above table, just 15 products do not have category. 

In [None]:
joined_df.createOrReplaceTempView("target_products_cats")
sql_filter = "SELECT * FROM target_products_cats WHERE categories is not NULL"
target_products_categories_p = spark.sql(sql_filter)

print("Number of Products sold in Switzerland with categories:\n" + str(target_products_categories_p.count()))

In [None]:
target_products_categories_p.show(5)

In [None]:
# Flatten categories column
target_products_categories_p = target_products_categories_p.withColumn('categories', F.explode_outer(F.split('categories', ',')))

# Remove occurances of en: in name of categories
target_products_categories_p = target_products_categories_p.withColumn('categories', F.regexp_replace('categories', 'en:', ''))

target_products_categories_p.show(5)

In [None]:
# Find dominant categories
target_products_categories_p.createOrReplaceTempView("target_products_categories_p")
target_categories = spark.sql("SELECT categories, COUNT(categories) FROM target_products_categories_p GROUP BY categories ORDER BY COUNT(categories) DESC")
target_categories = target_categories.withColumnRenamed('count(categories)' , 'Count')
target_categories.show()

print("Number of Categories:\n" + str(target_categories.count()))

According to above table not all categories are independant and they may contain similar products. We did not change categories for this milestone but we would like to merge some categories in future to have better distinction between countries which import these products. 

In [None]:
# Export Categories to CSV 
target_categories.toPandas().to_csv('output/categories.csv')

In [None]:
# Extract the 15 most frequent categories and number of occurances.
categories = target_categories.toPandas() 
categories = categories.head(15)

# Plot the most frequent categories
fig, ax = plt.subplots()
ax.grid(zorder=-1)
plt.barh(categories.categories, categories.Count, zorder=3, color='skyblue')
plt.xlabel('Number of products')   
plt.title('Fig. The number of products sold in Switzerland')
ax.invert_yaxis() 
plt.show()

In [None]:
cleaned_countries = pd.read_csv("cleaning_data/origins_cleaning.csv")

In [None]:
# Extract the most important manufacturer countries for 15 most frequent categories

categories_countires_p = target_products_categories_p.withColumn('manufacturing_places', F.explode_outer(F.split('manufacturing_places', ',')))

# Clean manufacturers countries
for index, row in cleaned_countries.iterrows():
    if(str(row['replace_with']) != "nan"):
        categories_countires_p = categories_countires_p.withColumn('manufacturing_places', F.regexp_replace('manufacturing_places', "^" + row['origins'] + "$", row['replace_with']))
    else:
        categories_countires_p = categories_countires_p.withColumn('manufacturing_places', F.regexp_replace('manufacturing_places', row['origins'], "Not Specified"))

categories_countires_p = categories_countires_p.withColumn('categories', F.explode_outer(F.split('categories', ',')))
products_cats_df = categories_countires_p.toPandas()

countries_cats = []
cat = "Snacks sucrés"
for index, row in categories.iterrows():
    temp_df = products_cats_df[(products_cats_df.categories == row.categories)].groupby('manufacturing_places').count().sort_values(by=['code'], ascending=False).head(5)
    countries_cats.append(temp_df)

# Each element of countries_cats is a dataframe containing the most important manufacturers
# of the corresponding category

i = 0
print("category: " + str(categories.iloc[i].categories))
print("The most important manufacturers:")
print(countries_cats[0].head(3).index.values)

We will use extracted information about the most important manufacturer countries for 15 most frequent categories for our visualization.

## Descriptive Analysis

In [None]:
openfood = pd.read_csv(DATA_FOLDER + openfood_file, sep = '\t')

In [None]:
foodSwitzerland = openfood[openfood['countries_tags']=="en:switzerland"]

In [None]:
nbProdSwit = len(foodSwitzerland)
print("Number of products: ", nbProdSwit)

# first perspective: Compare old vs new products inside the dataset

A product in the dataset is considered "old" if its description was uploaded to the dataset before February 2017. By the contrary, it is considered as "new" if its description was uploaded after that date.

_As that definition is a little rigid, in order to get closer to the real situation, the following assumption is made: From the total products uploaded to the dataset after February 2017, 20% are old products._  

A first perspective to tackle the research question, is to compare how in this more than 6 years of existance of the dataset, the characteristics of the products have changed. Specifically, we would like to know if there have been some changes in the origin of the primary resources, or in the origin of the manufacture or in the labels of the products.

The study of the evolution in time of each one of those features, will include an __exploratory data analysis__.
Finally, a study including the three features will be done, aiming to find an aggregated differentiated behavior in time, reflected in different clusters of periods of time. For that the __K-modes algorithm__ will be used.

In [None]:
print("Date of first upload: ", min(foodSwitzerland['created_datetime']))
print("Date of last upload retrieved: ", max(foodSwitzerland['created_datetime']))

In [None]:
foodSwitzerland['created_datetime'] = pd.to_datetime(foodSwitzerland['created_datetime'])
foodSwitzerland = foodSwitzerland.sort_values(by='created_datetime')

The uploads of products sold in Switzerland, behaves differently in time. Taking into account the histogram presented below, two periods of time are defined:
- Period 1: Created for studying the behaviour from "old" products. Products uploaded before Feb 2017.
- Period 2: Created for studying the behaviour from "new" products. Products uploaded after Feb 2017.

In [None]:
rateOldInNew = 0.2
old_products = len(foodSwitzerland[foodSwitzerland['created_datetime']<"2017-03-01 00:00:00"])
print("Old products: ", old_products + (len(foodSwitzerland)-old_products)*(rateOldInNew))
print("New products: ", (len(foodSwitzerland)-old_products)*(1-rateOldInNew))
foodSwitzerland['created_datetime'].hist()

Before making subdivision of data:

In [None]:
filter_ch = '[Ss]witzerland|[Ss]uisse|[Ss]chweiz|[Ss]vizerra'
filter_local = '[Ss]witzerland|[Ss]uisse|[Ss]chweiz|[Ss]vizerra|[Ll]ocal'

place = pd.Series(['Other country','Switzerland', 'No information'], index=[0,1,2])
refLabel = pd.Series(['Other Label','Related with Switzerland', 'No information'], index=[0,1,2]) 

foodSwitzerland["originsCat"] = foodSwitzerland["origins"].str.contains(filter_ch,regex=True).map(place,na_action='ignore')
foodSwitzerland["manuCat"] = foodSwitzerland["manufacturing_places"].str.contains(filter_ch,regex=True).map(place,na_action='ignore')
foodSwitzerland["labCat"] = foodSwitzerland["labels_tags"].str.contains(filter_local,regex=True).map(refLabel,na_action='ignore')

Now, let's do the subdivision in the two periods, taking care of our assumption of 20%:

In [None]:
foodSwitzerlandBef = foodSwitzerland[foodSwitzerland['created_datetime']<"2017-03-01 00:00:00"]
foodSwitzerlandAft = foodSwitzerland[foodSwitzerland['created_datetime']>="2017-03-01 00:00:00"]

befInAft = foodSwitzerlandAft.sample(n=int(rateOldInNew*len(foodSwitzerlandAft)), replace=False)
foodSwitzerlandBef = pd.concat([foodSwitzerlandBef,befInAft],axis=0)

for i in range (0,len(befInAft)):
    foodSwitzerlandAft = foodSwitzerlandAft[foodSwitzerlandAft['code']!=befInAft['code'].iloc[i]]
print("Number of Old products: ", np.shape(foodSwitzerlandBef))
print("Number of New products: ", np.shape(foodSwitzerlandAft))

### 2.1 Study of the evolution in time of each one of the interest features

### 2.1.1 With respect to: Origin of the primary resources

First, bootstrapping for include confidence intervals to results

In [None]:
#before
propSB_100ite = np.zeros (100)
propOCB_100ite = np.zeros (100)
#after
propSA_100ite = np.zeros (100)
propOCA_100ite = np.zeros (100)

for iteration in range(0,100):
    #before
    temp_bef = foodSwitzerlandBef["originsCat"].sample(n=len(foodSwitzerlandBef), replace=True)
    propSB_100ite[iteration] = (sum(temp_bef=="Switzerland")/len(foodSwitzerlandBef))
    propOCB_100ite[iteration] = sum(temp_bef=="Other country")/len(foodSwitzerlandBef)
    #after
    temp_aft = foodSwitzerlandAft["originsCat"].sample(n=len(foodSwitzerlandAft), replace=True)
    propSA_100ite[iteration] = sum(temp_aft=="Switzerland")/len(foodSwitzerlandAft)
    propOCA_100ite[iteration] = sum(temp_aft=="Other country")/len(foodSwitzerlandAft)
    
    #Relation of time evolution and difference of products in each categories
    difpropB = propSB_100ite[iteration]-propOCB_100ite[iteration]
    difpropA = propSB_100ite[iteration]-propOCA_100ite[iteration]

#Calculating standard deviation of count of each category
SBstd = np.std(propSB_100ite)
OCBstd = np.std(propOCB_100ite)
#Calculating standard deviation of count of each category
SAstd = propSA_100ite.std()
OCAstd = propOCA_100ite.std()

Then, plotting of behavior 

In [None]:
#Report information of NAN cases
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
print("There was not information for ",len(foodSwitzerlandBef["originsCat"])-sum(foodSwitzerlandBef["originsCat"]=="Switzerland")-sum(foodSwitzerlandBef["originsCat"]=="Other country")," products in the period 1.")
print("There was not information for ",len(foodSwitzerlandAft["originsCat"])-sum(foodSwitzerlandAft["originsCat"]=="Switzerland")-sum(foodSwitzerlandAft["originsCat"]=="Other country")," products in the period 2.")

#Plot origin of primary resources by periods
plt.subplot(1,2,1)
(foodSwitzerlandBef["originsCat"].value_counts()/len(foodSwitzerlandBef["originsCat"])).plot(kind='bar', yerr = [SBstd,OCBstd],title='Period 1')
plt.subplot(1,2,2)
(foodSwitzerlandAft["originsCat"].value_counts()/len(foodSwitzerlandAft["originsCat"])).plot(kind='bar', yerr = [SAstd,OCAstd],title='Period 2')

plt.show()

Most of old products have as origin of primary resources Switzerland. However, in new products, the origin of primary resources is for Switzerland and Other countries statistically the same.

### 2.1.2 With respect to: Manufacture

First, bootstrapping for include confidence intervals to results

In [None]:
#before
propSB_100ite = np.zeros (100)
propOCB_100ite = np.zeros (100)

#after
propSA_100ite = np.zeros (100)
propOCA_100ite = np.zeros (100)

for iteration in range(0,100):
    #before
    temp_bef = foodSwitzerlandBef["manuCat"].sample(n=len(foodSwitzerlandBef), replace=True)
    propSB_100ite[iteration] = sum(temp_bef=="Switzerland")/len(foodSwitzerlandBef)
    propOCB_100ite[iteration] = sum(temp_bef=="Other country")/len(foodSwitzerlandBef)
    #after
    temp_aft = foodSwitzerlandAft["manuCat"].sample(n=len(foodSwitzerlandAft), replace=True)
    propSA_100ite[iteration] = sum(temp_aft=="Switzerland")/len(foodSwitzerlandAft)
    propOCA_100ite[iteration] = sum(temp_aft=="Other country")/len(foodSwitzerlandAft)
    

#Calculating standard deviation of count of each category
SBstd = propSB_100ite.std()
OCBstd = propOCB_100ite.std()
#Calculating standard deviation of count of each category
SAstd = propSA_100ite.std()
OCAstd = propOCA_100ite.std()

In [None]:
#Report information of NAN cases
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
print("There was not information for ",len(foodSwitzerlandBef["manuCat"])-sum(foodSwitzerlandBef["manuCat"]=="Switzerland")-sum(foodSwitzerlandBef["manuCat"]=="Other country")," products in the period 1.")
print("There was not information for ",len(foodSwitzerlandAft["manuCat"])-sum(foodSwitzerlandAft["manuCat"]=="Switzerland")-sum(foodSwitzerlandAft["manuCat"]=="Other country")," products in the period 2.")

#Plot origin of primary resources by periods
plt.subplot(1,2,1)
(foodSwitzerlandBef["manuCat"].value_counts()/len(foodSwitzerlandBef["manuCat"])).plot(kind='bar', yerr = [SBstd,OCBstd],title='Period 1')
plt.subplot(1,2,2)
(foodSwitzerlandAft["manuCat"].value_counts()/len(foodSwitzerlandBef["manuCat"])).plot(kind='bar', yerr = [SAstd,OCAstd],title='Period 2')

plt.show()

Most of old products were manufactured in Switzerland. However, in new products, the manufacture is for Switzerland and Other countries statistically the same.

### 2.1.3 With respect to: Labels

In [None]:
#before
propSB_100ite = np.zeros (100)
propOCB_100ite = np.zeros (100)
#after
propSA_100ite = np.zeros (100)
propOCA_100ite = np.zeros (100)

for iteration in range(0,100):
    #before
    temp_bef = foodSwitzerlandBef["labCat"].sample(n=len(foodSwitzerlandBef), replace=True)
    propSB_100ite[iteration] = sum(temp_bef=="Related with Switzerland")/len(foodSwitzerlandBef)
    propOCB_100ite[iteration] = sum(temp_bef=="Other Label")/len(foodSwitzerlandBef)
    #after
    temp_aft = foodSwitzerlandAft["labCat"].sample(n=len(foodSwitzerlandAft), replace=True)
    propSA_100ite[iteration] = sum(temp_aft=="Related with Switzerland")/len(foodSwitzerlandAft)
    propOCA_100ite[iteration] = sum(temp_aft=="Other Label")/len(foodSwitzerlandAft)
    
    #Relation of time evolution and difference of products in each categories
    difpropB = propSB_100ite[iteration]-propOCB_100ite[iteration]
    difpropA = propSB_100ite[iteration]-propOCA_100ite[iteration]

#Calculating standard deviation of count of each category
SBstd = propSB_100ite.std()
OCBstd = propOCB_100ite.std()
#Calculating standard deviation of count of each category
SAstd = propSA_100ite.std()
OCAstd = propOCA_100ite.std()

In [None]:
#Report information of NAN cases
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
print("There was not information for ",len(foodSwitzerlandBef["labCat"])-sum(foodSwitzerlandBef["labCat"]=="Switzerland")-sum(foodSwitzerlandBef["labCat"]=="Other country")," products in the period 1.")
print("There was not information for ",len(foodSwitzerlandAft["labCat"])-sum(foodSwitzerlandAft["labCat"]=="Switzerland")-sum(foodSwitzerlandAft["labCat"]=="Other country")," products in the period 2.")

#Plot origin of primary resources by periods
plt.subplot(1,2,1)
(foodSwitzerlandBef["labCat"].value_counts()/len(foodSwitzerlandBef["labCat"])).plot(kind='bar', yerr = [SBstd,OCBstd],title='Period 1')
plt.subplot(1,2,2)
(foodSwitzerlandAft["labCat"].value_counts()/len(foodSwitzerlandBef["labCat"])).plot(kind='bar', yerr = [SAstd,OCAstd],title='Period 2')

plt.show()

For the two periods studied, other labels are more frequent than the related with Switzerland

## Future plan

### Improving data cleaning for countries 

Although we cleaned the data of name of countries but about 200 are still remained.

### Extend research questions to Europe

After talking with TA, we decided to extend our research questions to Europe in order to have more data. The similar approach will be taken for answering our main questions about Europe.

### Evaluate and complement our results

We would like to complement our results using following databases:
- Additional datasets “Evolution de la consommation de denrées alimentaires en Suisse” (https://opendata.swiss/fr/dataset/entwicklung-des-nahrungsmittelverbrauches-in-der-schweiz-je-kopf-und-jahr1) and “Dépenses fédérales pour l’agriculture et l’alimentation” (https://opendata.swiss/fr/dataset/bundesausgaben-fur-die-landwirtschaft-und-die-ernahrung1) from https://opendata.swiss/fr/group/agriculture

- A last additional dataset for the second question of the project https://www.gate.ezv.admin.ch/swissimpex/public/bereiche/waren/result.xhtml Total of imports of agriculture, forestry and fishing goods

### Find the most important characteristic of Swiss-made products

After answering our main research questions we would like to find which products are likely to import to Switzerland. To answer this question we would like to train a classifier in order to see which features are the most important features to decide if a product is Swiss made or not.

### Improve our assumption about date of entering products to the Swiss market

Study of the evolution in time of the interest features combined
- The k-modes and other machine learning algorithms are expected to be done for the milestone 3.