<h2>Market Study - Chicken Exports</h2>

<a href="https://openclassrooms.com/en/" >Openclassrooms</a>, Data Analyst Course
<br>Project 5 - Michael Orange


<hr>

A company in the food industry specialized in Chicken Meat is looking at expanding worldwide. All countries are considered.

Production is located in France.
<br>The mission is to provide the company with data ensuring it will collaborate with the most promising export market countries.

Grouping countries in 'clusters' based on their similarity is required. 
<br>Default variables are: 
- population growth, 
- calory supply per capita 
- protein supply per capita, 
- prevalence of animal proteins in the mix. 

Additional relevant variable might be included (ex. GDP per capita)

<hr>

**Section 1** [Importing FAOSTAT Datasets](#import)


**Section 2** [Adding General information](#general)
- [Population](#pop)
- [Gross Domestic Product](#gdp)
- [Political Stability](#stab)
- [European Union countries](#eu)
    
    
**Section 3** [Adding Food-related data](#food)
- [Food-Balance](#fb)
- [Diet - Calories and Proteins](#diet)
- [Poulty key data](#poultry)
- [Chicken meat importations from France](#france)

**Section 4** [Imputating missing data](#imputation)

**Section 5** [Preparation data set](#prep) 
    
**Section 6** [Exporting Data set](#export) 

<hr>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from display_pca import *

<h2>Importing FAOSTAT Datasets</h2><a id='import'></a>

Data are collected from The Food and Agriculture Organization Corporate Statistical Database (FAOSTAT). The FAOSTAT website disseminates statistical data collected and maintained by the Food and Agriculture Organization (FAO).

FAOSTAT - Data collected (2013 and 2017 are selected for the below analysis): analysis is based on 2017 figures, with a comparaison from 2013 data. 
2017 is the most recent comprehensive data from FAOSTAT.

- Datasets Food Balance Animal and Vegetal
- Dataset Population
- Dataset Gross Domestic Product (GDP)
- Dataset Political Stability Index

In [2]:
#import food datasets
veg2013 = pd.read_csv("data/raw/vegetal2013.csv")
ani2013 = pd.read_csv("data/raw/animal2013.csv")
veg2017 = pd.read_csv('data/raw/vegetal2017.csv', dtype={'Note': str })
ani2017 = pd.read_csv("data/raw/animal2017.csv")
ani2013["origin"] = "animal"
veg2013["origin"] = "vegetal"
ani2017["origin"] = "animal"
veg2017["origin"] = "vegetal"

#import population
pop = pd.read_csv("data/raw/FAOSTAT_data_Population_2013-2017.csv")

# import GDP and Stability
gdp = pd.read_csv("data/raw/FAOSTAT_data_MacroIndicators_20132017.csv")

# import Stability Index
stability = pd.read_csv("data/raw/FAOSTAT_data_PoliticalStability_2017.csv")

# import EU countries
eu_country = pd.read_csv("data/raw/listofeucountries.csv")

# import chicken trade
chicken_trade = pd.read_csv("data/raw/data_Trade_ChickenMeat_20132017_France.csv")

**Important note** 
Naming convention: analysis is based on 2017 data, the default year is 2017 for the variables. 
- For the variables related to 2017, indication of the 'year' is not included. ex. 'pop_thousand' =population 2017.
- For the variables related to 2013, '2013' is always included - ex. 'pop_thousand_2013' =population 2013.

<hr>

<h2>Adding general information</h2><a id='general'></a>

<h3>Population</h3><a id='pop'></a>

In [3]:
# population
pop.columns = ["xx","xx2","country_code","country", 'xx3', 'xx4', 'xx5','xx6', 
               'xx7', 'year',"xx8","population_total_thousand","xx9", 'xx10', 'xx11']

pop = pop.drop(["xx","xx2",'xx3','xx4','xx5', 'xx6', 'xx7','xx8', 'xx9', 'xx10', 'xx11'], axis = 1)
pop.reset_index(drop=True, inplace=True)

data_country = pop.pivot_table(index=['country_code', 'country'], columns = ['year'], values=['population_total_thousand'])
data_country.columns = ['pop_thousand_2013', 'pop_thousand']
data_country.reset_index(inplace=True)

#growth population
data_country['pop_growth'] = (data_country['pop_thousand'] / data_country['pop_thousand_2013'] - 1) *100

<h3>Gross Domestic Product (GDP) per capita</h3><a id='gdp'></a>

In [4]:
gdp.columns = ["xx", "xx2", "country_code", "country", 'xx3', 'xx4', 'xx5', 'item', 'xx6', "year", "xx7", "value", 'xx8', 'xx9', 'xx10']
gdp = gdp.loc[gdp['item'] == 'Gross Domestic Product per capita'].pivot_table(\
                                                                                                      index=['country_code', "country"], columns = ['year', 'item'], values=['value'])
gdp.columns = ['gdp_percapita_usd_2013', 'gdp_percapita_usd']
gdp.reset_index(inplace=True)

data_country = pd.merge(data_country, gdp, how='left')

GDP of Taiwan is missing. 
Imputation with information collected from the International Monetary Fund (IMF).
- Taiwan GDP per capita 2013 : 43 831 USD 
- Taiwan GDP per capita 2017 : 50 593 USD 
Source: https://bit.ly/2U2251b

In [5]:
# Imputation GDP Taiwan
data_country.loc[data_country['country'] == 'China, Taiwan Province of', 'gdp_percapita_usd_2013'] = 43831
data_country.loc[data_country['country'] == 'China, Taiwan Province of', 'gdp_percapita_usd'] = 50593

In [6]:
# Growth GDP
data_country['gdp_growth'] = (data_country['gdp_percapita_usd'] / data_country['gdp_percapita_usd_2013'] - 1) * 100

In [7]:
#remove column not needed for the analysis
data_country.drop(['gdp_percapita_usd_2013'], axis = 1, inplace=True)

<h3>Political Stability</h3><a id='stab'></a>

In [8]:
stability.columns = ["xx", "xx2", "country_code", "country", 'xx3', 'xx4', 'xx5', 'item', 'xx6', "xx8", "xx8", "value", 'xx9', 'xx10', 'xx11']
stability = stability.pivot_table(index=['country_code', 'country'], columns = ['item'], values=['value'])
stability.columns = ['political_stability_index']
stability.reset_index(inplace=True)

data_country = pd.merge(data_country, stability, how='left')

<h3>European Union countries</h3><a id='eu'></a>

In [9]:
#countries with different spellings between eu_country and data_country
eu_country.loc[~eu_country['x'].isin(data_country['country'])]

Unnamed: 0,x
5,Czech Republic
23,Slovak Republic


In [10]:
# correct names
eu_country.loc[eu_country['x'] == 'Czech Republic', 'x'] = 'Czechia'
eu_country.loc[eu_country['x'] == 'Slovak Republic', 'x'] = 'Slovakia'
eu_country.loc[eu_country['x'] == 'United Kingdom', 'x'] = 'United Kingdom of Great Britain and Northern Ireland'

In [11]:
# flag EU countries
data_country.loc[data_country['country'].isin(eu_country['x']), 'euro_union'] = 'EU'
data_country.loc[~data_country['country'].isin(eu_country['x']), 'euro_union'] = 'Outside EU'

<hr>

<h2>Adding Food-related data</h2><a id='food'></a>

<h3>Food Balance</h3><a id='fb'></a>

In [12]:
# group food datatsets
temp = [veg2013, ani2013, ani2017, veg2017]
temp = pd.concat(temp, ignore_index=True)

# delete ani2013, veg2013, ani2017, veg2017
del ani2013, veg2013, ani2017, veg2017

temp.columns = ["xx", "xx2", "country_code", "country", 'xx3', 'element', 'item_code', 'item',
                'xx4', "year", "unit", "value", 'xx5', 'flag_description', 'origin', 'xx6']

data = temp.pivot_table(index=["year", "country_code", "country", "origin", "item_code", "item"], columns = ["element"], values=["value"], aggfunc=sum)

# rename columns
data.columns = ['domestic_supply_quantity','export_quantity','fat_supply_quantity_gcapitaday','feed',
                'food','food_supply_kcalcapitaday','food_supply_quantity_kgcapitayr','import_quantity','losses','other_uses','processing',
                'production', 'protein_supply_quantity_gcapitaday', 'residuals', 'seed','stock_variation','tourist_consumption']

data = data.reset_index()

# merge data and pop
data = pd.merge(data, pop, how='left')

Data for China are duplicated. 
<br>data for 'China' = sum of the splitted data for 'China, mainland', 'China, Hong Kong', 'China, Macao', 'China, Province of Taiwan'.
- deletion of 'China' data (code 351)
- 'China, mainland', 'China, Hong Kong', 'China, Macao', 'China, Province of Taiwan' data are kept in order to preserve a better granularity.


Data for Bermuda and Brunei are no longer available in 2017.

In [13]:
# remove doublon from China (code_country 351)
data = data.loc[data.country_code != 351]

# remove Bermuda and Brunei - no information for 2017
data = data.loc[data.country != 'Bermuda']
data = data.loc[data.country != 'Brunei Darussalam']

<h3>Diet - Calories and Proteins</h3><a id='diet'></a>

In [14]:
temp = data.pivot_table(index=['country_code', 'country'], columns=['year'],
                        values=['food_supply_kcalcapitaday',  'protein_supply_quantity_gcapitaday'], aggfunc=sum)

temp.columns = ['food_supply_kcalcapitaday_2013', 'food_supply_kcalcapitaday', 
                'protein_supply_gcapitaday_2013',  'protein_supply_gcapitaday']
temp = temp.reset_index()

data_country = pd.merge(temp, data_country, how='left')

**Animal proteins in the total protein supply**

In [15]:
NB_DAYS_YEAR = 365 

In [16]:
# Protein Supply in kg
for (x,y, z) in [('protein_supply_gcapitaday','pop_thousand', 'protein_supply_kg'),
              ('protein_supply_gcapitaday_2013','pop_thousand_2013', 'protein_supply_kg_2013') ]:
    
    data_country[z] = data_country[x] / 1000 * NB_DAYS_YEAR *  data_country[y] * 1000

In [17]:
temp = data.loc[data['origin']=='animal'].pivot_table(index=['country_code', 'country'], columns=['year'],
                                                      values=[ 'protein_supply_quantity_gcapitaday'], aggfunc=sum)
temp.columns = ['protein_supply_animal_gcapitaday_2013', 'protein_supply_animal_gcapitaday']
temp.reset_index(inplace=True)

data_country = pd.merge(data_country, temp, how='left')

In [18]:
# Protein Supply from animals in kg 
for (x,y, z) in [('protein_supply_animal_gcapitaday','pop_thousand', 'protein_supply_animal_kg'),
              ('protein_supply_animal_gcapitaday_2013','pop_thousand_2013', 'protein_supply_animal_kg_2013') ]:
    
    data_country[z] = data_country[x] / 1000 * NB_DAYS_YEAR *  data_country[y] * 1000

In [19]:
# Proportion animal proteins in the total protein supply
for (x,y, z) in [('protein_supply_animal_kg','protein_supply_kg', 'protein_animal_over_protein'),
              ('protein_supply_animal_kg_2013','protein_supply_kg_2013', 'protein_animal_over_protein_2013'), 
                ]:
    
    data_country[z] = data_country[x] / data_country[y] * 100

In [20]:
#remove column not needed for the analysis
data_country.drop(['food_supply_kcalcapitaday_2013' , 'protein_supply_gcapitaday_2013', 
       'pop_thousand_2013', 
       'protein_supply_kg', 'protein_supply_kg_2013',
       'protein_supply_animal_gcapitaday_2013',
       'protein_supply_animal_gcapitaday', 'protein_supply_animal_kg',
       'protein_supply_animal_kg_2013', 
       'protein_animal_over_protein_2013'], axis = 1, inplace=True)

<h3>Poultry key data</h3><a id='poultry'></a>

Poultry are domesticated avian species that can be raised for eggs, meat and/or feathers. The term “poultry” covers a wide range of birds, from indigenous and commercial breeds of chickens to Muscovy ducks, mallard ducks, turkeys, guinea fowl, geese, quail, pigeons, ostriches and pheasants. 

- In 2017, chickens accounted for some 92 percent of the world’s poultry population, followed by ducks (5 percent), and turkeys (2 percent). 
- Chickens contribute 89 percent of world poultry meat production, followed by turkeys with 5 percent, ducks with 4 percent and geese and guinea fowl with 2 percent. The rest comes from other poultry species.
- Chickens provide 92 percent of world egg production.

source: http://www.fao.org/poultry-production-products/production/en/

There is no granular data about chicken exclusively in the dataset 'animal' but the data for the category 'Poultry' (in which chicken is accounting around 90% of the total) give us a fair indicator of the chicken market in each country. 

Poultry composition: 
- Meat chicken, Fat liver prepared (foie gras), Meat chicken canned, Meat,duck, Meat goose and guinea fowl, Meat turkey.

In [21]:
temp = data.loc[data['item'] == 'Poultry Meat'].pivot_table(index=['country_code', 'country'], columns=['year'],
                                                            values=[ 'domestic_supply_quantity', 'export_quantity', 'import_quantity', 'production', 'protein_supply_quantity_gcapitaday'], 
                                                            aggfunc=sum)

temp.columns = ['dom_supply_poultry_tons_2013', 'dom_supply_poultry_tons', 
                  'export_poultry_tons_2013', 'export_poultry_tons',
                'import_poultry_tons_2013', 'import_poultry_tons', 
                 'prod_poultry_tons_2013', 'prod_poultry_tons', 
                'protein_poultry_gcapitaday_2013', 'protein_poultry_gcapitaday']

temp.reset_index(inplace=True)

#convert 1000 tons to tons
for z in temp.iloc[:, list(range(2, len(temp.columns)-2))].columns:
    temp[str(z)] = temp[str(z)] * 1000

data_country = pd.merge(data_country, temp, how='left')

In [22]:
 data_country['net_import_poultry_tons'] = data_country['import_poultry_tons'] - data_country['export_poultry_tons']
data_country['net_import_poultry_tons_2013'] = data_country['import_poultry_tons_2013'] - data_country['export_poultry_tons_2013']

In [23]:
#Poultry: Proportion of imports of in the domestic supply, Proportion of production in the domestic supply

for (x,y, z) in [('dom_supply_poultry_tons','net_import_poultry_tons', 'net_import_poultry_over_domsupply'),
              ('dom_supply_poultry_tons','import_poultry_tons', 'import_poultry_over_domsupply'),
                 ('dom_supply_poultry_tons','prod_poultry_tons', 'prod_poultry_over_domsupply') ]:
    data_country.loc[(data_country[x] ==0) & (data_country[y] ==0), z] = 0
    data_country.loc[(data_country[x] ==0) & (data_country[y] !=0), z] = 100
    data_country.loc[data_country[x] !=0, z] = (data_country[y] / data_country[x]) * 100


In [24]:
# Poultry:  growth production and growth import

for (x,y, z) in [('import_poultry_tons_2013','import_poultry_tons', 'growth_import_poultry'),
                ('dom_supply_poultry_tons_2013','dom_supply_poultry_tons', 'growth_domsupply_poultry')]:
    
    data_country.loc[(data_country[x] ==0) & (data_country[y] ==0), z] = 0
    data_country.loc[(data_country[x] ==0) & (data_country[y] !=0), z] = 100
    data_country.loc[data_country[x] !=0, z] = (data_country[y] / data_country[x] - 1) * 100

In [25]:
#remove column not needed for the analysis
data_country.drop(['dom_supply_poultry_tons_2013', 'prod_poultry_tons_2013', 'import_poultry_tons_2013', 'export_poultry_tons_2013', 
                    'net_import_poultry_tons_2013', 'protein_poultry_gcapitaday_2013']
                  , axis = 1, inplace=True)

<h3>Chicken meat importations from France</h3><a id='france></a>

Data are from the food and agricultural trade dataset collected, processed and disseminated by FAO. The data is mainly provided by UNSD, Eurostat, and other national authorities as needed.

Products: 'meat, chicken', 'meat, chicken, canned'

In [26]:
chicken_trade.columns =['xx', 'xx2', 'xx3', 'xx4', 'country_code', 'country', 'xx6', 
                        'element', 'xx8', 'xx9', 'xx10', 'year', 'unit', 'value', 'xx13', 'xx14', 'xx15']

# there is a disruptancy on the name of the UK between the FAO trade dataset and other FAO datasets
chicken_trade.loc[chicken_trade['country'] == 'United Kingdom', 'country'] = 'United Kingdom of Great Britain and Northern Ireland'

In [27]:
chicken_import = chicken_trade.loc[(chicken_trade['element'].isin(['Export Quantity'])) & (chicken_trade['year']==2017)]
chicken_import = chicken_import.pivot_table(index=["country_code", "country"], columns=['element'], values=['value'], aggfunc=sum)
chicken_import.columns = ['import_french_chicken_tons']
chicken_import.reset_index(inplace=True)

data_country = pd.merge(data_country, chicken_import, how='left')

# replace Nan by zero on countries with no importations from France
data_country['import_french_chicken_tons'] = data_country['import_french_chicken_tons'].fillna(0)

<hr>

<h2>Imputing missing data and verifying dataset</h2><a id='imputation'></a>

Countries with no political stability index: {{data_country.loc[data_country['political_stability_index'].isna()].country.tolist()}}

- Imputation possible for the 2 French territories (French Polynesia and New Caledonia) with France's index.
- Imputation possible for Namibia with United Arab Emirates' index - the political stability index of the UAE is the closest (and almost equivalent) of Namibia in 2017 (source: World Bank - https://www.theglobaleconomy.com/rankings/wb_political_stability/)

In [28]:
# imputation French Polynesia and New Caledonia
for z in ['French Polynesia', 'New Caledonia']:
    data_country.loc[data_country['country'] == z, 'political_stability_index'] = \
                                            data_country.loc[data_country['country'] == 'France']['political_stability_index'].values

# imputation Namibia
data_country.loc[data_country['country'] == 'Namibia', 'political_stability_index'] = \
                                            data_country.loc[data_country['country'] == 'United Arab Emirates']['political_stability_index'].values

In [29]:
data_country.isna().any()

country_code                         False
country                              False
food_supply_kcalcapitaday            False
protein_supply_gcapitaday            False
pop_thousand                         False
pop_growth                           False
gdp_percapita_usd                    False
gdp_growth                           False
political_stability_index            False
euro_union                           False
protein_animal_over_protein          False
dom_supply_poultry_tons              False
export_poultry_tons                  False
import_poultry_tons                  False
prod_poultry_tons                    False
protein_poultry_gcapitaday           False
net_import_poultry_tons              False
net_import_poultry_over_domsupply    False
import_poultry_over_domsupply        False
prod_poultry_over_domsupply          False
growth_import_poultry                False
growth_domsupply_poultry             False
import_french_chicken_tons           False
dtype: bool

<h2>Preparation datasets</h2><a id='prep'></a>

In [30]:
#remove column not needed for the analysis
data_country.drop(['country_code'], axis = 1, inplace=True)

In [31]:
data_country

Unnamed: 0,country,food_supply_kcalcapitaday,protein_supply_gcapitaday,pop_thousand,pop_growth,gdp_percapita_usd,gdp_growth,political_stability_index,euro_union,protein_animal_over_protein,...,import_poultry_tons,prod_poultry_tons,protein_poultry_gcapitaday,net_import_poultry_tons,net_import_poultry_over_domsupply,import_poultry_over_domsupply,prod_poultry_over_domsupply,growth_import_poultry,growth_domsupply_poultry,import_french_chicken_tons
0,Armenia,3072.0,97.33,2944.791,1.629045,3078.978564,9.889928,-0.71,Outside EU,45.782390,...,35000.0,11000.0,5.44,35000.0,74.468085,74.468085,23.404255,9.375000,17.500000,275.0
1,Afghanistan,1997.0,54.09,36296.113,12.477767,468.297893,-5.206857,-2.78,Outside EU,19.523017,...,29000.0,28000.0,0.54,29000.0,50.877193,50.877193,49.122807,-39.583333,-24.000000,244.0
2,Albania,3400.0,119.50,2884.169,-0.675703,3347.701760,7.818559,0.40,Outside EU,55.497908,...,38000.0,13000.0,6.26,38000.0,80.851064,80.851064,27.659574,52.000000,11.904762,88.0
3,Algeria,3345.0,92.85,41389.189,8.518733,3264.338962,-7.348644,-0.96,Outside EU,27.679052,...,2000.0,275000.0,1.97,2000.0,0.722022,0.722022,99.277978,-33.333333,-4.810997,46.0
4,Angola,2266.0,54.09,29816.766,14.610305,2805.692595,-15.509934,-0.29,Outside EU,30.449251,...,277000.0,42000.0,3.60,277000.0,86.833856,86.833856,13.166144,-19.005848,-14.247312,205.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167,Belgium,3770.0,101.35,11419.748,2.382453,32988.016625,0.652140,0.42,EU,57.109028,...,338000.0,463000.0,4.57,-318000.0,-209.210526,222.368421,304.605263,34.126984,6.293706,19743.0
168,Luxembourg,3334.0,105.79,591.910,9.095768,81059.319125,3.178482,1.34,EU,63.692220,...,11000.0,0.0,7.19,10000.0,90.909091,100.000000,0.000000,10.000000,0.000000,1997.0
169,Serbia,2799.0,82.43,8829.628,-1.000975,4580.333116,6.113513,0.10,Outside EU,48.538154,...,12000.0,85000.0,3.50,5000.0,5.555556,13.333333,94.444444,0.000000,-10.891089,4109.0
170,Montenegro,3478.0,113.12,627.563,0.194462,5609.035115,11.469796,0.01,Outside EU,60.705446,...,8000.0,4000.0,5.79,8000.0,80.000000,80.000000,40.000000,14.285714,0.000000,0.0


In [32]:
# variables to be used on the PCA 
data_pca = data_country[['country', 'pop_growth', 'gdp_percapita_usd',  'political_stability_index',
        'food_supply_kcalcapitaday', 'protein_supply_gcapitaday',
        'protein_animal_over_protein',
       'protein_poultry_gcapitaday',
        'prod_poultry_over_domsupply', 'import_poultry_over_domsupply', 'net_import_poultry_over_domsupply', 
        'growth_import_poultry', 'growth_domsupply_poultry']]

In [33]:
# subset of data_pca with only data related to Poultry and Chicken 
data_pca_poultry = data_country[['country', 'protein_poultry_gcapitaday',
         'prod_poultry_over_domsupply','growth_import_poultry', 'growth_domsupply_poultry', 'import_poultry_over_domsupply',
                                 'net_import_poultry_over_domsupply']]

In [34]:
# subset of data_pca with data NOT related to Poultry and Chicken 
data_pca_not_poultry = data_pca.copy()
for i in data_pca.columns: 
    if i in data_pca_poultry.columns and i != 'country':
        data_pca_not_poultry.drop([i], axis = 1, inplace=True)

<hr>

<h2>Exporting Dataset data_country</h2><a id='export'></a>

In [35]:
data_country.to_csv(r'data/output/data_country.csv', index = False)

In [36]:
data_pca.to_csv(r'data/output/data_pca.csv', index = False)

In [37]:
data_pca_poultry.to_csv(r'data/output/data_pca_poultry.csv', index = False)

In [38]:
data_pca_not_poultry.to_csv(r'data/output/data_pca_not_poultry.csv', index = False)