# Coffee Industry

## 1. Business Understanding

### The Uniqueness of Colombian Coffee

Colombian coffee is renowned worldwide for its unique qualities, which stem from several factors related to the country's geography, climate, and cultivation methods. Here are some key aspects that make Colombian coffee distinctive:

1. Ideal Growing Conditions
    - Geography:\ 
    Colombia is located near the equator and has mountainous terrain, providing an ideal environment for coffee cultivation. The Andes mountain range creates diverse microclimates and elevations, which are perfect for growing coffee.
    - Climate:\
    The country enjoys a climate that is neither too hot nor too cold, with well-distributed rainfall throughout the year. This consistent climate allows for two harvests annually in many regions, leading to fresh coffee availability year-round.

2. High-Altitude Growing\
Colombian coffee is typically grown at elevations between 1,200 and 1,800 meters above sea level. Higher altitudes slow the maturation of coffee beans, allowing them to develop more complex flavors and a higher acidity, which are prized in specialty coffee.

3. Varieties\
The most common coffee variety grown in Colombia is Arabica, known for its smooth and mild flavor profile. Within Arabica, Colombia grows several specific varieties such as Typica, Bourbon, and Caturra, each contributing unique flavor notes to the coffee.

4. Hand-Picking\
Colombian coffee is predominantly hand-picked. This selective harvesting ensures that only the ripest cherries are picked, which contributes to the quality and consistency of the beans. It also allows for careful selection and minimal damage to the plants.

5. Processing Methods\
Most Colombian coffee is wet-processed or "washed," which means the beans are separated from the cherries and then fermented to remove the mucilage before drying. This process enhances the coffee's bright acidity and clean flavor.

6. Flavor Profile\
Colombian coffee is known for its well-balanced flavor with medium body, bright acidity, and notes of citrus, chocolate, caramel, and sometimes fruity undertones. The specific flavors can vary depending on the region, altitude, and processing methods.

7. Regional Diversity\
Different regions in Colombia produce coffee with distinct flavor profiles due to the varied climates and altitudes. For instance, coffee from the Huila region often has floral and fruity notes, while coffee from the Antioquia region might have a nuttier and sweeter profile.

8. Cultural and Economic Importance\
Coffee is deeply embedded in Colombian culture and economy. The country has a rich coffee tradition, supported by the National Federation of Coffee Growers of Colombia (FNC), which promotes high-quality standards and the international reputation of Colombian coffee.

9. Sustainability and Certification\
Many Colombian coffee farms adhere to sustainable farming practices, and some coffees are certified organic, Fair Trade, or Rainforest Alliance. These certifications further enhance the appeal of Colombian coffee to consumers looking for ethically sourced products.

These factors combine to create a coffee that is not only high in quality but also highly distinctive, making Colombian coffee a favorite among coffee aficionados worldwide.

## 2. Data Mining

### Installs

In [1]:
# !pip install pycountry

### Libraries

In [2]:
import numpy as np
import pandas as pd
import os

# Data Profile Reporting Tool
from ydata_profiling import ProfileReport
# To avoid unneeded warning display
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

import time
import datetime
import pycountry

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
import seaborn as sns

import pymysql
from sqlalchemy import create_engine, text

### Importing my Functions

In [3]:
from coffee_functions import process_files, clean_and_prepare_dataframe, create_sqlalchemy_engine, insert_dataframe_to_mysql
import config  # Access to MySQL

### Load the Data

#### All Exports

#### Coffee Exports

In [4]:
colombia_trade_raw = process_files(r"source\datasets\UN_Comtrade_Exports_Coffee\3_Colombia", "ImportsExports_Coffee", (2017, 2023))
colombia_trade_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8907 entries, 0 to 8906
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   TypeCode                  8907 non-null   object 
 1   FreqCode                  8907 non-null   object 
 2   RefPeriodId               8907 non-null   int64  
 3   RefYear                   8907 non-null   int64  
 4   RefMonth                  8907 non-null   int64  
 5   Period                    8907 non-null   int64  
 6   ReporterCode              8907 non-null   int64  
 7   ReporterISO               8907 non-null   object 
 8   ReporterDesc              8907 non-null   object 
 9   FlowCode                  8907 non-null   object 
 10  FlowDesc                  8907 non-null   object 
 11  PartnerCode               8907 non-null   int64  
 12  PartnerISO                8907 non-null   object 
 13  PartnerDesc               8907 non-null   object 
 14  Partner2

In [5]:
colombia_trade_raw

Unnamed: 0,TypeCode,FreqCode,RefPeriodId,RefYear,RefMonth,Period,ReporterCode,ReporterISO,ReporterDesc,FlowCode,FlowDesc,PartnerCode,PartnerISO,PartnerDesc,Partner2Code,Partner2ISO,Partner2Desc,ClassificationCode,ClassificationSearchCode,IsOriginalClassification,CmdCode,CmdDesc,AggrLevel,IsLeaf,CustomsCode,CustomsDesc,MosCode,MotCode,MotDesc,QtyUnitCode,QtyUnitAbbr,Qty,IsQtyEstimated,AltQtyUnitCode,AltQtyUnitAbbr,AltQty,IsAltQtyEstimated,NetWgt,IsNetWgtEstimated,GrossWgt,IsGrossWgtEstimated,Cifvalue,Fobvalue,PrimaryValue,LegacyEstimationFlag,IsReported,IsAggregate,Unnamed: 47
0,C,M,20170101,2017,1,201701,170,COL,Colombia,X,Export,899,_X,"Areas, nes",0,W00,World,H5,HS,True,90121,"Coffee; roasted, not decaffeinated",6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,1484.84,False,8,kg,1484.84,False,1484.84,False,0,False,0.00,2.499957e+04,2.499957e+04,0,False,True,
1,C,M,20170101,2017,1,201701,170,COL,Colombia,X,Export,32,ARG,Argentina,0,W00,World,H5,HS,True,90111,Coffee; not roasted or decaffeinated,6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,98202.00,False,8,kg,98202.00,False,98202.00,False,0,False,0.00,3.931844e+05,3.931844e+05,0,False,True,
2,C,M,20170101,2017,1,201701,170,COL,Colombia,X,Export,533,ABW,Aruba,0,W00,World,H5,HS,True,90121,"Coffee; roasted, not decaffeinated",6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,525.00,False,8,kg,525.00,False,525.00,False,0,False,0.00,4.481000e+03,4.481000e+03,0,False,True,
3,C,M,20170101,2017,1,201701,170,COL,Colombia,X,Export,36,AUS,Australia,0,W00,World,H5,HS,True,90111,Coffee; not roasted or decaffeinated,6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,621420.00,False,8,kg,621420.00,False,621420.00,False,0,False,0.00,2.443180e+06,2.443180e+06,0,False,True,
4,C,M,20170101,2017,1,201701,170,COL,Colombia,X,Export,56,BEL,Belgium,0,W00,World,H5,HS,True,90111,Coffee; not roasted or decaffeinated,6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,2452090.00,False,8,kg,2452090.00,False,2452090.00,False,0,False,0.00,9.986221e+06,9.986221e+06,0,False,True,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8902,C,M,20231201,2023,12,202312,170,COL,Colombia,M,Import,0,W00,World,0,W00,World,H6,HS,True,90122,"Coffee; roasted, decaffeinated",6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,229.81,False,8,kg,229.81,False,229.81,False,0,False,6298.43,6.112320e+03,6.298430e+03,0,False,True,
8903,C,M,20231201,2023,12,202312,170,COL,Colombia,X,Export,0,W00,World,0,W00,World,H6,HS,True,90111,Coffee; not roasted or decaffeinated,6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,66749600.00,False,8,kg,66749600.00,False,66749600.00,False,0,False,,2.940802e+08,2.940802e+08,0,False,True,
8904,C,M,20231201,2023,12,202312,170,COL,Colombia,X,Export,0,W00,World,0,W00,World,H6,HS,True,90112,"Coffee; decaffeinated, not roasted",6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,393139.00,False,8,kg,393139.00,False,393139.00,False,0,False,,2.122915e+06,2.122915e+06,0,False,True,
8905,C,M,20231201,2023,12,202312,170,COL,Colombia,X,Export,0,W00,World,0,W00,World,H6,HS,True,90121,"Coffee; roasted, not decaffeinated",6,True,C00,TOTAL CPC,0,0,TOTAL MOT,8,kg,1094070.40,False,8,kg,1094070.40,False,1094070.40,False,0,False,,9.145831e+06,9.145831e+06,0,False,True,


In [6]:
colombia_trade_raw.shape

(8907, 48)

## 3. Data Cleaning

In [7]:
colombia_trade = clean_and_prepare_dataframe(colombia_trade_raw)
colombia_trade.shape

(8044, 13)

In [8]:
colombia_trade.head()

Unnamed: 0,Year,Month,Period,ReporterISO,ReporterDesc,FlowCode,FlowDesc,PartnerISO,PartnerDesc,CmdCode,CmdDesc,Qty_in_kg,PrimaryValue
1,2017,1,201701,COL,Colombia,X,Export,ARG,Argentina,90111,Coffee; not roasted or decaffeinated,98202.0,393184.36
2,2017,1,201701,COL,Colombia,X,Export,ABW,Aruba,90121,"Coffee; roasted, not decaffeinated",525.0,4481.0
3,2017,1,201701,COL,Colombia,X,Export,AUS,Australia,90111,Coffee; not roasted or decaffeinated,621420.0,2443180.34
4,2017,1,201701,COL,Colombia,X,Export,BEL,Belgium,90111,Coffee; not roasted or decaffeinated,2452090.0,9986220.53
5,2017,1,201701,COL,Colombia,X,Export,BOL,"Bolivia, Plurinational State of",90121,"Coffee; roasted, not decaffeinated",1944.0,24373.41


### Missing data (Null values)

In [9]:
# Checking for missing data
colombia_trade.isnull().sum().sort_values(ascending=False)

Year            0
Month           0
Period          0
ReporterISO     0
ReporterDesc    0
FlowCode        0
FlowDesc        0
PartnerISO      0
PartnerDesc     0
CmdCode         0
CmdDesc         0
Qty_in_kg       0
PrimaryValue    0
dtype: int64

### Finding Duplicates

In [10]:
# Find duplicates
colombia_trade.duplicated().sum()

0

### Filtering Data

In [None]:
# As colombia_trade includes data from Jan 2017 up to Feb 2024, let's filter the dataset up to the end of 2023.
# This condition will include all rows where the 'Period' is less than or equal to 202312 (December 2023).
colombia_trade = colombia_trade[colombia_trade['Period'] <= 202312]
colombia_trade

## Exploratory Data Analysis

### Initial Exploration

In [None]:
colombia_trade.head()

In [None]:
# Retrieving the number of rows and columns in the dataframe
colombia_trade.shape

In [None]:
# Displaying the data types of each column in the dataframe
colombia_trade.dtypes

### Exploring numerical and categorical variables

In [None]:
# Retrieving the unique data types present in the dataframe columns
list(set(colombia_trade.dtypes.tolist()))

In [None]:
# Extracting column names with numerical data types from the dataframe
colombia_trade.select_dtypes("number").columns

In [None]:
# Counting and sorting the unique values for each numerical column in descending order
colombia_trade.select_dtypes("number").nunique().sort_values(ascending=False)

In [None]:
# Separating between discrete and continuous variables, as discrete ones could potentially be treated as categorical.
# Remember to adjust the threshold (in this case, < 20) based on your dataset's specific characteristics and domain knowledge.
potential_categorical_from_numerical = colombia_trade.select_dtypes("number").loc[:, colombia_trade.select_dtypes("number").nunique() < 90]
potential_categorical_from_numerical

In [None]:
# Retrieving column names with object (typically string) data types from the dataframe
colombia_trade.select_dtypes("object").columns

In [None]:
# Counting and sorting the unique values for each object (string) column in descending order
colombia_trade.select_dtypes("object").nunique().sort_values(ascending=False)

# All columns seem categorical, as there isn't a wide variability of values.
# Country related columns have a considerable number of values but it's expected. 

In [None]:
# Extracting columns with object (typically string) data types to create a categorical dataframe
# For demonstration purposes, let's consider the columns in potential_categorical_from_numerical as categorical variables.
colombia_trade_categorical = pd.concat([colombia_trade.select_dtypes("object"), potential_categorical_from_numerical], axis=1)

# Adjusting the numerical dataframe by removing the moved columns
colombia_trade_numerical = colombia_trade.select_dtypes("number").drop(columns=potential_categorical_from_numerical.columns)

In [None]:
# Verifying that the total number of columns in the dataframe is the sum of object (string) and numerical columns
len(colombia_trade.columns) == len(colombia_trade.select_dtypes("object").columns) + len(colombia_trade.select_dtypes("number").columns)

### Categorical Variables

In [None]:
colombia_trade_categorical.head()

#### Trade Type: FlowDesc

In [None]:
# Frequency table for 'FlowDesc'
frequency_table = colombia_trade['FlowDesc'].value_counts()

# Calculating the proportion of each unique value in the 'FlowDesc'
proportion_table = colombia_trade['FlowDesc'].value_counts(normalize=True)

frequency_table, proportion_table

In [None]:
# Creating a crosstab table for the 'FlowDesc' column, counting occurrences for each unique value
FlowDesc_ct = pd.crosstab(index = colombia_trade_categorical['FlowDesc'],  # Make a crosstab
                              columns="count")      # Name the count column
FlowDesc_ct

In [None]:
# Calculating the proportions for each value in 'FlowDesc_ct' and rounding the results to two decimal places
(FlowDesc_ct/FlowDesc_ct.sum()).round(2)

In [None]:
# Plotting a pie chart of the 'FlowDesc' column value counts, with percentage labels, 
# starting at angle 90, and using colors from the "Set3" Seaborn palette
colombia_trade['FlowDesc'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette("Set3"))

#### Trade Type: CmdCode

### Numerical Variables

In [None]:
colombia_trade_numerical.head()

#### Correlation matrix

In [None]:
# Compute the correlation matrix
correlation_matrix = colombia_trade_numerical.corr()

# Display the correlation matrix
print(correlation_matrix)

In [None]:
# Summary statistics for the dataset
colombia_trade_numerical.describe()

### Univariable Analysis

In [None]:
colombia_trade.columns

#### Year

In [None]:
colombia_trade['Year'].nunique() # 8 years of data, from 2017 up to 2024

In [None]:
colombia_trade['Year'].unique() # All years are included

#### Month

In [None]:
colombia_trade['Month'].nunique() # 12 months per each year

In [None]:
colombia_trade['Month'].unique() # All months are included

#### Period

In [None]:
colombia_trade['Period'].nunique() # Combination of Year and Month
# Expected output of 91-92 (84 full years plus 7 or 8 up to Jul-Aug 2024), missing 5 periods (from Mar 2024)

In [None]:
colombia_trade['Period'].unique()
# Data from Jan 2017 up to Feb 2024, missing last 5 months

In [None]:
#TODO
# take out 2 months of 2014?

#### ReporterISO

In [None]:
colombia_trade['ReporterISO'].nunique() # Only one code for now, as this dataset is just only Colombia market

In [None]:
colombia_trade['ReporterISO'].unique() # Only one code for now, as this dataset is just only Colombia market

#### ReporterDesc

In [None]:
colombia_trade['ReporterDesc'].nunique() # Expected output: 1 as this dataset is just Colombia market

In [None]:
colombia_trade['ReporterDesc'].unique() # Expected output: Colombia 

#### FlowDesc

In [None]:
colombia_trade['FlowDesc'].nunique() # Expected number of types of trade (Exports and Imports)

In [None]:
colombia_trade['FlowDesc'].unique() # Expected types of trade (Exports and Imports)

#### PartnerISO

In [None]:
colombia_trade['PartnerISO'].nunique() # Number of countries that have traded with Colombia in those years
# Count of ISO 3 letter country code for trade partners

In [None]:
colombia_trade['PartnerISO'].unique() # List of commercial partners (country ISO abbrevations of 3 letters)
# Unexpected: '_X ', 'X2 ', 'S19'
# TODO codes abbrevations match the ISO?

#### PartnerDesc

In [None]:
colombia_trade['PartnerDesc'].nunique() # Number of countries that have traded with Colombia in those years

In [None]:
colombia_trade['PartnerDesc'].unique() # List of commercial partners (country names)
# Unexpected: 'Areas, nes' and 'Other Asia, nes'.
# TODO to confirm official name:'China', 'China, Hong Kong SAR' and 'China, Macao SAR' are not duplicates?
# TODO names convention match the ISO?

#### CmdCode

In [None]:
colombia_trade['CmdCode'].nunique() # Expected output 4, for the commodities codes included in this report.

In [None]:
colombia_trade['CmdCode'].unique() # Expected commodities codes.

#### Qty_in_kg

In [None]:
colombia_trade['Qty_in_kg'].nunique()

In [None]:
colombia_trade['Qty_in_kg'].describe()

In [None]:
colombia_trade['Qty_in_kg'].min()

In [None]:
colombia_trade['Qty_in_kg'].max()

In [None]:
# Adjust the figure size
plt.figure(figsize=(3, 5))
colombia_trade['Qty_in_kg'].plot(kind='box')
# Adjust the scale of the y-axis
plt.ylim(0, 45000000)  # Set the lower and upper limits of the y-axis
plt.show()

#### PrimaryValue

In [None]:
colombia_trade['PrimaryValue'].nunique()

In [None]:
colombia_trade['PrimaryValue'].describe()

In [None]:
colombia_trade['PrimaryValue'].min()

In [None]:
colombia_trade['PrimaryValue'].max()

In [None]:
# Adjust the figure size
plt.figure(figsize=(3, 5))
colombia_trade['PrimaryValue'].plot(kind='box')
# Adjust the scale of the y-axis
plt.ylim(0, 180000000)  # Set the lower and upper limits of the y-axis
plt.show()

### Bivariable Analysis

## Insights:

- The global coffee market is projected to grow as it is driven by increasing coffee consumption, particularly in emerging markets such as Asia-Pacific, where coffee culture is expanding rapidly.

Major Coffee Producing and Consuming Regions:
- Producing Regions: The majority of coffee is produced in developing countries, especially in South America (Brazil, Colombia), Asia (Vietnam, Indonesia), and Africa (Ethiopia).

- Consuming Regions: The largest coffee markets in terms of consumption are in North America, Europe, and increasingly in Asia-Pacific. Europe leads as the largest consuming region, with 3.2 million metric tons (3,186,000 metric tons) consumed.