## Exploratory Data Analysis

### Section 1 - Initial Data Handling

Our first step is to import our libraries for analysis:

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Next we need to load our dataset for inspection:

In [3]:
df = pd.read_csv("../data/food_price_feature_engineered_clean.csv")

In [4]:
print(df.info())      # shows column names, types, and missing values



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   country                             25 non-null     object 
 1   iso3                                25 non-null     object 
 2   components                          25 non-null     object 
 3   currency                            25 non-null     object 
 4   number_of_food_items                25 non-null     int64  
 5   data_coverage_food                  25 non-null     float64
 6   average_annualized_food_inflation   25 non-null     float64
 7   maximum_food_drawdown               25 non-null     float64
 8   average_annualized_food_volatility  25 non-null     float64
 9   date                                25 non-null     object 
 10  open                                25 non-null     float64
 11  high                                25 non-null

Here we can see that our dataset has the shape of 25 rows and 18 columns. 
We notice that each row is specific data concerning a seperate country.
We also notice that our date is in float64 format, so it will be necessary to convert this to 'datetime'

In [5]:
# Convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'])

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   country                             25 non-null     object        
 1   iso3                                25 non-null     object        
 2   components                          25 non-null     object        
 3   currency                            25 non-null     object        
 4   number_of_food_items                25 non-null     int64         
 5   data_coverage_food                  25 non-null     float64       
 6   average_annualized_food_inflation   25 non-null     float64       
 7   maximum_food_drawdown               25 non-null     float64       
 8   average_annualized_food_volatility  25 non-null     float64       
 9   date                                25 non-null     datetime64[ns]
 10  open                        

Now that we have converted our 'date' column into the correct format, let's go ahead and transform our 'object' data types to categories.

In [6]:
# Convert object columns to category dtype
cat_cols = ["country", "iso3", "components", "currency", "inflation_band"]
for cat in cat_cols:
    df[cat] = df[cat].astype("category")

In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   country                             25 non-null     category      
 1   iso3                                25 non-null     category      
 2   components                          25 non-null     category      
 3   currency                            25 non-null     category      
 4   number_of_food_items                25 non-null     int64         
 5   data_coverage_food                  25 non-null     float64       
 6   average_annualized_food_inflation   25 non-null     float64       
 7   maximum_food_drawdown               25 non-null     float64       
 8   average_annualized_food_volatility  25 non-null     float64       
 9   date                                25 non-null     datetime64[ns]
 10  open                        

We have now successfully converted our columns into the correct data types.
Next let us check for any missing values and duplicates:

In [None]:
# Inspect missing values.
df.isna().sum().sort_values(ascending=False)


country                               0
iso3                                  0
month                                 0
year                                  0
inflation                             0
close                                 0
low                                   0
high                                  0
open                                  0
date                                  0
average_annualized_food_volatility    0
maximum_food_drawdown                 0
average_annualized_food_inflation     0
data_coverage_food                    0
number_of_food_items                  0
currency                              0
components                            0
inflation_band                        0
dtype: int64

Great, we see that there are no missing values, now let us check for duplicates:

In [14]:
# Inspect duplicates.
df.duplicated().sum()
df[df.duplicated()]


Unnamed: 0,country,iso3,components,currency,number_of_food_items,data_coverage_food,average_annualized_food_inflation,maximum_food_drawdown,average_annualized_food_volatility,date,open,high,low,close,inflation,year,month,inflation_band


In [15]:
# Identify numeric vs categorical columns
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

num_cols, cat_cols

(['number_of_food_items',
  'data_coverage_food',
  'average_annualized_food_inflation',
  'maximum_food_drawdown',
  'average_annualized_food_volatility',
  'open',
  'high',
  'low',
  'close',
  'inflation',
  'year',
  'month'],
 ['country', 'iso3', 'components', 'currency', 'inflation_band'])

We can now see that we do not have any duplicated entries in our dataset.
As we are relatively confident about having correctly converted our data types, and that we have no missing values or duplicated entries; we can now move on to our inital, exploratory data analysis. 
Firstly we will look at the shape of each numeric column:

In [16]:
desc = df[num_cols].describe().T  # count, mean, std, min, quartiles, max
desc


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
number_of_food_items,25.0,9.32,7.016172,3.0,4.0,7.0,11.0,26.0
data_coverage_food,25.0,35.3404,17.45147,8.84,21.55,31.08,47.97,69.91
average_annualized_food_inflation,25.0,12.1988,15.052045,1.24,3.58,6.68,10.79,55.3
maximum_food_drawdown,25.0,-22.858,11.123922,-40.67,-31.98,-23.71,-13.96,-2.79
average_annualized_food_volatility,25.0,10.5904,5.413819,1.84,7.15,9.89,12.58,24.77
open,25.0,7.0976,16.762704,1.07,1.4,1.45,2.57,77.3
high,25.0,7.2752,17.269724,1.08,1.42,1.48,2.62,79.99
low,25.0,6.9152,16.255538,1.05,1.36,1.43,2.52,74.61
close,25.0,6.9876,16.377158,1.06,1.39,1.46,2.56,75.16
inflation,25.0,20.084,35.317431,-15.91,0.04,6.41,27.13,139.28
