<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Preprocess Data
</div>

In [1]:
import pandas as pd

### Read raw data that we have collected

In [23]:
raw_df = pd.read_csv("../data/raw/raw_data.csv")
raw_df.head()

Unnamed: 0.1,Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,0,QCL,Crops and livestock products,237,Viet Nam,5312,Area harvested,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,1961,ha,1000.0,E,Estimated value,
1,1,QCL,Crops and livestock products,237,Viet Nam,5419,Yield,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,1961,100 g/ha,7000.0,E,Estimated value,
2,2,QCL,Crops and livestock products,237,Viet Nam,5510,Production,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,1961,t,700.0,E,Estimated value,
3,3,QCL,Crops and livestock products,237,Viet Nam,5312,Area harvested,711,"Anise, badian, coriander, cumin, caraway, fenn...",1962,1962,ha,1000.0,E,Estimated value,
4,4,QCL,Crops and livestock products,237,Viet Nam,5419,Yield,711,"Anise, badian, coriander, cumin, caraway, fenn...",1962,1962,100 g/ha,7000.0,E,Estimated value,


### Clear duplicates columns:
-   There is an index column, name: `Unnamed: 0`, we don't need that, so we drop the `Unnamed: 0` column.
-   There also `Year Code` and `Year` columns, they have the same values, so we drop the `Year Code`.

In [24]:
raw_df = raw_df.drop(columns=['Unnamed: 0', 'Year Code'])
raw_df.head()

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year,Unit,Value,Flag,Flag Description,Note
0,QCL,Crops and livestock products,237,Viet Nam,5312,Area harvested,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,ha,1000.0,E,Estimated value,
1,QCL,Crops and livestock products,237,Viet Nam,5419,Yield,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,100 g/ha,7000.0,E,Estimated value,
2,QCL,Crops and livestock products,237,Viet Nam,5510,Production,711,"Anise, badian, coriander, cumin, caraway, fenn...",1961,t,700.0,E,Estimated value,
3,QCL,Crops and livestock products,237,Viet Nam,5312,Area harvested,711,"Anise, badian, coriander, cumin, caraway, fenn...",1962,ha,1000.0,E,Estimated value,
4,QCL,Crops and livestock products,237,Viet Nam,5419,Yield,711,"Anise, badian, coriander, cumin, caraway, fenn...",1962,100 g/ha,7000.0,E,Estimated value,


### Check for duplicate rows

In [25]:
index = raw_df.index
detectDupSeries = index.duplicated(keep='first')
num_duplicated_rows = detectDupSeries.sum()

In [26]:
if num_duplicated_rows == 0:
    print(f"Your raw data have no duplicated line.!")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Your raw data have {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

Your raw data have no duplicated line.!


### Meaning of each columns

// Explain here

In [None]:
#TODO: Use code to explain if need

### Check data type of each columns

In [29]:
raw_df.dtypes

Domain Code          object
Domain               object
Area Code             int64
Area                 object
Element Code          int64
Element              object
Item Code             int64
Item                 object
Year                 object
Unit                 object
Value               float64
Flag                 object
Flag Description     object
Note                 object
dtype: object

`Area Code`, `Element Code`, `Item Code`, `Year` have numeric type. However, their magnitude does not have a significance. And, `Year` actually represents a period instead of a number, `Area Code`, `Element Code`, `Item Code` is just a category. Thus, they can be convert to categorical type.

In [36]:
for col in raw_df.drop(columns= 'Value').columns:
    raw_df[col] = raw_df[col].astype(str)
raw_df.dtypes

Domain Code          object
Domain               object
Area Code            object
Area                 object
Element Code         object
Element              object
Item Code            object
Item                 object
Year                 object
Unit                 object
Value               float64
Flag                 object
Flag Description     object
Note                 object
dtype: object

### Check for the distribution of each value associated with an item.

In [None]:
def missing_ratio(s):
    return (s.isna().mean() * 100).round(1)

def median(df):
    return (df.quantile(0.5)).round(1)

def lower_quartile(df):
    return (df.quantile(0.25)).round(1)

def upper_quartile(df):
    return (df.quantile(0.75)).round(1)

In [57]:
# make a DataFrame to store each item info about how their data distribute
data_distribute_df = pd.DataFrame()

for item in raw_df['Item'].unique():
    item_df = raw_df[raw_df['Item'] == item]

    num_col_info_df = item_df.select_dtypes(exclude='object')
    num_col_info_df = num_col_info_df.agg([missing_ratio, "min", lower_quartile, median, upper_quartile, "max"])

    num_col_info_df = num_col_info_df.transpose()

    num_col_info_df['Item'] = item
    
    data_distribute_df = pd.concat([data_distribute_df, num_col_info_df])

In [58]:
data_distribute_df

Unnamed: 0,missing_ratio,min,lower_quartile,median,upper_quartile,max,Item
Value,0.0,300.0,1000.0,4200.0,6885.5,25817.00,"Anise, badian, coriander, cumin, caraway, fenn..."
Value,0.0,0.0,13644.2,63587.0,119814.8,212977.00,Avocados
Value,0.0,36300.0,95900.0,138357.0,525000.0,2346877.71,Bananas
Value,0.0,3078.0,9327.0,73000.0,157266.0,221500.00,"Beans, dry"
Value,0.0,52700.0,93525.0,147780.0,815450.0,4600000.00,"Beer of barley, malted"
...,...,...,...,...,...,...,...
Value,0.0,150000.0,190000.0,410100.0,1288351.0,2720955.00,Sheep and Goats
Value,0.0,0.0,0.0,140.0,235.0,278.00,"Skim Milk & Buttermilk, Dry"
Value,0.0,27000.0,275924.5,424770.0,2344100.0,20128588.00,Sugar Crops Primary
Value,0.0,1700.0,6774.0,10166.0,139258.5,405225.85,"Treenuts, Total"
