<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Preprocess Data
</div>

In [None]:
import pandas as pd

### Read raw data that we have collected

In [None]:
annual_pop_df = pd.read_csv("../data/raw/OA.csv")
agri_employments_df = pd.read_csv("../data/raw/OEA.csv")
rural_employments_df = pd.read_csv("../data/raw/OER.csv")
values_df = pd.read_csv("../data/raw/QCL.csv")
products_df = pd.read_csv("../data/raw/QV.csv")

display(annual_pop_df.head(3))
display(agri_employments_df.head(3))
display(rural_employments_df.head(3))
display(values_df.head(3))
display(products_df.head(3))

### Clear duplicates columns:
-   There is `Year Code` and `Year` columns, they have the same values, so we drop the `Year Code`.

In [None]:
annual_pop_df = annual_pop_df.drop(columns=['Year Code'])
agri_employments_df = agri_employments_df.drop(columns=['Year Code'])
rural_employments_df = rural_employments_df.drop(columns=['Year Code'])
values_df = values_df.drop(columns=['Year Code'])
products_df = products_df.drop(columns=['Year Code'])

### Check for duplicate rows

In [None]:
for df in [annual_pop_df, agri_employments_df, rural_employments_df, values_df, products_df]:
    index = agri_employments_df.index
    detectDupSeries = index.duplicated(keep='first')
    num_duplicated_rows = detectDupSeries.sum()
    if num_duplicated_rows == 0:
        print(f"{df['Domain'].values[0]} have no duplicated line.!")
    else:
        if num_duplicated_rows > 1:
            ext = "lines"
        else:
            ext = "line"
        print(f"{df['Domain'].values[0]} {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

### Determine the features that have a large number of missing values

In [None]:
for df in [annual_pop_df, agri_employments_df, rural_employments_df, values_df, products_df]:
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'column_name': df.columns,
                                    'percent_missing': percent_missing})
    
    miss = (missing_value_df['column_name'][missing_value_df['percent_missing'].values >= 75]).values
    print(f"{df['Domain'].values[0]}, features with large missing value: {miss} (>= 75%)")

-   `annual_pop_df` and `products_df` have `Note` as the features that have a large number of missing values, but for now, we will keep it

### Check data type of each columns

In [None]:
for df in [annual_pop_df, agri_employments_df, rural_employments_df, values_df, products_df]:
    df.info()

-   For all of the above dataframe: `Area Code`, `Element Code`, `Item Code`, `Year` , `Indicator Code` have numeric type. However, their magnitude does not have a significance. And, `Year` actually represents a period instead of a number, `Area Code`, `Element Code`, `Item Code`, `Indicator Code` are just a category. Thus, they can be convert to categorical type.

In [151]:
for col in annual_pop_df.drop(columns= 'Value').columns:
    annual_pop_df[col] = annual_pop_df[col].astype(str)

for col in agri_employments_df.drop(columns= 'Value').columns:
    agri_employments_df[col] = agri_employments_df[col].astype(str)

for col in rural_employments_df.drop(columns= 'Value').columns:
    rural_employments_df[col] = rural_employments_df[col].astype(str)

for col in values_df.drop(columns= 'Value').columns:
    values_df[col] = values_df[col].astype(str)

for col in products_df.drop(columns= 'Value').columns:
    products_df[col] = products_df[col].astype(str)

-   Let check one last time

In [152]:
for df in [annual_pop_df, agri_employments_df, rural_employments_df, values_df, products_df]:
    df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 655 entries, 0 to 654
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Domain Code       655 non-null    object 
 1   Domain            655 non-null    object 
 2   Area Code         655 non-null    object 
 3   Area              655 non-null    object 
 4   Element Code      655 non-null    object 
 5   Element           655 non-null    object 
 6   Item Code         655 non-null    object 
 7   Item              655 non-null    object 
 8   Year              655 non-null    object 
 9   Unit              655 non-null    object 
 10  Value             655 non-null    float64
 11  Flag              655 non-null    object 
 12  Flag Description  655 non-null    object 
 13  Note              655 non-null    object 
dtypes: float64(1), object(13)
memory usage: 71.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (t

### Save processed data to a new csv file

In [153]:
annual_pop_df.to_csv("../data/processed/preprocess_OA.csv")
agri_employments_df.to_csv("../data/processed/preprocess_OEA.csv")
rural_employments_df.to_csv("../data/processed/preprocess_OER.csv")
values_df.to_csv("../data/processed/preprocess_QV.csv")
products_df.to_csv("../data/processed/preprocess_QCL.csv")