# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

Create a single, analysis-ready dataset by:

* Loading and cleaning the two raw CSVs from the Global Food Price Inflation 2024 Kaggle archive

* Standardising column names, parsing dates, and coercing numeric fields

* Harmonising key categorical fields (e.g., country, item)

* Merging item-/series-level details with country-level context using reliable keys

* Saving cleaned per-file outputs and a merged dataset for downstream analysis

## Inputs

* data/raw/WLD_RTFP_country_2023-10-02.csv

* data/raw/WLD_RTP_details_2023-10-02.csv

* Python packages: pandas, numpy, os, re, pathlib

## Outputs

* data/processed/WLD_RTFP_country_2023-10-02_clean.csv

* data/processed/WLD_RTP_details_2023-10-02_clean.csv

* data/processed/food_price_merged_clean.csv (final, analysis-ready)

## Additional Comments

* Join strategy: left join from details → country, prioritising keys ['country','date','item'] then ['country','date'], falling back to ['country'] only if needed.

* If your columns differ slightly (e.g., country_name instead of country, month instead of date), the notebook will adapt and log exactly which keys were used.

* Currency/unit normalisation can be added as a follow-up step once business rules are agreed.



---

# Change working directory

* We assume this notebook sits in a subfolder (e.g., jupyter_notebooks/). We make the parent the working directory.

In [None]:
# Access the current directory
import os
current_dir = os.getcwd()
current_dir

'/Users/aminaibrahim/Documents/vscode-projects/food-price-inflation-analysis/jupyter_notebooks'

In [None]:
# Make the parent directory current
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [None]:
# Confirm new current directory
current_dir = os.getcwd()
current_dir

'/Users/aminaibrahim/Documents/vscode-projects/food-price-inflation-analysis'

# Section 1

Section 1 content

In [11]:
import pandas as pd

# Load dataset
country_path = "data/WLD_RTFP_country_2023-10-02.csv"
details_path = "data/WLD_RTP_details_2023-10-02.csv"

country_df = pd.read_csv(country_path)
details_df = pd.read_csv(details_path)

---

# Section 2

## Data Cleaning Utility Functions

The following code defines a set of reusable functions to clean and standardize tabular data using pandas. These functions help ensure consistent column naming, tidy string values, parse dates, convert numeric columns, drop nearly empty columns, and provide a quick summary report. You can use these utilities to prepare raw datasets for analysis.

In [2]:
import re
import numpy as np
import pandas as pd

# Standardize column names: strip spaces, remove special characters, replace spaces with underscores, and lowercase
def normalise_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = (
        df.columns
          .str.strip()
          .str.replace(r"[^\w\s]", "", regex=True)
          .str.replace(r"\s+", "_", regex=True)
          .str.lower()
    )
    return df

# Clean string columns: strip leading/trailing spaces and collapse multiple spaces into one
def tidy_strings(df: pd.DataFrame, cols) -> pd.DataFrame:
    df = df.copy()
    for c in cols:
        if c in df.columns:
            df[c] = (df[c].astype(str)
                           .str.strip()
                           .str.replace(r"\s+", " ", regex=True))
    return df

# Parse columns with names like 'date', 'month', 'year', or 'period' into datetime objects
def parse_dates(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for c in df.columns:
        if re.search(r"(date|month|year|period)", c):
            try:
                df[c] = pd.to_datetime(df[c], errors="coerce")
            except Exception:
                pass
    return df

# Convert columns that look numeric (even if stored as strings) to numeric dtype
def coerce_numeric(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for c in df.columns:
        if df[c].dtype == object:
            sample = df[c].dropna().astype(str).head(50)
            looks_numeric = (
                not sample.empty and
                sample.str.replace(",","", regex=False)
                      .str.replace("%","", regex=False)
                      .str.match(r"^-?\d+(\.\d+)?$")
                      .mean() > 0.6
            )
            if looks_numeric:
                df[c] = (df[c].astype(str)
                               .str.replace(",","", regex=False)
                               .str.replace("%","", regex=False)
                               .str.strip())
                df[c] = pd.to_numeric(df[c], errors="coerce")
    return df

# Drop columns that are nearly empty (default: 98% or more missing values)
def drop_nearly_empty(df: pd.DataFrame, thresh=0.98) -> pd.DataFrame:
    na_ratio = df.isna().mean()
    to_drop = na_ratio[na_ratio >= thresh].index.tolist()
    return df.drop(columns=to_drop) if to_drop else df

# Run all cleaning steps in sequence for a generic DataFrame
def clean_generic(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = normalise_columns(out)
    out = out.drop_duplicates()
    # Identify likely categorical columns and clean them
    cat_cols = [c for c in out.columns if out[c].dtype == object]
    priority = [c for c in ["country","country_name","item","commodity","product","unit","currency","series","market"] if c in out.columns]
    out = tidy_strings(out, list(dict.fromkeys(priority + cat_cols)))
    for c in ["country","country_name"]:
        if c in out.columns:
            out[c] = out[c].str.title()
    out = parse_dates(out)
    out = coerce_numeric(out)
    out = drop_nearly_empty(out, thresh=0.98)
    return out

# Print a brief summary report of the DataFrame's shape, duplicate count, and top missing columns
def brief_report(df: pd.DataFrame, name: str):
    print(f"{name}: shape={df.shape}, duplicates={df.duplicated().sum()}")
    miss = df.isna().sum()
    if miss.any():
        print("  top missing:", miss.sort_values(ascending=False).head(5).to_dict())

# Section 3

## Applying Data Cleaning Functions

Now we use the cleaning utilities to process both datasets (`country_df` and `details_df`).  
We then print a brief summary report for each cleaned DataFrame to check their shape, duplicate count, and missing values.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
