# **FOOD SHEET PREDICTION APP - DATA COLECTION**

## Objectives

* This notebook handles the initial data collection for the Food Balance App using the Kaggle dataset "Food Balance Sheet Europe." We will:
- Authenticate with Kaggle using `kaggle.json`.

## Inputs

* Kaggle JSON file - the authentication token 

## Outputs

- Authenticate with Kaggle using `kaggle.json`.
- Download the dataset.
- Load and preprocess it using `data_loader.py`.
- Save it to `data/food_balance_sheet_europe.csv`.

## Additional Comments

- Install and import necessary packages.



---

# Import Packages

### Install Dependencies

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [14]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\User\\Documents\\food sheet prediction\\food-sheet-prediction'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [15]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [16]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\User\\Documents\\food sheet prediction'

# Import Packages

### Install Packages

In [7]:
!pip install kaggle pandas




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import libraries and set up paths

In [23]:
import os
import sys
import json
from pathlib import Path
from kaggle.api.kaggle_api_extended import KaggleApi
import pandas as pd  


# Define project and data paths
project_root = Path(r"C:\Users\User\Documents\food sheet prediction\food-sheet-prediction")
data_dir = project_root / 'data'
input_dir = data_dir / 'input'
input_dir.mkdir(parents=True, exist_ok=True)  

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Load Kaggle credentials
kaggle_json_path = data_dir / 'kaggle.json'
assert kaggle_json_path.exists(), f"Please place kaggle.json in {data_dir}"

with open(kaggle_json_path) as f:
    creds = json.load(f)

os.environ['KAGGLE_USERNAME'] = creds['username']
os.environ['KAGGLE_KEY'] = creds['key']

# Authenticate with Kaggle
api = KaggleApi()
api.authenticate()

# Download the dataset
dataset = "cameronappel/food-balance-sheet-europe"
api.dataset_download_files(dataset, path=str(input_dir), unzip=True)

print(f"✅ Dataset downloaded and extracted to: {input_dir}")

print("Files in input_dir:", [f.name for f in input_dir.iterdir()])


csv_files = list(input_dir.glob("*.csv"))  
if csv_files:
    
    df = pd.read_csv(csv_files[0])
    print("\nDataset Preview:")
    print(df.head())  
else:
    print("No CSV files found in the dataset.")

Dataset URL: https://www.kaggle.com/datasets/cameronappel/food-balance-sheet-europe
✅ Dataset downloaded and extracted to: C:\Users\User\Documents\food sheet prediction\food-sheet-prediction\data\input
Files in input_dir: ['FAOSTAT_data_10-23-2018.csv']

Dataset Preview:
  Domain Code               Domain  Country Code Country  Element Code  \
0         FBS  Food Balance Sheets          5400  Europe           511   
1         FBS  Food Balance Sheets          5400  Europe          5511   
2         FBS  Food Balance Sheets          5400  Europe          5611   
3         FBS  Food Balance Sheets          5400  Europe          5072   
4         FBS  Food Balance Sheets          5400  Europe          5911   

                         Element  Item Code                Item  Year Code  \
0  Total Population - Both sexes       2501          Population       2013   
1                     Production       2511  Wheat and products       2013   
2                Import Quantity       2511  Whea

In [24]:
from pathlib import Path

input_dir = Path(r"C:\Users\User\Documents\food sheet prediction\food-sheet-prediction\data\input")
print(list(input_dir.glob("*.csv")))


[WindowsPath('C:/Users/User/Documents/food sheet prediction/food-sheet-prediction/data/input/FAOSTAT_data_10-23-2018.csv')]


### Download Dataset

In [19]:
# Dataset identifier on Kaggle
dataset = "cameronappel/food-balance-sheet-europe"

# Make sure data directory exists
os.makedirs(data_dir, exist_ok=True)

# Download and unzip the dataset
api.dataset_download_files(dataset, path=data_dir, unzip=True)

# List downloaded CSVs
downloaded_files = list(data_dir.glob('*.csv'))
print("Downloaded files:", downloaded_files)


Dataset URL: https://www.kaggle.com/datasets/cameronappel/food-balance-sheet-europe
Downloaded files: [WindowsPath('C:/Users/User/Documents/food sheet prediction/food-sheet-prediction/data/FAOSTAT_data_10-23-2018.csv'), WindowsPath('C:/Users/User/Documents/food sheet prediction/food-sheet-prediction/data/food_balance_sheet_preprocessed.csv')]


In [28]:
# Optional: Save cleaned version to /data folder
cleaned_output = data_dir / "food_balance_sheet_europe.csv"
df.to_csv(cleaned_output, index=False)
print(f"💾 Cleaned dataset saved to: {cleaned_output}")


💾 Cleaned dataset saved to: C:\Users\User\Documents\food sheet prediction\food-sheet-prediction\data\food_balance_sheet_europe.csv


### Preprocess, and Inspect Dataset

In [None]:
# Cell 4: Load, preprocess, and inspect the dataset

# Locate the CSV file in the input directory
csv_files = list(input_dir.glob("*.csv"))
if not csv_files:
    raise FileNotFoundError(f"No CSV files found in {input_dir}")
    
input_file = csv_files[0]
print(f"📂 Loading file: {input_file.name}")

# Load the dataset
df = pd.read_csv(input_file)

# Basic preprocessing
df.dropna(how='all', axis=1, inplace=True)  
df.dropna(how='all', axis=0, inplace=True)  
df.columns = df.columns.str.strip()         

# Show dataset info
print("✅ Dataset loaded and preprocessed.")
print("🔢 Shape:", df.shape)
print("\n🧱 Columns:", df.columns.tolist())
df.head()


📂 Loading file: FAOSTAT_data_10-23-2018.csv
✅ Dataset loaded and preprocessed.
🔢 Shape: (1214, 14)

🧱 Columns: ['Domain Code', 'Domain', 'Country Code', 'Country', 'Element Code', 'Element', 'Item Code', 'Item', 'Year Code', 'Year', 'Unit', 'Value', 'Flag', 'Flag Description']


Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,5400,Europe,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,742186.0,A,"Aggregate, may include official, semi-official..."
1,FBS,Food Balance Sheets,5400,Europe,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,226089.0,A,"Aggregate, may include official, semi-official..."
2,FBS,Food Balance Sheets,5400,Europe,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,45338.0,A,"Aggregate, may include official, semi-official..."
3,FBS,Food Balance Sheets,5400,Europe,5072,Stock Variation,2511,Wheat and products,2013,2013,1000 tonnes,-4775.0,A,"Aggregate, may include official, semi-official..."
4,FBS,Food Balance Sheets,5400,Europe,5911,Export Quantity,2511,Wheat and products,2013,2013,1000 tonnes,91363.0,A,"Aggregate, may include official, semi-official..."


## Save Dataset

In [29]:
# Cell 8: Save the preprocessed dataset
# Define output file name and path
output_filename = 'food_balance_sheet_preprocessed.csv'
output_path = data_dir / output_filename

# Save the DataFrame to CSV
df.to_csv(output_path, index=False)
print(f"✅ Preprocessed dataset saved to: {output_path}")


✅ Preprocessed dataset saved to: C:\Users\User\Documents\food sheet prediction\food-sheet-prediction\data\food_balance_sheet_preprocessed.csv


---

# Section 2

Section 2 content
### Notes
- The dataset is now saved in `data/food_balance_sheet_europe.csv`.
- We used `load_and_preprocess_data` to ensure consistency with the app's preprocessing.
- Next steps: Explore the data in `DataVisualisation.ipynb`.

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.


### Push Generated/new files from this section to your Github Repo

- ### gitignore

```gitignore
core.Microsoft*
core.mongo*
core.python*
env.py
__pycache__/
*.py[cod]
node_modules/
.github/
cloudinary_python.txt
kaggle.json
```

In [15]:
import os

try:
    folder_name = 'outputs'  
    os.makedirs(folder_name, exist_ok=True)
    print(f"✅ Folder '{folder_name}' created (or already exists).")
except Exception as e:
    print(f"❌ Error: {e}")


✅ Folder 'outputs' created (or already exists).
