# Canada's DataFarm Crops — Initial Data Overview

[ENG]
This notebook represents the first phase of Exploratory Data Analysis (EDA) on the dataset of Canadian agricultural crops.
It includes preliminary operations such as data import, structural inspection, type identification, checking for missing values, and descriptive statistics.
The goal is to become familiar with the structure of the dataset before proceeding with cleaning and visualization.

[IT]
Questo notebook rappresenta la prima fase di analisi esplorativa (EDA) del dataset sulle coltivazioni agricole canadesi.  
Contiene le operazioni preliminari di importazione, verifica struttura dei dati, identificazione dei tipi, eventuali valori nulli e statistiche descrittive.  
Lo scopo è familiarizzare con la struttura del dataset prima della pulizia e visualizzazione.

In [2]:
# Main libraries for data analysis
import pandas as pd          # Data manipulation and analysis
import numpy as np           # Numerical computations and array handling
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns        # Statistical data visualization

# Display settings for better readability
pd.set_option('display.max_columns', None)   # Show all columns when displaying a DataFrame
pd.set_option('display.precision', 2)        # Display float numbers with 2 decimal places

# Set Seaborn theme for nicer plots
sns.set(style='whitegrid')   # Use white background with gridlines for clarity

# Enable inline plotting in Jupyter Notebooks (if applicable)
%matplotlib inline

In [4]:
# Check the number of rows and columns in the dataset
print("Shape of the dataset:")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

Shape of the dataset:
Rows: 10273
Columns: 9


In [3]:
# Load the dataset from the raw data folder
df = pd.read_csv("../data_raw/farm_production_dataset.csv")

# Preview the first 5 rows to understand the structure
df.head()

Unnamed: 0,REF_DATE,GEO,Type of crop,Average farm price (dollars per tonne),Average yield (kilograms per hectare),Production (metric tonnes),Seeded area (acres),Seeded area (hectares),Total farm value (dollars)
0,1908,AB,Barley,15.0,1585.0,84000.0,129800.0,53000.0,1296
1,1908,AB,Flaxseed,29.0,950.0,1900.0,5900.0,2000.0,56
2,1908,AB,Oats,18.0,1675.0,352000.0,519400.0,210000.0,6316
3,1908,AB,"Rye, all",23.0,1665.0,5000.0,6500.0,3000.0,117
4,1908,AB,Sugar beets,0.55,18100.0,38100.0,5200.0,2100.0,208


In [5]:
# Check the data types of each column
df.dtypes

REF_DATE                                    int64
GEO                                        object
Type of crop                               object
Average farm price (dollars per tonne)    float64
Average yield (kilograms per hectare)     float64
Production (metric tonnes)                float64
Seeded area (acres)                       float64
Seeded area (hectares)                    float64
Total farm value (dollars)                  int64
dtype: object

In [6]:
# Check for missing values (NaN) in each column
missing_values = df.isnull().sum()

# Calculate the percentage of missing values
missing_percentage = (missing_values / len(df)) * 100

# Combine the count and percentage in a single DataFrame
missing_data = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage.round(2)
})

# Display only columns with missing values
missing_data[missing_data['Missing Values'] > 0]

Unnamed: 0,Missing Values,Percentage (%)
Type of crop,1,0.01
Average farm price (dollars per tonne),30,0.29
Average yield (kilograms per hectare),27,0.26
Production (metric tonnes),28,0.27
Seeded area (acres),400,3.89
Seeded area (hectares),426,4.15
