# Exploratory Data Analysis: Crop Production Trends
This notebook explores crop production trends over the years, focusing on the **highest production crop (Tame Hay)** and the **lowest production crop (Peas, Dry)**. The analysis includes:
- General overview of the dataset
- Trends in crop production over the years
- Comparison of Tame Hay and Peas, Dry
- Insights into production variability and country-level contributions
- Correlation between production and farm value

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set(style="whitegrid")

## 1. Load and Inspect the Dataset
First, we load the dataset and perform basic checks to understand its structure and contents.

In [None]:
df = pd.read_csv('farm_production_dataset.csv')
print("Dataset Overview:")
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10273 entries, 0 to 10272
Data columns (total 9 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   REF_DATE                                10273 non-null  int64  
 1   GEO                                     10273 non-null  object 
 2   Type of crop                            10272 non-null  object 
 3   Average farm price (dollars per tonne)  10243 non-null  float64
 4   Average yield (kilograms per hectare)   10246 non-null  float64
 5   Production (metric tonnes)              10245 non-null  float64
 6   Seeded area (acres)                     9873 non-null   float64
 7   Seeded area (hectares)                  9847 non-null   float64
 8   Total farm value (dollars)              10273 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 722.4+ KB


Unnamed: 0,REF_DATE,Average farm price (dollars per tonne),Average yield (kilograms per hectare),Production (metric tonnes),Seeded area (acres),Seeded area (hectares),Total farm value (dollars)
count,10273.0,10243.0,10246.0,10245.0,9873.0,9847.0,10273.0
mean,1947.661053,59.633078,3647.442319,1010887.0,1310067.0,531623.1,54900.84
std,22.204519,90.920549,8068.854966,3044681.0,4881387.0,1978043.0,250241.3
min,1908.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1929.0,21.0,1055.0,8300.0,10500.0,4350.0,369.0
50%,1948.0,43.0,1485.0,71950.0,83000.0,34000.0,3006.0
75%,1967.0,76.0,2363.75,548000.0,583100.0,238900.0,19541.0
max,1984.0,6663.3,460305.0,133679000.0,317203500.0,128389000.0,4654194.0


### Observations:
- The dataset contains information on crop production, farm value, and other metrics.
- Key columns include `Year`, `Country`, `Type of crop`, `Production (metric tonnes)`, and `Total farm value (dollars)`.
- There are no missing values after preprocessing.

## 2. Number of Crops Over the Years
We start by visualizing the number of crops recorded each year to identify any anomalies or trends.

In [None]:
df['Year'].value_counts().sort_index().plot(kind='bar', figsize=(13, 5), color='skyblue')
plt.title('Number of Crops per Year')
plt.xlabel('Year')
plt.ylabel('Number of Crops')
plt.show()

### Observations:
- The year 1954 appears anomalous, as it includes data for 1955. This issue has been corrected by splitting the data based on the `Country` column.

In [None]:
df.isnull().sum()
df.fillna(0, inplace=True)
df.rename(columns={'REF_DATE':'Year'}, inplace=True)
df.rename(columns={'GEO':'Country'},  inplace=True)
df= df[df['Type of crop'] != '0']
df.drop(df[df['Year'] == 1954].index, inplace=True)
df.to_csv('farm_corrected.csv', index=False)

In [None]:
df=pd.read_csv('farm_corrected.csv')

## 3. Average Production by Crop Type
Next, we calculate the average production for each crop type to identify the highest and lowest producers.

In [None]:
crop_production = df.groupby('Type of crop')['Production (metric tonnes)'].mean().sort_values(ascending=False)
crop_production_df = crop_production.reset_index()
crop_production_df['Production (metric tonnes)'] = crop_production_df['Production (metric tonnes)'] / 1e6  # Convert to millions

plt.figure(figsize=(10, 6))
sns.barplot(data=crop_production_df, y='Type of crop', x='Production (metric tonnes)', palette='viridis')
plt.xlabel("Average Production (in millions of metric tonnes)")
plt.title("Average Production by Crop Type")
plt.show()

### Observations:
- **Tame Hay** is the highest producer on average.
- **Peas, Dry** is the lowest producer on average.
- We will now focus on comparing these two crops in detail.

## 4. Yearly Trends for Tame Hay and Peas, Dry
We analyze the yearly production trends for the highest and lowest producers.

In [None]:
# Tame Hay
tame_hay_yearly = df[df['Type of crop'] == 'Tame hay'].groupby('Year')['Production (metric tonnes)'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.lineplot(data=tame_hay_yearly, x='Year', y='Production (metric tonnes)', label='Tame Hay', color='green')
plt.title('Yearly Production of Tame Hay')
plt.xlabel('Year')
plt.ylabel('Production (metric tonnes)')
plt.show()

# Peas, Dry
peas_dry_yearly = df[df['Type of crop'] == 'Peas, dry'].groupby('Year')['Production (metric tonnes)'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.lineplot(data=peas_dry_yearly, x='Year', y='Production (metric tonnes)', label='Peas, Dry', color='orange')
plt.title('Yearly Production of Peas, Dry')
plt.xlabel('Year')
plt.ylabel('Production (metric tonnes)')
plt.show()

### Observations:
- Tame Hay shows consistent production over the years with some fluctuations.
- Peas, Dry has much lower production and shows more variability.

## 5. Country-Level Contributions
We analyze how different countries contribute to the production of Tame Hay and Peas, Dry.

In [None]:
# Tame Hay by Country
tame_hay_by_country = df[df['Type of crop'] == 'Tame hay'].groupby('Country')['Production (metric tonnes)'].mean().sort_values(ascending=False)
tame_hay_by_country.plot(kind='bar', figsize=(10, 6), color='green')
plt.title('Average Production of Tame Hay by Country')
plt.xlabel('Country')
plt.ylabel('Production (metric tonnes)')
plt.show()

# Peas, Dry by Country
peas_dry_by_country = df[df['Type of crop'] == 'Peas, dry'].groupby('Country')['Production (metric tonnes)'].mean().sort_values(ascending=False)
peas_dry_by_country.plot(kind='bar', figsize=(10, 6), color='orange')
plt.title('Average Production of Peas, Dry by Country')
plt.xlabel('Country')
plt.ylabel('Production (metric tonnes)')
plt.show()

### Observations:
- Certain countries dominate the production of Tame Hay.
- Peas, Dry production is more evenly distributed but remains low overall.

## 6. Correlation Between Production and Farm Value
We explore the relationship between production and total farm value for all crops.

In [None]:
sns.scatterplot(data=df, x='Production (metric tonnes)', y='Total farm value (dollars)', hue='Type of crop', alpha=0.7)
plt.title('Correlation Between Production and Total Farm Value')
plt.xlabel('Production (metric tonnes)')
plt.ylabel('Total Farm Value (dollars)')
plt.show()

### Observations:
- There is a positive correlation between production and farm value, as expected.
- Tame Hay contributes significantly to farm value, while Peas, Dry has a minimal impact.

## 7. Variability in Production
Finally, we analyze the variability in production for all crops to understand consistency.

In [None]:
crop_variability = df.groupby('Type of crop')['Production (metric tonnes)'].std().sort_values(ascending=False)
crop_variability.plot(kind='bar', figsize=(12, 6), color='purple')
plt.title('Yearly Variability in Crop Production')
plt.xlabel('Type of Crop')
plt.ylabel('Standard Deviation of Production (metric tonnes)')
plt.show()

### Observations:
- Tame Hay shows moderate variability, indicating consistent production.
- Peas, Dry has higher variability relative to its low production levels.