# Exploratory Data Analysis: Climate Change Impact on Agriculture Dataset (Kaggle)
### https://www.kaggle.com/datasets/waqi786/climate-change-impact-on-agriculture

### This is primarily for personal study and refreshing things that I learned in courses related to data science in the past. However, I believe climate change to be an important topic and worth researching.

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
# Load dataset 
df = pd.read_csv("../data/climate_change_impact_on_agriculture_2024.csv")
df.head()

Unnamed: 0,Year,Country,Region,Crop_Type,Average_Temperature_C,Total_Precipitation_mm,CO2_Emissions_MT,Crop_Yield_MT_per_HA,Extreme_Weather_Events,Irrigation_Access_%,Pesticide_Use_KG_per_HA,Fertilizer_Use_KG_per_HA,Soil_Health_Index,Adaptation_Strategies,Economic_Impact_Million_USD
0,2001,India,West Bengal,Corn,1.55,447.06,15.22,1.737,8,14.54,10.08,14.78,83.25,Water Management,808.13
1,2024,China,North,Corn,3.23,2913.57,29.82,1.737,8,11.05,33.06,23.25,54.02,Crop Rotation,616.22
2,2001,France,Ile-de-France,Wheat,21.11,1301.74,25.75,1.719,5,84.42,27.41,65.53,67.78,Water Management,796.96
3,2001,Canada,Prairies,Coffee,27.85,1154.36,13.91,3.89,5,94.06,14.38,87.58,91.39,No Adaptation,790.32
4,1998,India,Tamil Nadu,Sugarcane,2.19,1627.48,11.81,1.08,9,95.75,44.35,88.08,49.61,Crop Rotation,401.72


In [9]:
# Check for null values
print(len(df))
df.info()

10000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Year                         10000 non-null  int64  
 1   Country                      10000 non-null  object 
 2   Region                       10000 non-null  object 
 3   Crop_Type                    10000 non-null  object 
 4   Average_Temperature_C        10000 non-null  float64
 5   Total_Precipitation_mm       10000 non-null  float64
 6   CO2_Emissions_MT             10000 non-null  float64
 7   Crop_Yield_MT_per_HA         10000 non-null  float64
 8   Extreme_Weather_Events       10000 non-null  int64  
 9   Irrigation_Access_%          10000 non-null  float64
 10  Pesticide_Use_KG_per_HA      10000 non-null  float64
 11  Fertilizer_Use_KG_per_HA     10000 non-null  float64
 12  Soil_Health_Index            10000 non-null  float64
 13  Adaptation_

## Column Explanations
### The dataset looks to be complete, with each column containing no null values. We have mixture of integers, floats, and objects. <br>
1. **Year**: The year when the data was collected. It is stored as an integer.
2. **Country**: The name of the country where the data was collected. This is stored as text.
3. **Region**: The specific region within the country where the data was gathered. Also stored as text.
4. **Crop_Type**: The type of crop being grown in the region (e.g., wheat, corn, rice). This is a categorical field.
5. **Average_Temperature_C**: The average temperature for the region during the growing season, measured in degrees Celsius. Stored as a floating-point number.
6. **Total_Precipitation_mm**: The total amount of rainfall or precipitation during the growing season, measured in millimeters. Stored as a floating-point number.
7. **CO2_Emissions_MT**: The amount of CO2 emissions from the region during the specified year, measured in metric tons (MT). This field indicates how much CO2 was released into the atmosphere.
8. **Crop_Yield_MT_per_HA**: The yield or production of the crop in metric tons per hectare (MT/HA). It measures how much crop is produced per unit of land.
9. **Extreme_Weather_Events**: The number of extreme weather events (like floods, droughts, etc.) that occurred in the region during the year. This is an integer count.
10. **Irrigation_Access_%**: The percentage of cropland that has access to irrigation. Stored as a floating-point number.
11. **Pesticide_Use_KG_per_HA**: The amount of pesticide applied to the land, measured in kilograms per hectare (KG/HA). Indicates how much pesticide is used on the crops.
12. **Fertilizer_Use_KG_per_HA**: The amount of fertilizer used, measured in kilograms per hectare (KG/HA).
13. **Soil_Health_Index**: A numerical index indicating the health or quality of the soil in the region, often based on factors like nutrient content, pH, and organic matter. Stored as a floating-point number.
14. **Adaptation_Strategies**: The strategies employed by farmers or the region to adapt to environmental challenges, like climate change or soil degradation. This is a categorical field (text).
15. **Economic_Impact_Million_USD**: The economic impact of farming in the region, measured in millions of US dollars. This could reflect profits, losses, or economic damages due to factors like extreme weather or low crop yield.
<br>
<br>
### First, I'll take a look at some of the columns individually to get a better idea of the diversity of the data collected.

In [26]:
# Collection period ('Year') column
start_year = min(df['Year'])
end_year = max(df['Year'])
collection_total = end_year - start_year
print(f'Data collected over {collection_total} years from {start_year} to {end_year}')

Data collected over 34 years from 1990 to 2024


In [27]:
# Country representation
df['Country'].value_counts()

Country
Australia    1032
USA          1032
China        1031
Nigeria      1029
India        1025
Canada        984
Argentina     984
France        978
Russia        961
Brazil        944
Name: count, dtype: int64

In [28]:
df['Crop_Type'].value_counts()

Crop_Type
Wheat         1047
Cotton        1044
Vegetables    1036
Corn          1022
Rice          1022
Sugarcane      995
Fruits         979
Soybeans       958
Barley         952
Coffee         945
Name: count, dtype: int64