<a href="https://colab.research.google.com/github/Oak-ke/Correlation-Project-/blob/main/Copy_of_Maize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Kenyan Maize Dataset Analysis

## Objective
The main objective of this notebook is to explore and understand the 'Kenyan Maize Dataset' to potentially identify factors influencing maize yield.

## Dataset Description

The dataset used in this notebook is the 'Kenyan Maize Dataset', loaded from the file 'maize.csv'.
It contains information related to maize yield in Kenya over several years.
The columns in the dataset include:
- Item: The crop item (Maize)
- Year: The year of observation
- hg/ha_yield: Maize yield in hectograms per hectare
- average_rain_fall_mm_per_year: Average annual rainfall in millimeters
- pesticides_tonnes: Amount of pesticides used in tonnes
- avg_temp: Average temperature
- Area: The geographical area (Kenya)


## Notebook Steps

Sequence of Actions Performed in the Notebook:

1.  **Data Loading:** The dataset was loaded into a pandas DataFrame named `df` from the CSV file 'maize.csv'.

2.  **Initial Data Inspection:**
    *   The first few rows of the DataFrame were displayed using `df.head()` to get a preliminary look at the data.
    *   The column names were checked using `df.columns`.
    *   Information about the DataFrame, including data types and non-null counts, was examined using `df.info()`. This confirmed that there were no missing values in the dataset.
    *   Descriptive statistics (count, mean, std, min, max, quartiles) for the numerical columns were generated using `df.describe()`.
    *   The shape of the DataFrame (number of rows and columns) was checked using `df.shape`.

3.  **Data Cleaning - Column Removal:** The 'Unnamed: 0' column, which appeared to be an index column and not relevant for the analysis, was removed from the DataFrame.

4.  **Data Cleaning - Column Renaming:** Although not permanently saved to the DataFrame, a step was taken to rename some columns for better readability ('average_rain_fall_mm_per_year' to 'Average_Rain_Fall_MM_PER_YEAR', 'pesticides_tonnes' to 'Pesticides_Tonnes', 'avg_temp' to 'Avg_Temp') using `df.rename()`.

5.  **Data Cleaning - Duplicate Check:** Duplicate rows in the DataFrame were identified using `df.duplicated()`. The output indicated no duplicate rows.

These steps covered the initial loading, basic inspection, and some preliminary cleaning of the dataset before any deeper analysis or modeling.


## Key Findings

Key Findings from Initial Data Analysis:

Based on the initial exploratory data analysis, the following significant observations and patterns were derived from the dataset:

1.  **Numerical Column Summary (`df.describe()`):**
    *   `hg/ha_yield`: The maize yield varies significantly, with a minimum of 849 hg/ha and a maximum of 207556 hg/ha. The mean yield is around 36310 hg/ha, with a standard deviation of 27456, indicating a wide distribution in yield values.
    *   `average_rain_fall_mm_per_year`: Rainfall also shows considerable variation, ranging from 51 mm to 3240 mm annually. The mean rainfall is approximately 1098 mm.
    *   `pesticides_tonnes`: Pesticide usage varies greatly, from a minimum of 0.04 tonnes to a maximum of 367778 tonnes. The mean usage is around 32766 tonnes, with a large standard deviation (54088), suggesting a highly skewed distribution or outliers in pesticide application.
    *   `avg_temp`: Average temperature has a smaller range compared to yield, rainfall, and pesticides, varying from 1.61°C to 30.65°C. The mean temperature is around 19.93°C.
    *   `Year`: The data spans from 1990 to 2013.

2.  **Data Quality (`df.info()`, Duplicate Check):**
    *   `df.info()` revealed that there are no missing values across all columns, which is excellent for data quality and means no imputation or handling of missing data is required at this stage.
    *   The duplicate check (`df.duplicated()`) confirmed that there are no duplicate rows in the dataset, indicating the data is unique and consistent in this regard.

Overall, the dataset appears to be clean with no missing values or duplicates. The numerical features, particularly 'hg/ha_yield', 'average_rain_fall_mm_per_year', and 'pesticides_tonnes', exhibit significant variability, which could be important factors influencing maize yield. The wide range in pesticide usage might warrant further investigation into its distribution and potential impact on yield.



In [None]:
# import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile

In [None]:
!pip install kaggle



In [None]:
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
# Configure the Kaggle API by creating the required directory, copying the API key, and setting secure permissions
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download yvvonjemmy/kenyan-maize-dataset

Dataset URL: https://www.kaggle.com/datasets/yvvonjemmy/kenyan-maize-dataset
License(s): unknown
Downloading kenyan-maize-dataset.zip to /content
  0% 0.00/42.4k [00:00<?, ?B/s]
100% 42.4k/42.4k [00:00<00:00, 145MB/s]


In [None]:
with zipfile.ZipFile('kenyan-maize-dataset.zip', 'r') as zip_ref:
    # This will print all files contained in the zip
    file_list = zip_ref.namelist()
    print("Files in the zip archive:")
    for file_name in file_list:
        print(f" - {file_name}")

    # Then extract them
    zip_ref.extractall()

Files in the zip archive:
 - maize.csv


In [None]:
df = pd.read_csv('maize.csv')

In [None]:
#checking the first five values of the dataset
df.head ()

Unnamed: 0.1,Unnamed: 0,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp,Area
0,0,Maize,1990,36613,1485.0,121.0,16.37,Kenya
1,6,Maize,1991,29068,1485.0,121.0,15.36,Kenya
2,12,Maize,1992,24876,1485.0,121.0,16.06,Kenya
3,18,Maize,1993,24185,1485.0,121.0,16.05,Kenya
4,23,Maize,1994,25848,1485.0,201.0,16.96,Kenya


In [None]:
#Cheking our columns
df.columns

Index(['Unnamed: 0', 'Item', 'Year', 'hg/ha_yield',
       'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp',
       'Area'],
      dtype='object')

In [None]:
#Checking the type of data in our columns
df.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4121 entries, 0 to 4120
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     4121 non-null   int64  
 1   Item                           4121 non-null   object 
 2   Year                           4121 non-null   int64  
 3   hg/ha_yield                    4121 non-null   int64  
 4   average_rain_fall_mm_per_year  4121 non-null   float64
 5   pesticides_tonnes              4121 non-null   float64
 6   avg_temp                       4121 non-null   float64
 7   Area                           4121 non-null   object 
dtypes: float64(3), int64(3), object(2)
memory usage: 257.7+ KB


In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
count,4121.0,4121.0,4121.0,4121.0,4121.0,4121.0
mean,14135.846639,2001.553749,36310.070614,1098.124242,32765.983322,19.925159
std,8201.607601,7.04449,27456.370877,721.559071,54088.622824,6.654389
min,0.0,1990.0,849.0,51.0,0.04,1.61
25%,6847.0,1995.0,17086.0,537.0,1597.0,15.67
50%,14595.0,2001.0,25401.0,1020.0,14485.33,20.81
75%,21228.0,2008.0,48243.0,1622.0,43720.04,25.92
max,28235.0,2013.0,207556.0,3240.0,367778.0,30.65


In [None]:
# Check the number of rows and columns (shape)
df.shape

(4121, 8)

In [None]:
# Finding NaNs
df.isna().sum() * 100/len(df)

Unnamed: 0,0
Unnamed: 0,0.0
Item,0.0
Year,0.0
hg/ha_yield,0.0
average_rain_fall_mm_per_year,0.0
pesticides_tonnes,0.0
avg_temp,0.0
Area,0.0


In [None]:
# Data Preparation, Checking what rows we have
df.head()

Unnamed: 0.1,Unnamed: 0,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp,Area
0,0,Maize,1990,36613,1485.0,121.0,16.37,Kenya
1,6,Maize,1991,29068,1485.0,121.0,15.36,Kenya
2,12,Maize,1992,24876,1485.0,121.0,16.06,Kenya
3,18,Maize,1993,24185,1485.0,121.0,16.05,Kenya
4,23,Maize,1994,25848,1485.0,201.0,16.96,Kenya


In [None]:
df.columns

Index(['Unnamed: 0', 'Item', 'Year', 'hg/ha_yield',
       'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp',
       'Area'],
      dtype='object')

In [22]:
#Commenting out what we don't want

df = df[[#'Unnamed: 0',
     'Item', 'Year', 'hg/ha_yield',
       'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp',
       'Area']].copy()

In [23]:
df.head()


Unnamed: 0,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp,Area
0,Maize,1990,36613,1485.0,121.0,16.37,Kenya
1,Maize,1991,29068,1485.0,121.0,15.36,Kenya
2,Maize,1992,24876,1485.0,121.0,16.06,Kenya
3,Maize,1993,24185,1485.0,121.0,16.05,Kenya
4,Maize,1994,25848,1485.0,201.0,16.96,Kenya


In [25]:
df.shape

(4121, 7)

In [28]:
df.rename (columns = {'average_rain_fall_mm_per_year':'Average_Rain_Fall_MM_PER_YEAR','pesticides_tonnes': 'Pesticides_Tonnes','avg_temp':'Avg_Temp' })

Unnamed: 0,Item,Year,hg/ha_yield,Average_Rain_Fall_MM_PER_YEAR,Pesticides_Tonnes,Avg_Temp,Area
0,Maize,1990,36613,1485.0,121.00,16.37,Kenya
1,Maize,1991,29068,1485.0,121.00,15.36,Kenya
2,Maize,1992,24876,1485.0,121.00,16.06,Kenya
3,Maize,1993,24185,1485.0,121.00,16.05,Kenya
4,Maize,1994,25848,1485.0,201.00,16.96,Kenya
...,...,...,...,...,...,...,...
4116,Maize,2009,4642,657.0,3269.99,20.52,Kenya
4117,Maize,2010,8751,657.0,3305.17,21.17,Kenya
4118,Maize,2011,6568,657.0,3340.35,20.78,Kenya
4119,Maize,2012,7912,657.0,3375.53,20.52,Kenya


In [31]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
4116,False
4117,False
4118,False
4119,False
