<a href="https://colab.research.google.com/github/SATYA1962S/digit_1/blob/main/Cote_dIvoire_Agriculture_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
chatermarzougui_cote_divoire_byte_sized_agriculture_challenge_path = kagglehub.dataset_download('chatermarzougui/cote-divoire-byte-sized-agriculture-challenge')

print('Data source import complete.')


Data source import complete.


<img src="https://devra.ai/analyst/notebook/1850/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;"><div style="font-size:150%; color:#FEE100"><b>Cote d'Ivoire Agriculture Analysis</b></div><div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>

# Introduction

The data we are exploring today comes from the Cote d'Ivoire Byte-Sized Agriculture Challenge. It is intriguing to note that even a few variables can tell a compelling story about agriculture trends in the region. In this notebook, we will clean the data, perform exploratory analysis, and even build a predictor to classify samples. If you find this analysis useful, please upvote it.

## Table of Contents

- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Prediction Model](#Prediction-Model)
- [Conclusion and Future Work](#Conclusion-and-Future-Work)

In [2]:
# Import necessary libraries and set up configurations
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend for matplotlib
import matplotlib.pyplot as plt
plt.switch_backend('Agg')  # Ensure plt uses Agg if only plt is imported
import seaborn as sns
import json
import warnings

warnings.filterwarnings('ignore')

# For inline plotting in Kaggle notebooks
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Setting a random seed for reproducibility
RANDOM_STATE = 42

In [None]:
# Data Loading
# Read the sample training dataset and manifest file
train_df = pd.read_csv('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/sample_train.csv',
                       delimiter=',', encoding='ascii')

with open('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/manifest-8b1c012925d4ea8da8a753934dfc9e0320250427-32468-3f0dl3.json', 'r') as f:
    manifest_data = json.load(f)

# Read the sample submission file for reference
submission_df = pd.read_csv('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/SampleSubmission.csv',
                            delimiter=',', encoding='ascii')

print('Data loading complete. The sample training dataset has {} rows and {} columns.'.format(train_df.shape[0], train_df.shape[1]))
# A quick look at the first few rows (this operation may be commented out in production notebooks)
train_df.head()

In [5]:
# Data Cleaning and Preprocessing
# Let's inspect the data types and missing values
print('Data Types:')
print(train_df.dtypes)

print('\nMissing Values:')
print(train_df.isnull().sum())

# As a reminder, the columns are:
# - ID (string)
# - year (integer)
# - month (string): This implies a categorical variable, even if it might represent a date component
# - tifPath (string): a file path (unlikely to be used in prediction)
# - Target (string): a target descriptor, but note that sample submission expects an integer Target
# - class (integer): This seems to be a numerical version of the target and will be the label for classification

# For modeling, we will focus on predicting 'class' using 'year' and 'month'.
# Convert 'month' to category if not already
if train_df['month'].dtype != 'category':
    train_df['month'] = train_df['month'].astype('category')

# Optionally, drop columns that are not useful for modeling. In this case, 'ID', 'tifPath', and 'Target' are dropped.
df_model = train_df.drop(['ID', 'tifPath', 'Target'], axis=1)

print('\nProcessed DataFrame head:')
df_model.head()  # This shows the prepared data for our prediction model

Data Types:
ID           object
year          int64
month      category
tifPath      object
Target       object
class         int64
dtype: object

Missing Values:
ID         0
year       0
month      0
tifPath    0
Target     0
class      0
dtype: int64

Processed DataFrame head:


Unnamed: 0,year,month,class
0,2024,Jan,3
1,2024,Jan,3
2,2024,Jan,3
3,2024,Jan,3
4,2024,Jan,3


In [7]:
# Exploratory Data Analysis (EDA)

sns.set(style='whitegrid')

## Histogram for 'year'
plt.figure(figsize=(8,4))
sns.histplot(df_model['year'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Year')
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('year_histogram.png')
plt.close()

## Countplot for 'month'
plt.figure(figsize=(8,4))
sns.countplot(x='month', data=df_model, palette='viridis')
plt.title('Count of Records per Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('month_countplot.png')
plt.close()

## Pair Plot for numeric features
# Our numeric features include 'year' and 'class'. While limited, the pairplot can help spot any anomalies.
sns.pairplot(df_model.select_dtypes(include=[np.number]))
plt.savefig('pairplot.png')
plt.close()

# Note: We skip the correlation heatmap since there are fewer than 4 numeric columns.

In [8]:
# Prediction Model
# In this section, we build a predictor to classify 'class' using 'year' and 'month'.
# We'll use a simple logistic regression pipeline with one-hot encoding for the categorical 'month'.

# Define features and target
X = df_model.drop('class', axis=1)
y = df_model['class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)

# Identify numeric and categorical columns
numeric_features = ['year']
categorical_features = ['month']

# Create transformations for each type
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline with the preprocessor and Logistic Regression
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=RANDOM_STATE, max_iter=1000))
])

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Prediction accuracy: {:.2f}%'.format(accuracy * 100))

# Confusion matrix visualization using seaborn's heatmap
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.savefig('confusion_matrix.png')
plt.close()

# In case others encounter errors during logistic regression, note that scaling numeric features and proper handling of categorical variables
# via OneHotEncoder can resolve many issues related to model convergence and performance.

Prediction accuracy: 42.41%


# Conclusion and Future Work

In this notebook, we explored the Cote d'Ivoire Agriculture dataset by cleaning the data, conducting exploratory analysis, and building a prediction model using logistic regression. While our current model considers only a few features, the journey from data exploration to prediction illustrates how even a small set of variables can yield valuable insights.

Future analyses may include:

- Integrating additional features, perhaps from the tifPath if image analysis is feasible.
- Exploring more complex models or ensemble methods to improve prediction accuracy.
- Conducting a time series analysis if the temporal dimensions (year and month) reveal seasonality trends.

Your feedback is appreciated. If this notebook provided value, kindly upvote it.

Thank you for reading.

In [4]:
# Data Loading and Preprocessing
# Read the sample training dataset and manifest file
train_df = pd.read_csv('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/sample_train.csv',
                       delimiter=',', encoding='ascii')

with open('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/manifest-8b1c012925d4ea8da8a753934dfc9e0320250427-32468-3f0dl3.json', 'r') as f:
    manifest_data = json.load(f)

# Read the sample submission file for reference
submission_df = pd.read_csv('/kaggle/input/cote-divoire-byte-sized-agriculture-challenge/SampleSubmission.csv',
                            delimiter=',', encoding='ascii')

print('Data loading complete. The sample training dataset has {} rows and {} columns.'.format(train_df.shape[0], train_df.shape[1]))
# A quick look at the first few rows (this operation may be commented out in production notebooks)
train_df.head()

# Data Cleaning and Preprocessing
# Let's inspect the data types and missing values
print('Data Types:')
print(train_df.dtypes)

print('\nMissing Values:')
print(train_df.isnull().sum())

# As a reminder, the columns are:
# - ID (string)
# - year (integer)
# - month (string): This implies a categorical variable, even if it might represent a date component
# - tifPath (string): a file path (unlikely to be used in prediction)
# - Target (string): a target descriptor, but note that sample submission expects an integer Target
# - class (integer): This seems to be a numerical version of the target and will be the label for classification

# For modeling, we will focus on predicting 'class' using 'year' and 'month'.
# Convert 'month' to category if not already
if train_df['month'].dtype != 'category':
    train_df['month'] = train_df['month'].astype('category')

# Optionally, drop columns that are not useful for modeling. In this case, 'ID', 'tifPath', and 'Target' are dropped.
df_model = train_df.drop(['ID', 'tifPath', 'Target'], axis=1)

print('\nProcessed DataFrame head:')
df_model.head()  # This shows the prepared data for our prediction model

Data loading complete. The sample training dataset has 953 rows and 6 columns.
Data Types:
ID         object
year        int64
month      object
tifPath    object
Target     object
class       int64
dtype: object

Missing Values:
ID         0
year       0
month      0
tifPath    0
Target     0
class      0
dtype: int64

Processed DataFrame head:


Unnamed: 0,year,month,class
0,2024,Jan,3
1,2024,Jan,3
2,2024,Jan,3
3,2024,Jan,3
4,2024,Jan,3
