# **Business Understanding for Maize Price Prediction Model**

### 1. **Objective**
The central focus of the project is the development of a predictive model that will be able to forecast the wholesale and retail prices of maize within different markets in Kenya, based on the historical data from the prices of maize and related factors. It will thereby allow for accurate forecasts of the prices that will assist key stakeholders within the maize supply chains, such as farmers, traders, and policymakers, in making relevant decisions.

### 2. **Key Stakeholders**
- **Farmers**: To optimize when to sell their maize based on expected future prices.
- **Traders**: To plan their buying and selling strategies and improve profitability.
- **Policymakers**: To inform policy decisions aimed at ensuring food security and stabilizing market prices.
- **Suppliers and Distributors**: To manage stock levels and plan supply chain logistics effectively.
- **Retailers**: To predict retail prices, helping them manage inventory and sales strategies.

### 3. **Business Problem**
The fluctuations of maize price directly create impacts on the changes in Kenyan economic and food security conditions. Conditions like seasonality, supply volume, and market location and classification of different types of maize such as yellow, white amongst others drive the fluctuation. But it is hard to predict price changes of maize due to a change in an agricultural market or outside factors associated with weather condition or policy amendment.

The inability to forecast maize prices accurately results in:
- Uncertainty in income for farmers.
- exploitation of farmers
- Price volatility for consumers.
- Inefficient stock management for traders and suppliers.

By predicting prices accurately, stakeholders can mitigate risks, optimize profit margins, and contribute to the stability of the maize market.

### 4. **Current Situation**
Currently, the maize market operates with limited forward-looking insights. Most pricing decisions are made based on historical prices, intuition, or delayed market reports. This has led to:
- **Price rises or drops**, negatively affecting both farmers and consumers.
- **Supply chain inefficiencies** caused by unanticipated demand-supply imbalances.

### 5. **Expected Benefits**
- **For Farmers**: Improved decision-making for when to sell maize, optimizing profit margins.
- **For Traders and Distributors**: Efficient stock management, reducing losses from oversupply or undersupply.
- **For Policymakers**: Better policy formulation and food security planning.
- **For Retailers**: Pricing strategies that minimize losses and increase profitability.

### 6. **Scope of the Model**
The model will predict the wholesale and retail prices of maize using key input variables:
- **Date** (to capture seasonality and trends over time).
- **Supply Volume** (to reflect how the quantity of maize brought to the market affects prices).
- **County** (to capture regional price differences).
- **Market** (to account for market-specific variations).
- **Classification** (to distinguish between types of maize, e.g., yellow and white maize).

This model will be trained on historical data, considering seasonality, trend, and all other market factors. It may be used in price forecasting at wholesale and retail marketplaces.

### 7. **Data Sources**
The primary data source is the **Ministry of Agriculture and Livestock Development's AMIS website**, which provides detailed information on:
- Maize commodity prices (wholesale and retail)
- Classification (yellow or white maize)
- Supply volume
- Market and county location
- Dates and other relevant time-based data.

### 8. **Challenges and Risks**
- **Data Gaps**: Missing data for certain markets, maize types, or time periods may affect the accuracy of predictions. Data imputation techniques (like `bfill`, `ffill`, and KNN) will be used to mitigate this.
- **Price Volatility**: Sudden external factors, such as weather changes or government interventions, could cause price changes that are hard to predict.
- **Seasonality**: Maize prices are highly seasonal, and capturing this with sufficient precision is essential for accurate forecasting.

### 9. **Model Output**
The model will output predictions for:
- **Wholesale prices**,
- **Retail prices**.

These predictions will be available for a given future date and specific locations/markets across Kenya.

### 10. **Future Considerations**
- **Expansion to Other Crops**: The model could be adapted to predict prices for other staple crops, provided relevant data is available.
- **Improvement in Accuracy**: By incorporating additional data, such as weather patterns, government interventions, or global commodity prices, the model’s accuracy could be further enhanced.




# **Data Understanding for Maize Price Prediction Model**

### 1. **Overview of the Dataset**
The dataset for the maize price prediction model has been sourced from the **Ministry of Agriculture and Livestock Development's AMIS** website. It contains information about maize prices in various markets across Kenya. The data includes several key features that are expected to influence maize prices.

### 2. **Key Features in the Dataset**
The dataset consists of the following columns:

- **Date**: The date of the record. This field is of utmost importance since it might be able to capture temporal patterns, trends, and seasonality in the prices.
  
- **Commodity**: This is the name of the product, in this case, the crop maize. The commodity field does not really add much variability for the current model since the focus would be on maize; it could be useful in further applications to extend it for other crops.
  
- **Classification**: Maize class. Example: Yellow Maize, White Maize. This is a categorical feature, which helps in differentiating the price pattern for different types of maize.
  
- **Market**:  The market in which the price of maize was recorded. This feature would account for geographical variation in prices recorded, due to differences in demand and supply within different markets.
  
- **County**:  County of sale location of the market. County level data accounts for regional factors such as transport cost and demand which drive price setting.
  
- **Wholesale Price**: The wholesale price of maize. This is one of the main target variables that will be forecasted by the model.
  
- **Retail Price**: This is the amount paid by consumers for maize. This is the second target variable to be forecasted.
  
- **Supply Volume**: The amount of maize brought to the market, which directly impacts price movements. A higher supply typically reduces prices, while lower supply may increase prices.
  
- **Other Time Features**: In addition to the raw date, several engineered time features are included:
  - **Day**, **Month**, **Year**: These capture the specific day, month, and year of each record.
  - **Day_sin**, **Day_cos**, **Month_sin**, **Month_cos**, **Year_sin**, **Year_cos**: These are cyclic transformations of day, month, and year to help capture seasonality patterns and cyclical trends in maize prices.

### 3. **Data Types**
- **Numerical Features**:
  - **Wholesale Price**: Continuous, float.
  - **Retail Price**: Continuous, float.
  - **Supply Volume**: Continuous, float.
  - **Time Features**: Discrete (Day, Month, Year) and continuous (sin/cos transformations).

- **Categorical Features** (One-hot Encoded):
  - **Classification**
  - **Market**
  - **County**

### 4. **Data Quality**
#### **Missing Values**
- Missing values are present in the dataset, particularly in the **Wholesale**, **Retail**, and **Supply Volume** columns. Several imputation strategies have been considered:
  - **Backward Fill (`bfill`)**: Filling missing values with the next valid value in the column.
  - **Forward Fill (`ffill`)**: Filling missing values with the last valid value in the column.
  - **KNN Imputation**: Using K-Nearest Neighbors to impute missing values based on similarity in other features.
  
  The approach to be used depends on the data distribution and model performance after each technique.

#### **Outliers**
- **Wholesale and Retail Prices**: Outliers in price data may occur due to exceptional circumstances (e.g., price spikes due to drought or political interventions). 

- **Supply Volume**: Sudden, extreme variations in supply volume may need to be analyzed 

### 5. **Feature Engineering**
To enhance the predictive power of the model, several engineered features have been created:
- **Rolling Means and Standard Deviations**: 7-day rolling means and standard deviations for **Wholesale Price**, **Retail Price**, and **Supply Volume**. These features help smooth out short-term fluctuations and provide a clearer view of price trends.
  
- **Lagged Features**: Lag features for **Supply Volume** and prices help capture the delayed effects of supply changes on prices.
  
- **Binary Encoding**: For categorical variables like **Classification**, **Market**, and **County**, to ensure they can be properly utilized in machine learning algorithms.



# Data Cleaning: Preparing Our Dataset for Analysis

In every data analysis project, the quality of data is imperative on which we will be working. This notebook will focus on the major steps of cleaning our dataset so that it is accurate, complete, and ready for insightful analysis.

### Objectives:
1. **Understand the Data**: Explore the dataset to identify its structure and contents.
2. **Identify Issues**: Detect missing values, duplicates, and inconsistencies.
3. **Apply Cleaning Techniques**: Utilize various methods to correct and standardize the data.
4. **Prepare for Analysis**: Finalize the dataset for visualization and modeling.

Let’s dive in and transform our raw data into a clean, usable format!

In [1]:
# importing all neccessary libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from scipy import stats


In [2]:
# importing and combining the data
file_paths = ("raw data/Market Prices.xls", "raw data/Market Prices 2.xls", "raw data/Market Prices 3.xls", "raw data/Market Prices 4.xls", "raw data/Market Prices 5.xls", "Raw Data/Market Prices 6.xls", "raw data/Market Prices 7.xls", "Raw Data/Market Prices 8.xls" )


dfs = []
for file in file_paths:
  df = pd.read_excel(file)
  dfs.append(df)

df = pd.concat(dfs)

df

Unnamed: 0,Commodity,Classification,Grade,Sex,Market,Wholesale,Retail,Supply Volume,County,Date
0,Dry Maize,Mixed-Traditional,-,-,Isebania Market,-,32.00/Kg,45000.0,Migori,2024-08-27
1,Dry Maize,White Maize,-,-,Ahero,45.00/Kg,50.00/Kg,7000.0,Kisumu,2024-08-27
2,Dry Maize,White Maize,-,-,Nyamakima,44.44/Kg,70.00/Kg,,Nairobi,2024-08-27
3,Dry Maize,White Maize,-,-,Kathonzweni,35.00/Kg,40.00/Kg,7200.0,Makueni,2024-08-27
4,Dry Maize,White Maize,-,-,Kawangware,40.00/Kg,50.00/Kg,,Nairobi,2024-08-27
...,...,...,...,...,...,...,...,...,...,...
1272,Dry Maize,-,-,-,Loitoktok Market,25.56/Kgs,35.00/Kgs,9900.0,Kajiado,2021-05-24
1273,Dry Maize,White Maize,-,-,Chepseon,24.56/Kgs,28.33/Kgs,1800.0,Kericho,2021-05-24
1274,Dry Maize,Yellow Maize,-,Male,Nakuru Wakulima,36.67/Kgs,40.00/Kgs,,Nakuru,2021-05-24
1275,Dry Maize,White Maize,-,-,Elwak Market,40.00/Kgs,50.00/Kgs,1200.0,Mandera,2021-05-24


Given that there is only one value in the commodity column and there is no value in grade and sex columns we can drop these columns


In [3]:
# Drop irrelevant columns
df.drop(['Commodity','Grade','Sex'], axis=1, inplace=True)


# Verify columns have been dropped
df.head()

Unnamed: 0,Classification,Market,Wholesale,Retail,Supply Volume,County,Date
0,Mixed-Traditional,Isebania Market,-,32.00/Kg,45000.0,Migori,2024-08-27
1,White Maize,Ahero,45.00/Kg,50.00/Kg,7000.0,Kisumu,2024-08-27
2,White Maize,Nyamakima,44.44/Kg,70.00/Kg,,Nairobi,2024-08-27
3,White Maize,Kathonzweni,35.00/Kg,40.00/Kg,7200.0,Makueni,2024-08-27
4,White Maize,Kawangware,40.00/Kg,50.00/Kg,,Nairobi,2024-08-27


In [4]:
# replacing the hyphen with Nan to allow better visibility and easier manipulation
df.replace(['-', ' - ', '- ', ' -'], np.nan, inplace=True)
df.isna().sum()

Classification     668
Market               0
Wholesale         1936
Retail             908
Supply Volume     4401
County               0
Date                 0
dtype: int64

In [5]:
df.shape

(22277, 7)

In [6]:
# Convert 'Wholesale' and 'Retail' to numerical values for exploration
price_columns = ["Wholesale", "Retail"]

for col in price_columns:
  df[col] = df[col].str.lower().str.replace("/kg", "").str.strip()
  df[col] = df[col].str.lower().str.replace("s", "").str.strip().astype(float)

df.head()

Unnamed: 0,Classification,Market,Wholesale,Retail,Supply Volume,County,Date
0,Mixed-Traditional,Isebania Market,,32.0,45000.0,Migori,2024-08-27
1,White Maize,Ahero,45.0,50.0,7000.0,Kisumu,2024-08-27
2,White Maize,Nyamakima,44.44,70.0,,Nairobi,2024-08-27
3,White Maize,Kathonzweni,35.0,40.0,7200.0,Makueni,2024-08-27
4,White Maize,Kawangware,40.0,50.0,,Nairobi,2024-08-27


I will use going to use KNN to fill the numerical missing values and drop the records with categorical missing values

In [7]:
# using knn to fill null values
knn_imputer = KNNImputer(n_neighbors = 5)

knn_columns = ["Supply Volume", "Retail", "Wholesale"]
df[knn_columns] = knn_imputer.fit_transform(df[knn_columns]) 
df.head()


Unnamed: 0,Classification,Market,Wholesale,Retail,Supply Volume,County,Date
0,Mixed-Traditional,Isebania Market,22.578,32.0,45000.0,Migori,2024-08-27
1,White Maize,Ahero,45.0,50.0,7000.0,Kisumu,2024-08-27
2,White Maize,Nyamakima,44.44,70.0,1870.0,Nairobi,2024-08-27
3,White Maize,Kathonzweni,35.0,40.0,7200.0,Makueni,2024-08-27
4,White Maize,Kawangware,40.0,50.0,1560.0,Nairobi,2024-08-27


In [8]:
df.dropna(inplace=True)

In [9]:
df.shape

(21609, 7)

In [10]:
# checking data types for various colummns
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21609 entries, 0 to 1276
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Classification  21609 non-null  object 
 1   Market          21609 non-null  object 
 2   Wholesale       21609 non-null  float64
 3   Retail          21609 non-null  float64
 4   Supply Volume   21609 non-null  float64
 5   County          21609 non-null  object 
 6   Date            21609 non-null  object 
dtypes: float64(3), object(4)
memory usage: 1.3+ MB


In [11]:
# Sort data by 'Market','County' ,'Classification and 'Date'
df.sort_values(by=['County', 'Market','Classification', 'Date',], inplace=True)

In [12]:
df.head(20)

Unnamed: 0,Classification,Market,Wholesale,Retail,Supply Volume,County,Date
0,Mixed-Traditional,Eldama Ravine,31.288,40.0,2650.0,Baringo,2021-05-24
541,Mixed-Traditional,Eldama Ravine,27.78,40.0,1000.0,Baringo,2021-05-24
1055,Mixed-Traditional,Eldama Ravine,22.22,40.0,900.0,Baringo,2021-05-24
1809,Mixed-Traditional,Eldama Ravine,31.11,45.0,900.0,Baringo,2021-06-28
368,Mixed-Traditional,Eldama Ravine,31.11,32.0,900.0,Baringo,2021-09-20
3,Mixed-Traditional,Eldama Ravine,30.0,35.86,900.0,Baringo,2021-10-04
2534,Mixed-Traditional,Eldama Ravine,30.0,36.0,900.0,Baringo,2021-10-25
1176,Mixed-Traditional,Eldama Ravine,30.0,36.0,600.0,Baringo,2022-01-03
639,Mixed-Traditional,Eldama Ravine,30.0,34.0,1800.0,Baringo,2022-01-31
500,Mixed-Traditional,Eldama Ravine,30.0,34.0,1500.0,Baringo,2022-02-07


i have noted that some markets have very small amounts of data, these may not be sufficient in trainning the model. i will drop markets with less than 10 records

In [13]:
# removing markets with less than 10 records
threshold = 10

# Filter out markets with less than 'threshold' records
market_counts = df["Market"].value_counts()
markets_to_keep = market_counts[market_counts >= threshold].index

# Keep only the data for markets with enough records
data = df[df['Market'].isin(markets_to_keep)]

# Check how many markets remain
print(f"Number of remaining markets: {data['Market'].nunique()}")

Number of remaining markets: 201


we need to remove outliers from the data as they may be erroneous and may skew the data hence creating bias in our final model

In [14]:
num_columns = ["Retail", "Wholesale", "Supply Volume"]

outliers = np.zeros(data.shape[0], dtype=bool)

for col in num_columns:
    z_scores = stats.zscore(data[col])
    outliers = outliers | (np.abs(z_scores) > 3) 

outlier_df = data[outliers].reset_index(drop=True)

data = data[~outliers]

In [15]:
len(outlier_df)

235

In [16]:
data = data.drop_duplicates()

In [17]:
# saving the data to use in modelling
data.to_csv("clean_data2.csv", index = False)