![alt text](PaletteSkills_Banner.png "Banner")
# Final Assignment 5: Analyzing Rural Municipality Crop Yield Prediction

## Objective

The objective of this assignment is to provide hands-on experience in data science, including data cleaning, exploration, visualization, feature selection, and machine learning model deployment. The focus is on analyzing the Rural Municipality Yield Data from 1938 to 2021 to identify trends, patterns, and insights, and using machine learning techniques to predict crop yields for the top 10 most consumed crops all over the Saskatchewan.

## Crop Yield Prediction

Agriculture is critical to the global economy, and understanding crop yield is essential for addressing food security challenges and reducing the impacts of climate change. Crop yield prediction is a crucial agricultural problem that depends on weather conditions, pesticides, and accurate information about the history of crop yield. This paper focuses on predicting the top 10 most consumed crops in the Saskatchewan using machine learning techniques.

The crops that will be considered are:
- Winter Wheat
- Canola
- Spring Wheat
- Mustard
- Durum
- Sunflowers
- Oats
- Lentils
- Peas
- Barley
- Fall Rye
- Canary Seed
- Spring Rye
- Tame Hay
- Flax
- Chickpeas

## Data

1. **Yield Data**
The Rural Municipality Yield Data contains the yield data for crops grown in rural municipalities from 1938 to 2021. The data includes the following fields:

- **Year:** The year for which the yield data was collected.
- **Municipality:** The name of the rural municipality. 
- **Crops:** The type of crops grown.
- **Yield (bu/acre):** The yield of the crop in kilograms per hectare.

2. **GIS Data**

The column names in the GIS Data
- PPID
- EFFDT
- EXPDT
- FEATURECD
- RMNO
- RMNM
- SHAPE_AREA
- SHAPE_LEN
- geometry

### Work Plan
1. **Collection and Understanding**
- Import libraires
- Load data
- Create dataframes, variables

2. **Data Cleaning and Preparation**
- Check for and remove missing values
- Identify and handle outliers
- Transform the data if necessary

3. **Data Exploration**
- Generate summary statistics
- Create visualizations
- Identify trends, patterns, and outliers

4. **Statistical Analysis**
- Perform regression analysis and one supervised and one unsupervised machine learning techniques to understand the relationship between different variables and the yield of crops.

5. **Insights and Conclusions**
- Identify which crops have the highest yields
- Determine which rural municipalities have the most productive farmland
- Identify which years had the best yields overall

6. **Crop Yield Prediction**
- Use machine learning techniques to predict the yield of the top 10 most consumed crops
- Evaluate the accuracy of the model

7. **Final Report**
- Summarize the findings and conclusions
- Include visualizations to support the analysis
- Provide recommendations for future research

By following this work plan, we will be able to analyze the Rural Municipality Yield Data and predict the yield of the top 10 most consumed crops. The final report will provide valuable insights for addressing food security challenges and reducing the impacts of climate change.


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

## 1. Collection & Understanding Data
- Import libraires
- Load data
- Create dataframes, variables

1. **Yield Data**
The Rural Municipality Yield Data contains the yield data for crops grown in rural municipalities from 1938 to 2021. The data includes the following fields:

- **Year:** The year for which the yield data was collected.
- **Municipality:** The name of the rural municipality. 
- **Crops:** The type of crops grown.
- **Yield (bu/acre):** The yield of the crop in kilograms per hectare.

2. **GIS Data**

The column names in the GIS Data
- PPID
- EFFDT
- EXPDT
- FEATURECD
- RMNO
- RMNM
- SHAPE_AREA
- SHAPE_LEN
- geometry

In [4]:
# Read in data
rm_crop_yields = pd.read_csv('rm_crop_yields_1938_2021.csv', encoding='utf-8')
geo_df = gpd.read_file('Rural Municipality\Rural Municipality.shp')

In [5]:
rm_crop_yields.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25017 entries, 0 to 25016
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          25017 non-null  int64  
 1   RM            25017 non-null  int64  
 2   Winter Wheat  3037 non-null   float64
 3   Canola        14008 non-null  float64
 4   Spring Wheat  24924 non-null  float64
 5   Mustard       4487 non-null   float64
 6   Durum         11581 non-null  float64
 7   Sunflowers    946 non-null    float64
 8   Oats          23913 non-null  float64
 9   Lentils       5515 non-null   float64
 10  Peas          8134 non-null   float64
 11  Barley        24703 non-null  float64
 12  Fall Rye      15847 non-null  float64
 13  Canary Seed   3819 non-null   float64
 14  Spring Rye    805 non-null    float64
 15  Tame Hay      4205 non-null   float64
 16  Flax          20934 non-null  float64
 17  Chickpeas     960 non-null    float64
dtypes: float64(16), int64(2)
m

In [6]:
rm_crop_yields.shape

(25017, 18)

In [9]:
rm_crop_yields.head()

Unnamed: 0,Year,RM,Winter Wheat,Canola,Spring Wheat,Mustard,Durum,Sunflowers,Oats,Lentils,Peas,Barley,Fall Rye,Canary Seed,Spring Rye,Tame Hay,Flax,Chickpeas
0,1938,1,,,4.0,,,,1.0,,,1.0,,,,,0.0,
1,1939,1,,,9.0,,,,16.0,,,16.0,,,,,0.0,
2,1940,1,,,12.0,,,,23.0,,,19.0,,,,,8.0,
3,1941,1,,,18.0,,,,32.0,,,28.0,,,,,5.0,
4,1942,1,,,20.0,,,,35.0,,,28.0,14.0,,,,5.0,


In [7]:
geo_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   PPID        298 non-null    object  
 1   EFFDT       298 non-null    object  
 2   EXPDT       0 non-null      float64 
 3   FEATURECD   298 non-null    object  
 4   RMNO        298 non-null    object  
 5   RMNM        298 non-null    object  
 6   SHAPE_AREA  298 non-null    float64 
 7   SHAPE_LEN   298 non-null    float64 
 8   geometry    298 non-null    geometry
dtypes: float64(3), geometry(1), object(5)
memory usage: 21.1+ KB


In [8]:
geo_df.shape 

(298, 9)

In [10]:
geo_df.head()

Unnamed: 0,PPID,EFFDT,EXPDT,FEATURECD,RMNO,RMNM,SHAPE_AREA,SHAPE_LEN,geometry
0,101000095,2019-01-21,,RMPPID,95,GOLDEN WEST,810143100.0,265851.388799,"POLYGON ((654081.000 5546088.320, 654885.320 5..."
1,101000378,2019-07-29,,RMPPID,378,ROSEMOUNT,584470100.0,161271.937167,"POLYGON ((265258.740 5810148.180, 266062.740 5..."
2,101000288,2015-01-27,,RMPPID,288,PLEASANT VALLEY,853200700.0,116895.097209,"POLYGON ((254141.490 5701256.420, 254179.510 5..."
3,101000106,2019-04-24,,RMPPID,106,WHISKA CREEK,852628300.0,129288.281136,"POLYGON ((339874.810 5539057.770, 339849.430 5..."
4,101000132,2019-07-16,,RMPPID,132,HILLSBOROUGH,634391300.0,103052.690196,"POLYGON ((445175.620 5573313.600, 445572.880 5..."


## 2. Data Cleaning and Preparation
- Check for and remove missing values
- Identify and handle outliers
- Transform the data if necessary