# Innovative AI Challenge 2024

<img src="https://github.com/Sakib-Dalal/Crop_Recommendation_AI_App/blob/main/app/client/public/images/Logo.jpg?raw=true" />

#### Project By Sakib Dalal
- GitHub Project repo: <a href="www.github.com">Link</a>
#### problem statement: 
- AI in Agriculture: Develop AI models to enhance intensive agricultural practices and address the future global food crisis.

#### Agriculture Productivity Prediction
- **Objective**: Build an AI/ML model that predicts agricultural productivity based on crop type, weather conditions, soil properties, and other relevant factors.
- **Requirements**:
    - Ensure the model is accurate and farmer-friendly.
    - Provide a simple, accessible user interface for farmers to use effectively. Create an interface or a small website to showcase your AI/ML model.
    - Problem analysis, solution overview, methodology, and implementation steps. Short video, codes via challenge website.
- **Scoring**:
    - Submissions will be evaluated based on *Mean Squared Error*.

### Challenge Overview

#### Files
- **train.csv** - The training dataset, which includes the features and target variable (crop_yield in kg/ha).
- **test.csv** - The test dataset, which you will use to generate predictions and submit your solutions.
- **sample_submission.csv** - A sample submission file that shows the correct format for submitting your predictions.
#### Columns
- `id' - A unique identifier for each data point (e.g., 1, 2, 3,…).
- 'Year' - The year of the production (e.g., 2020, 2002).
- 'State' - The state where the data is collected (e.g., Punjab).
- 'Crop_Type' - The type of crop grown (e.g., Rice, Wheat, Bajra).
- 'Rainfall' - The amount of annual average state rainfall in mm (e.g., 1200 mm).
- 'Soil_Type' - The type of soil in the region (e.g., Loamy).
- 'Irrigation_Area' - Area of irrigated land in Thousand hectare
- 'Crop_Yield' - The target variable representing the crop yield in kg/ha.
#### Notes:
- Data Format: The data is provided in CSV format. Ensure that all files are read correctly and that you handle any missing data appropriately.
- Feature Engineering: While the data is provided in a raw form, you may perform feature engineering and transformations to enhance your model.
- Prediction Goal: Your model should predict agricultural productivity based on the features in the data (for the agriculture problem statement).

<img src="https://images.javatpoint.com/tutorial/machine-learning/images/machine-learning-life-cycle.png" />

# Index

- **Step 1**: Loading Dataset
    - The dataset is provided by the Innovative AI Challenge 2024
    - Link for data set on kaggle: <a href="https://www.kaggle.com/competitions/innovative-ai-challenge-2024/data">dataset</a>
- **Step 2**: Data Preparation
    - We will prepare our data in this step.
    - The dataset is provided into *csv* format.
    - using **polars** we will convert into dataframe.
- **Step 3**: Data Wrangling
    - Look for any **null** values in dataset.
- **Step 4**: Analyse Data
    - In this step we will *Analyse* and *Visualise* our data.
    - Here we will select important features to train our *ML* model.
- **Step 5**: Training Model
    - We will use diffent *ML* model's for our *regression problem*.
    - Based on the evaluation we will select a model for testing.
- **Step 6**: Testing Model
    - It it the phase where we will finalize and predict the test data.
- **Step 7**: Deployment
    - Saving best performing model so that we can use in our **Web App**.

# Step 1: Importing Data
- We will import the dataset from kaggle
- The dataset is in csv format.
- here are the links for dataset:
    - train.csv: "/kaggle/input/innovative-ai-challenge-2024/train.csv"
    - test.csv: "/kaggle/input/innovative-ai-challenge-2024/test.csv"
    - sample_submission.csv: "/kaggle/input/innovative-ai-challenge-2024/sample_submission.csv"

In [None]:
TRAIN_URL = "/kaggle/input/innovative-ai-challenge-2024/train.csv"
TEST_URL = "/kaggle/input/innovative-ai-challenge-2024/test.csv"
SUBMISSION_URL = "/kaggle/input/innovative-ai-challenge-2024/sample_submission.csv"

# Step 2: Data Preparation
- We will be using <a href="https://pandas.pydata.org/pandas-docs/stable/index.html">Pandas</a> to convert our dataset from *csv* to *dataframes*.
- **Note**: For future proof project better use **Polars** for data processing.
- We will use **read_csv** method from Pandas to read the csv file and convert into dataframe.

In [None]:
import pandas as pd

In [None]:
train_df = pd.read_csv(TRAIN_URL)
test_df = pd.read_csv(TEST_URL)

submission_df = pd.read_csv(SUBMISSION_URL)

- Let's view the first five items in our dataset's

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
submission_df.head()

- Let get from training data
    - information
    - number of columns
    - shape and size
    - description

In [None]:
train_df.info()

In [None]:
train_df.columns

In [None]:
len(train_df.columns)

In [None]:
train_df.shape

- From this we get to now:
    - There are total 8 columns
    - **Crop_Yield (kg/ha)** is the target columns
    - There are 55 rows.

In [None]:
train_df.describe()

- let's see unique values and value counts in Year column from `train_df`

In [None]:
train_df["Year"].unique()

In [None]:
train_df["Year"].value_counts()

- Let's see unique values and value count's in State column in train_df

In [None]:
train_df["State"].unique()

In [None]:
train_df["State"].value_counts()

- Let's see unique values and value count's in Crop Type column in train_df

In [None]:
train_df["Crop_Type"].unique()

In [None]:
train_df["Crop_Type"].value_counts()

- Let's see unique values and value count's in Soil_Type column in train_df

In [None]:
train_df["Soil_Type"].unique()

In [None]:
train_df["Soil_Type"].value_counts()

# Step 3: Data Wrangling
- In this step we will look for any null values in our *training* dataset.
- For this we will be using **Pandas** library

In [None]:
train_df.isna().sum()

- Hence, there are no *null* values in our training dataset.
- Let's see the datatype of each columns in train_df.

In [None]:
train_df.info()

In [None]:
for k, v in train_df.items():
    print(k,"column has datatype of:", v.dtype)

In [None]:
# print columns with datatype as object
for k, v in train_df.items():
    if v.dtype == "object":
        print(k,"column has datatype of:", v.dtype)

- There are 3 types of data (float64, int64, object)
- We can train our model on float64, int64 datatype but we get an error when we pass object datatype to our model
- **Solution**: To overcome this problem we can use **Sklearn Preprocessing's** *<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html"> LabelEncoder </a>*.

In [None]:
# Label Encoder to deal with object datatypes
from sklearn.preprocessing import LabelEncoder

- before moving one we create a copy of `train_df` dataframe so we can reuse the original when we needed.

In [None]:
train_cp_df = train_df.copy()
train_cp_df.head()

In [None]:
label_encoder = LabelEncoder()

# Encoding labels in columns
train_cp_df["State"] = label_encoder.fit_transform(train_cp_df["State"])
train_cp_df["Crop_Type"] = label_encoder.fit_transform(train_cp_df["Crop_Type"])
train_cp_df["Soil_Type"] = label_encoder.fit_transform(train_cp_df["Soil_Type"])

- `train_cp_df` dataset after *LabelEncoding*.

In [None]:
train_cp_df.head()

In [None]:
for k, v in train_cp_df.items():
    print(k,"column has datatype of:", v.dtype)

In [None]:
# old dataset info
for k, v in train_df.items():
    print(k,"column has datatype of:", v.dtype)

- We have successfuly encoded the object datatypes in out dataset.

### Feature Scaling
- our `train_cp_df` dataset contains features that highly vary in magnitudes, units, and range.
- here is visualization for all the features.
- We will be using matplotlib and seaborn for visualization.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

train_cp_df.plot(figsize=(16, 6))
plt.title("Before Feature Scaling", fontsize=16)
plt.xlabel("X Scale")
plt.ylabel("Y Scale")

plt.show()

- Here is the histogram view.

In [None]:
train_cp_df.hist(figsize=(16, 8), bins=20, color=["salmon"])
plt.show()

- Let's perform feature scaling on out `train_cp_df` dataset.
- We will be using Sklearn's preprocessing method named **<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">StandardScaler</a>**
- Formula to perform standard scaling is:<br> $z = (x - u) / s$
- where:
    - z is scaled data
    - x is to be scaled data
    - u is the mean of the training samples
    - s is the standard deviation of the training samples.
- There are also different scaling method provided by sklearn but StandardScaler perform the best.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

# we will apply feature scaling on train_cp_df dataset
train_cp_df = scaler.fit_transform(train_cp_df)

- after standard scaling our data will be no more in dataframe, it will be converted into numpy arrays.
- we will again use the pandas to convert it into dataframe.

In [None]:
type(train_cp_df)

In [None]:
train_cp_df = pd.DataFrame(data=train_cp_df, columns=train_df.columns)
train_cp_df.head()

In [None]:
# plot after feature scaling
train_cp_df.plot(figsize=(16, 6))
plt.title("After Feature Scaling", fontsize=16)
plt.xlabel("X Scale")
plt.ylabel("Y Scale")

plt.show()

- Now it's better. Each feature had scaled between 2 to -2.
- here is the histogram view.

In [None]:
train_cp_df.hist(figsize=(16, 8), bins=20, color=["salmon"])
plt.show()

- Now we can easy extract the important features from dataset.

### Extracting Important Features 
- In this step we will select the best features so that our model will perform better.
- In first method of feature extraction we will be using C

In [None]:
import pandas as pd

In [None]:
import polars as pl

In [None]:
data = pd.read_csv("/kaggle/input/innovative-ai-challenge-2024/train.csv")
data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Load data using panadas
train_df = pd.read_csv('/kaggle/input/innovative-ai-challenge-2024/train.csv')
test_df = pd.read_csv('/kaggle/input/innovative-ai-challenge-2024/test.csv')
submission_df= pd.read_csv('/kaggle/input/innovative-ai-challenge-2024/sample_submission.csv')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Correlation Matrix
plt.figure(figsize=(10, 6))
# Select only numeric columns for correlation calculation
numeric_data = train_df.select_dtypes(include=np.number)
correlation_matrix = numeric_data.corr()  
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows  
# how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
  
# Encode labels in column 'species'. 
train_df['Soil_Type']= label_encoder.fit_transform(train_df['Soil_Type']) 
train_df['Crop_Type']= label_encoder.fit_transform(train_df['Crop_Type']) 
train_df['State']= label_encoder.fit_transform(train_df['State']) 

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Assuming df is your DataFrame and 'target_variable' is the column you want to predict
X = train_df.drop(["Crop_Yield (kg/ha)", "State", "id", "Year"], axis=1)
y = train_df["Crop_Yield (kg/ha)"]

# Applying RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Displaying feature importance
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': model.feature_importances_})
print(feature_importance.sort_values(by='Importance', ascending=False))

In [None]:
test_df

In [None]:
train_df

In [None]:
submission_df