 **<font color='white gray'>panData</font>**

# **<font color='white gray'>Data Science for Multivariate Data Analysis</font>**

**<font color='white gray'>Data Science in Agribusiness</font>
<font color='white gray'>Yield Prediction and Irrigation Optimization</font>**



## **Installing and Loading Packages**

In [5]:
!pip install -q -U watermark

https://www.tensorflow.org/

In [6]:
%env TF_CPP_MIN_LOG_LEVEL=3

env: TF_CPP_MIN_LOG_LEVEL=3


In [7]:
# Imports
import joblib
import sklearn
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import warnings
warnings.filterwarnings('ignore')

In [8]:
%reload_ext watermark
%watermark -a "panData"

Author: panData



## **Loading the Dataset**

In [9]:
# Load the dataset
df = pd.read_csv('dataset.csv')

In [10]:
df.shape

(124, 12)

In [11]:
df.head()

Unnamed: 0,date,veg_index,soil_capacity,co2_level,nutrient_level,fertilizer_index,root_depth,solar_radiation,precipitation,growth_stage,yield_history,humidity
0,2012-12-01,323,455,3102.61,423.45,844.0,468.0,578.0,28.67,207.70504,117.7,79.261905
1,2013-01-01,345,546,3100.45,415.85,799.0,485.0,557.0,24.49,228.94287,4.5,82.193548
2,2013-02-01,362,595,3199.41,410.77,718.0,466.0,552.0,22.06,238.41747,25.1,74.839286
3,2013-03-01,376,636,3281.67,414.82,614.0,442.0,574.0,21.64,218.47599,53.6,77.935484
4,2013-04-01,383,738,3261.65,451.04,619.0,429.0,595.0,22.3,226.1501,166.0,80.45


In [12]:
df.tail()

Unnamed: 0,date,veg_index,soil_capacity,co2_level,nutrient_level,fertilizer_index,root_depth,solar_radiation,precipitation,growth_stage,yield_history,humidity
119,2022-11-01,362,363,2626.91,1252.78,738.07,427.49,1430.48,60.18,186.68326,38.2,77.95
120,2022-12-01,310,322,2736.64,1287.68,749.57,385.09,1472.27,62.25,210.72987,33.7,76.177419
121,2023-01-01,277,307,2842.81,1289.12,761.6,373.03,1525.43,63.04,244.41912,4.6,74.774194
122,2023-02-01,323,330,2936.19,1303.59,759.59,390.69,1572.25,71.52,223.31732,6.9,66.910714
123,2023-03-01,360,339,2847.84,1234.88,771.62,396.87,1302.61,74.8,228.56676,41.5,69.0


## **Exploratory Analysis**

In [13]:
# Check the data types of the columns
df.dtypes

Unnamed: 0,0
date,object
veg_index,int64
soil_capacity,int64
co2_level,float64
nutrient_level,float64
fertilizer_index,float64
root_depth,float64
solar_radiation,float64
precipitation,float64
growth_stage,float64


In [14]:
# Display the columns of the dataset
df.columns

Index(['date', 'veg_index', 'soil_capacity', 'co2_level', 'nutrient_level',
       'fertilizer_index', 'root_depth', 'solar_radiation', 'precipitation',
       'growth_stage', 'yield_history', 'humidity'],
      dtype='object')

In [15]:
# Identify non-numeric columns
non_numeric_columns = df.select_dtypes(include=['object']).columns
print(f'Non-numeric columns: {non_numeric_columns}')

Non-numeric columns: Index(['date'], dtype='object')


## **Cleaning and Transformation**

In [16]:
# Remove non-numeric columns (if not needed)
df = df.drop(columns=non_numeric_columns)

In [17]:
# Check if the 'humidity' column contains numeric values
if df['humidity'].dtype == 'object':
    df['humidity'] = pd.to_numeric(df_dsa['humidity'], errors='coerce')

In [18]:
# Remove rows with missing values
df = df.dropna()

## **Data Standardization**

In [19]:
# Separate predictors and target variable
X = df.drop(columns='humidity')
y = df['humidity']

In [20]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
# Create the scaler
scaler = StandardScaler()

In [22]:
# Standardize the data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [23]:
# Save the scaler to disk
joblib.dump(scaler, 'scaler.joblib')

['scaler.joblib']

## **Model Building**

In [24]:
# Define the model architecture
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dense(1)
])



The code above defines the architecture of a sequential neural network model using the Keras library. Here is a detailed explanation of each line:

**`model = Sequential([ ... ])`**: Creates a sequential model, which is a linear stack of layers.

**`Dense(64, activation='relu', input_shape=(X_train.shape[1],))`**: Adds a dense (fully connected) layer with 64 neurons and a ReLU (Rectified Linear Unit) activation function. The `input_shape` specifies the shape of the input data, corresponding to the number of features (columns) in `X_train`.

**`Dropout(0.3)`**: Adds a dropout layer with a 30% rate. Dropout is a regularization technique that randomly deactivates 30% of the neurons during training to prevent overfitting.

**`Dense(32, activation='relu')`**: Adds another dense layer with 32 neurons and a ReLU activation function.

**`Dropout(0.3)`**: Adds another dropout layer with a 30% rate.

**`Dense(16, activation='relu')`**: Adds another dense layer with 16 neurons and a ReLU activation function.

**`Dense(1)`**: Adds the output layer with a single neuron. No activation function is specified, which is common in regression problems where the output is a continuous value. If it were a binary classification problem, a sigmoid activation could be used in this layer.

This architecture is typical for regression problems where the goal is to predict a single continuous value based on multiple input features.

## **Model Compilation**

In [25]:
# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

The code above compiles the neural network model that was defined earlier. Compilation is a necessary step before training the model. Here is an explanation of each part of the code:

**`model.compile(...)`**: Compiles the model, configuring it for training with an optimizer, a loss function, and metrics.

**`optimizer='adam'`**: Sets the optimizer to 'adam'. Adam (Adaptive Moment Estimation) is an efficient and widely used optimizer in neural networks, combining the advantages of Gradient Descent with Momentum and RMSProp algorithms. It dynamically adjusts the learning rate during training.

**`loss='mse'`**: Sets the loss function to 'mse' (Mean Squared Error). MSE is a common loss function in regression problems, which calculates the mean of the squared differences between predicted and actual values. It is used to measure model performance, with lower values indicating a better fit.

**`metrics=['mae']`**: Specifies that the metric to be monitored during training is 'mae' (Mean Absolute Error). MAE is the mean of the absolute values of the differences between predicted and actual values. Like MSE, lower MAE values indicate better model performance, but MAE is less sensitive to large errors than MSE.

Compiling the model is an essential step that defines how the model will be optimized and evaluated during training.

## **Defining Callbacks**

In [26]:
# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('model.keras', save_best_only=True)

This code snippet defines two callbacks for model training: EarlyStopping and ModelCheckpoint. Callbacks are tools that allow you to perform certain actions at specific points during training.

**`early_stopping = EarlyStopping(...)`**: Defines an early stopping callback.

**`monitor='val_loss'`**: Monitors the validation loss (`val_loss`) during training. Validation is a process where a separate set of data (validation data) is used to assess the model's performance, helping to prevent overfitting.

**`patience=10`**: Sets the patience to 10 epochs. This means that if the validation loss does not improve for 10 consecutive epochs, the training will be stopped.

**`restore_best_weights=True`**: Restores the model weights to the state with the best validation loss. This ensures that the final model weights are the best found during training.

**`model_checkpoint = ModelCheckpoint(...)`**: Defines a model checkpoint callback.

**`'model.keras'`**: Specifies the filename where the model will be saved.

**`save_best_only=True`**: Saves the model only when it shows the best validation loss so far. This prevents saving multiple versions of the model and ensures that the best model found during training is saved.

In summary, these callbacks help to:

- Stop training early if the model is not improving, saving time and computational resources.

- Automatically save the best model found during training, ensuring that you have a version of the model with the best performance.



## **Model Training**

In [27]:
model.summary()

In [28]:
# Train the model
history = model.fit(
    X_train_scaled,
    y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stopping, model_checkpoint]
)

Epoch 1/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 136ms/step - loss: 5192.9194 - mae: 71.9399 - val_loss: 5202.6504 - val_mae: 72.0255
Epoch 2/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - loss: 5153.2324 - mae: 71.6530 - val_loss: 5189.6201 - val_mae: 71.9351
Epoch 3/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - loss: 5197.3320 - mae: 71.9554 - val_loss: 5174.9775 - val_mae: 71.8337
Epoch 4/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - loss: 5161.6426 - mae: 71.7099 - val_loss: 5158.3149 - val_mae: 71.7180
Epoch 5/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - loss: 5163.8237 - mae: 71.7093 - val_loss: 5139.7422 - val_mae: 71.5888
Epoch 6/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - loss: 5093.6187 - mae: 71.2209 - val_loss: 5118.1499 - val_mae: 71.4382
Epoch 7/100
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[

## **Model Evaluation**









In [29]:
# Evaluate the model on the test set
test_loss, test_mae = model.evaluate(X_test_scaled, y_test)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - loss: 153.7987 - mae: 10.2485


In [30]:
print(f'Test Loss: {test_loss}')
print(f'Test MAE: {test_mae}')

Test Loss: 153.7987060546875
Test MAE: 10.248539924621582



Let's now deploy the model via API in the next chapter.

In [31]:
%reload_ext watermark
%watermark -a "panData"

Author: panData



In [32]:
#%watermark -v -m

In [33]:
#%watermark --iversions

In [35]:
print(df.shape)
df.head()

(124, 11)


Unnamed: 0,veg_index,soil_capacity,co2_level,nutrient_level,fertilizer_index,root_depth,solar_radiation,precipitation,growth_stage,yield_history,humidity
0,323,455,3102.61,423.45,844.0,468.0,578.0,28.67,207.70504,117.7,79.261905
1,345,546,3100.45,415.85,799.0,485.0,557.0,24.49,228.94287,4.5,82.193548
2,362,595,3199.41,410.77,718.0,466.0,552.0,22.06,238.41747,25.1,74.839286
3,376,636,3281.67,414.82,614.0,442.0,574.0,21.64,218.47599,53.6,77.935484
4,383,738,3261.65,451.04,619.0,429.0,595.0,22.3,226.1501,166.0,80.45


# **The End**