# Sensor Based Aquaponics Fish Pond Datasets



### 1. Business Problem

#### 1.1 Introduction

<p>The aquaculture industry in Indonesia is a vital source of income for many small-scale farmers, but they often lack knowledge and resources to help them. What are the key factors that affect fish growth and health in the freshwater aquaculture system one might ask? Someone said there are many factors that affect the growth of fish, including feed and water quality. Water quality impacts on fish growth rate, feed consumption, and their general wellbeing.</p> 
<p>Farmers’ ignorance of how to manage pond water properly has resulted to the death of fishes. Therefore, performing exploratory data analysis (EDA) on the aquaculture dataset is essential for improving the health and growth of fish in the freshwater aquaculture system. By identifying the key factors that affect fish growth and health, farmers can take steps to manage their ponds more effectively and improve the success of their aquaculture operation.</p>

#### 1.2 Sources/Useful Links

Some Useful Links:
- https://www.sciencedirect.com/science/article/pii/S2352340922005972
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256380
- Coming Soon

#### 1.3 Business Objectives and Constraints

- May bias to certain Catfish
- Coming Soon

### 2. Machine Learning Problem

#### 2.1 Data Overview

The dataset contains the date and time the data were collected, the entry id of each data from 1 to n. Then the next six columns contain the IoT water quality parameters (Temperature, pH, Dissolved Oxygen, Ammonia, Turbidity, Nitrate) followed by the last two (Length, Width) columns were manually measured with random sampling.

- <b>Temperature</b>:
<p></p>

- <b>pH</b>:
<p></p>

- <b>Dissolved Oxygen</b>:
<p></p>

- <b>Ammomia</b>:
Ammonia accumulates in fish ponds due to the breakdown of the protein rich
fish feeds, ...

- <b>Turbidity</b>:
<p></p>

- <b>Nitrate</b>:
<p></p>

Read more about the dataset (https://www.sciencedirect.com/science/article/pii/S2352340922005972)

#### 2.2 Type of Machine Learning Task

<p>It is a Regression and Forecasting problem</p>

#### 2.3 Performance Metric

Metrics(s):
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)

### 3. Exploratory Data Analysis

#### 3.1 Basic Information About Dataset

In [20]:
"""
    Importing The Neccessary Libraries
"""

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [21]:
dataset = pd.read_csv('../dataset/raw/dataset.csv')

dataset.head()

Unnamed: 0,created_at,entry_id,Temperature (C),Turbidity (NTU),Dissolved Oxygen(g/ml),PH,Ammonia(g/ml),Nitrate(g/ml),Population,Fish_Length (cm),Fish_Weight (g)
0,2021-06-19 00:00:05 CET,1889,24.875,100,4.505,8.43365,0.38,193,50,6.96,3.36
1,2021-06-19 00:01:02 CET,1890,24.9375,100,6.601,8.43818,0.38,194,50,6.96,3.36
2,2021-06-19 00:01:22 CET,1891,24.875,100,15.797,8.42457,0.38,192,50,6.96,3.36
3,2021-06-19 00:01:44 CET,1892,24.9375,100,5.046,8.43365,0.38,193,50,6.96,3.36
4,2021-06-19 00:02:07 CET,1893,24.9375,100,38.407,8.40641,0.38,192,50,6.96,3.36


In [22]:
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
entry_id,172249.0,148725.7,85418.78,1889.0,62717.0,147763.0,226310.0,269372.0
Temperature (C),172249.0,24.98285,0.9018905,-127.0,24.375,24.9375,25.5,27.8125
Turbidity (NTU),172249.0,90.97466,21.09991,1.0,94.0,100.0,100.0,100.0
Dissolved Oxygen(g/ml),172249.0,9.708503,10.97196,0.007,3.2,3.283,11.739,41.046
PH,172249.0,3.971857,3.960719,-3.13745,-0.17318,7.09904,7.51667,8.55167
Ammonia(g/ml),172159.0,311211000.0,12578090000.0,0.00659,0.56935,8.47056,80.70516,996513000000.0
Nitrate(g/ml),172249.0,719.8914,415.9798,45.0,189.0,890.0,1050.0,2224.0
Population,172249.0,50.0,0.0,50.0,50.0,50.0,50.0,50.0
Fish_Length (cm),172249.0,23.42878,9.609826,6.96,14.22,20.97,32.54,35.39
Fish_Weight (g),172249.0,166.4705,145.7527,3.36,22.89,65.48,302.5,394.66


In [23]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172249 entries, 0 to 172248
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   created_at              172249 non-null  object 
 1   entry_id                172249 non-null  int64  
 2   Temperature (C)         172249 non-null  float64
 3   Turbidity (NTU)         172249 non-null  int64  
 4   Dissolved Oxygen(g/ml)  172249 non-null  float64
 5   PH                      172249 non-null  float64
 6   Ammonia(g/ml)           172159 non-null  float64
 7   Nitrate(g/ml)           172249 non-null  int64  
 8   Population              172249 non-null  int64  
 9   Fish_Length (cm)        172249 non-null  float64
 10  Fish_Weight (g)         172249 non-null  float64
dtypes: float64(6), int64(4), object(1)
memory usage: 14.5+ MB


<p>Based on the information above, we don't need the first two columns</p>

#### 3.2 Drop The First Two Column

In [24]:
dataset = dataset.drop(columns=['created_at', 'entry_id'])

#### 3.2 Check for Missing (Null) Values

In [25]:
null_values = dataset.isnull()
null_index = null_values.stack()[null_values.stack() == True].index.tolist()

print(null_values.sum())
print("\nIndex: ")
for index in null_index:
    print(index[0], end = ", ")

Temperature (C)            0
Turbidity (NTU)            0
Dissolved Oxygen(g/ml)     0
PH                         0
Ammonia(g/ml)             90
Nitrate(g/ml)              0
Population                 0
Fish_Length (cm)           0
Fish_Weight (g)            0
dtype: int64

Index: 
12388, 13842, 13861, 17088, 17108, 17253, 18202, 18277, 18386, 19802, 20005, 20037, 20090, 20127, 20196, 20201, 20241, 20276, 20718, 20721, 22877, 23321, 23341, 23560, 23649, 24030, 24056, 24115, 24116, 24165, 24364, 24660, 27883, 27966, 28121, 28186, 28598, 28747, 28750, 30285, 30286, 30393, 30737, 31031, 31062, 31079, 31993, 32527, 32809, 32813, 32822, 32910, 34865, 35267, 35775, 36919, 37092, 37143, 37208, 40240, 40513, 40670, 40674, 45373, 45414, 45642, 46037, 46292, 46342, 46447, 47266, 150702, 157385, 157397, 157516, 157680, 157833, 157871, 157955, 157963, 158020, 160425, 160635, 160636, 160729, 160772, 164077, 164398, 171754, 171756, 

There are 90 rows that have missing values followed by its index, due to the size of the data, it might be better to drop it altogether rather than to replace it with some values.

In [26]:
dataset = dataset.dropna(how='any', subset=None, inplace=False)

In [27]:
dataset.isnull().sum()

Temperature (C)           0
Turbidity (NTU)           0
Dissolved Oxygen(g/ml)    0
PH                        0
Ammonia(g/ml)             0
Nitrate(g/ml)             0
Population                0
Fish_Length (cm)          0
Fish_Weight (g)           0
dtype: int64