#1. Dataset
Proyek ini menggunakan dataset yang diambil dari kaggle no copyright. Dataset ini berisi pH dan temperature ikan air tawar yang menjadi variabel untuk pembuatan aplikasi pakan ikan otomatis. Dataset ini dapat diunduh disini: https://drive.google.com/drive/folders/1KjLpnUZJAXY8mJio7sBCXipj04bwiY_z?usp=drive_link

#2. Data Loading
Pada tahap ini merupakan tahapan untuk mengakses dataset dengan menyiapkan library yang dibutuhkan dan mengakses dataset

**2.1 Menyiapkan Library**

In [8]:
!python --version

Python 3.10.12


In [9]:
import os
import numpy as np
import pandas as pd

**2.2 Memuat Dataset Realfish**

In [10]:
url= "https://drive.google.com/uc?export=download&id=1HB6vUxPcNhWPQYF2ik24cntxxDhPurIu"
realfish_df = pd.read_csv(url)
realfish_df.head()

Unnamed: 0,ph,temperature,turbidity,fish
0,6.0,27.0,4.0,katla
1,7.6,28.0,5.9,sing
2,7.8,27.0,5.5,sing
3,6.5,31.0,5.5,katla
4,8.2,27.0,8.5,prawn


**2.3 Memuat Dataset Realtime**

In [11]:
url= "https://drive.google.com/uc?export=download&id=1n77qT7IV25ciYPr4zccUzI_AewIpyweq"
realtime_df = pd.read_csv(url)
realtime_df.head()

Unnamed: 0,Date,Time,NITRATE(PPM),PH,AMMONIA(mg/l),TEMP,DO,TURBIDITY,MANGANESE(mg/l),pressure,tempC,humidity,windspeedKmph,label
0,01-02-2022,08:00:00,112.1,7.7,0.002,24.35,1.9,21.3,2.01,1012.0,27.0,21.0,4.0,0.0
1,01-02-2022,08:20:00,119.8,7.6,0.088,24.3,0.8,31.0,1.0,1012.0,30.0,19.0,4.0,0.0
2,01-02-2022,08:40:00,127.4,8.5,0.029,24.28,2.6,27.4,2.45,1012.0,33.0,16.0,4.0,0.0
3,01-02-2022,09:00:00,105.0,7.5,0.06,24.25,2.9,21.9,1.83,1011.0,37.0,13.0,4.0,0.0
4,01-02-2022,09:20:00,121.1,7.4,0.001,24.1,1.3,22.7,1.69,1010.0,37.0,12.0,4.0,0.0


# 3. Menggabungkan Dataset

In [12]:
realfish_df = realfish_df[['ph', 'temperature']].rename(columns={'temperature': 'temp'})
realtime_df = realtime_df[['PH', 'TEMP']].rename(columns={'PH': 'ph', 'TEMP': 'temp'})

In [13]:
data_df = pd.concat([realfish_df, realtime_df], ignore_index=True)

Mengubah format

In [27]:
data_df = data_df.round(2)
print("DataFrame setelah pembulatan menggunakan round():")
print(data_df)

DataFrame setelah pembulatan menggunakan round():
        ph  temp
0      6.0  27.0
1      7.6  28.0
2      7.8  27.0
3      6.5  31.0
4      8.2  27.0
...    ...   ...
75382  NaN   NaN
75383  NaN   NaN
75384  NaN   NaN
75385  NaN   NaN
75386  NaN   NaN

[75387 rows x 2 columns]


# 4. Cleaning Data


In [28]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75387 entries, 0 to 75386
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ph      75349 non-null  float64
 1   temp    75349 non-null  float64
dtypes: float64(2)
memory usage: 1.2 MB


**4.1 Memeriksa duplikasi data**

In [29]:
print("Jumlah Duplikasi data: ", data_df.duplicated().sum())

Jumlah Duplikasi data:  64501


**4.2 Menghapus duplikasi data**

In [31]:
data_df = pd.DataFrame(data_df)
data_df = data_df.drop_duplicates()

In [32]:
print("Jumlah Duplikasi data: ", data_df.duplicated().sum())

Jumlah Duplikasi data:  0


**4.3 Mengecek data mengandung Nan**

In [33]:
print(data_df.isna().sum().sum())

2


In [34]:
data_df = data_df.dropna(how='all')
print("DataFrame setelah menghapus baris yang semua nilainya NaN:")
print(data_df)

# Menghapus kolom jika semua nilainya adalah NaN
data_df = data_df.dropna(axis=1, how='all')
print("DataFrame setelah menghapus kolom yang semua nilainya NaN:")
print(data_df)

DataFrame setelah menghapus baris yang semua nilainya NaN:
        ph  temp
0      6.0  27.0
1      7.6  28.0
2      7.8  27.0
3      6.5  31.0
4      8.2  27.0
...    ...   ...
75297  7.4  15.3
75334  6.2  15.7
75340  6.4  15.8
75347  6.8  15.6
75348  6.2  16.0

[10885 rows x 2 columns]
DataFrame setelah menghapus kolom yang semua nilainya NaN:
        ph  temp
0      6.0  27.0
1      7.6  28.0
2      7.8  27.0
3      6.5  31.0
4      8.2  27.0
...    ...   ...
75297  7.4  15.3
75334  6.2  15.7
75340  6.4  15.8
75347  6.8  15.6
75348  6.2  16.0

[10885 rows x 2 columns]


In [35]:
print(data_df.isna().sum().sum())

0


In [36]:
data_df.head()

Unnamed: 0,ph,temp
0,6.0,27.0
1,7.6,28.0
2,7.8,27.0
3,6.5,31.0
4,8.2,27.0


**4.4 Memeriksa parameter statistik**

In [37]:
data_df.describe()

Unnamed: 0,ph,temp
count,10885.0,10885.0
mean,6.43311,28.780588
std,1.203719,8.346725
min,4.5,0.0
25%,5.4,22.0
50%,6.4,27.9
75%,7.3,35.4
max,9.0,45.5


In [38]:
data_df.isna().sum() #Tidak terdapat Missing Value

ph      0
temp    0
dtype: int64

# Menyimpan Dataset

In [39]:
save_dir = '/mnt/data'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [40]:
file_path = os.path.join(save_dir, 'data.csv')

In [41]:
data_df.to_csv(file_path, index=False)

print(f"Dataset gabungan telah disimpan ke {file_path}")

Dataset gabungan telah disimpan ke /mnt/data/data.csv
