# Analysis and Prediction GMSL

## Water Dataset (https://www.kaggle.com/mathsian/water-temperature)

Podaci u ovom skupu podataka predstavljaju uzorke vode sakupljene iz okeana između 1959. i 2020. godine. Uzećemo u obzir 
podatke prikupljane na dubinama do 200m (Photic).

Podaci koji su od interesa za naš rad su:
* T_degC - temperatura vode izražena u celzijusima 
* O2ml_L - zasićenost vode kiseonikom

In [1]:
import pandas as pd
import seaborn as sb

In [2]:
df = pd.read_csv('original_datasets/water.csv', delimiter=',')

In [3]:
df.head()

Unnamed: 0,Sta_ID,Date,Quarter,Lat_Deg,Depthm,Zone,T_degC,PO4uM,SiO3uM,NO2uM,NO3uM,Salnty,O2ml_L
0,130.0 050.0,08/16/1959,3,25,0,Photic,25.38,0.36,1.0,0.0,0.9,34.15,4.72
1,130.0 050.0,08/16/1959,3,25,1,Photic,25.38,0.36,1.0,0.0,0.9,34.15,4.72
2,130.0 050.0,08/16/1959,3,25,10,Photic,25.35,0.41,1.0,0.0,0.8,34.18,4.14
3,130.0 050.0,08/16/1959,3,25,20,Photic,23.5,0.42,1.0,0.0,1.2,34.146,4.26
4,130.0 050.0,08/16/1959,3,25,30,Photic,21.45,0.43,1.0,0.0,1.5,34.117,4.57


In [4]:
df.columns

Index(['Sta_ID', 'Date', 'Quarter', 'Lat_Deg', 'Depthm', 'Zone', 'T_degC',
       'PO4uM', 'SiO3uM', 'NO2uM', 'NO3uM', 'Salnty', 'O2ml_L'],
      dtype='object')

#### Uklanjanje kolona koje nisu od interesa. Kolone 'Depthm' i 'Zone' se ostavljaju radi daljih razmatranja.

* Depthm - dubina na kojoj je izvršeno merenje
* Zone - zona u kojoj je izvršeno merenje (Photic/Disphotic)

In [5]:
df.drop(['Sta_ID', 'Quarter', 'Lat_Deg', 'PO4uM', 'SiO3uM',
         'NO2uM', 'NO3uM', 'Salnty'], axis = 1, inplace = True)

In [6]:
df.rename(columns={'T_degC':'WaterTemp', 'O2ml_L':'O2ml'}, inplace = True)

In [7]:
df.describe()

Unnamed: 0,Depthm,WaterTemp,O2ml
count,337792.0,337792.0,337792.0
mean,170.392887,11.185802,3.728877
std,214.807837,3.820133,1.991083
min,0.0,1.48,-0.01
25%,40.0,8.2,1.99
50%,103.0,10.45,3.93
75%,250.0,14.17,5.68
max,5351.0,30.02,11.13


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337792 entries, 0 to 337791
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       337792 non-null  object 
 1   Depthm     337792 non-null  int64  
 2   Zone       337792 non-null  object 
 3   WaterTemp  337792 non-null  float64
 4   O2ml       337792 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 12.9+ MB


#### Konvertovanje datuma iz stringa u datetime

In [9]:
df['Date'] = pd.to_datetime(df['Date'])

#### Uzimaju se u obzir podaci koji su prikupljani da dubinama do 200m (Photic) nakon čega se ta kolona briše

In [10]:
df = df[df['Zone'] == 'Photic']

In [11]:
df['Zone'].values

array(['Photic', 'Photic', 'Photic', ..., 'Photic', 'Photic', 'Photic'],
      dtype=object)

In [12]:
df.drop('Zone', axis = 1, inplace = True)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236431 entries, 0 to 337791
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Date       236431 non-null  datetime64[ns]
 1   Depthm     236431 non-null  int64         
 2   WaterTemp  236431 non-null  float64       
 3   O2ml       236431 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 9.0 MB


In [14]:
df.head()

Unnamed: 0,Date,Depthm,WaterTemp,O2ml
0,1959-08-16,0,25.38,4.72
1,1959-08-16,1,25.38,4.72
2,1959-08-16,10,25.35,4.14
3,1959-08-16,20,23.5,4.26
4,1959-08-16,30,21.45,4.57


In [15]:
fig, ax = plt.subplots(figsize=(16,8))
ax = sb.boxplot(data=df, orient="h", palette="Set2")

NameError: name 'plt' is not defined

In [None]:
df.plot(x_compat=True, rot=90, figsize=(6, 5))

In [None]:
df.shape

In [None]:
df.nunique()

#### Pošto skup podataka ima 236431 redova, a rezultat nunique vraća da postoji 3080 različitih Date zaključujemo da postoje duplikati. 

#### Izdvajanjem duplikata, može se primetiti da oni predstavljaju merenja u istom danu na različitim dubinama.

In [None]:
duplicate_dates = df.duplicated(subset=['Date'], keep=False)
df = df.loc[duplicate_dates.values]
df.head(20)

#### Grupisanjem po datumu i korišćenjem median metode, dobija se podatak o temperaturi vode i koncentraciji O2 za odrđeni dan.

In [None]:
#df = df.groupby('Date').mean().reset_index()
df = df.groupby('Date')[['WaterTemp', 'O2ml']].median().reset_index()

#### Nakon grupisanja izbačeni su duplikati i obrisane suvišne kolone

In [None]:
df.sort_values('Date').head(5)

#### Kako bi se dobili podaci za određeni mesec izvršiće se grupisanje po datumu nakon što se izbaci dan iz datuma

In [None]:
df['Date'] = df['Date'].dt.strftime('%Y-%m')

In [None]:
#df = df.groupby('Date').mean().reset_index()
df = df.groupby('Date')[['WaterTemp', 'O2ml']].median().reset_index()

In [None]:
df.head()

In [None]:
df.to_csv('processed_datasets/WaterTemp_O2ml.csv', index = False)