In [14]:
import pandas as pd
import numpy as np
pd.set_option("display.precision", 2)

#### Lets import the dataset we have created first

In [23]:
df = pd.read_csv("./Data/data.csv",sep=",")
df.drop(['Unnamed: 0'], axis=1, inplace=True) # There were some formatting issues while
                                              # writing the csv

In [24]:
df.head()

Unnamed: 0,DISTRICT,UPAZILA,STATION_ID,STATION_NAME,DATE,RAIN_FALL(mm),LATITUDE,LONGITUDE,WATER_LEVEL(m)
0,Bandarban,Lama,CL317,Lama,01-jan-2017,0.0,21.81,92.19,6.22
1,Bandarban,Lama,CL317,Lama,02-jan-2017,0.0,21.81,92.19,6.22
2,Bandarban,Lama,CL317,Lama,03-jan-2017,0.0,21.81,92.19,6.22
3,Bandarban,Lama,CL317,Lama,04-jan-2017,0.0,21.81,92.19,6.21
4,Bandarban,Lama,CL317,Lama,05-jan-2017,0.0,21.81,92.19,6.21


The shape of our dataset:
- we have 1826 samples containing 9 features(includes target)

In [25]:
df.shape

(1826, 9)

The columns are:

In [26]:
df.columns

Index(['DISTRICT', 'UPAZILA', 'STATION_ID', 'STATION_NAME', 'DATE',
       'RAIN_FALL(mm)', 'LATITUDE', 'LONGITUDE', 'WATER_LEVEL(m)'],
      dtype='object')

Lets look at the data types:

In [27]:
df.dtypes

DISTRICT           object
UPAZILA            object
STATION_ID         object
STATION_NAME       object
DATE               object
RAIN_FALL(mm)     float64
LATITUDE          float64
LONGITUDE         float64
WATER_LEVEL(m)    float64
dtype: object

- **DATE** is rendered as object! Need to convert it to datatime feature! 

In [30]:
df['DATE'] = pd.to_datetime(df['DATE'])

In [31]:
df.dtypes

DISTRICT                  object
UPAZILA                   object
STATION_ID                object
STATION_NAME              object
DATE              datetime64[ns]
RAIN_FALL(mm)            float64
LATITUDE                 float64
LONGITUDE                float64
WATER_LEVEL(m)           float64
dtype: object

Lets check for general infos!

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1826 entries, 0 to 1825
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   DISTRICT        1826 non-null   object        
 1   UPAZILA         1826 non-null   object        
 2   STATION_ID      1826 non-null   object        
 3   STATION_NAME    1826 non-null   object        
 4   DATE            1826 non-null   datetime64[ns]
 5   RAIN_FALL(mm)   1826 non-null   float64       
 6   LATITUDE        1826 non-null   float64       
 7   LONGITUDE       1826 non-null   float64       
 8   WATER_LEVEL(m)  1826 non-null   float64       
dtypes: datetime64[ns](1), float64(4), object(4)
memory usage: 128.5+ KB


Things to notics:
- Theres no missing values in our dataset!

Lets check statistical properties of the numerical values:

In [None]:
df.describe()

Unnamed: 0,RAIN_FALL(mm),LATITUDE,LONGITUDE,WATER_LEVEL(m)
count,1826.0,1830.0,1830.0,1826.0
mean,10.0,21.8,92.2,6.81
std,26.09,6.04e-14,1.76e-12,0.97
min,0.0,21.8,92.2,5.86
25%,0.0,21.8,92.2,6.19
50%,0.0,21.8,92.2,6.5
75%,6.28,21.8,92.2,7.15
max,273.0,21.8,92.2,13.54


and for non numerical features:

In [34]:
df.describe(include=["object"])

Unnamed: 0,DISTRICT,UPAZILA,STATION_ID,STATION_NAME
count,1826,1826,1826,1826
unique,1,1,1,1
top,Bandarban,Lama,CL317,Lama
freq,1826,1826,1826,1826


Note:
- as these categorical features only has one values, they wont contribute anything to the model!