# Preprocessing of Dataset


TODO:
1. Import all required libraries
2. Getting a dataset
3. Importing datasets
4. Finding missing data, data in wrong format, duplicates, wrong data(can be identified by comparing it with other values)
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature Scaling(Standardization and Normalization)

### Importing all required libraries

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


### Data Cleaning & Pre-Processing

Import the brazil dataset and create a dataframe using the dataset

Then implement data cleaning

In [36]:
dataset=r"C:\Users\kamat\Downloads\central_west.csv\central_west.csv"
# Note I replaced all wrong data with empty cell and I will replace them with mean below for better performance
# Load the file into a pandas DataFrame
df=pd.read_csv(dataset)
df.drop('index',axis=1,inplace=True)

# Removing duplicates from the dataset

df.drop_duplicates(inplace=True)

# We are replacing the wrong data with nan and then we will correct it

df.replace(-9999,np.nan)

# Converting data into correct format if there is any discrepancy

df.columns=['Date (YYYY-MM-DD)','Time (HH:00)','Amount of precipitation in millimetres (last hour)','Atmospheric pressure at station level (mb)','Maximum air pressure for the last hour (mb)','Minimum air pressure for the last hour (mb)','Solar radiation (KJ/m2)','Air temperature (instant) (°c)','Dew point temperature (instant) (°c)','Maximum temperature for the last hour (°c)','Minimum temperature for the last hour (°c)','Maximum dew point temperature for the last hour (°c)','Minimum dew point temperature for the last hour (°c)','Maximum relative humid temperature for the last hour (%)','Minimum relative humid temperature for the last hour (%)','Relative humid (% instant)','Wind direction (radius degrees (0-360))','Wind gust in metres per second','Wind speed in metres per second','Brazilian geopolitical regions','State (Province)','Station Name (usually city location or nickname)','Station code (INMET number)','Latitude','Longitude','Elevation']

df['Date (YYYY-MM-DD)']=pd.to_datetime(df['Date (YYYY-MM-DD)'])
df['Time (HH:00)'] = pd.to_datetime(df['Time (HH:00)'], format='%H:%M', errors='coerce')
df.dropna(subset=['Time (HH:00)'], inplace=True)

# Filling empty data

df["Amount of precipitation in millimetres (last hour)"].fillna(df["Amount of precipitation in millimetres (last hour)"].mean(), inplace = True)
df["Atmospheric pressure at station level (mb)"].fillna(df["Atmospheric pressure at station level (mb)"].mean(), inplace = True)
df["Maximum air pressure for the last hour (mb)"].fillna(df["Maximum air pressure for the last hour (mb)"].mean(), inplace = True)
df["Minimum air pressure for the last hour (mb)"].fillna(df["Minimum air pressure for the last hour (mb)"].mean(), inplace = True)
df["Solar radiation (KJ/m2)"].fillna(df["Solar radiation (KJ/m2)"].mean(), inplace = True)
df["Air temperature (instant) (°c)"].fillna(df["Air temperature (instant) (°c)"].mean(), inplace = True)
df["Dew point temperature (instant) (°c)"].fillna(df["Dew point temperature (instant) (°c)"].mean(), inplace = True)
df["Maximum temperature for the last hour (°c)"].fillna(df["Maximum temperature for the last hour (°c)"].mean(), inplace = True)
df["Minimum temperature for the last hour (°c)"].fillna(df["Minimum temperature for the last hour (°c)"].mean(), inplace = True)
df["Maximum dew point temperature for the last hour (°c)"].fillna(df["Maximum dew point temperature for the last hour (°c)"].mean(), inplace = True)
df["Minimum dew point temperature for the last hour (°c)"].fillna(df["Minimum dew point temperature for the last hour (°c)"].mean(), inplace = True)
df["Maximum relative humid temperature for the last hour (%)"].fillna(df["Maximum relative humid temperature for the last hour (%)"].mean(), inplace = True)
df["Minimum relative humid temperature for the last hour (%)"].fillna(df["Minimum relative humid temperature for the last hour (%)"].mean(), inplace = True)
df["Relative humid (% instant)"].fillna(df["Relative humid (% instant)"].mean(), inplace = True)
df["Wind direction (radius degrees (0-360))"].fillna(df["Wind direction (radius degrees (0-360))"].mean(), inplace = True)
df["Wind gust in metres per second"].fillna(df["Wind gust in metres per second"].mean(), inplace = True)
df["Wind speed in metres per second"].fillna(df["Wind speed in metres per second"].mean(), inplace = True)
print(df)

         Date (YYYY-MM-DD)        Time (HH:00)  \
0               2017-12-20 1900-01-01 14:00:00   
1               2017-12-20 1900-01-01 15:00:00   
2               2017-12-20 1900-01-01 16:00:00   
3               2017-12-20 1900-01-01 17:00:00   
4               2017-12-20 1900-01-01 18:00:00   
...                    ...                 ...   
11427115        2017-12-20 1900-01-01 09:00:00   
11427116        2017-12-20 1900-01-01 10:00:00   
11427117        2017-12-20 1900-01-01 11:00:00   
11427118        2017-12-20 1900-01-01 12:00:00   
11427119        2017-12-20 1900-01-01 13:00:00   

          Amount of precipitation in millimetres (last hour)  \
0                                                       0.0    
1                                                       0.0    
2                                                       0.0    
3                                                       0.0    
4                                                       0.0    
...            

#### Encoding Categorical Data

Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into numbers.

Hence we use LabelEncoder and OneHotEncoder Classes of Scikit-learn library

In [None]:
from sklearn.preprocessing import OneHotEncoder
# Before converting categorical_features first convert timestamp objects as they will not be processed properly bu ColumnTransformer

df['Date (YYYY-MM-DD)']=pd.to_datetime(df['Date (YYYY-MM-DD)']).apply(lambda x:x.timestamp())

# Converting Time Column

df['Time (HH:00)']=pd.to_datetime(df['Time (HH:00)'],format='%H:%M').apply(lambda x: x.hour * 3600 + x.minute * 60)
# we can even use df['Time (HH:00)'].apply(lambda x: int(x.split(':')[0]) * 3600 + int(x.split(':')[1]) * 60)

# Replace categorical_features with the actual column names or indices

categorical_columns = ['Brazilian geopolitical regions','State (Province)','Station Name (usually city location or nickname)','Station code (INMET number)']

#Initialize OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)

# Apply one-hot encoding to the categorical columns

one_hot_encoded = encoder.fit_transform(df[categorical_columns])

#Create a DataFrame with the one-hot encoded columns
#We use get_feature_names_out() to get the column names for the encoded data

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded dataframe with the original dataframe

df_encoded = pd.concat([df, one_hot_df], axis=1)

# Drop the original categorical columns
df_encoded = df_encoded.drop(categorical_columns, axis=1)

### Feature Scaling(Standardisation)

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same range and in the same scale so that no any variable dominate the other variable.

#### Standardization

![alt text](https://static.javatpoint.com/tutorial/machine-learning/images/data-preprocessing-machine-learning-9.png)

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library 

In [None]:
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler

scaler = StandardScaler()

# Define the numerical columns to be scaled excluding the categorical columns

numerical_columns = [col for col in df.columns if col not in categorical_columns]

# Apply StandardScaler to the numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])