## Data Preprocessing
Data preprocessing is a critical step in the machine learning that often determines the success of a model. We are seeking to enhance our data preprocessing in our machine learning project.

### Objective
Give the numerical representation of the categorial data such that it can be used for Classification of whether the 'Type of Location' of the AQI data is 'Industrial Area' or 'Residential, Rural and other Areas'.

### Tasks
- Improve the data preprocessing workflow.
- Data Cleaning & Transformation.
- Feature Engineering
- Encoding of categorial data, and also provide reason behind use of any particular encoding technique.

In [None]:
# importing basic libraries
import numpy as np
import pandas as pd

In [None]:
# importing the data set as a Dataframe
df = pd.read_csv('gujarat_aqi.csv')

In [None]:
# as every input in the column 'state' is just "Gujarat" , we are dropping it as it provides no relevant information
df = df.drop(columns='State')

In [None]:
# as every input in the column 'SPM' is just "Nan" , we are dropping it as it provides no relevant information
df = df.drop(columns='SPM')

In [None]:
# since the 'Type of Location' column has only 2 types of values, we can encode them in binary as 1 and 0
# this will be useful when we perform regression on this data

df['Type of Location'] = df['Type of Location'].replace({'Industrial Area': 1, 'Residential, Rural and other Areas': 0})

In [None]:
# since the readings will be same reguardless of what agency conducts it, the column 'Agency' would provide no new information to the model
# thus ive decided to drop it
df = df.drop(columns='Agency')

In [None]:
# inorder to get a more indepth analasis of the data, I've split the date column into three columns, namely 'Day', 'Month', 'Year'
df[['Day', 'Month', 'Year']] = df['Sampling Date'].str.split('/', expand=True)

# Converting the columns to integers as they will be 'objects' by default and that slows down the computation a bit
df['Day'] = df['Day'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)

In [None]:
# removing the sampling date column as the Day, Month and Year columns have already been made
df = df.drop(columns='Sampling Date')

In [None]:
# as there is missing data in some of the columns, I've used the simple imputer to fill in the missing data with the mean of their columns

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(df.iloc[: ,2:5])
df.iloc[: ,2:5] = imputer.transform(df.iloc[: ,2:5])

In [None]:
'''
in order the categotical variable in the 'City/Town/Village/Area' column, ive decided to use one hot encoding to convert the city names to machine readable data
one hot encoding is a techinque that turns a column of catagorical data into multiple columns of 1's and 0's where 1 indicates that it is of thar type and 0 indicates that it isnt
i've chosen one hot encoding over just naming the columns as 1,2,3... as using integers like that may give our model a wrong idea about the data and may lead to reduced accuracy and false relationships between the data'''

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])],  remainder='passthrough')
df = ct.fit_transform(df)
df = pd.DataFrame(df)

In [None]:
# renaming our data back to the original names as the one hot encoding changed them for some reason, this could be done by a function, but since the list was small, I decided to do it manually
df = df.rename(columns={7:'Type of location', 8: 'SO2', 9: 'NO2',10: 'RSPM/PM10',11: 'Day',12: 'Month', 13: 'Year'})

In [None]:
df = df.drop(columns='Year')