## Library Import
NumPy, pandas, matplotlib, seaborn

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Import Dataset

In [4]:
rain = pd.read_csv("weatherAUS.csv")
rain.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


Todo things:  
Basic statistics  
Visualization  
Missing data  
Outliers  
Categorical vs Numerical  

In [5]:
print("Data type : ", type(rain))
print("Data dims : ", rain.shape)
print(rain.dtypes)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (145460, 23)
Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object


## Cleaning the dataset

* Removing duplicate
* Handling missing values
* Correct data types, for `object` data types, use one hot encoding, for Date, use convert to Date types

In [6]:
# Removing duplicates
rain.drop_duplicates()
# After removing
print("Data dims : ", rain.shape)

Data dims :  (145460, 23)


No duplicates were found

The extend of missing data is severe, thus imputation is applied to preserve data. We are using KNN/MICE imputation for numerical data and mode imputation for categorical ones.

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns
# Scaling the data
scaler = MinMaxScaler()
rain_scaled = scaler.fit_transform(rain.select_dtypes(include=['float64', 'int64']))

# Initialize the KNN Imputer
imputer = KNNImputer(n_neighbors=5, weights="uniform")

# Impute missing values
rain_imputed = imputer.fit_transform(rain_scaled)

# Convert back to a DataFrame
rain = pd.DataFrame(rain_imputed, columns=data.select_dtypes(include=['float64', 'int64']).columns)

In [None]:
# Categorical data


Target variable `RainTommorow` is one hot encoded, so are `RainToday`, wind direction variables like `WindGustDir`, `WindDir9am` and `WindDir3pm`, and also `Location`

In [None]:
# Encode RainTommorow, RainToday
# Encode WindGustDir, WindDir9am, WindDir3pm
# Encode location
# Date time conversion

## Exploratory Data Analysis
* Basic Statistics
* Visualization
Choose variables like 

In [None]:
# using describe, boxplot, 

## Further Data Cleaning
The cycle of cleaning-EDA continues until dataset is deemed cleaned enough for modelling
* Outliers