## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from IPython.display import display
from bs4 import BeautifulSoup
import pandas as pd


## Read data from HTML file 

In [2]:
weatherdf= pd.read_html('Datasets/weather_df.html', skiprows=1)[0]

In [3]:
weatherdf.head()

Unnamed: 0,0,1,2015-03-14,27.9,0.540,?
0,1,2,2015-03-15,28.1,0.563,1
1,2,3,2015-03-16,29.2,0.604,1
2,3,4,2015-03-17,31.4,0.637,1
3,4,5,2015-03-18,31.4,0.658,1
4,5,6,2015-03-19,31.4,0.662,1


## Getting dataframe ready

Again, we do not have it in correct dataframe format. Let's first get the required dataframe.

In [4]:
weatherdf.drop(columns=weatherdf.columns[0], axis=1, inplace=True)

In [5]:
weatherdf

Unnamed: 0,1,2015-03-14,27.9,0.540,?
0,2,2015-03-15,28.1,0.563,1
1,3,2015-03-16,29.2,0.604,1
2,4,2015-03-17,31.4,0.637,1
3,5,2015-03-18,31.4,0.658,1
4,6,2015-03-19,31.4,0.662,1
...,...,...,...,...,...
777,779,2017-04-30,32.2,0.541,1
778,780,2017-05-01,30.5,0.484,1
779,781,2017-05-02,29.6,0.539,1
780,782,2017-05-03,28.4,0.568,1


In [6]:
weatherdf.columns=['Index',	'Date',	'Temperature', 'Humidity' ,'Wind level']

In [7]:
weatherdf

Unnamed: 0,Index,Date,Temperature,Humidity,Wind level
0,2,2015-03-15,28.1,0.563,1
1,3,2015-03-16,29.2,0.604,1
2,4,2015-03-17,31.4,0.637,1
3,5,2015-03-18,31.4,0.658,1
4,6,2015-03-19,31.4,0.662,1
...,...,...,...,...,...
777,779,2017-04-30,32.2,0.541,1
778,780,2017-05-01,30.5,0.484,1
779,781,2017-05-02,29.6,0.539,1
780,782,2017-05-03,28.4,0.568,1


## Data Vizualization

In [8]:
visualizable_feature_names_weather = weatherdf.columns[2:]
num_visualizable_features_weather = len(visualizable_feature_names_weather)
fig_hist_weather = []
for i, feature_name in enumerate(visualizable_feature_names_weather):
    fig_hist_weather.append(go.Figure(go.Histogram(x=weatherdf[feature_name])))
    fig_hist_weather[i].update_layout(height=400, width=800, title_text=feature_name)
    fig_hist_weather[i].show()

Observations:
- **Index** is the index of the data and is uninformative and does not provide any discrimination power;
- **Wind level** looks like a numerical feature but it is actually a ordinal feature, so a unary encoding might be the best bet;
- **Wind level** has missing value. 
- Values of features **Temperatue** and **Humidity** come in different ranges, so it's a good idea to normalize them.

### Pairwise Scatter Plot

In [9]:
fig_scatmat_weather = go.Figure(data=go.Splom(
                        dimensions=[dict(label=feature, values=weatherdf[feature]) \
                                    for feature in visualizable_feature_names_weather],
                        marker=dict(showscale=False, line_color='white', line_width=0.5)))

fig_scatmat_weather.update_layout(title='Pairwise feature scatter plots', \
                                  width=400 * num_visualizable_features_weather, \
                                  height=400 * num_visualizable_features_weather)

fig_scatmat_weather.show()

Everything seems fine. One interesting observation is that `"Temperature"` and `"Humidity"` seem to have a high correlation with each other.

# Data Preprocessing

### 1.Dropping Uninformative feature

In [10]:
weatherdf = weatherdf.drop(columns = "Index")

### 2. Handling an ordinal feature with MCAR missing values

On to adressing missing values in `"Wind level"` now. First off, we know almost surely this is missing MCAR from what the expert told us, so there is no need to add a new feature that shows whether the value on `"Wind level"` was missing in the original data, since this would almost surely just noise. Second, this is an ordinal feature and thus, imputing values using neighbourhoods (found using proximity among other features, specifically `"Temperature"` and `"Humidity"`) makes more sense compared to some other strategy like replacing it with the majority value.

In [11]:
is_wind_level_missing = (weatherdf["Wind level"] == "?")

In [12]:
display(np.where(is_wind_level_missing)[0])

array([  7,  99, 100, 103, 104, 166, 224, 229, 236, 266, 278, 345, 347,
       356, 379, 393, 394, 439, 480, 495, 504, 519, 530, 546, 575, 583,
       597, 612, 615, 621, 623, 631, 642, 657, 669, 719, 755, 766],
      dtype=int64)

In [13]:
from sklearn.neighbors import KNeighborsClassifier
knn_imputor = KNeighborsClassifier(n_neighbors=3)
X_train_knn = weatherdf[np.logical_not(is_wind_level_missing)][["Temperature" , "Humidity"]].astype("float")
y_train_knn = weatherdf[np.logical_not(is_wind_level_missing)]["Wind level"].astype("float")
knn_imputor.fit(X_train_knn, y_train_knn)

KNeighborsClassifier(n_neighbors=3)

In [14]:
X_production_knn= weatherdf[is_wind_level_missing][["Temperature", "Humidity"]].astype("float")
y_production_knn= knn_imputor.predict(X_production_knn)

In [15]:
display(y_production_knn)

array([0., 1., 2., 2., 2., 2., 2., 2., 2., 2., 2., 1., 1., 2., 1., 1., 2.,
       0., 1., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 1., 1., 2., 2., 2.,
       2., 0., 2., 1.])

replace the portion of `weather_df_3` at the intersection of `is_wind_level_missing` and `"Wind level"` (remember you have to use `.loc`) with the imputed labels, `y_production_knn`:

In [16]:
weatherdf.loc[is_wind_level_missing, "Wind level"] = y_production_knn

### 3. Converting an ordinal feature into unary encoding

Ordinal features are "categorical" features with an ordering defined between "categories" (so a "category" is bigger than some "categories" in that ordering, equal to itself in that ordering, and smaller than the rest of the "categories"). This means that integers are not a suitable encoding, since in integers the distance between $0$ and $1$ is the same as the distance between $1$ and $2$ and that **is** important to numerical algorithms since they will assign a fixed weight which would multiply this number as a part of how they work (at least with linear models). One-hot encoding is not the best encoding for them neither, since it is too loose. That will assign different weights to each encoding which may break the ordering of the "categories". So, we should resort to something where weights assigned by numerical algorithms can work cumulatively: unary encoding. In unary encoding for integers between $0$ and $n$, each encoding is of length $n$. The encoding of $0$ is $n$ zeros, $1$ is $n-1$ zeros followed by a single one, $k$ ($0 \leq k \leq n$) is $n-k$ zeros followed by $k$ ones and so on and so forth. So you can see the distances between increasing "categories" are the sum of different weights and it is cumulative.

In [17]:
num_days = weatherdf.shape[0]
wind_level_int = weatherdf["Wind level"].astype("int")
wind_level_uniques = np.sort(np.unique(wind_level_int))
max_wind_levels = wind_level_uniques.max()
wind_level_encoded = np.zeros((num_days, max_wind_levels), dtype="int")
weather_df_2= weatherdf.copy()

for (i, day_wind_level) in enumerate(wind_level_int):
    wind_level_encoded[i, :day_wind_level] = 1   
for level in range(max_wind_levels):
    weather_df_2["Wind level > " + str(level)] = wind_level_encoded[:, level]
weather_df_3= weather_df_2.drop(columns="Wind level")

In [18]:
display(weather_df_3)

Unnamed: 0,Date,Temperature,Humidity,Wind level > 0,Wind level > 1,Wind level > 2
0,2015-03-15,28.1,0.563,1,0,0
1,2015-03-16,29.2,0.604,1,0,0
2,2015-03-17,31.4,0.637,1,0,0
3,2015-03-18,31.4,0.658,1,0,0
4,2015-03-19,31.4,0.662,1,0,0
...,...,...,...,...,...,...
777,2017-04-30,32.2,0.541,1,0,0
778,2017-05-01,30.5,0.484,1,0,0
779,2017-05-02,29.6,0.539,1,0,0
780,2017-05-03,28.4,0.568,1,0,0


In [19]:
from sklearn.preprocessing import StandardScaler
weather_numerical = weather_df_3[["Temperature", "Humidity"]]
weather_scaler = StandardScaler()
weather_numerical_standardized = weather_scaler.fit_transform(weather_numerical)
weather_df_3[["Temperature", "Humidity"]] = weather_numerical_standardized

In [20]:
weather_df_3

Unnamed: 0,Date,Temperature,Humidity,Wind level > 0,Wind level > 1,Wind level > 2
0,2015-03-15,1.242275,1.283167,1,0,0
1,2015-03-16,1.306493,1.448742,1,0,0
2,2015-03-17,1.434929,1.582010,1,0,0
3,2015-03-18,1.434929,1.666817,1,0,0
4,2015-03-19,1.434929,1.682971,1,0,0
...,...,...,...,...,...,...
777,2017-04-30,1.481633,1.194321,1,0,0
778,2017-05-01,1.382387,0.964131,1,0,0
779,2017-05-02,1.329845,1.186244,1,0,0
780,2017-05-03,1.259789,1.303359,1,0,0


In [21]:
%store weather_df_3

Stored 'weather_df_3' (DataFrame)
