# Table of Contents
* ##  General Exploratory Data Analysis

* ##  Preprocessing of the Tasks
    - Sliding days as 1D Tempreture Arrays to predict next day's Temperature (SDT)
    - Sliding windows with multiple columns to predict next day's Temperature [Done] (SWT)
    - Sliding windows with multiple columns to predict next day's Daily Summary (SWS) [Done]


* ##  Machine Learning Models
    - Random Forest Regression [Done] (RF)
    - Linear Regression (LR)
    - XGBoost (XG) [Done-Bad]


* ##  Evaluation and understanding predictions with XAI tools
    - Lime
    - Lime for Time
    - SHAP [Done]

## General Exploratory Data Analysis

*Starting with importing the required libraries*

In [None]:
import numpy as np
import pandas as pd
import os
import shap
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
import lime
import lime.lime_tabular
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
shap.initjs()

*Then continue by Exploring the dataset*

In [None]:
data = pd.read_csv("/kaggle/input/szeged-weather/weatherHistory.csv")
data.describe()
# Note: The column named "Loud Cover" is not making any sense as it is only "0" I will drop it during Preprocessing.

In [None]:
data.info()
# Note: The data does not require any imputing or interpolation as it has no null rows at all.
# Note: Most of useful columns are numeric, no need to overthink about encoding as some task won't require any.

In [None]:
data.head(3)
# Note: the Dataset is designed to be "Hourly". This is good in terms of details, but I rather something less complex. So, I will change it to "Daily" on next steps.

## Preprocessing of the Tasks

we need to simplify the summaries as there are too many details we don't need. I wil do it with a custom function.

In [None]:
def simplify_summaries(base_summary):
    base_split = base_summary.split(" ")
    removals_list = ["Light","Dangerously","Partly","Mostly","and"]
    to_be_replaced_list = ["Breezy","Drizzle","Overcast"]
    replacement_list = ["Windy","Rain","Cloudy"]
    for removal in removals_list: 
        if removal in base_split:
            base_split.remove(removal)
            
    for i in range(len(to_be_replaced_list)):
        if to_be_replaced_list[i] in base_split:
            base_split.remove(to_be_replaced_list[i])
            base_split.append(replacement_list[i])
        
    base_split.sort()
    return " ".join(base_split)

In [None]:
data.Summary = data.Summary.apply(simplify_summaries)
data.head(3)
# much better now as we reduced complexity of it dramatically.

In [None]:
# Dropping the column named "Loud Cover" on general dataset "data"
data.drop(columns=["Loud Cover"], inplace=True)

In [None]:
# Changing the original "Hourly" dataset to new and simpler "Daily"
# Fixing the Formatted Date for pandas usage.
data['Formatted Date'] = pd.to_datetime(data['Formatted Date'], utc=True)
data.sort_values(by=['Formatted Date'], inplace=True, ascending=True)

## Sliding windows with multiple columns to predict next day's Temperature (SWT)

In [None]:
data.head(4)

In [None]:
# Grouping by days to achieve "Daily" dataset on what's left as numerical columns for "Sliding Windows to predict Temp" task. 
swt_data = data.groupby([data['Formatted Date'].dt.date]).mean()
swt_data["Summary"] = data["Summary"].groupby([data['Formatted Date'].dt.date]).agg(lambda x:x.value_counts().index[0])
le = LabelEncoder()
swt_data.Summary = le.fit_transform(swt_data.Summary)

In [None]:
# Results are sorted and daily.
swt_data.head() 

In [None]:
# Checking the results and it is clearly worked.
swt_data.describe()

In [None]:
# Plotting approx. 2 years to have an idea about what we are working with.
plt.figure(figsize=(24,8))
plt.plot(swt_data["Temperature (C)"][:740])
plt.grid()
plt.show()

We see that there are a lot of spikes everywhere on the plot. This would increase complexity. <br>
So, I decided to apply rolling mean to reduce spikes.

In [None]:
ROLLING_MEAN_PARAMETER = 3
swt_data[["Temperature (C)","Apparent Temperature (C)","Humidity","Wind Speed (km/h)", "Wind Bearing (degrees)", "Visibility (km)", "Pressure (millibars)"]] = np.round(swt_data[["Temperature (C)","Apparent Temperature (C)","Humidity","Wind Speed (km/h)", "Wind Bearing (degrees)", "Visibility (km)", "Pressure (millibars)"]].rolling(ROLLING_MEAN_PARAMETER).mean(),3)
swt_data.dropna(inplace=True) # dropping the null days that are created by rolling mean

In [None]:
# Plotting approx. 2 years to have an idea about what we are working with after rolling mean
plt.figure(figsize=(24,8))
plt.plot(swt_data["Temperature (C)"][:740])
plt.grid()
plt.show()

As we see above, spikes are reduced dramatically. Therefore, it is much easier to learn a statistical model.

In [None]:
# Now I will design the dataset into more trainable sliding windows format.
N_DAYS_BEFORE = 5
swt_train = pd.DataFrame()

for day in range(N_DAYS_BEFORE-1,len(swt_data)):
    for i in reversed(range(1,N_DAYS_BEFORE)):
        for j in swt_data.columns:
            col_name = str(j) + " - " + str(i)
            swt_train.loc[day, col_name] = (swt_data[j][day-i])

In [None]:
# each row consist from previous 5 days with details.
swt_train.head()

In [None]:
# first part of the shapes must be the same to labels.
print(swt_train.shape)

In [None]:
# Prepearing the labels for SWT task
# ignoring the first 4 days to match training data & only getting values so we won't have issues with date index later on.
swt_labels = swt_data["Temperature (C)"][N_DAYS_BEFORE-1:].values
# first part of the shapes must be the same to train.
print(swt_labels.shape)

In [None]:
# Temperature (C) - 1  of 22th feature should be equal to the value of 23th label (today = tomorrow of yesterday)
print(" -- Features -- \n",swt_train.iloc[23])
print("\n -- Label -- \n", swt_labels[22])

In [None]:
# Splitting train and test to be able to evaluate properly with some train test ratio.
swt_train_x, swt_test_x, swt_train_y, swt_test_y = train_test_split(swt_train,swt_labels, test_size=0.1)

In [None]:
# Checking the shapes for safety
print("shape of training dataset features: ",swt_train_x.shape)
print("shape of training dataset labels: ",swt_train_y.shape)
print("shape of testing dataset features: ",swt_test_x.shape)
print("shape of testing dataset labels: ",swt_test_y.shape)

## Sliding windows with multiple columns to predict next day's Daily Summary (SWS)

In [None]:
# Prepearing the labels for SWT task
# ignoring the first 4 days to match training data & only getting values so we won't have issues with date index later on.
sws_labels = swt_data["Summary"][N_DAYS_BEFORE-1:].values
# first part of the shapes must be the same to train.
print(sws_labels.shape)

In [None]:
# splitting (75/25) as usual
sws_train_x, sws_test_x, sws_train_y, sws_test_y = train_test_split(swt_train, sws_labels, random_state=41, test_size=0.25)

In [None]:
sws_train_x

In [None]:
# Checking the shapes for safety
print("shape of training dataset features: ",sws_train_x.shape)
print("shape of training dataset labels: ",sws_train_y.shape)
print("shape of testing dataset features: ",sws_test_x.shape)
print("shape of testing dataset labels: ",sws_test_y.shape)

## Sliding days as 1D Tempreture Arrays to predict next day's Temperature (SDT)

In [None]:
# For this approach I will only use 1 column. this will be the "Temperature (C)"
all_temps = swt_data["Temperature (C)"].values
train_temps = []
label_temps = []
for i in range(len(all_temps)-30):
    label_temps.append(all_temps[i+30])
    train_temps.append(all_temps[i:i+30])
    
train_temps = np.array(train_temps)
label_temps = np.array(label_temps)

In [None]:
# last of the tomorrow's array should be same as the today's label 
print(train_temps[45])
print(label_temps[44]) 

In [None]:
# Splitting the train and test 
sdt_train_x = train_temps[:-400]
sdt_test_x = train_temps[-400:]
sdt_train_y = label_temps[:-400]
sdt_test_y = label_temps[-400:]

In [None]:
# Checking the shapes for safety
print("shape of training dataset features: ",sdt_train_x.shape)
print("shape of training dataset labels: ",sdt_train_y.shape)
print("shape of testing dataset features: ",sdt_test_x.shape)
print("shape of testing dataset labels: ",sdt_test_y.shape)

## Machine Learning Models

* ### Random Forest Regressor

In [None]:
rf_model = RandomForestRegressor(max_depth=10)
rf_model.fit(swt_train_x,swt_train_y)

* ### XGBoost

In [None]:
my_imputer = SimpleImputer()
sws_train_x_imp = my_imputer.fit_transform(sws_train_x)
sws_test_x_imp = my_imputer.transform(sws_test_x)

my_model = xgb.XGBClassifier(n_estimators=1000, 
                            max_depth=4, 
                            eta=0.05, 
                            base_score=sws_train_y.mean())
hist = my_model.fit(sws_train_x_imp, sws_train_y, 
                    early_stopping_rounds=5, 
                    eval_set=[(sws_test_x_imp, sws_test_y)], eval_metric='mlogloss', 
                    verbose=10)

* ### Ridge Regression

In [None]:
lr_model = Ridge()
lr_model.fit(sdt_train_x,sdt_train_y)

## Evaluation and understanding predictions with XAI tools

* ### RF - SWT - SHAP

Here on this part of the notebook, the first explainable AI tool and it's use case will be demonstrated.

The tool is **SHAP**. <br />
To implement this tool for our time series explanation purposes, I will use my **RandomForestRegressor** model which was trained on the Sliding Windows styled dataframe with the Temperature label. <br />
Firstly, we can start by checking if the model is worth to explain or does it require more development. <br />
For this, I will be using r^2 score. The better r^2 score means better performance. <br />


In [None]:
swt_pred_y = rf_model.predict(swt_test_x)
print("r_square score of the RandomForestRegressor model : ",r2_score(swt_test_y,swt_pred_y))

As we can see above, the model performs really good on the test data. <br /> 
So we shall continue with explaining.  <br />  <br /> <br /> 
To begin our explanation on this model and the task I want to use a simple bar representation of importance. <br />
Basically, this will sort the features by decreasing importance for our trained model and plot them.

### SHAP Feature Importance

In [None]:
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(swt_train_x)

shap.summary_plot(shap_values, swt_train_x, plot_type="bar");

The chart above explains the average impact per given column on predictions. <br />
So, if we look at it we can say the biggest effect on the prediction (today's temperature) is yesterday's Temperature. <br />
That is follwed by: Apparent Temperature of yesterday, Temperature of the day before yesterday, Apparent Temperature of the day before yesterday.
<br /> <br /> <br />
Now that we know which columns are more important when it comes to do predictions with our model, we can now see how they were actually affecting the outputs. <br />
For this, I will use a typical summary plot. Which combines feature importance with feature effects in very visible way.
### SHAP Summary Plot

In [None]:
shap.summary_plot(shap_values, swt_train_x)

As we can see above, each point on the summary plot is a Shapley value for a feature and an instance. <br />
The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. <br />
The color represents the value of the feature from low to high. <br />
Just like the previous plot, here features are ordered according to their importance. <br /> <br />
### SHAP Force Plot

In [None]:
a = shap.force_plot(explainer.expected_value, shap_values[100,:], swt_train_x.iloc[100,:])
display(a)

b = shap.force_plot(explainer.expected_value, shap_values[80,:], swt_train_x.iloc[80,:])
display(b)

c = shap.force_plot(explainer.expected_value, shap_values[70,:], swt_train_x.iloc[70,:])
display(c)

d = shap.force_plot(explainer.expected_value, shap_values[90,:], swt_train_x.iloc[90,:])
display(d)

As we can see above, I displayed force plots of predictions for the 4 different days. <br />
The most important things to pay attention here are:
* Model Output Value
* Base Value
* Forces that affect the Output value 
<br /> <br />
To begin with, the base value is the **average** output value for entire dataset. But as we can see, with the forces (affects of the values of some columns) the Model Output Value moves away from the base value. <br />
Now, we can take a look at the prediction instances above. <br /> 
At the last prediciton. We can see that the columns "Apparent Temperature(C)-1" and "Temperature(C)-1" columns increased the base value. Where the same columns for the day before applied forces to lower it. At the end prediction value become 15.83. <br /> <br /> <br /> 

Up next, is an interactive cluster of all predictions. The best part about this is that you can change the x-axis selection to have different force plots for different purposes <br /> 

### Clustered SHAP Values

In [None]:
shap.force_plot(explainer.expected_value, shap_values, swt_train_x)

To simply describe what we see above, we can say "a cummulative force plot graphs for all predictions". (This is also the reason of slowness). <br />
Go ahead and change the value on the left-hand side dropdown input if you want to experience interacitveness of the graph.

In [None]:
print("prediction : ",rf_model.predict(swt_test_x.iloc[77].values.reshape(1,32)))
print("ground truth : ",swt_test_y[77])
# very accurate prediction.

* ### XG - SWS - LIME

Here on this part of the notebook, the second explainable AI tool and it's use case will be demonstrated.

The tool is **LIME**. the name stands for "Local Interpretable Model-agnostic Explanations" <br />
To implement this tool for our time series explanation purposes, I will use my **XGBoost** model which was trained on the Sliding Windows styled dataframe with the Simplified Summary Classes. <br />
Firstly, we can start by checking if the model is worth to explain or does it require more development. <br />
For this, I will be using accuracy score. The reason of using accuracy is that I implemented the model as if it was a classification problem. <br />


In [None]:
y_pred = my_model.predict(sws_test_x_imp)
accuracy_score(y_pred, sws_test_y)

As we can see above, model performs good enough to be considered as accurate. (Almost 90%.) <br /> <br />
Next, we need to define a lambda function called " *predict_fn* ", this function will help us to get prediciton probabilities. <br />
Then, we will set up our explainer. For this, I am using LimeTabularExplainer which explains predictions on tabular (i.e. matrix) data. <br /> 

In [None]:
predict_fn = lambda x: my_model.predict_proba(x).astype(float)
explainer = lime.lime_tabular.LimeTabularExplainer(sws_test_x_imp, feature_names=sws_test_x.columns, class_names=range(0,14), verbose=True, mode='classification')

Now, we need to choose some instances to explain later on. For the purpose of making it different, I will pick 3 different predictions.

In [None]:
print(le.inverse_transform(my_model.predict(sws_test_x_imp)[60].ravel()))
print(le.inverse_transform(my_model.predict(sws_test_x_imp)[0].ravel()))
print(le.inverse_transform(my_model.predict(sws_test_x_imp)[124].ravel()))
# the indexes will be used later on.

To make things more clear, I will store the instances above with better variable names.

In [None]:
foggy_instance = sws_test_x.iloc[124].values
cloudy_instance = sws_test_x.iloc[0].values
clear_instance = sws_test_x.iloc[60].values

Now I will initialize the explainers for each instance we declared above. <br />
The important thing here is to provide how many "possible labels" and how many "features" you want to present. <br />
For the purposes of this notebook, I limited those to smaller numbers.

In [None]:
exp1 = explainer.explain_instance(foggy_instance, predict_fn, num_features=5, labels=range(0,6))
exp2 = explainer.explain_instance(cloudy_instance, predict_fn, num_features=5, labels=range(0,6))
exp3 = explainer.explain_instance(clear_instance, predict_fn, num_features=5, labels=range(0,6))

Now, we can start with the presentation of the explanations with LIME. <br /> <br />

### Showing In Notebook <br />
This function is designed especially for IPython notebooks such as my notebook that one can look at :) <br />
What it shows are basically:
* Prediction Probabilities of the given instance for the related classes
* Class by Class opposite sided horizontal bar charts for each feature (Sorted By their affects on the prediciton) 
* An impractical table that shows values for related features. (Color coded by their affects)

In [None]:
exp1.show_in_notebook()

The figures above may seem complex at first sight, I reckon the best way to look is to start from top left and follow:
1. Check the class at the top of the progressbar styled graph named "Prediction probabilities". (it is 3 with 69% + yellow)
2. Find the bar chart for it("NOT 3" and "3") to look most effective features. (it is "yesterday's Visibility" which was lower than ?!?) <br />
    2.1 As you can't see, we don't see it properly :/ (Don't worry, we will see it soon :D)
1. You can check rest of the figures to gain some information about other classes. 

### As Pyplot Figure
Up next, the alternative for the ones who are happy with less details. <br />
In my opinion, the label parameter is the most important part of this function. Becuase by giving this parameter you pick what label to get graph for. 

In [None]:
exp1.as_pyplot_figure(label=3)   # for class of 3, which is foggy
plt.show()

Now that we see reasons better, we can tell that prediction was foggy because on the day before the Visibility was lower than 9km. (Which is quite a good reason for such prediction) 

If we don't prefer graphs or plots, we can also receive the numerical values without any visualization. <br />
Most common methods are 
### As_list
and
### As_map

In [None]:
exp1.as_list(label=3)

In [None]:
print(exp1.as_map())

As one can imagine, for some implementations numerical values like above may need to be used as input. (e.g. specialized plotting methods) <br />
Therefore, it is a nice addition that LIME can provide the output without any graph. <br /> <br />
Now that we find out how LIME works, let's do a small comparison between some different predictions to see reasoning of the xgboost model.

In [None]:
# label 0 is "clear"
exp3.as_pyplot_figure(label=0)
plt.show()

As shown above, High amount of pressure caused xgboost model to predict as "Clear". <br /> Which is good, because low pressure would make weather cloudy. Unlike low, high pressure would cause "Dry" conditions.   <br /> On the second most effective feature, we see the pressure of the day before yesterday. Which is also nice because we understand that predicitons for being "Clear" is mostly based on related fields. 

* ### LR - SDT - SHAP

In [None]:
sdt_pred_y = lr_model.predict(sdt_test_x)
print("r_square score of the Ridge Regression model : ",r2_score(sdt_test_y,sdt_pred_y))   # the model performs really good.

In [None]:
plt.figure(figsize=(20,6))
plt.plot(sdt_pred_y)
plt.plot(sdt_test_y)
plt.tight_layout()

In [None]:
# efe ergün