### What is this about

This is an experimentation in implementing Gaussian Discriminant Analyis for prediction milti-class classifications.

### Scope

The target of this study is to test the capability of Gaussian Discriminant Analysis in predicting the weather using a pre-defined dataset found at https://www.kaggle.com/datasets/nikhil7280/weather-type-classification

### Approach

This study will makes use of the following techniques and formulaes after derivations done by the researcher

### Formula

$$
P(y=k|x) = \frac{1}{\sum_{i=0, i\not= k}^kexp(-\theta^T_ix + \theta_{0i}) } 
$$

where, when predicting for $k$ 

$\theta_i = \Sigma^-(\mu_k - \mu_i)$ and $\theta_{0i} = ln\frac{\phi_i}{\phi_k} + \frac{1}{2}(\mu_k^T\Sigma^-\mu_k - \mu_i^T\Sigma^-\mu_i)$

### Matrix shortcut for computing the parameters

We can easily compute for the parameters using the following matrix multipications 

$$
\phi = \frac{1}{n}\sum_{i=1}^nY
$$

$$
\mu = \frac{X^TY}{\sum_{i=1}^nY}
$$

$$
\Sigma = \frac{1}{n}\sum_{i = 1}^n(x^i - \mu_{y^i})^T(x^i - \mu_{y^i})
$$

Where 

 $\phi\exists\R^{1\times K}$

 $\mu\exists\R^{D \times K}$ = where the $i'th$ column represents the mean for the $i'th$ classification

$X\exists\R^{M \times D}$ = where the $i'th$ row represents the $i'th$ training data

$Y\exists\R^{M \times K}$ = where the $i'th$ row represents the classification of the $i'th$ data represented in a                     one-hot row matrix

Using this we can solve for the $\theta_i$  and $\theta_{0i}$ for every class $k$ and compose them in a matrix to be used below

### Matrix Shortcut For Predictions in Multi-class

To easily compute $P(y = k|x)$  for every $k$ what we can do is perform the following matrix transformations

$$
H(k) = sum(X'W)
$$

Where  

$X'\exists\R^{M\times D + 1}$ = Where the $i'th$ row corresponds with the $i'th$ dataset plus a bias for the intercept ( 1 )

$W\exists\R^{D + 1\times K}$ = Where the $i'th$ column represents the $\theta_{i}$ for classification $k$ and the last row represents all the $\theta_0$ where the $j'th$ column represents $\theta_{0j}$

$D$ = number of dimensions (features)

$M$ = number of training data

$K$  = number of classifications

We then do this for every $k$ to produce a prediction for every k


### STAGE 1: DATA SANITATION AND COLLECTION

First we will get the data (X and Y) and compose it to our intended format.

We also have to take into account, and assign number for the categorical values

In [48]:
import pandas as pd
import tensorflow as tf

# assign numbers to the categorical values
categorical_mapping = {
    "Weather Type": {"Rainy": 0, "Cloudy": 1, "Sunny": 2, "Snowy": 3},
    "Cloud Cover" : {"overcast": 0.,"partly cloudy": 1., "clear": 2., "cloudy": 3.},
    "Season"     : {"Spring": 0., "Autumn": 1., "Winter": 2., "Summer": 3.},
    "Location"    : {"inland": 0., "mountain": 1., "coastal": 2.}
}

# configs
file_path = 'dataset\weather_classification_data.csv'
features = ["Temperature", "Humidity", "Wind Speed", 
            "Precipitation (%)", "Cloud Cover", "Atmospheric Pressure", 
            "UV Index", "Season", "Visibility (km)", "Location"]

dependent_variable = "Weather Type"
K = len(categorical_mapping[dependent_variable])
D = len(features)
M = None

def display_tensor(tensor, name):
    print(name + ": \n", tensor.numpy())

def encode_row_matrix_to_one_hot(row_matrix, depth=K):
    row_matrix = tf.reshape(row_matrix, [-1])
    one_hot_matrix = tf.one_hot(row_matrix, depth=depth)
    
    return one_hot_matrix

def read_csv_to_XYy(file_path, categorical_mapping, dependent_variable):
    df = pd.read_csv(file_path)

    for column, mapping in categorical_mapping.items():
        df[column] = df[column].map(mapping)

    y = df[dependent_variable]
    x = df.drop(columns = [dependent_variable])

    # add a bias of 1 to take into account the intercept
    X = tf.concat([tf.convert_to_tensor(x.values, dtype=tf.float32), tf.ones([len(x), 1], dtype=tf.float32)], axis=1)
    Y = encode_row_matrix_to_one_hot(tf.convert_to_tensor(y.values, dtype=tf.uint8))

    return [X, Y, y.values]

# parse the data to X and Y
X, Y, y = read_csv_to_XYy(file_path, categorical_mapping, dependent_variable)
X_T = tf.transpose(X)
M = len(X)

display_tensor(X, "X")
display_tensor(Y, "Y")

  file_path = 'dataset\weather_classification_data.csv'


TypeError: cannot convert the series to <class 'int'>

### STAGE 2: COMPUTE FOR THE PARAMTERS
We then compute for our intended parameters which are $\phi, \Sigma, \mu$

In [None]:

Y_SUM = tf.reduce_sum(Y, axis=0)
PHI = Y_SUM / M
MU  = tf.matmul(X_T, Y) / Y_SUM
SIGMA = tf.Variable(initial_value=tf.zeros(shape=(K, K)), dtype=tf.float32)

# Sigma
for i in range(K):
    diff = X_T[:, i] - MU[:, y[i]]
    SIGMA += tf.matmul(tf.transpose(diff), diff) / M


display_tensor(PHI, "PHI")
display_tensor(MU, "MU")
display_tensor(SIGMA, "SIGMA")

[0. 1. 2. ... 1. 3. 0.]


TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got 0.0