# <center> Travaux pratiques d’IA
## <center> Série 5: Naive Bayes & Logistic Regression

### Données
1. Les données à utiliser pour ce TP se trouvent dans les fichiers data_train.csv et data_test csv.
    Il s’agit de prédire l’achat d’un produit en fonction du sexe, de l’âge et du salaire d’un individu.
2. Les 3 premières colonnes spécifient les covariables tandis que la dernière colonne correspond aux labels.
3. Le fichier data_test.csv sert à évaluer les performances des modèles développés à partir des données du fichier data_train.csv.
4. Il est recommandé d’utiliser pandas et/ou NumPy pour manipuler les données. Notamment la méthode get_dummies de pandas vous permet de convertir la première covariable en deux covariables binaires.

### 1 Naive Bayes
Le but de cette section est d’implémenter Naive Bayes. Voici les différentes étapes à accomplir:

1. Calculer la distribution empirique des labels.

In [1]:
import csv
import numpy as np

def convert_csv2array(name: str) -> tuple[np.ndarray, np.ndarray]:
    """
    Convert a csv file into numpy array

    Parameters
    ----------
    name: string
        Path to the file

    Return value
    ------------
    header and data of csv
    """
    file = open(name, 'r')
    data = []
    reader = csv.reader(file)
    for line in reader:
        data.append(line)
    data = np.array(data)
    return data[0, :], data[1:, :]

def ampiric_distribution(data: np.ndarray, target: int) -> float:
    return np.sum(data[:, target].astype(int))/len(data)

header, data = convert_csv2array('data_train.csv')

# P(X == 1)
def ampiric_distribution(data: np.ndarray, target: int) -> float:
    return np.sum(data[:, target].astype(int))/len(data)

print(ampiric_distribution(data, 3))

0.3


2. Pour chaque valeur des labels, estimer les paramètres des distributions des covariables.

In [2]:
def distrib_param(data: np.ndarray, header: np.ndarray, target: int):
    """
    Get parameters of all distributions of a set of label and covariables

    Parameters
    ----------
    data: np.ndarray
        Data to compute
    header: np.ndarray
        Header of the data
    target: int
        Column target for the label

    Return value
    ------------
    dict contain labels, parameter of the distrib, and parameter of the distribuof each variable
    """
    result = {}
    values = np.unique(data[:, target])
    for i in range(len(values)):
        newdata = data[data[:, target] == values[i]]
        tmp = {}
        for j in range(len(data[0])):
            if j != target:
                tmp_2 = {}
                newvalues, count = np.unique(newdata[:, j], return_counts = True)
                if len(newvalues) == 2:
                    for k in range(len(newvalues)):
                        tmp_2[newvalues[k]] = count[k] / len(newdata)
                else:
                    tmp_2["Mean"] = np.mean(newdata[:, j].astype(float))
                    tmp_2["Variance"] = np.var(newdata[:, j].astype(float))
                tmp[header[j]] = tmp_2
        result[values[i]] = (len(newdata) / len(data), tmp)
    return result

print(distrib_param(data, header, 3))

{'0': (0.7, {'Gender': {'Female': 0.49159663865546216, 'Male': 0.5084033613445378}, 'Age': {'Mean': 32.26890756302521, 'Variance': 64.32264670574112}, 'EstimatedSalary': {'Mean': 60100.84033613445, 'Variance': 625023444.6719865}}), '1': (0.3, {'Gender': {'Female': 0.5490196078431373, 'Male': 0.45098039215686275}, 'Age': {'Mean': 45.0, 'Variance': 79.66666666666667}, 'EstimatedSalary': {'Mean': 96549.01960784313, 'Variance': 1617855440.2153018}})}


3. Implémenter la fonction de densité gausienne et la fonction de probabilité de Bernoulli.

In [3]:
def gaussian_density_function(x: float, mean: float, var: float) -> float:
    return (1/np.sqrt(var * 2 * np.pi)) * np.exp((-1/2)* ((x - mean)/np.sqrt(var))**2)

def bernoulli_density_function(x: float, p: float) -> float:
    return (p ** x)((1 - p) ** (1 - x))

4. Étant donné de nouvelles covariables, prédire les labels correspondants

In [12]:
def naive_base(distrib: dict, header: np.ndarray, data: np.ndarray):
    values_keys = list(distrib.keys())
    maximum = 0
    value = values_keys[0]
    for i in values_keys:
        denom = 1
        for j in range(len(header)):
            if list(distrib[i][1][header[j]].keys())[0] == "Mean" or list(distrib[i][1][header[j]].keys())[1] == "Mean":
                denom *= gaussian_density_function(float(data[j]), distrib[i][1][header[j]]["Mean"], distrib[i][1][header[j]]["Variance"])
            else:
                denom *= distrib[i][1][header[j]][data[j]]

        if denom * distrib[i][0] > maximum:
            maximum = denom
            value = i
    return int(value)

FN, FP, TN, TP = range(4)

def test_naive_base(distrib: dict, header: np.ndarray, data: np.ndarray, target: int):
    target_values = data[:, target].astype(int)
    data = np.delete(data, target, 1)
    header = np.delete(header, target)

    FN_count = 0
    FP_count = 0
    TN_count = 0
    TP_count = 0
    for i in range(len(data)):
        result = naive_base(distrib, header, data[i])

        if result == 1 and target_values[i] == 1:
            TP_count += 1
        elif result == 0 and target_values[i] == 0:
            TN_count += 1
        elif result == 1 and target_values[i] == 0:
            FP_count += 1
        elif result == 0 and target_values[i] == 1:
            FN_count += 1
    return FN_count / len(data), FP_count / len(data), TN_count / len(data), TP_count / len(data)


_, data_test = convert_csv2array('data_test.csv')

print(test_naive_base(distrib_param(data, header, 3), header, data_test, 3))


(0.2833333333333333, 0.0, 0.31666666666666665, 0.4)


### 2 Logistic Regression
Le but de cette section est d’implémenter Logistic Regression. On suppose donc le modèle
suivant:
$$yi \approx^{ind} Bernoulli(p_i), p_i = \sigma(w^T x_i + b), \sigma(z) = \frac{1}{1+exp−z}$$

1. Sur papier, dériver:

(a) $p(y_i|x_i; w, b)$

$$p(y_i|x_i; w, b)$$
$$= p_i^{y_i}(1 - p_i)^{1 - y_i}$$
$$= \sigma(w^T x_i + b)^{y_i}(1 - \sigma(w^T x_i + b))^{1 - y_i}$$
$$= (\frac{1}{1+e^{w^T x_i + b}})^{y_i}(1 - \frac{1}{1+e^{w^T x_i + b}})^{1 - y_i}$$
$$= (\frac{1}{1+e^{w^T x_i + b}})^{y_i}(\frac{1+e^{w^T x_i + b} - 1}{1+e^{w^T x_i + b}})^{1 - y_i}$$
$$= \boxed{(\frac{1}{1+e^{w^T x_i + b}})^{y_i}(\frac{e^{w^T x_i + b}}{1+e^{w^T x_i + b}})^{1 - y_i}}$$
$$= \frac{1}{(1+e^{w^T x_i + b})^{y_i}}\frac{(e^{w^T x_i + b})^{1 - y_i}}{(1+e^{w^T x_i + b})^{1 - y_i}}$$
$$= \boxed{\frac{(e^{w^T x_i + b})^{1 - y_i}}{1+e^{w^T x_i + b}}}$$

(b) $log(p(y_i|x_i; w, b))$

$$log(p(y_i|x_i; w, b))$$
$$= log(\frac{(e^{w^T x_i + b})^{1 - y_i}}{1+e^{w^T x_i + b}})$$
$$= log((e^{w^T x_i + b})^{1 - y_i}) - log({1+e^{w^T x_i + b}})$$
$$=(1 - y_i) log(e^{w^T x_i + b}) - log({1+e^{w^T x_i + b}})$$
$$=\boxed{(1 - y_i) (w^T x_i + b) - log({1+e^{w^T x_i + b}})}$$




$$=log((\frac{1}{1+e^{w^T x_i + b}})^{y_i}(\frac{e^{w^T x_i + b}}{1+e^{w^T x_i + b}})^{1 - y_i})$$
$$=log((\frac{1}{1+e^{w^T x_i + b}})^{y_i}) + log((\frac{e^{w^T x_i + b}}{1+e^{w^T x_i + b}})^{1 - y_i}))$$
$$=y_i log(\frac{1}{1+e^{w^T x_i + b}}) + (1 - y_i) log(\frac{e^{w^T x_i + b}}{1+e^{w^T x_i + b}}))$$
$$=-y_i log(1+e^{w^T x_i + b}) + (1 - y_i) (log(e^{w^T x_i + b}) - log(1+e^{w^T x_i + b}))$$
$$=-y_i log(1+e^{w^T x_i + b}) + (1 - y_i) log(e^{w^T x_i + b}) + (y_i - 1) log(1+e^{w^T x_i + b})$$
$$=(1 - y_i) log(e^{w^T x_i + b}) - log(1+e^{w^T x_i + b})$$

(c) $\frac{d\sigma(z)}{dz}$ comme une fonction de $\sigma$

$$\frac{d\sigma(z)}{dz}$$
$$= ((1 + e^{-z})^{-1})'$$
$$= -1 \times - e ^{-z} \times (1 + e^{-z})^{-2}$$
$$=\frac{e^{-z}}{(1 + e^{-z})^2}$$

(d) $\frac{dlog(p(y_i|x_i;w,b))}{dw_j}$

$$\frac{dlog(p(y_i|x_i;w,b))}{dw_j}$$

$$=\frac{(1 - y_i)(w^T x_i + b) - log(1+e^{w^T x_i + b})}{dw_j}$$
$$= \frac{(1 - y_i)((\sum_{k = 0}^{N} w_k x_i) + b) - log({1+e^{(\sum_{k = 0}^{N} w_k x_i) + b}})}{dw_j}$$
$$= x_i (1 - y_i) - \frac{log({1+e^{(\sum_{k = 0}^{N} w_k x_i) + b}})}{dw_j}$$
$$= x_i (1 - y_i) - \frac{\frac{(1+e^{(\sum_{k = 0}^{N} w_k x_i) + b})}{dw_j}}{1+e^{w^T x_i + b}}$$
$$= \boxed{x_i (1 - y_i) - \frac{x_i e^{w^T x_i + b}}{1+e^{w^T x_i + b}}}$$

(e) $\frac{dlog(p(yi|xi;w,b))}{db}$

$$\frac{dlog(p(yi|xi;w,b))}{db}$$
$$=\frac{(1 - y_i)(w^T x_i + b) - log(1+e^{w^T x_i + b})}{db}$$
$$= 1 - y_i - \frac{log(1+e^{w^T x_i + b})}{db}$$
$$= 1 - y_i - \frac{\frac{1+e^{w^T x_i + b}}{db}}{1+e^{w^T x_i + b}}$$
$$= \boxed{1 - y_i - \frac{e^{w^T x_i + b}}{1+e^{w^T x_i + b}}}$$

2. Implémenter une fonction train_logistic_regression qui prend en arguments:

    (a) Une matrice de covariables X
  
    (b) Un vecteur de labels y

    (c) Un vecteur de poids initial w

    (d) Une valeur de biais initiale

    (e) Un nombre d’itérations num_iters
    
    (f) Un taux d’apprentissage learning_rate et qui renvoit les poids et le biais entraînés par descente de gradient à minimiser
    $$−\sum_{i = 1}^N log(p(y_i|x_i; w, b))$$
    avec $N$ le nombre d’exemples d’entraînement.


### 3 Evaluation
Comparer les performances des modèles développés en les évaluant sur les données de data_train.csv
et data_test.csv selon les métriques suivantes: accuracy, precision, recall, F1 score.

In [None]:
def accuracy(tree: dict, test_datas: np.ndarray, header: np.ndarray, target: int) -> float:
    """
    Return the error rate of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas

    Return value
    ------------
    Return the probability of making an error in our prediction
    """
    fn, fp, tn, tp = evaluation(tree, test_datas, header, target)
    return (fp + fn)/(tp + tn + fp + fn)

def precision(tree, test_datas: np.ndarray, header: np.ndarray, target: int):
    """
    Return the precision of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas

    Return value TODO !!!
    ------------
    Return the precision and the recall of the tree
    """
    fn, fp, tn, tp = evaluation(tree, test_datas, header, target)
    return tp / (tp + fp), tp / (tp + fn)

def f1_score(tree, test_datas: np.ndarray, header: np.ndarray, target: int) -> float:
    """
    Return the f1 score of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas
    
    Return value
    ------------
    Return the f1 score of the tree
    """
    p, r = precision(tree, test_datas, header, target)
    return (2 * p * r) / (p + r)

