<div >
<img src = "../banner.jpg" />
</div>

### LDA


\begin{align}
p (Y=1|X)=\frac{f(X|Y=1)p(Y=1)}{m(X)}
\end{align}


with $m(X)$ is the marginal distribution of $X$, i.e.

\begin{align}
m(X)=\int f(X|Y=y)p(Y=y)dy
\end{align}

Recall that there are two states of nature $y \rightarrow i\in\{0,1\}$


\begin{align}
m(X) &= f(X|Y=1)p(Y=1) + f(X|Y=0)p(Y=0) 
\end{align}


\begin{align}
m(X)     &= f(X|Y=1)p(Y=1) + f(X|Y=0)(1-p(Y=1))
\end{align}

We need to estimate $f(X|Y=1)$,  $f(X|Y=0)$ and $p(Y=1)$ 


#### By Hand

Let's apply it in our unemployment problem. Unemployment prediction is a classic problem of classification and remains one of the key application areas for machine learning: we use previous employment results (employed versus unemployed) to train a model that can predict the employment status of individuals in new cases.

\begin{align}
Unemployment = f(x) + u
\end{align}

where $Unemployment = I(Unemployment=1)$


In [1]:
#Cargar librerías 
require("pacman")
p_load(tidyverse)
set.seed(1011)

Loading required package: pacman



In [2]:
#Leer los datos 
db <- readRDS(url("https://github.com/ignaciomsarmiento/datasets/blob/main/desempelo_arg_2010.Rds?raw=true"))
head(db)

desempleado,edad,mujer,parentesco,nivel_ed,estado_civil,total_miembros_hogar,miembros_hogar_menores10,ing_tot_fam,tipo_vivienda,ciudad,trimestre,id_hogar
<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<chr>
0,49,0,Hijo/a - Hijastro/a,Superior Universitaria Incompleta,Soltero/a,2,0,4200,Casa,Gran Rosario,1,1250021
0,56,1,Jefe/a,Primaria Completa,Viudo/a,1,0,1380,Departamento,Rio Cuarto,1,1250241
0,31,0,Jefe/a,Superior Universitaria Incompleta,Unido/a,3,1,8400,Casa,Partidos del GBA,1,1251051
0,33,1,Conyuge/Pareja,Superior Universitaria Completa,Unido/a,3,1,8400,Casa,Partidos del GBA,1,1251051
0,40,1,Hijo/a - Hijastro/a,Secundaria Completa,Unido/a,5,2,8800,Casa,Gran Cordoba,1,1251161
0,49,0,Yerno/Nuera,Secundaria Completa,Unido/a,5,2,8800,Casa,Gran Cordoba,1,1251161



- Let's start by estimating $p(Y=1)$. We've done this before

    \begin{align}
    p(Y=1) = \frac{\sum_{i=1}^n 1[Y_i=1]}{N}
    \end{align}


In [5]:
p1<-sum(db$desempleado)/dim(db)[1]
p1


- Next $f(X|Y=j)$ with $j=0,1$. 

    - If we assume one predictor and $X|Y\sim N(\mu_j,\sigma_j)$, the problem boils down to estimating $\mu_j,\sigma_j$

    - LDA makes it simpler, assumes $\sigma_j=\sigma$ $\forall j$

To do this partition the sample in two $Y=0$ and $Y=1$, estimate the moments and get $\hat{f}(X|Y=j)$

**Means**

\begin{align}
\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}x_i
\end{align}

In [7]:
#Means
mu1<-mean(db$edad[db$desempleado==1])
mu1

In [8]:
mu0<-mean(db$edad[db$desempleado==0])
mu0

**Variance**

\begin{align}
\hat{\sigma}^2 = \frac{1}{N-K} \sum_{k=1}^K \sum_{i:y_i=k} (x_i -\hat{\mu}_k)^2
\end{align}

In [9]:
#Variance
g1<-sum((db$edad[db$desempleado==1]-mu1)^2)
g0<-sum((db$edad[db$desempleado==0]-mu0)^2)


sigma<-sqrt((g1+g0)/(dim(db)[1]-2))
sigma

With the moments, now we can obtain $f(X|Y=j)$ with $j=0,1$. 

In [10]:
f1<-dnorm(db$edad,mean=mu1,sd=sigma)
f0<-dnorm(db$edad,mean=mu0,sd=sigma)

- Finally plug everything into the Bayes Rule and we are done:
\begin{align}
p (Y=1|X)=\frac{f(X|Y=1)p(Y=1)}{f(X|Y=1)p(Y=1) + f(X|Y=0)(1-p(Y=1))}
\end{align}


In [11]:
post_hand<-f1*p1/(f1*p1+f0*(1-p1))
head(post_hand)

In [12]:
p_load("MASS")     # LDA
lda_simple <- lda(desempleado~edad, data = db)
lda_simple_pred<-predict(lda_simple,db)
names(lda_simple_pred)


In [13]:
posteriors<-data.frame(lda_simple_pred$posterior)
posteriors$hand<-post_hand

head(posteriors)

Unnamed: 0_level_0,X0,X1,hand
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,0.9625907,0.03740933,0.03740933
2,0.9732954,0.02670463,0.02670463
3,0.9131315,0.08686852,0.08686852
4,0.9207042,0.07929581,0.07929581
5,0.942681,0.05731898,0.05731898
6,0.9625907,0.03740933,0.03740933
