### Logistic Regression (Binary - Supervised)

We want to predict whether or not a student will be admitted into University A based off of ther GMAT score and their GPA. 

We will train our model to find the probabily that a student will be admitted.

### Feed Forward (with a single neuron)

$(x^i, y^i),..., (x^n,y^n)$

Feature Vector  
$x^i = Student_i\ GMAT Score  \\
      Student_i\ GPA$
 
Label  
$y^i = {0,1}$  
where 1 indicates student_i was accepted and 0 if not.

=== Figure out how to do neural map on latex for jupyter ===

We have feature vectors ($x^i$) and weights ($\omega$) give a linear combinations $Z = \omega^T x^i = b$ imply an activation funciton ($\sigma = \frac{1}{1+e^-x}$) of our choice to get an output ($\hat y^i = \sigma(Z^i)$).


---
#### Loss Cross Entropy Function
We want $L(\hat y, j^i) =$ How close $\hat y^i$ is to $y^i$  
First consder maximizing $P(y^i|x^i)$ the probability that $\hat y^i$ predicts $y^i$. Since there are only two discrete outputs, this is subject to the following formula by $Bernoulli$. 

Maximize -> $P(y^i|x^i) = \hat y^y (1-\hat y)^{1-y}$
equivalent to:  
$log(P(y^i|x^i)) = log(\hat y^y (1-\hat y)^{1-y})$
$log(P(y^i|x^i)) = ylog \hat y + (1-y)log(1- \hat y)$

For gradient descent we want a minimization problem:  
Loss Cross Entropy Function  $L_{CE}(\omega, b)$  
Minimize  
$-log(P(y^i|x^i)) = -[ylog \hat y + (1-y)log(1- \hat y)]$

$ -log(P(y^i|x^i)) = -[ylog \sigma (z) + (1-y)log(1- \sigma(z))]$

$ -log(P(y^i|x^i)) = -[ylog \sigma (\omega^Tx+b) + (1-y)log(1- \sigma(\omega^Tx+b))]$

---
---
#### Train (Stochastic Gradient Descent)

We want $\hat\theta = argminL_{CE}(x_i,y:\theta)$, where $\theta = \omega, b$


$
\frac{\partial L_{CE}}{\partial \omega_j}(\omega, b) = [\theta(\omega x+b)-y]\cdot x_j$  

$\frac{\partial L_{CE}}{\partial b}(\omega, b)= \theta(\omega^T x+b)-y$

For a given $(x^i, y)$  
$\omega_j^{k+1} = \omega_j^k - \alpha \cdot \frac{\partial L_{CE}}{\partial \omega_j^k}(\omega^k, b^k)$ for $j = 1, 2$

$b^{k+1} = b^k - \alpha \cdot \frac{\partial L_{CE}}{\partial b}(\omega^k, b^k)$



---


In [1]:
using CSV
using DataFrames
data = DataFrame(CSV.File("candidates_data.csv"))

Unnamed: 0_level_0,gmat,gpa,work_experience,admitted
Unnamed: 0_level_1,Int64,Float64,Int64,Int64
1,780,4.0,3,1
2,750,3.9,4,1
3,690,3.3,3,0
4,710,3.7,5,1
5,680,3.9,4,0
6,730,3.7,6,1
7,690,2.3,1,0
8,720,3.3,4,1
9,740,3.3,5,1
10,690,1.7,1,0


In [2]:
x_data = [[x[1], x[2]] for x in zip(data.gmat, data.gpa)]
y_data = [x for x in data.admitted];

In [3]:
σ(x) = 1/(1+exp(-x))

σ (generic function with 1 method)

In [4]:
function cross_entropy_loss(x, y, w, b)
    return -y*log(σ(w'x + b)) - (1-y)*log(1 - σ(w'x+b))
end

cross_entropy_loss (generic function with 1 method)

In [5]:
#What we want to minimize
function average_cost(features, labels, w, b)
    N = length(features)
    return(1/N)*sum([cross_entropy_loss(features[i], labels[i], w, b) for i = 1:N])
end

average_cost (generic function with 1 method)

In [6]:
function batch_gradient_descent(features, labels, w, b, α)
    
    del_w = [0.0 for i = 1:length(w)]
    del_b = 0.0
    
    N = length(features)
    
    for i = 1:N
        del_w += (σ(w'features[i]+b)-labels[i])*features[i]
        del_b += (σ(w'features[i]+b)-labels[i])
    end
    w = w -α*del_w
    b = b -α*del_b
    
    return w, b
end

batch_gradient_descent (generic function with 1 method)

In [7]:
w = [0.0, 0.0]
b = 0.0

println("The initial cost is: ", average_cost(x_data, y_data, w, b))

w, b = batch_gradient_descent(x_data, y_data, w, b, 0.0000001)
println("The new cost is: ", average_cost(x_data, y_data, w, b))


The initial cost is: 0.6931471805599451
The new cost is: 0.6931188566349795


In [13]:
function train_batch_gradient_descent(features, labels, w, b, α, epochs)
    for i = 1:epochs
        #Updating w,b
        w,b = batch_gradient_descent(features, labels, w, b, α)
        
        if i == 1
            println("Epoch ", i, " with cost: " , average_cost(x_data, y_data, w, b))
        end
        if i == 100
            println("Epoch ", i, " with cost: " , average_cost(x_data, y_data, w, b))
        end
        if i == 1000
            println("Epoch ", i, " with cost: " , average_cost(x_data, y_data, w, b))
        end
        if i == 10000
            println("Epoch ", i, " with cost: " , average_cost(x_data, y_data, w, b))
        end
        if i == 100000
            println("Epoch ", i, " with cost: " , average_cost(x_data, y_data, w, b))
        end
    end
    return w, b
end

train_batch_gradient_descent (generic function with 1 method)

In [22]:
w = [0.0, 0.0]
b = 0.0
w,b = train_batch_gradient_descent(x_data, y_data, w, b, 0.0000001, 1000000)

Epoch 1 with cost: 0.6931188566349795
Epoch 100 with cost: 0.6930977152288289
Epoch 1000 with cost: 0.6930282266294219
Epoch 10000 with cost: 0.6923351299473173
Epoch 100000 with cost: 0.6855799117618873


([-0.0020551903863979, 0.47622113690915635], -0.11626329950708125)

In [23]:
w,b = train_batch_gradient_descent(x_data, y_data, w, b, 0.0000001, 1000000)

Epoch 1 with cost: 0.632691827205463
Epoch 100 with cost: 0.6326872176867723
Epoch 1000 with cost: 0.6326453230745035
Epoch 10000 with cost: 0.6322273761510574
Epoch 100000 with cost: 0.6281458496177578


([-0.0036370472006028087, 0.8445273524582464], -0.22916112906791533)

In [24]:
w,b = train_batch_gradient_descent(x_data, y_data, w, b, 0.0000001, 1000000)

Epoch 1 with cost: 0.5954226371160484
Epoch 100 with cost: 0.5954197027866317
Epoch 1000 with cost: 0.5953930326541702
Epoch 10000 with cost: 0.5951268841889991
Epoch 100000 with cost: 0.592519647987565


([-0.004865616495539809, 1.13654419765688], -0.33931275004419825)

In [25]:
w,b = train_batch_gradient_descent(x_data, y_data, w, b, 0.0000001, 1000000)

Epoch 1 with cost: 0.5709852048107705
Epoch 100 with cost: 0.5709832149786357
Epoch 1000 with cost: 0.5709651288345847
Epoch 10000 with cost: 0.5707845878223877
Epoch 100000 with cost: 0.5690106767456337


([-0.005840241274094449, 1.3738471774117296], -0.44718732969308767)

In [29]:
w,b = train_batch_gradient_descent(x_data, y_data, w, b, 0.0000005, 1000000)

Epoch 1 with cost: 0.49662146605673574
Epoch 100 with cost: 0.4966198632622398
Epoch 1000 with cost: 0.49660529525590635
Epoch 10000 with cost: 0.4964598968477649
Epoch 100000 with cost: 0.4950330686603879


([-0.009966727824506375, 2.6569769985318787], -1.822723496544359)

In [50]:
function predict(x, y, w, b)
    if σ(w'x+b)>=.5
        println("Predicted to be Accepted")
        y==1 ? println("Accepted") : println("NOT Accepted")
    else
        println("Predicted NOT be Accepted")
        y==1 ? println("Accepted") : println("NOT Accepted")
    end
end

predict (generic function with 1 method)

In [51]:
for i = 1:length(x_data)
    predict(x_data[i], y_data[i], w, b)
    println()
end

Predicted to be Accepted
Accepted

Predicted to be Accepted
Accepted

Predicted to be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted to be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted NOT be Accepted
Accepted

Predicted NOT be Accepted
Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted to be Accepted
Accepted

Predicted to be Accepted
NOT Accepted

Predicted NOT be Accepted
Accepted

Predicted to be Accepted
NOT Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted NOT be Accepted
NOT Accepted

Predicted to be Accepted
NOT Accepted

Predicted to be Accepted
Accepted

Predicted to be Accept

In [53]:
function predict(x,y,w,b)
    if σ(w'x+b) >= 0.5
        return 1
    else
        return 0
    end 
end

predict (generic function with 1 method)

In [54]:
mean_error = 0.0
for i = 1:length(x_data)
    mean_error += (predict(x_data[i], y_data[i], w, b) - y_data[i])^2
end

print(mean_error/length(x_data))

0.225