# Exercise 2 - Julia: Logistic Regression

## *Part One*: Logistic regression without regularization

Predicting if a student will be accepted into a university based off of two test scores

Beginning with package imports, data loading, and initial visualization

In [61]:
using DataFrames
using Plots
using GLM  # For comparing answers

df = readtable("ex2/ex2data1.txt", header=false)
names!(df, [:Exam1Score, :Exam2Score, :Admitted])

# Adding the intercept term
df[:x0] = ones(nrow(df))

X = df[[:x0, :Exam1Score, :Exam2Score]]
y = df[:Admitted]

# An array of 0s for starting values of theta to be used in many functions
initialTheta = zeros(3)

head(df)

Unnamed: 0,Exam1Score,Exam2Score,Admitted,x0
1,34.62365962451697,78.0246928153624,0,1.0
2,30.28671076822607,43.89499752400101,0,1.0
3,35.84740876993872,72.90219802708364,0,1.0
4,60.18259938620976,86.30855209546826,1,1.0
5,79.0327360507101,75.3443764369103,1,1.0
6,45.08327747668339,56.3163717815305,0,1.0


In [62]:
# Plotting the data
# Subsetting to plot each group separately in order to get separate colors
admitted = df[(df[:Admitted].==1),:]
notAdmitted = df[(df[:Admitted].==0),:]

scatter(admitted[:Exam1Score], admitted[:Exam2Score],
    label="Admitted", xlab="Exam 1 Score", ylab="Exam 2 Score")
scatter!(notAdmitted[:Exam1Score], notAdmitted[:Exam2Score],
    label="Not Admitted")

### Sigmoid Function

$g(z) = \frac{1}{1+e^{-z}}$

Converts $z$ into a value between 0 and 1

In [69]:
function sigmoid(z)
    # Converts numerical input into a value between 0 and 1
    z = 1/(1+exp(-z))
    return z
end

# Plotting values to validate the function is working correctly
plot(collect(-10:10), 
     sigmoid.(collect(-10:10)))  # f. applies the function to the array



) in module Main at In[63]:3 overwritten at In[69]:3.


### Logistic Regression Hypothesis

$h_\theta(x) = g(\theta^Tx)$

- Notation:

    - $g$: Sigmoid function

    - $\theta^T$: Transposed parameters
       
        - E.x.: $\theta^T = \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}$

In [68]:
function logisticHypothesis(theta, X)
    # Calculates the hypothesis for X given values of
    # theta for logistic regression
    X = convert(Array, X)
    h = sigmoid.(*(X, theta))
    return h
end

logisticHypothesis(initialTheta, X)[1:5]



5-element Array{Float64,1}:
 0.5
 0.5
 0.5
 0.5
 0.5

) in module Main at In[67]:4 overwritten at In[68]:4.


### Cost Function

$J(\theta) = \frac{1}{m} \sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1-h_\theta(x^{(i)}))]$

- Notation:

    - $m$: Number of records

    - $h_\theta$: Logistic hypothesis $(h)$ given specific values of $\theta$ for parameters
    
    - $i$: Index of the record (e.x. if $i = 46$, then 46th row)

In [90]:
function costFunction(theta, X, y)
    # Computes cost for logistic regression
    X = convert(Array, X)
    y = convert(Array, y)
    m = length(y)
    
    h = logisticHypothesis(theta, X)
    error = sum(-y.*log(h)-(1-y).*log(1-h))
    J = (1/m)*error
    return J
end

costFunction(initialTheta, X, y)



0.6931471805599452

(Any, Any, Any) in module Main at In[89]:3 overwritten at In[90]:3.


### Gradient

$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

- Notation:

    - $\partial$: Partial derivative
    
    - $J(\theta)$: Cost given $\theta$

    - $m$: Number of records
    
    - $h_\theta$: Logistic hypothesis $(h)$ given specific values of $\theta$ for parameters
    
    - $i$: Index of the record (e.x. if $i = 46$, then 46th row)
    
We won't actually be using this function to find the optimal values of $\theta_j$, so this is just illustrating the gradient

In [101]:
function logisticGradient(theta, X, y)
    # Computes the gradient for logistic regression
    X = convert(Array, X)
    y = convert(Array, y)
    m = length(y)
    
    h = logisticHypothesis(theta, X)
    gradient = (1/m) * (*(transpose(X), (h-y)))
    return gradient
end

logisticGradient(initialTheta, X, y)



3-element Array{Float64,1}:
  -0.1   
 -12.0092
 -11.2628

(Any, Any, Any) in module Main at In[100]:3 overwritten at In[101]:3.


Finding the optimal values of $\theta_j$ for the cost function using the base R optim function.  This is similar to MATLAB's fminunc function.

Comparing the obtained parameters to what Julia's **[insert function here]** function provides

**[Comment on if they are similar]**

Calculating the class probability and generating predictions of acceptance using values of $\theta_j$ obtained from the optimization function

The outputs from logistic regression are just the class probability, or $P(y = 1 \mid x; \theta)$, so we are predicting the classes (accepted or not) as follows:

$Prediction(y \mid x; \theta) = \begin{cases} 1, \quad\mbox{ if } P(y = 1 \mid x; \theta) > 0.50 \\ 0, \quad\mbox{ if } P(y = 1 \mid x; \theta) \leq 0.50 \end{cases} $

Plotting the decision boundary over the data

---

##  **Part 2:** Logistic regression with regularization

Predicting if a microchip passes QA after two tests

In [None]:
# Plotting the data
# Subsetting to plot each group separately in order to get separate colors
admitted = df[(df[:Admitted].==1),:]
notAdmitted = df[(df[:Admitted].==0),:]

scatter(admitted[:Exam1Score], admitted[:Exam2Score],
    label="Admitted", xlab="Exam 1 Score", ylab="Exam 2 Score")
scatter!(notAdmitted[:Exam1Score], notAdmitted[:Exam2Score],
    label="Not Admitted")

### Feature Mapping

Maps the features into all polynomial terms of $x_1$ and $x_2$ up to the sixth power.  This allows for a more complex and nonlinear decision boundary.  

The feature space prior to feature mapping (3-dimensional vector): 

$\hspace{1cm} Feature(x) = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}$ 

The feature space after feature mapping:

$\hspace{1cm} mapFeature(x) = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ x_1^2 \\ x_1x_2 \\ x_2^2 \\ x_1^3 \\ \vdots \\ x_1x_2^5 \\ x_2^6 \end{bmatrix}$

**Note:** I made a few adjustments on the Octave/MATLAB code provided for this assignment in order to maintain the names of the polynomials
Octave/MATLAB code:
```
degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
    for j = 0:i
        out(:, end+1) = (X1.^(i-j)).*(X2.^j);
    end
end
```

### Regularized Cost Function

$J(\theta) = \frac{1}{m} \sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

The only change from the other cost function we used earlier is the addition of the regularization parameter:

#### Regularization Parameter

$\frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

- Notation:

    - $\lambda$: The weight which the parameters are adjusted by.  A lower $\lambda$ has little effect on the parameters, and a higher $\lambda$ (e.x. $\lambda = 1,000$) will adjust the parameters to be close to 0.
    - $m$: Number of records
    - $j$: The index for the parameter.  E.x. $\theta_{j=1}$ is the score for Microchip Test #1

**Note:** $\theta_0$ should not be regularized as denoted by the summation in the regularization parameter

### Regularized Gradient

$\frac{\partial J(\theta)}{\partial \theta_j} = \begin{cases} 
\hspace{0.25cm} \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} & \text{for}\ j = 0 \\
\Big(\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\Big) + \frac{\lambda}{m}\theta_j & \text{for}\ j \geq 1
\end{cases}$

This is also the same as the last gradient with the exception of the regularization parameter

#### Regularization Parameter

$\frac{\lambda}{m}\theta_j \hspace{0.5cm}$for $j \geq 1$

- Notation:

    - $\lambda$: The weight which the parameters are adjusted by.  A lower $\lambda$ has little effect on the parameters, and a higher $\lambda$ (e.x. $\lambda = 1,000$) will adjust the parameters to be close to 0.
    - $m$: Number of records
    - $j$: The index for the parameter.  E.x. $\theta_{j=1}$ is the score for Microchip Test #1

Finding the optimal values of $\theta$.  This chunk will take longer to run since we're dealing with a much higher dimensional dataset.

Checking against Julia's **[Insert function here]** logistic regression

**[Comment on the difference]**

Lastly, comparing the accuracy between the two models.  Classification accuracy is just the percentage of records correctly classified (precision, recall, f-1 score, etc. offer more nuanced information on performance), so we will have to calculate the class probabilities and assign predictions like we did for part one:

**[Comment on the difference]**

Plotting the decision boundary