## Logistic Regression Model from scratch for Divorce Prediction

### Logistic regression
Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as W) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.<br>

#####  $\hat{y}$ (w, x) = 1/(1+exp^-(w_0 + w_1 * x_1 + ... + w_p * x_ps))

#### Dataset
The dataset is available at <strong>"data/divorce.csv"</strong> in the respective challenge's repo.<br>
<strong>Original Source:</strong> https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set. Dataset is based on rating for questionnaire filled by people who already got divorse and those who is happily married.<br><br>

[//]: # "The dataset is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00520/data.zip. Unzip the file and use either CSV or xlsx file.<br>"


#### Features (X)
1. Atr1 - If one of us apologizes when our discussion deteriorates, the discussion ends. (Numeric | Range: 0-4)
2. Atr2 - I know we can ignore our differences, even if things get hard sometimes. (Numeric | Range: 0-4)
3. Atr3 - When we need it, we can take our discussions with my spouse from the beginning and correct it. (Numeric | Range: 0-4)
4. Atr4 - When I discuss with my spouse, to contact him will eventually work. (Numeric | Range: 0-4)
5. Atr5 - The time I spent with my wife is special for us. (Numeric | Range: 0-4)
6. Atr6 - We don't have time at home as partners. (Numeric | Range: 0-4)
7. Atr7 - We are like two strangers who share the same environment at home rather than family. (Numeric | Range: 0-4)

&emsp;.<br>
&emsp;.<br>
&emsp;.<br>
54. Atr54 - I'm not afraid to tell my spouse about her/his incompetence. (Numeric | Range: 0-4)
<br><br>
Take a look above at the source of the original dataset for more details.

#### Target (y)
55. Class: (Binary | 1 => Divorsed, 0 => Not divorsed yet)

#### Objective
To gain understanding of logistic regression through implementing the model from scratch

#### Tasks
- Download and load the data (csv file contains ';' as delimiter)
- Define X matrix (independent features) and y vector (target feature) as numpy arrays
- Add column at position 0 with all values=1 (pandas.DataFrame.insert function). This is for input to the bias $w_0$
- Print the shape and datatype of both X and y
[//]: # "- Dataset contains missing values, hence fill the missing values (NA) by performing missing value prediction"
[//]: # "- Since the all the features are in higher range, columns can be normalized into smaller scale (like 0 to 1) using different methods such as scaling, standardizing or any other suitable preprocessing technique (sklearn.preprocessing.StandardScaler)"
- Split the dataset into 85% for training and rest 15% for testing (sklearn.model_selection.train_test_split function)
- Follow code cells to implement simple logistic regression from scratch
    - Write hypothesis function to predict values
    - Write function for calculating cross entropy loss (or log loss)
    - Write function to return gradients for given weights
    - Perform gradient descent taking help of above functions
    - Write function for calculating accuracy
- Train the model using training set using the function implementation
- Predict the output for testing set samples and compute accuracy

#### Further Fun (will not be evaluated)
- Play with learning rate and max_iterations
- Testing between whether label encoder vs one hot encoder for categorical features gives better results.
- Running model with different feature scaling methods (i.e. scaling, normalization, standardization etc using sklearn)
- Training model with different sizes of dataset splitting such as 60-40, 50-50, 70-30, 80-20, 90-10, 95-5 etc.
- Shuffling of training samples with different random seed values in the train_test_split function. Check the model error for the testing data for each setup.
- Write functions for other classification metrics such as confusion matrix,  precision, recall and f1 scores.


#### Helpful links
- How Logistic Regression works: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Training testing splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [None]:
# Download the dataset from the source
!wget _URL_

In [1]:
# Unzip the file to the local cloud directory
!unzip _filename_

In [None]:
# Read the data from local cloud directory
data = 

# Set delimiter to semicolon(;) in case of unexpected results

In [None]:
# Add column which has all 1s
# The idea is that weight corresponding to this column is equal to intercept
# This way it is efficient and easier to handle the bias/intercept term


In [None]:
# Print the dataframe rows just to see some samples


In [None]:
# Define X (input features) and y (output feature) 
X = 
y = 

In [None]:
X_shape = 
X_type  = 
y_shape = 
y_type  = 
print(f'X: Type-{X_type}, Shape-{X_shape}')
print(f'y: Type-{y_type}, Shape-{y_shape}')

<strong>Expected output: </strong><br><br>

X: Type-<class 'numpy.ndarray'>, Shape-(170, 55)<br>
y: Type-<class 'numpy.ndarray'>, Shape-(170,)

In [None]:
# Perform missing value prediction



In [None]:
# Perform feature scaling



In [None]:
# Split the dataset into training and testing here
X_train, X_test, y_train, y_test = 

In [None]:
# Print the shape of features and target of training and testing: X_train, X_test, y_train, y_test
X_train_shape = 
y_train_shape = 
X_test_shape  = 
y_test_shape  = 

print(f"X_train: {X_train_shape} , y_train: {y_train_shape}")
print(f"X_test: {X_test_shape} , y_test: {y_test_shape}")
assert (X_train.shape[0]==y_train.shape[0] and X_test.shape[0]==y_test.shape[0]), "Check your splitting carefully"

##### Let us start implementing logistic regression from scratch. Just follow code cells, see hints if required.

In [None]:
def predict(X, weights):

    ### START CODE HERE ###
    y_pred = 
    ### END CODE HERE ###
    
    assert (y_pred.shape==(X.shape[0],)), 'Wrong implementation of predict function. Check carefully'
    
    return y_pred

In [None]:
def cross_entropy_loss(y_train, y_pred) : 
    '''
    y_true : (m,1)
    y_pred : (m,1)
    
    cross entropy loss (or log loss)
    '''
    ### START CODE HERE ###
    loss = 
    ### END CODE HERE ###
    
    return loss

In [None]:
def gradient(X_train, y_train, y_pred):

    # Initialize the gradient vector 
    grad = np.zeros(,)
    
    ### START CODE HERE ###
    
    grad[0] = 
    grad[1] = 
    
    ### END CODE HERE ###
    
    return grad

In [None]:
def gradient_descent(X_train, y_train, learning_rate=0.01, max_iterations=100):

    # Initialise weights vector of random values
    weights = np.random.rand()
    # Initialize a list to record all the losses 
    losses  = []
    
    ### START CODE HERE ###
    
    ### END CODE HERE ###
    
    return weights, losses

In [None]:
def accuracy(y_pred, y_test):
    
    ### START CODE HERE ###
    accuracy = 
    ### END CODE HERE ###
    
    return accuracy

##### Congratulations! You have implemented logistic regression from scratch. Let's see this in action.

In [None]:
# Perform gradient descent
optimal_weights, losses = gradient_descent(X_train, y_train)

In [None]:
# Print final loss
print("Cross Entropy loss:", losses[-1])

In [None]:
# Plot the loss curve
plt.plot([i for i in range(len(losses))], losses)
plt.title("Loss curve")
plt.xlabel("Iteration num")
plt.ylabel("Loss")
plt.show()

In [None]:
# Make predictions for testing set using trained weights
y_pred = hypothesis(X_test, optimal_weights)

In [None]:
# Calculate accuracy score for the testing set

accuracy = 