## Logistic Regression Model from scratch for Divorce Prediction

### Logistic regression
Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as W) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.<br>

#####  $\hat{y}$ (w, x) = 1/(1+exp^-(w_0 + w_1 * x_1 + ... + w_p * x_ps))

#### Dataset
The dataset is available at <strong>"data/divorce.csv"</strong> in the respective challenge's repo.<br>
<strong>Original Source:</strong> https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set. Dataset is based on rating for questionnaire filled by people who already got divorse and those who is happily married.<br><br>

[//]: # "The dataset is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00520/data.zip. Unzip the file and use either CSV or xlsx file.<br>"


#### Features (X)
1. Atr1 - If one of us apologizes when our discussion deteriorates, the discussion ends. (Numeric | Range: 0-4)
2. Atr2 - I know we can ignore our differences, even if things get hard sometimes. (Numeric | Range: 0-4)
3. Atr3 - When we need it, we can take our discussions with my spouse from the beginning and correct it. (Numeric | Range: 0-4)
4. Atr4 - When I discuss with my spouse, to contact him will eventually work. (Numeric | Range: 0-4)
5. Atr5 - The time I spent with my wife is special for us. (Numeric | Range: 0-4)
6. Atr6 - We don't have time at home as partners. (Numeric | Range: 0-4)
7. Atr7 - We are like two strangers who share the same environment at home rather than family. (Numeric | Range: 0-4)

&emsp;.<br>
&emsp;.<br>
&emsp;.<br>

54. Atr54 - I'm not afraid to tell my spouse about her/his incompetence. (Numeric | Range: 0-4)
<br><br>
Take a look above at the source of the original dataset for more details.

#### Target (y)
55. Class: (Binary | 1 => Divorced, 0 => Not divorced yet)

#### Objective
To gain understanding of logistic regression through implementing the model from scratch

#### Tasks
- Download and load the data (csv file contains ';' as delimiter)
- Define X matrix (independent features) and y vector (target feature) as numpy arrays
- Add column at position 0 with all values=1 (pandas.DataFrame.insert function). This is for input to the bias $w_0$
- Print the shape and datatype of both X and y
[//]: # "- Dataset contains missing values, hence fill the missing values (NA) by performing missing value prediction"
[//]: # "- Since the all the features are in higher range, columns can be normalized into smaller scale (like 0 to 1) using different methods such as scaling, standardizing or any other suitable preprocessing technique (sklearn.preprocessing.StandardScaler)"
- Split the dataset into 85% for training and rest 15% for testing (sklearn.model_selection.train_test_split function)
- Follow code cells to implement simple logistic regression from scratch
    - Write hypothesis function to predict values
    - Write function for calculating cross entropy loss (or log loss)
    - Write function to return gradients for given weights
    - Perform gradient descent taking help of above functions
    - Write function for calculating accuracy
- Train the model using training set using the function implementation
- Predict the output for testing set samples and compute accuracy

#### Further Fun (will not be evaluated)
- Play with learning rate and max_iterations
- Testing between whether label encoder vs one hot encoder for categorical features gives better results.
- Running model with different feature scaling methods (i.e. scaling, normalization, standardization etc using sklearn)
- Training model with different sizes of dataset splitting such as 60-40, 50-50, 70-30, 80-20, 90-10, 95-5 etc.
- Shuffling of training samples with different random seed values in the train_test_split function. Check the model error for the testing data for each setup.
- Write functions for other classification metrics such as confusion matrix,  precision, recall and f1 scores.


#### Helpful links
- How Logistic Regression works: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Training testing splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g


In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [46]:
# Download the dataset from the source
!wget _URL_

In [47]:
# Unzip the file to the local cloud directory
!unzip _filename_

In [59]:
# Read the data from local cloud directory

# Set delimitebr to semicolon(;) in case of unexpected results

In [87]:
# Print the dataframe rows just to see some samples


In [None]:
# Define X (input features) and y (output feature) 
X = 
y = 

In [None]:
X_shape = 
X_type  = 
y_shape = 
y_type  = 
print(f'X: Type-{X_type}, Shape-{X_shape}')
print(f'y: Type-{y_type}, Shape-{y_shape}')

<strong>Expected output: </strong><br><br>

X: Type-<class 'numpy.ndarray'>, Shape-(170, 55)<br>
y: Type-<class 'numpy.ndarray'>, Shape-(170,)

In [64]:
# Fill the missing values (IF ANY)

In [65]:
# Perform feature scaling


In [70]:
# Split the dataset into training and testing here
X_train, X_test, y_train, y_test = 

In [None]:
# Print the shape of features and target of training and testing: X_train, X_test, y_train, y_test
X_train_shape = 
y_train_shape = 
X_test_shape  = 
y_test_shape  = 

print(f"X_train: {X_train_shape} , y_train: {y_train_shape}")
print(f"X_test: {X_test_shape} , y_test: {y_test_shape}")
assert (X_train.shape[0]==y_train.shape[0] and X_test.shape[0]==y_test.shape[0]), "Check your splitting carefully"

##### Let us start implementing logistic regression from scratch. Just follow code cells, see hints if required.

##### We will build a LogisticRegression class

In [79]:
class LogisticRegression:
    
    def __init__(self, lr=0.01, num_iter=100, fit_intercept=True):
        #Initialising all the parameters
        self.lr = lr
        self.num_iter = num_iter
        self.fit_intercept = fit_intercept
        self.losses= []
        
    
    #Make an add_intercept function which adds a column of ones to X
    def add_intercept(self, X):
        
        #put appropriate shape inside np.ones(())
        intercept = np.ones(())
        
        #now concatenate it with the original X along suitable axis
        return np.concatenate((intercept, X), axis=)
    
    #Define the sigmoid function
    def sigmoid(self, z):
        
        ## Code Starts here
       
        ## Code ends here
    
    
    # Define the loss function
    def loss(self, y_pred, y):
        # Use binary cross entropy
        
        return # Code here #
    
    
    # Defining the fit function which does gradient ascend
    def fit(self, X, y):
        
        # If fit_intercept is true, we add a column of ones to X
        if self.fit_intercept:
            X = self.add_intercept(X)
        
        ## Code starts here
        
        # weights initialization - pass the appropriate shape
        self.theta = np.zeros()
        
        
        # Gradient ascend
        for i in range(self.num_iter):
            
            # z is the hypothesis
            # define the hypothesis
            z = 
            
            # final predicted output will be found by taking sigmoid of z
            y_pred = self.sigmoid(z)
            
            # define the gradient
            gradient = 
            
            #Update the weights
            self.theta -= self.lr * gradient
            
           
            #Calculating loss and appending it to the losses array
            loss = self.loss(y_pred, y)
            self.losses.append(loss)
    
        ## Code ends here
    
    # Defining the predict_prob function which will predict probabilties of each class by taking dot product of weights and features
    def predict_prob(self, X):
        
        #Adding intercept if fit_intercept is True
        if self.fit_intercept:
            X = self.add_intercept(X)
            
        ## Code starts here
    
        return self.sigmoid()
    
        ## Code ends here
    
    def predict(self, X):
        
        # Rounding off the probability to predict the correct class
        return self.predict_prob(X).round()

##### Congratulations! You have implemented logistic regression from scratch. Let's see this in action.

In [80]:
# Initialise the model
model=

In [81]:
#Fit the model to the training data
model.fit()

In [None]:
# Plot the loss curve
plt.plot([i for i in range(len(model.losses))], model.losses)
plt.title("Loss curve")
plt.xlabel("Iteration num")
plt.ylabel("Loss")
plt.show()

In [83]:
#Make predictions on test data
y_pred = model.predict()

In [85]:
# Write a function for calculating accuracy
def accuracy(y_true,y_pred):
    ## Code Starts here
    
    ## Code ends here
    

In [90]:
# Print test accuracy
