## Library
We are using the Turicreate library for implementing the Logistic Regression Model.

In [1]:
import turicreate as tc

## Data
We will be using the same given [LendingClub](https://www.lendingclub.com/) dataset.

In [2]:
loans = tc.SFrame('../data/lending-club-data.sframe/')

## Target Column Definition

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

We reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans` and define it as `target` column.

In [3]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)

## Features Selection
Like previous assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below:

In [4]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]
     
# Extract the feature columns and target column
loans = loans[features + [target]]

## Class Balancing
One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

We do this in order to help the algorithm studies both classes equally so it can perform more precise predictions.

In [5]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed = 1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

## Data Splitting
80% of the original data will be randomly split into training set `(train_data)` and 20% will be randomly split into test set `(test_data)`. We used `seed=1` so everyone gets the same results.

In [6]:
train_data, test_data = loans_data.random_split(.8, seed = 1)

## Logistic Regression Model Building
By using Turi Create we use its `logistic_regression_classifier` class to create the model. The parameters are:

* `dataset`: the dataset input for training.

* `target`: name of the column containing the target variable, which is `safe_loans`.

* `features`: list of the columns containing features, which is `feature`.

* `validation_set`: the dataset for monitoring the model’s generalization performance, which is `test_data`.

* `seed`: seed for random number generation. Set this value to ensure that the same model is created every time. 

* `max_iterations`: the maximum number of allowed passes through the data

* `solver`: name of the solver to be used to solve the regression. The default value is `auto`, automatically chooses the best solver for the data and model parameters. In this case is Newton method. which works best for datasets with plenty of examples and few features (long datasets).


The rest parameter will be automatically setup

* `l2_penalty`: the larger this weight, the more the model coefficients thrink toward 0. The default value is 0.01.

* `l1_penalty`: the larger this weight, the more the model coefficients thrink toward 0. The default weight is 0 prevents any features from being discarded.

* `lbfgs_memory_level`: because the Newton method is applied, we don't need this parameter.

* `feature_rescaling`: Feature rescaling is an important pre-processing step that ensures that all features are on the same scale. Because our features are categorical type (string) they will be rescaled to the dummy variables that are used to represent them. The coefficients are returned in original scale of the problem. This process is particularly useful when features vary widely in their ranges.

* `convergence_threshhold`: Convergence is tested using variation in the training objective. The variation in the training objective is calculated using the difference between the objective values between two steps. Consider reducing this below the default value (0.01) for a more accurately trained model. Beware of overfitting (i.e a model that works well only on the training data) if this parameter is set to a very low value.

* `step_size`: the default is set to 1.0, this is an aggressive setting, reducing this parameter may speed up model training.

* `class_weight`: weights the examples in the training data according to the given class weights. If set to None, all classes are supposed to have weight one. The auto mode set the class weight to be inversely proportional to number of examples in the training data with the given class.

In [7]:
logistic_regression_model = tc.logistic_classifier.create(dataset = train_data,
                                                            target = target,
                                                            features = features,
                                                            validation_set = None,
                                                            seed = 1,
                                                            solver = 'auto',
                                                            max_iterations=100
                                                            )

Here is the summary of the model after building:

In [8]:
logistic_regression_model.summary()

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 63
Number of examples             : 37224
Number of classes              : 2
Number of feature columns      : 12
Number of unpacked features    : 12

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 5
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 1.1528

Settings
--------
Log-likelihood                 : 23377.019

Highest Positive Coefficients
-----------------------------
sub_grade[A1]                  : 0.962
sub_grade[A2]                  : 0.7342
sub_grade[A3]                  : 0.6628
grade[A]                       : 0.5396
sub_grade[A4]                  : 0.4526

Lowest Negative Coefficients
----------------------------
purpose[small_business]        : -0.9446
sub_g

## Accuracy Comparison With The Decision Tree Model
By using the same training data set `(train_data)`. We also build a Decision Tree Model with them same `target` and `features`

In [9]:
decision_tree_model = tc.decision_tree_classifier.create(train_data,
                                                            validation_set = None,
                                                            target = target,
                                                            features = features)

Now, have a quick comparison between two models with their corresponding accuracy

In [10]:
print("Decision Tree model's accuracy:", decision_tree_model.evaluate(test_data)['accuracy'])
print("Logistic Regression model's accuracy:", logistic_regression_model.evaluate(test_data)['accuracy'])

Decision Tree model's accuracy: 0.6367944851357173
Logistic Regression model's accuracy: 0.6427186557518311


We can see that **most of the time** (with `random_seed = None` when building the model), the accuracy of the **Logistic Regression model** is ***higher*** than the **Decision Tree model**.

If your result is opposite, you can try **rebuild** the **Logistic Regression model** by executing its code cell again. Then try the accuracy test again (or just hit **Run All**)

The result will be 100% different each try (with `random_seed = None` when building the model). So how does this work?

## **Logistic Regression Model Algorithm Explaination**

### **Main Idea**

Logistic Regression is an algorithm for classification used when the target variable is categorical. The Logistic Regression algorithm comes into the picture when the data has a binary output belonging to one class, i.e., either 0 or 1. 

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.

Logistic Regression is similar to Linear Regression, except it only predicts something true or false. Also insteal fitting a line to the data visualization, Logistic Regression fits an s-shaped logistic funtion.

**Linear Regression Equation**

$y = \beta{0} + \beta{1}X1 + \beta{2}X2 + ... + \beta{n}Xn$<br>

**Logistic function**

Logistic function is a type of sigmoid function that squishes values between 0 and 1<br>
$logistic(z) = {\frac{1}{1 + e^{-z}}}$


### **The Learning Process**
**Input**: the training data set `train_data`

**Learning**

1. Calcualate Linear Regression Equation

2. Calculate Logistic Sigmoid Function

3. Apply Sigmoid function on Linear Regression:

$p = 1/1 + e^{\beta{0} + \beta{1}X1 + \beta{2}X2 + ... + \beta{n}Xn}$


### **Pros and Cons**
**Pros**
 
- Don’t need to pick learning rate
- Often run faster (not always the case)
- Can numerically approximate gradient for you (doesn’t always work out well)


**Cons**
- More complex
- More of a black box unless you learn the specifics



**Visualized Reference From**
https://www.youtube.com/watch?v=yIYKR4sgzI8

