# Logistic Regression
---
### About The Dataset
We were provided a dataset from a bank who has seen a recent surge in customers leaving.\
They want a program that can predict if a customer is likely to leave the bank.\

For an in depth explanation of data preprocessing, please review [Data Preprocessing](../../Data_Preprocessing/data_preprocessing.ipynb)

## Import Libraries

In [174]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

For our libraries well just import them all now.\
We imported Pandas and Numpy to handle the data for us.\
I imported matplotlib, not thats its needed, but we will use it to visualize data later.\
Even though I know that our current dataset doesn't have any missing data I planned for it anyway and imported SimpleImputer.\
We have a column with categorical data that contains more than two options, so we imported ColumnTransformer to convert one column into multiple.\
This also means we'll use OneHotEncoder, to assign the vaules of our new columns based the categorical information.\
We will also need to scale our data before passing it to our model, so we grabbed StandardScaler.\
Last thing we need it train_test_split, so we can quickly and easily split our data into the 4 required sets.

## Import Dataset

In [175]:
dataset = pd.read_csv('../ANN/dataset.csv')
print(dataset.loc[0])

RowNumber                  1
CustomerId          15634602
Surname             Hargrave
CreditScore              619
Geography             France
Gender                Female
Age                       42
Tenure                     2
Balance                  0.0
NumOfProducts              1
HasCrCard                  1
IsActiveMember             1
EstimatedSalary    101348.88
Exited                     1
Name: 0, dtype: object


Here we simply import the dataset we used in our Artificial Neural Network, which is returned as a pandas dataframe.\
We print the first row so we can see what the dataset consists of, and decide what data we need.\
\
RowNumber, CustomerId, and Surname are features that are likely to have little to no affect on the dependent variable, so we wont need those columns.\
The rest of the data seems like it can be relevant.\
We also know that the "Exited" column is the dependent variable we are trying to predict.\
So we want to get the independant variables we determined useful in one variable and then the dependent variables into another.\

In [176]:
x = dataset.iloc[:,3:-1].values
y = dataset.iloc[:,-1].values
print(f'X\n{x[0:5]}')
print(f'Y\n{y[0:5]}')

X
[[619 'France' 'Female' 42 2 0.0 1 1 1 101348.88]
 [608 'Spain' 'Female' 41 1 83807.86 1 0 1 112542.58]
 [502 'France' 'Female' 42 8 159660.8 3 1 0 113931.57]
 [699 'France' 'Female' 39 1 0.0 2 0 0 93826.63]
 [850 'Spain' 'Female' 43 2 125510.82 1 1 1 79084.1]]
Y
[1 0 1 0 0]


Pandas provides us a method that allows us to index the dataframe called .iloc.\
As mentioned, we need to grab everything from the credit score to estimated salary columns.\
We also need to remeber that we are working with a 2 dimensional array, so we have rows and columns.\
\
We will pass .iloc two index's seperated by a comma using conventional Python indexing.\
In other words, it breaks down like this  -  dataset.iloc[:,3:-1] (":" = all rows, "3:-1" = start at the 4th column/stop at the second to last column)\
We will assign the to the variable x.\
Next we pass [:,-1] which is saying, all rows / last column only, and assign that to y.\
Last thing to mention we apply .values to the dataframes to convert them into Numpy arrays.


## Encode The Dataset
In our current dataset we have both binary(Gender) and non-binary categorical data(Country).\
The Gender categorical data is in a column right of the Country data.\
If we encode the Country data first we will alter the postioning of the Gender data.\
\
So we will encode the binary data first in this case.\
Of course, we can account for the additional rows, but we would have to know how many "Countries" there are.\
Its, just easier to do the binary data first.

In [177]:
le = LabelEncoder() # Encodes the binary data
x[:,2] = le.fit_transform(x[:,2]) # All rows 3 element

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x[0:12:2])


[[1.0 0.0 0.0 619 0 42 2 0.0 1 1 1 101348.88]
 [1.0 0.0 0.0 502 0 42 8 159660.8 3 1 0 113931.57]
 [0.0 0.0 1.0 850 0 43 2 125510.82 1 1 1 79084.1]
 [1.0 0.0 0.0 822 1 50 7 0.0 2 1 1 10062.8]
 [1.0 0.0 0.0 501 1 44 4 142051.07 2 0 1 74940.5]
 [1.0 0.0 0.0 528 1 31 6 102016.72 2 0 0 80181.12]]


The first thing that we did was create an instance of LabelEncoder class and stored it in variable le.\
We index all the rows 3 column only and set all the cell equal to the encoded versions of themselves using le.fit_transform.\
We pass x with the same indexing as an argument to the LableEncoder instance.\
\
Next we tackle the non binary categorical data using ColumnTransformer and OneHotEncoder.\
We create an instance of the ColumnTransformer class and store in ct.\
For arugments:
- Transformers which is a list with a tuple. Inside the tuple we have a name "encoder", the OneHotEncoder class, and the index for the country column
- Remainder is set to 0 so we dont loose the other columns.

Last step is set all of x to ct.fit_transform(x) as a Numpy array.

## Split Data Set
Here we use train_test_split to make our train and test sets.\
The only two required arguments is the list of features, or independent variables, and the dependent variables, or did the person leave the bank.\
There are other arguments that can be passed, and you'll see used in other projects.\
\
train_test_split will return four new datasets.\
X, y train sets, and x, y test sets.\
We unpack them and assign all four at once as seen below.

In [178]:
x_train, x_test, y_train, y_test = train_test_split(x,y)
print(f"X Train:\n{x_train[0]}")
print(f"Y Train:\n{y_train[0]}")
print(f"X Test:\n{x_test[0]}")
print(f"Y Test:\n{y_test[0]}")

X Train:
[0.0 1.0 0.0 616 0 42 6 117899.95 2 0 0 150266.81]
Y Train:
0
X Test:
[0.0 1.0 0.0 594 0 23 4 104753.84 2 1 0 56756.52]
Y Test:
1


## Feature Scaling
Last step of the preprocessing part is scaling the data.\
We don't want to scale the "dummy data", or data we encoded.\
\
We are going to use the StandardScaler, so we will have to specify which columns we want to scale.\
This means we wat to scale index 3, 5-6, and the last element.\
This is acheived through indexing as we have been. (see below)\
\
We will scale both the x sets, but not the y sets.\
This is because the dependent variables are on a scale of 0 or 1.\
StandardScaler puts data into a range of -3 to 3, so the dependent variable is in the expected range.

In [179]:
sc = StandardScaler()
# Scaling only needed cols for Train Set
x_train[:,3:4] = sc.fit_transform(x_train[:,3:4])
x_train[:,5:8] = sc.fit_transform(x_train[:,5:8])
x_train[:,-1:] = sc.fit_transform(x_train[:,-1:])
# Scaling only needed cols for Test Set
x_test[:,3:4] = sc.fit_transform(x_test[:,3:4])
x_test[:,5:8] = sc.fit_transform(x_test[:,5:8])
x_test[:,-1:] = sc.fit_transform(x_test[:,-1:])

print(f'Train: {x_train[0]}')
print(f'Test: {x_test[0]}')

Train: [0.0 1.0 0.0 -0.3626714946204171 0 0.28258166022475556 0.3518452822884281
 0.6615470836797597 2 0 0 0.8745298796037806]
Test: [0.0 1.0 0.0 -0.5740432738697727 0 -1.4783920169613889 -0.3852735872835718
 0.4598337187240687 2 1 0 -0.7615015602981694]


Now we are ready to start working on the model.

---

## Training The Model

In [180]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(x_train, y_train)

## Predict New Result

In [181]:
print(clf.predict(x_test[10:21]))
print(clf.predict_proba(x_test[10:21]))
print(y_test[10:21])

[0 0 0 0 0 0 0 0 0 0 0]
[[0.916206   0.083794  ]
 [0.92289564 0.07710436]
 [0.79922113 0.20077887]
 [0.71812542 0.28187458]
 [0.88246673 0.11753327]
 [0.92027467 0.07972533]
 [0.91897824 0.08102176]
 [0.92234901 0.07765099]
 [0.50659966 0.49340034]
 [0.9501379  0.0498621 ]
 [0.84492819 0.15507181]]
[0 0 1 0 0 0 0 0 0 0 0]


## Predicting The Test Set

In [185]:
y_pred = clf.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))
print(clf.score(x_test,y_test))

[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 1]
[0 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[1 0]
[1 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 0]
[0 1]
[0 0]
[1 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[1 1]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 1]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 1]
[0 0]
[0 0]
[0 1]
[0 0]
[0 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0

## Making The Confusion Matrix

## Visualize The Test Results