# Logistic Regression
To estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.
Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability.

### Import Library
Import the Following Libraries:
- csv
- numpy (as np)
- LogisticRegression from sklearn.linear_model
- preprocessing from sklearn
- train_test_split from sklearn
- classification_report from sklearn

In [1]:
import csv
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Read data
Now, ***read the data*** using *csv*

The following functions **readData** will read data from csv file And returns all the data in the dimensions of the file itself <br>
Then in the next step, we prepare it for pre-processing.


In [2]:
def readData(address):
    with open(address) as csvFile:
        reader = csv.reader(csvFile)
        data = [row for row in reader]
    return data


def cleanData(data):
    return list(filter(lambda thisList: False if '?' in thisList else True, data))

First, we read the information and store the number of available data in a variable, then in the next stage of cleaning, we reduce the size of the information from the initial amount to find the number of rows containing *missing values* and print **percentage** of this Incorrect information.
<br>

Remove the row containing the headers name since it doesn't contain any information.

In [3]:
fileAddress = './train+dev+test.csv'
data = readData(fileAddress)
missingValues = len(data)
print(f"Number of rows before data cleaning: {len(data)}")
data = cleanData(data)
missingValues -= len(data)
data = np.array(data)[1:]  # remove headers
print(f"Number of rows after data cleaning: {len(data)}")
missingValues = round(missingValues/(len(data)+missingValues)*100, 2)
print(f"Percentage of missing values: {missingValues}%")

Number of rows before data cleaning: 8125
Number of rows after data cleaning: 5644
Percentage of missing values: 30.53%


### Indicator variables
As you may figure out, All features in this dataset are categorical, such as **cap-shape** or **habitat**. Sklearn Logestic Regression does not handle categorical variables. We can still convert these features to numerical values using `dummyVariables` to convert the categorical variable into dummy/indicator variables.


In [4]:
def dummyVariables(features):
    for column in range(features.shape[1]):
        # 0,1,2,3,...,21
        featureStatus = set(features[:, column])
        tranasformer = preprocessing.LabelEncoder()
        tranasformer.fit(list(featureStatus))
        features[:, column] = tranasformer.transform(features[:, column])
    return features

Now separate the labels of the samples and their features:
- **X** as the Feature Matrix (data)
- **Y** as the response vector (target)
<br>

Then we give the list of features **X** to the number converter function `dummyVariables`.

In [5]:
X = data[:, 1:]
Y = data[:, 0]
print(f"Before indicator variables: \n{X}")
X = dummyVariables(X)
print(f"\nAfter indicator variables: \n{X}")

Before indicator variables: 
[['x' 's' 'n' ... 'k' 's' 'u']
 ['x' 's' 'y' ... 'n' 'n' 'g']
 ['b' 's' 'w' ... 'n' 'n' 'm']
 ...
 ['x' 'y' 'g' ... 'w' 'y' 'p']
 ['x' 'y' 'c' ... 'w' 'c' 'd']
 ['f' 'y' 'c' ... 'w' 'c' 'd']]

After indicator variables: 
[['5' '2' '4' ... '1' '3' '5']
 ['5' '2' '7' ... '2' '2' '1']
 ['0' '2' '6' ... '2' '2' '3']
 ...
 ['5' '3' '3' ... '5' '5' '4']
 ['5' '3' '1' ... '5' '1' '0']
 ['2' '3' '1' ... '5' '1' '0']]


### Normalize Data
Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN, Logestic Regression, ... which is based on the Coordinates of data points:

In [6]:
def standardization(data):
    scaler = preprocessing.StandardScaler()
    scaler.fit(data)
    STD_data = scaler.transform(data)
    return STD_data

In [7]:
X = standardization(X)
print(f"Data after Normalization: \n{X}")

Data after Normalization: 
[[ 0.95193532  0.27895188 -0.14846445 ... -0.20344067 -0.53662286
   2.35567928]
 [ 0.95193532  0.27895188  1.48387322 ...  0.7408185  -1.28928177
  -0.14770122]
 [-2.06103179  0.27895188  0.93976067 ...  0.7408185  -1.28928177
   1.10398903]
 ...
 [ 0.95193532  1.02724296 -0.692577   ...  3.573596    0.96869495
   1.72983415]
 [ 0.95193532  1.02724296 -1.78080212 ...  3.573596   -2.04194068
  -0.77354635]
 [-0.85584494  1.02724296 -1.78080212 ...  3.573596   -2.04194068
  -0.77354635]]


### Train - Test split
I using train/test split to train and test decision tree,
train_test_split will return 4 different parameters. We will name them:
`X_train, X_test, y_train, y_test`.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

I chose the ratio of train and test set 70% and 30%.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=24)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (3950, 22) (3950,)
Test set: (1694, 22) (1694,)


### Modeling
Let's build our model using LogisticRegression from the Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including *‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’* solvers.

The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem of machine learning models. Now let's fit our model with train set:


In [9]:
logisticRegression = LogisticRegression(solver='liblinear')
logisticRegression.fit(X_train, y_train)
print(f"The Logestic Regression classifier is: {logisticRegression}")

The Logestic Regression classifier is: LogisticRegression(solver='liblinear')


### Prediction
We can use the model to make predictions on the test set:

In [10]:
yhat = logisticRegression.predict(X_test)
print(f'Estimated class is: {yhat}')

Estimated class is: ['e' 'p' 'e' ... 'e' 'e' 'e']


### Predict Probability
predict_proba returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 0, P(Y=0|X), and second column is probability of class 1, P(Y=1|X):

In [16]:
predictedProbability = logisticRegression.predict_proba(X_test)
predictedProbability = np.around(predictedProbability, decimals=3)
print(f'The Probability of classes: \n{predictedProbability}')

The Probability of classes: 
[[1.    0.   ]
 [0.001 0.999]
 [1.    0.   ]
 ...
 [1.    0.   ]
 [0.768 0.232]
 [1.    0.   ]]


### Evaluation
Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding real labels **y_test**.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0.

In [12]:
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           e       0.98      0.99      0.99      1083
           p       0.98      0.97      0.98       611

    accuracy                           0.98      1694
   macro avg       0.98      0.98      0.98      1694
weighted avg       0.98      0.98      0.98      1694

