# Logistic Regression
To estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.
Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability.

### Import Library
Import the Following Libraries:
- csv
- numpy (as np)
- LogisticRegression from sklearn.linear_model
- preprocessing from sklearn
- train_test_split from sklearn
- classification_report from sklearn

In [None]:
import csv
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Read data
Now, ***read the data*** using *csv*

The following functions **readData** will read data from csv file And returns all the data in the dimensions of the file itself <br>
Then in the next step, we prepare it for pre-processing.


In [None]:
def readData(address):
    with open(address) as csvFile:
        reader = csv.reader(csvFile)
        data = [row for row in reader]
    return data


def cleanData(data):
    return list(filter(lambda thisList: False if '?' in thisList else True, data))

First, we read the information and store the number of available data in a variable, then in the next stage of cleaning, we reduce the size of the information from the initial amount to find the number of rows containing *missing values* and print **percentage** of this Incorrect information.
<br>

Remove the row containing the headers name since it doesn't contain any information.

In [None]:
fileAddress = './train+dev+test.csv'
data = readData(fileAddress)
missingValues = len(data)
print(f"Number of rows before data cleaning: {len(data)}")
data = cleanData(data)
missingValues -= len(data)
data = np.array(data)[1:]  # remove headers
print(f"Number of rows after data cleaning: {len(data)}")
missingValues = round(missingValues/(len(data)+missingValues)*100, 2)
print(f"Percentage of missing values: {missingValues}%")

### Indicator variables
As you may figure out, All features in this dataset are categorical, such as **cap-shape** or **habitat**. Sklearn Logestic Regression does not handle categorical variables. We can still convert these features to numerical values using `dummyVariables` to convert the categorical variable into dummy/indicator variables.


In [None]:
def dummyVariables(features):
    for column in range(features.shape[1]):
        # 0,1,2,3,...,21
        featureStatus = set(features[:, column])
        tranasformer = preprocessing.LabelEncoder()
        tranasformer.fit(list(featureStatus))
        features[:, column] = tranasformer.transform(features[:, column])
    return features

Now separate the labels of the samples and their features:
- **X** as the Feature Matrix (data)
- **Y** as the response vector (target)
<br>

Then we give the list of features **X** to the number converter function `dummyVariables`.

In [None]:
X = data[:, 1:]
Y = data[:, 0]
print(f"Before indicator variables: \n{X}")
X = dummyVariables(X)
print(f"\nAfter indicator variables: \n{X}")