## Logistic Regression

In short Logistic Regression is a classification algorithm used to assign observations to a discrete set of classes. It is a predictive analysis algorithm and based on the concept of probability. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.

To understand the concept of Logistic Regression, we need to understand the sigmoid function. The sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. It is defined as:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

The sigmoid function is a special case of the logistic function and was first defined by statisticians to model population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. The logistic function is also known as the "logit" function because it is the log of the odds of the outcome.

The sigmoid function is used in Logistic Regression because it can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. This is useful because we can interpret the output as a probability.

The sigmoid function is also a differentiable function, which means that we can find the slope of the sigmoid function at any point on the curve. This is useful because we can use the slope of the sigmoid function as the derivative of the cost function, which we will use to train our model.

The cost function for Logistic Regression is:

$$J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}(x^{(i)})) + (1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))$$

Where:

* $h_{\theta}(x^{(i)})$ is the hypothesis for the $i^{th}$ training example, which is equal to  $\sigma(\theta^Tx^{(i)})$ and $\theta$ is the parameter vector.
* $y^{(i)}$ is the actual output for the $i^{th}$ training example
* $m$ is the number of training examples

The cost function is convex, which means that it has one global minimum. No matter where you initialize your parameters, you will always reach the global minimum.

The cost function is also differentiable, which means that we can use gradient descent to find the values of $\theta$ that minimize the cost function.

The gradient of the cost function is a vector of the same length as $\theta$ where the $j^{th}$ element (for $j = 0, 1, \dots, n$) is defined as follows:

$$\frac{\partial J(\theta)}{\partial \theta_{j}} = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}$$

The gradient descent update rule is:

$$\theta_{j} := \theta_{j} - \alpha\frac{\partial J(\theta)}{\partial \theta_{j}}$$

Where:

$\alpha$ is the learning rate

$\theta_{j}$ is the $j^{th}$ parameter of $\theta$

$\frac{\partial J(\theta)}{\partial \theta_{j}}$ is the partial derivative of the cost function with respect to $\theta_{j}$

The learning rate $\alpha$ determines how quickly we update the parameters. If the learning rate is too large we may "overshoot" the minimum. It is possible to use an adaptive learning rate that decreases over time.

The following code implements the vectorized version of the Logistic Regression from scratch using Python. Apart from the $\theta$ vector, the code uses also the bias term $b$.

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings( "ignore" )

# to compare our model's accuracy with sklearn model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score

# Logistic Regression
class LogitRegression() :
	def __init__(self, learning_rate, iterations):		
		self.learning_rate = learning_rate		
		self.iterations = iterations
		
	# Function for model training	
	def fit(self, X, Y) :		
		# no_of_training_examples, no_of_features		
		self.m, self.n = X.shape		
		# weight initialization		
		self.Theta = np.zeros((self.n, 1))		
		self.b = 0		
		self.X = X		
		self.Y = Y
		
		# gradient descent learning
		for i in range( self.iterations ):			
			self.update_weights()			
		return self
	
	# Helper function to update weights in gradient descent
	
	def update_weights(self):		
		Z = self.X.dot(self.Theta) + self.b
		h = self.sigmoid(Z)

		assert h.shape == (self.m, 1)
		
		# calculate gradients		
		dTheta = self.X.T.dot(h - self.Y) / self.m
		db = np.sum(h - self.Y) / self.m
		
		assert dTheta.shape == (self.n, 1)

		# update weights
		self.Theta = self.Theta - self.learning_rate * dTheta	
		self.b = self.b - self.learning_rate * db
		
		return self

	def sigmoid(self, x) :
		return 1 / (1 + np.exp(-x))	
	
	# Hypothetical function h( x )
	def predict(self, X) :	
		Z =  X.dot(self.Theta) + self.b
		h = self.sigmoid(Z)
		predictions = np.where(h > 0.5, 1, 0)		
		return predictions

In [2]:
# Importing dataset	
df = pd.read_csv( "diabetes.csv" )
X = df.iloc[:,:-1].values
Y = df.iloc[:,-1:].values

In [3]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
103,1,81,72,18,40,26.6,0.283,24,0
104,2,85,65,0,0,39.6,0.930,27,0
105,1,126,56,29,152,28.7,0.801,21,0
106,1,96,122,0,0,22.4,0.207,27,0


In [4]:
df['Outcome'].value_counts()

0    70
1    38
Name: Outcome, dtype: int64

In [5]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [6]:
# Splitting dataset into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

# Model training	
model = LogitRegression(learning_rate = 0.01, iterations = 10000)
model.fit(X_train, Y_train)	
Y_train_pred = model.predict(X_train)
Y_pred = model.predict(X_test)	

# Sklearn model training
model1 = LogisticRegression()	
model1.fit( X_train, Y_train)
Y_train_pred1 = model1.predict(X_train)
Y_pred1 = model1.predict(X_test)

In [7]:
print("Train accuracy of our model: " , accuracy_score(Y_train, Y_train_pred))
print("Test accuracy of our model: " , accuracy_score(Y_test, Y_pred), '\n')

print("Train accuracy of sklearn model: " , accuracy_score(Y_train, Y_train_pred1))
print("Test accuracy of sklearn model: " , accuracy_score(Y_test, Y_pred1))

Train accuracy of our model:  0.72
Test accuracy of our model:  0.5151515151515151 

Train accuracy of sklearn model:  0.7866666666666666
Test accuracy of sklearn model:  0.5454545454545454


In [8]:
print("Train accuracy of our model: " , balanced_accuracy_score(Y_train, Y_train_pred))
print("Test accuracy of our model: " , balanced_accuracy_score(Y_test, Y_pred), '\n')

print("Train accuracy of sklearn model: " , balanced_accuracy_score(Y_train, Y_train_pred1))
print("Test accuracy of sklearn model: " , balanced_accuracy_score(Y_test, Y_pred1))

Train accuracy of our model:  0.5434782608695652
Test accuracy of our model:  0.4722222222222222 

Train accuracy of sklearn model:  0.7127926421404682
Test accuracy of sklearn model:  0.5222222222222223


In [9]:
# Sklearn model with tuned regularization parameter
model1 = LogisticRegression(C=0.001)
model1.fit( X_train, Y_train)
Y_pred1 = model1.predict(X_test)
Y_train_pred1 = model1.predict(X_train)

print("Train accuracy of sklearn model: " , accuracy_score(Y_train, Y_train_pred1))

print("Accuracy of sklearn model: " , accuracy_score(Y_test, Y_pred1))
print("Balanced accuracy of sklearn model: " , balanced_accuracy_score(Y_test, Y_pred1))

Train accuracy of sklearn model:  0.76
Accuracy of sklearn model:  0.6060606060606061
Balanced accuracy of sklearn model:  0.5833333333333334


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, Normalizer, QuantileTransformer, PowerTransformer, SplineTransformer

In [11]:
# hint use power of combination of quantile + poly
preprocessor = Pipeline(steps=[
       #('normalizer', Normalizer()),
       #('scaler', StandardScaler()),
       ('quantile', QuantileTransformer(output_distribution='normal')),
       #('power', PowerTransformer(method='yeo-johnson')),
       ('poly', PolynomialFeatures(degree=4, include_bias=False)),
       #('spline', SplineTransformer(degree=4, include_bias=False)),
])

In [12]:
pipeline = Pipeline(steps = [
               ('preprocessor', preprocessor)
              ,('regressor', LogisticRegression(C=1.0))
           ])

In [13]:
rf_model = pipeline.fit(X_train, Y_train)
Y_pred_rf = pipeline.predict(X_test)
Y_train_pred1 = pipeline.predict(X_train)

print("Train accuracy of sklearn model: " , accuracy_score(Y_train, Y_train_pred1))

print("Accuracy of sklearn model: " , accuracy_score(Y_test, Y_pred_rf))
print("Balanced accuracy of sklearn model: " , balanced_accuracy_score(Y_test, Y_pred_rf))


Train accuracy of sklearn model:  1.0
Accuracy of sklearn model:  0.696969696969697
Balanced accuracy of sklearn model:  0.6888888888888889


#### Now let's cheat a little bit and use DecisionTreeClassifier from sklearn

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(X_train, Y_train)
Y_train_pred_tree = tree.predict(X_train)
print("Train accuracy of decision tree model: " , accuracy_score(Y_train, Y_train_pred_tree))

Y_pred_tree = tree.predict(X_test)
print("Test accuracy of decision tree model: " , accuracy_score(Y_test, Y_pred_tree))
print("Balanced accuracy of of decision tree model: " , balanced_accuracy_score(Y_test, Y_pred_tree))

Train accuracy of decision tree model:  0.7733333333333333
Test accuracy of decision tree model:  0.7272727272727273
Balanced accuracy of of decision tree model:  0.7333333333333334


In [15]:
boostTree = AdaBoostClassifier(n_estimators=10, random_state=0)
boostTree.fit(X_train, Y_train)
Y_train_pred_boostTree = boostTree.predict(X_train)
print("Train accuracy of decision tree model: " , accuracy_score(Y_train, Y_train_pred_boostTree))

Y_pred_boostTree = boostTree.predict(X_test)
print("Test accuracy of Ada boost model: " , accuracy_score(Y_test, Y_pred_boostTree))
print("Balanced accuracy of of Ada boost model: " , balanced_accuracy_score(Y_test, Y_pred_boostTree))

Train accuracy of decision tree model:  0.92
Test accuracy of Ada boost model:  0.7878787878787878
Balanced accuracy of of Ada boost model:  0.7777777777777777


## Data science contest

Investigate the data in the 'train.csv' and 'test.csv' files, which include the Titanic passengers. The 'train.csv' file contains the data for 891 passengers, and the 'test.csv' file contains the data for 418 passengers. The 'train.csv' file contains the 'Survived' column, which indicates whether the passenger survived or not. The 'test.csv' file does not contain the 'Survived' column. Your task is to predict the 'Survived' column for the 'test.csv' file. Use LogisticRegression model.

In [16]:
import pandas as pd

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [17]:
print("Shape of the train dataset: ", df_train.shape)
print("Shape of the test dataset: ", df_test.shape)

Shape of the train dataset:  (891, 12)
Shape of the test dataset:  (418, 11)


In [18]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
