## Classification/ Logistic Regression 

data set used https://www.kaggle.com/c/titanic/data

## What is the difference between Linear and Logistic Regression?


Logistic regression is a very popular algorithm that is probabilistic, it is a supervised learning model.
Tge output/ target variable of Logistic regression model is a class( as opoosed to a continues variable) The model predicts a class( categorical outcome) of future data( yes or no, pass or no pass, maligant or benign) or multiple class - which is the multinomial logistic regression. 
While Linear Regression is suited for estimating continuous values (e.g. estimating house prices), it is not the best tool for predicting the class of an observed data point.

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<font size = 3><strong>Recall linear regression:</strong></font>
<br>
<br>
    <b>Linear regression</b> finds a function that relates a continuous dependent variable, <b>y</b>, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, simple linear regression assumes a function of the form:
<br><br>
$$
y = \theta_0 + \theta_1  x_1 + \theta_2  x_2 + \cdots
$$
<br>
and finds the values of parameters $\theta_0, \theta_1, \theta_2$, etc, where the term $\theta_0$ is the "intercept". It can be generally shown as:
<br><br>
$$
ℎ_\theta(𝑥) = \theta^TX
$$
<p></p>

</div>

Logistic Regression is a variation of Linear Regression, used when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability with the following function, which is called the sigmoid function 𝜎:

$$
ℎ\_\theta(𝑥) = \sigma({\theta^TX}) =  \frac {e^{(\theta\_0 + \theta\_1  x\_1 + \theta\_2  x\_2 +...)}}{1 + e^{(\theta\_0 + \theta\_1  x\_1 + \theta\_2  x\_2 +\cdots)}}
$$
Or:
$$
ProbabilityOfaClass\_1 =  P(Y=1|X) = \sigma({\theta^TX}) = \frac{e^{\theta^TX}}{1+e^{\theta^TX}}
$$

In this equation, ${\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\sigma(\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01), also called logistic curve. It is a common "S" shape (sigmoid curve).

So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:



The objective of the **Logistic Regression** algorithm, is to find the best parameters θ, for $ℎ\_\theta(𝑥)$ = $\sigma({\theta^TX})$, in such a way that the model best predicts the class of each case.



In [12]:
# Import the necessary modules

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)

# Load the data into a `pandas` DataFrame object
titanic_df = pd.read_csv('~/Downloads/train.csv')


In [13]:
titanic_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
#removing missing values
tdf = titanic_df[titanic_df.Embarked.notnull()].dropna(axis = "columns", thresh = 450)
sh = tdf.shape

rows = sh[0]
cols = sh[1]

In [None]:
titanic_df.drop(['Ticket','Cabin', 'PassengerId', 'Name'], axis=1, inplace=True)
titanic_df = titanic_df.loc[titanic_df['Embarked'].notnull(),:]

### Drop "Survived" for purposes of KNN imputation:
y_target = titanic_df.Survived
titanic_knn = titanic_df.drop(['Survived'], axis = 1)  
titanic_knn.head()

In [None]:
to_dummy = ['Sex','Embarked']
titanic_knn = pd.get_dummies(titanic_knn, prefix = to_dummy, columns = to_dummy, drop_first = True)

titanic_knn.head()

In [43]:
train = titanic_knn[titanic_knn.Age.notnull()]
X_train = train.drop(['Age'], axis = 1)
y_train = train.Age


# Data to impute, -- Where Age is null; Remove completely-null "Age" column.
impute = titanic_knn[titanic_knn.Age.isnull()].drop(['Age'], axis = 1)
print("Data to Impute")
print(impute.head(3))

# import algorithm
from sklearn.neighbors import KNeighborsRegressor

# Instantiate
knr = KNeighborsRegressor()

# Fit
knr.fit(X_train, y_train)

# Create Predictions
imputed_ages = knr.predict(impute)

# Add to Df
impute['Age'] = imputed_ages
print("\nImputed Ages")
print(impute.head(3))

# Re-combine data frames
titanic_imputed = pd.concat([train, impute], sort = False, axis = 0)

# Return to original order - to match back up with "Survived"
titanic_imputed.sort_index(inplace = True)
print("Shape before imputation:", titanic_knn.shape)
print("Shape with imputed values:", titanic_imputed.shape)
titanic_imputed.head(7)

Data to Impute
    Pclass  SibSp  Parch     Fare  Sex_male  Embarked_Q  Embarked_S
5        3      0      0   8.4583         1           1           0
17       2      0      0  13.0000         1           0           1
19       3      0      0   7.2250         0           0           0

Imputed Ages
    Pclass  SibSp  Parch     Fare  Sex_male  Embarked_Q  Embarked_S   Age
5        3      0      0   8.4583         1           1           0  47.2
17       2      0      0  13.0000         1           0           1  30.4
19       3      0      0   7.2250         0           0           0  24.0
Shape before imputation: (889, 8)
Shape with imputed values: (889, 8)


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.25,1,0,1
1,1,38.0,1,0,71.2833,0,0,0
2,3,26.0,0,0,7.925,0,0,1
3,1,35.0,1,0,53.1,0,0,1
4,3,35.0,0,0,8.05,1,0,1
5,3,47.2,0,0,8.4583,1,1,0
6,1,54.0,0,0,51.8625,1,0,1


In [24]:
def prepare_data(input_x, target_y):
    # Ensure shape of x-array
    if input_x.shape[0] < input_x.shape[1]:
        input_x = np.transpose(input_x)
    
    # Check the size of y array, if necessary reshape to -1
    if len(target_y.shape) > 1:
        if min(target_y.shape) == 1:
            target_y.reshape(-1)
        else:
            print("Bad Y")
    
    
    # Create the column of ones
    ones = np.ones((input_x.shape[0],1), dtype = int)
    
    # prepend the column of ones
    prepared_x = np.concatenate((ones,input_x), axis = 1)
    
    # Ensure the target is all -1 and 1
    prepared_y = np.array([x if x ==1 else -1 for x in target_y])
    
    # Create the initial weights of 0s
    initial_w = np.zeros(prepared_x.shape[1])
    
    # Return the three numpy arrays

    return prepared_x, prepared_y, initial_w

In [45]:
import itertools
categorical = ['Pclass','Sex','Embarked']
numeric = ['Age','SibSp','Parch','Fare']

# Create all the pairs of categorical variables and look at the distributions
cat_combos = list(itertools.combinations(categorical, 2))

In [46]:
def prepare_data(input_x, target_y):
    # Ensure shape of x-array
    if input_x.shape[0] < input_x.shape[1]:
        input_x = np.transpose(input_x)
    
    # Check the size of y array, if necessary reshape to -1
    if len(target_y.shape) > 1:
        if min(target_y.shape) == 1:
            target_y.reshape(-1)
        else:
            print("Bad Y")
    
    
    # Create the column of ones
    ones = np.ones((input_x.shape[0],1), dtype = int)
    
    # prepend the column of ones
    prepared_x = np.concatenate((ones,input_x), axis = 1)
    
    # Ensure the target is all -1 and 1
    prepared_y = np.array([x if x ==1 else -1 for x in target_y])
    
    # Create the initial weights of 0s
    initial_w = np.zeros(prepared_x.shape[1])
    
    # Return the three numpy arrays

    return prepared_x, prepared_y, initial_w

In [25]:
def sigmoid_single(x, y, w):
    exponent = y*np.matmul(x.T,w)
    if exponent > 709.782:
        return 1
    else:
        exp = np.exp(exponent)
        
        return exp / (1+exp)

In [26]:
def to_sum(x,y,w):
    return (1- sigmoid_single(x,y,w))*y*x


In [27]:
# function 'sum_all' that will obtain and return the gradient of the log-likelihood.
def sum_all(x_input, y_target, w):
    grad = np.zeros(len(w))
    
    for x,y in zip(x_input, y_target):
        grad += to_sum(x,y,w)
    return grad

In [28]:
#function called 'update_w', that performs a single-step of gradient descent for calculating the Logistic Regression weights
def update_w(x_input, y_target, w, eta):
    return w + (eta * sum_all(x_input, y_target, w))

In [29]:
#function called 'fixed_iteration'will perform the gradient descent and calculate the Logistic Regression weights for a specified number of steps.
def fixed_iteration(x_input, y_target, eta, steps):
    # preprocess data
    x_input, y_target, w = prepare_data(x_input, y_target)
    
    #print(x_input, y_target, w)
    
    for i in range(steps):
        w = update_w(x_input, y_target, w, eta)
    
    return w


In [30]:
def predict(x_input, weights):
    # Add intercept term to x
    x_input = np.insert(x_input, 0, 1)
    
    prod = np.matmul(x_input,weights)
    
    if prod > 0:
        return 1
    else:
        return -1

In [33]:
Xs = np.array([[22,7.25],[38,71.2833],[26,7.925],[35,53.1]])
weights = np.array([0,1,-1])
for X in Xs:
    print(predict(X,weights))

1
-1
1
-1


In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

### YOUR ANSWER BELOW

sk_pred = None

### BEGIN SOLUTION

lr = LogisticRegression()

lr.fit(titanic_imputed, y_target)

sk_pred = lr.predict(titanic_imputed)

In [48]:
print(lr.intercept_)
print(lr.coef_)

[5.18322711]
[[-1.15005160e+00 -4.16216078e-02 -3.17966468e-01 -7.76906221e-02
   1.96181395e-03 -2.50691199e+00  2.56761700e-01 -2.68337022e-01]]


In [49]:
wt = fixed_iteration(titanic_imputed.values, y_target.values, .05, 12000)

print(wt)

cust_preds = np.array([predict(x,wt) for x in titanic_imputed.values])
cust_preds[cust_preds == -1] = 0

[  6837.1092539    -812.92803212   -126.08068623  -2903.52700482
  -1263.22489187     61.24324856 -14298.35656276     97.02066657
    396.60947192]


In [50]:
print("sklearn:")
print(classification_report(y_target, sk_pred))

print("Custom:")
print(classification_report(y_target, cust_preds))

sklearn:
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       549
           1       0.78      0.70      0.74       340

    accuracy                           0.81       889
   macro avg       0.80      0.79      0.79       889
weighted avg       0.81      0.81      0.81       889

Custom:
              precision    recall  f1-score   support

           0       0.77      0.93      0.84       549
           1       0.83      0.55      0.66       340

    accuracy                           0.78       889
   macro avg       0.80      0.74      0.75       889
weighted avg       0.79      0.78      0.77       889

