## **Logistic Regression**

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

For Example: spam detection for emails, predicting if a customer will default in a loan, etc.


<img src = "https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png">

## **Geting Started with Logistic Regression**

In [30]:
#Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

## **Data Framing**

Read .csv data into a Dataframe

In [2]:
#reading our dataset in Alias: data.
data = pd.read_csv("titanic.csv")

#Overviewing our dataset
data.head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0,0.0,0.0,19952,26.55,E12,S,3,,"New York, NY"
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0,0.0,0.0,112050,0.0,A36,S,,,"Belfast, NI"
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


## **Exploring Dataset**

In [3]:
data = data[["pclass","sex","age","survived"]]

In [4]:
data.shape

(1310, 4)

In [5]:
#Describing our dataset
data.describe()

Unnamed: 0,pclass,age,survived
count,1309.0,1046.0,1309.0
mean,2.294882,29.881135,0.381971
std,0.837836,14.4135,0.486055
min,1.0,0.1667,0.0
25%,2.0,21.0,0.0
50%,3.0,28.0,0.0
75%,3.0,39.0,1.0
max,3.0,80.0,1.0


In [6]:
#Printing top 5 rows
data.head()


Unnamed: 0,pclass,sex,age,survived
0,1.0,female,29.0,1.0
1,1.0,male,0.9167,1.0
2,1.0,female,2.0,0.0
3,1.0,male,30.0,0.0
4,1.0,female,25.0,0.0


In [7]:
#Printing last 5 rows
data.tail()

Unnamed: 0,pclass,sex,age,survived
1305,3.0,female,,0.0
1306,3.0,male,26.5,0.0
1307,3.0,male,27.0,0.0
1308,3.0,male,29.0,0.0
1309,,,,


In [8]:
#Finding  number of Nan or missing values in all columns
print(data.isna().sum(axis = 0))

# data.isna().sum(axis = 0)   # Nan values in every column
# data.isna().sum(axis = 1)   # Nan values in every row.

pclass        1
sex           1
age         264
survived      1
dtype: int64


## **Preprocessing: Dealing with Missing Values**

In [9]:
#Dropping Rows with Nan in all provided columns
data = data.dropna(subset=["sex","pclass","survived"])

In [10]:
#Again Finding  number of Nan or missing values in all columns
print(data.isna().sum(axis = 0))

pclass        0
sex           0
age         263
survived      0
dtype: int64


In [12]:
data.tail()

Unnamed: 0,pclass,sex,age,survived
1304,3.0,female,14.5,0.0
1305,3.0,female,,0.0
1306,3.0,male,26.5,0.0
1307,3.0,male,27.0,0.0
1308,3.0,male,29.0,0.0


## **Preprocessing: Label Encoding**

Since in **sex** column we are not having a numerical vallue, just having male and female. Since, model doesn't accept text, wo we will transform text into binary labels.

In [13]:
data["sex"] = data["sex"].map({"male":1,"female":0})

data["sex"]

0       0
1       1
2       0
3       1
4       0
       ..
1304    0
1305    0
1306    1
1307    1
1308    1
Name: sex, Length: 1309, dtype: int64

In [14]:
data.head()

Unnamed: 0,pclass,sex,age,survived
0,1.0,0,29.0,1.0
1,1.0,1,0.9167,1.0
2,1.0,0,2.0,0.0
3,1.0,1,30.0,0.0
4,1.0,0,25.0,0.0


## **Features Extraction**

Extracting features

In [15]:
features = data[["sex","age","pclass"]]
target = data[["survived"]]
features.head()

Unnamed: 0,sex,age,pclass
0,0,29.0,1.0
1,1,0.9167,1.0
2,0,2.0,1.0
3,1,30.0,1.0
4,0,25.0,1.0


In [16]:
target.head()

Unnamed: 0,survived
0,1.0
1,1.0
2,0.0
3,0.0
4,0.0


## **Preprocessing: Imputation | Missing Values in Age**

Now since we are having so many missing values in our column of **age** we will Impute the entire column

In [17]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
features = imputer.fit_transform(features)

In [18]:
features

array([[ 0.    , 29.    ,  1.    ],
       [ 1.    ,  0.9167,  1.    ],
       [ 0.    ,  2.    ,  1.    ],
       ...,
       [ 1.    , 26.5   ,  3.    ],
       [ 1.    , 27.    ,  3.    ],
       [ 1.    , 29.    ,  3.    ]])

## **Splitting our dataset into Train & Test Set**

In [19]:
feature_train, feature_test, target_train, target_test = train_test_split(features,target)

## **Training the Model**

We are using Logistic regression model as imported from sklearn library and then it's being trained on feature_train and target_train

In [20]:
model = LogisticRegression()
model.fit(feature_train, target_train)
predictions = model.predict(feature_test)

  y = column_or_1d(y, warn=True)


## **Printing an Error Matrix and Accuracy Score**

<img src ="https://glassboxmedicine.files.wordpress.com/2019/02/confusion-matrix.png?w=816" height =200>

In [21]:
print(confusion_matrix(target_test,predictions))
print((accuracy_score(target_test,predictions)*100))

[[181  29]
 [ 39  79]]
79.26829268292683


In [24]:
data

Unnamed: 0,pclass,sex,age,survived
0,1.0,0,29.0000,1.0
1,1.0,1,0.9167,1.0
2,1.0,0,2.0000,0.0
3,1.0,1,30.0000,0.0
4,1.0,0,25.0000,0.0
...,...,...,...,...
1304,3.0,0,14.5000,0.0
1305,3.0,0,,0.0
1306,3.0,1,26.5000,0.0
1307,3.0,1,27.0000,0.0
