## Logisitc Regression

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#read the training and testing dataset
train=pd.read_csv('trainT.csv')
test=pd.read_csv('testT.csv')

In [3]:
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#choose some of the variables

df=train[['Survived','Pclass','Sex','Age','Fare']]

df.head()


Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [5]:
#Encoding data - converting male and female to 0 and 1

df['Sex']=df['Sex'].apply(lambda sex:1 if sex=="male" else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [6]:
#to check for missing values

df.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
dtype: int64

In [7]:
#handling missing values- Data Imputation

df['Age']=df['Age'].fillna(df['Age'].median())
#median is robust to outliers 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
#set the predictor and response variable
X=df.drop('Survived', axis=1)
Y=df['Survived']

In [9]:
#split the dataset into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.3, random_state=25)

In [10]:
#The Logistic Regression
from sklearn.linear_model import LogisticRegression
logit=LogisticRegression()
logit.fit(X_train,Y_train)


LogisticRegression()

In [11]:
Y_pred=logit.predict(X_test)

In [12]:
#Confusion Matrix

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,Y_pred)
cm

array([[136,  29],
       [ 31,  72]], dtype=int64)

In [13]:
#Accuracy Score
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,Y_pred)

0.7761194029850746

In [14]:
#Classification Report

from sklearn.metrics import classification_report
report=classification_report(Y_test,Y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.81      0.82      0.82       165
           1       0.71      0.70      0.71       103

    accuracy                           0.78       268
   macro avg       0.76      0.76      0.76       268
weighted avg       0.78      0.78      0.78       268



#### Use another simple classifier - Naive Bayes and compare the results 
#### More arguments for test and train

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Naive Bayes Classifier

In [15]:
from sklearn.naive_bayes import GaussianNB

gaussian_nb=GaussianNB()
gaussian_nb.fit(X_train,Y_train)

GaussianNB()

In [16]:
YNB_pred=gaussian_nb.predict(X_test)

In [17]:
#Confusion Matrix

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,YNB_pred)
cm

array([[130,  35],
       [ 27,  76]], dtype=int64)

In [18]:
#Accuracy Score
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,YNB_pred)

0.7686567164179104

In [19]:
#Classification Report

from sklearn.metrics import classification_report
report=classification_report(Y_test,YNB_pred)
print(report)

              precision    recall  f1-score   support

           0       0.83      0.79      0.81       165
           1       0.68      0.74      0.71       103

    accuracy                           0.77       268
   macro avg       0.76      0.76      0.76       268
weighted avg       0.77      0.77      0.77       268



### Conclusion:

From the above two classification Techniques, we see that the Logisitc Regression Model is better. Eventhough, both the models have approximately similar accuracy values (Logisitic Regression - 78% and Naive bayes -77%), Precision, Recall and F1-Score Values are higher for Logistic Regression for both the classes 0 and 1. Hence, the Logistic Regression Model is performing better.

## Additional -Trying more arguments for train_test_split()

In [28]:
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.3, random_state=25, shuffle=False, stratify=None, train_size=0.7)

### Logisitic Regression

In [29]:
logit=LogisticRegression()
logit.fit(X_train,Y_train)
Y_pred=logit.predict(X_test)

In [30]:
cm=confusion_matrix(Y_test,Y_pred)
print("Confusion Matix") shuffle=False, stratify=None, train_size=0.7
print(cm)
print()
print("Accuracy")
print(accuracy_score(Y_test,Y_pred))


Confusion Matix
[[150  22]
 [ 34  62]]

Accuracy
0.7910447761194029


In [31]:
report=classification_report(Y_test,Y_pred)
print("Classification Report")
print(report)

Classification Report
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       172
           1       0.74      0.65      0.69        96

    accuracy                           0.79       268
   macro avg       0.78      0.76      0.77       268
weighted avg       0.79      0.79      0.79       268



### Naive Bayes Classifier

In [32]:
gaussian_nb=GaussianNB()
gaussian_nb.fit(X_train,Y_train)
YNB_pred=gaussian_nb.predict(X_test)

In [33]:
cmB=confusion_matrix(Y_test,YNB_pred)
print("Confusion Matix")
print(cmB)
print()
print("Accuracy")
print(accuracy_score(Y_test,YNB_pred))

Confusion Matix
[[147  25]
 [ 32  64]]

Accuracy
0.7873134328358209


In [34]:
reportB=classification_report(Y_test,YNB_pred)
print("Classification report")
print(reportB)

Classification report
              precision    recall  f1-score   support

           0       0.82      0.85      0.84       172
           1       0.72      0.67      0.69        96

    accuracy                           0.79       268
   macro avg       0.77      0.76      0.76       268
weighted avg       0.78      0.79      0.79       268



### Conclusion:

The following new parameters have been considered in train_test_split():
1. shuffle=False : default Value is True
2. stratify=None : defualt Value is None
3. train_size=0.7: default Value is 0.75

After making these changes, we see that accuracy for both the models have increased to 79%. Now both the models are performing equally good and it is tough to say which one is better as both have similar preicion and recall values. However, the first model i.e Logistic Regression has a higher True Positive rate as can be seen from the Consfusion Matrix which may be considered while evaluating the two models.