## Titanic Dataset

The dataset containing  information about passengers on the Titanic, including features such as age, sex, ticket class, number of siblings/spouses aboard, number of parents/children aboard, fare, and more, the task is to build a predictive model that can accurately classify whether a passenger survived (1) or did not survive (0)

![titanic-sinking.jpg](attachment:titanic-sinking.jpg)

Below is a brief information about each columns of the dataset:

1. **PassengerId:** An unique index for passenger rows. It starts from 1 for first row and increments by 1 for every new rows.

2. **Survived:** Shows if the passenger survived or not. 1 stands for survived and 0 stands for not survived.

3. **Pclass:** Ticket class. 1 stands for First class ticket. 2 stands for Second class ticket. 3 stands for Third class ticket.

4. **Name:** Passenger's name. Name also contain title. "Mr" for man. "Mrs" for woman. "Miss" for girl. "Master" for boy.

5. **Sex:** Passenger's sex. It's either Male or Female.

6. **Age:** Passenger's age. "NaN" values in this column indicates that the age of that particular passenger has not been recorded.

7. **SibSp:** Number of siblings or spouses travelling with each passenger.
8. **Parch:** Number of parents of children travelling with each passenger.
9. **Ticket:** Ticket number.
10. **Fare:** How much money the passenger has paid for the travel journey.
11. **Cabin:** Cabin number of the passenger. "NaN" values in this column indicates that the cabin number of that particular passenger has not been recorded.
12. **Embarked:** Port from where the particular passenger was embarked/boarded.

### Step 1:- Importing Necessary Libraries

In [None]:
# loading the Basic libraries

import pandas as pd                                     # To work on data manipulation
import numpy as np                                      # It is used for numerical python
import seaborn as sns                                   # For Visualization
import matplotlib.pyplot as plt                         # For Visualization
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
# Importing the Sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score

### Step 2:- Loading Dataset

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
print(train.shape,test.shape)

#### Explore the Dataset for better understanding

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
print(train.shape,test.shape)

In [None]:
train.describe()

*describe(include = ['O'])* will show the descriptive statistics of object data types.

In [None]:
train.describe(include='O')

This shows that there are duplicate *Ticket number* and *Cabins* shared, this information got by seeing freq row. The highest number of duplicate ticket number is "347082". It has been repeated 7 times. Similarly, the highest number of people using the same cabin is 4. They are using cabin number "B96 B98".

We also see that 644 people were embarked from port "S".

Among 891 rows, 577 were Male and the rest were Female.

We can justify by using count, unique,top,freq

In [None]:
train.info()

We can see that *Age* value is missing for many rows.

Out of 891 rows, the *Age* value is present only in 714 rows.

Similarly, *Cabin* values are also missing in many rows. Only 204 out of 891 rows have *Cabin* values.

similarly, Embarked value are also missing in some rows.
889 rows out of 891 rows

In [None]:
null_vals = train.isnull().sum()
null_vals[null_vals>0]

There are 177 rows with missing *Age*, 687 rows with missing *Cabin* and 2 rows with missing *Embarked* information.

### Now will check for test dataset

In [None]:
test.shape

*Survived* column is not present in Test data.
We have to train our classifier using the Train data and generate predictions (*Survived*) on Test data.

In [None]:
test.info()

There are missing entries for *Age* in Test dataset as well.

Out of 418 rows in Test dataset, only 332 rows have *Age* value.

*Cabin* values are also missing in many rows. Only 91 rows out ot 418 have values for *Cabin* column.

In [None]:
test_miss_vals = test.isnull().sum()
test_miss_vals[test_miss_vals>0]

There are 86 rows with missing *Age*, 327 rows with missing *Cabin* and 1 row with missing *Fare* information.

### check for duplicates

In [None]:
train[train.duplicated()]

In [None]:
test[test.duplicated()]

There is no duplicated records in training and testing set

## Relationship between Features and Survival

In this section, we analyze relationship between different features with respect to *Survival*. We see how different feature values show different survival chance. We also plot different kinds of diagrams to **visualize** our data and findings.

In [None]:
train.Survived.value_counts()

In [None]:
train.Survived.value_counts(normalize=True)*100

The data set has slightly imbalanced but its fine we can use it.
1. Balanced Dataset :- Each class has same propotion of distribution           
2. Slightly Imbalanced Dataset:- class distribution is roughly 60-40, 70-30, or even 80-20, depending on the problem.                 
3. Imbalanced Dataset:- When the class distribution is highly skewed, such as 90-10, 95-5, or even more extreme imbalances.          


In [None]:
sns.countplot(x = train.Survived)

### Pclass vs. Survival

Higher class passengers have better survival chance.

In [None]:
train.Pclass.value_counts()

In [None]:
train.groupby('Pclass').Survived.value_counts()

In [None]:
train[['Pclass','Survived']].groupby(['Pclass']).mean()

In [None]:
train.groupby('Pclass')['Survived'].mean()*100

In [None]:
train.groupby('Pclass')[['Survived','Fare']].mean()

In [None]:
train.groupby('Pclass')['Survived'].mean().plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

# Sex vs Survival
Females has better survival chance.

In [None]:
train.Sex.value_counts()

In [None]:
train.groupby('Sex').Survived.value_counts()

In [None]:
train[['Sex','Survived']].groupby(['Sex'],as_index=True).mean()

In [None]:
train.groupby('Sex')['Survived'].mean()*100

In [None]:
train.groupby('Sex').Survived.mean().plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

#### Now will perform multi-variate analysis using Pclass & Sex vs Survival

### Pclass & Sex vs Survival

Below, we just find out how many males and females are there in each *Pclass*. We then plot a stacked bar diagram with that information. We found that there are more males among the 3rd Pclass passengers.

In [None]:
tab = pd.crosstab(train['Pclass'],train['Sex'])
tab

In [None]:
tab.sum(0)

In [None]:
tab.sum(1)

In [None]:
pd.crosstab(index = [train.Survived,train.Pclass], columns = [train.Sex])

In [None]:
94/216

In [None]:
tab.sum(1)

In [None]:
tab.sum(0)

In [None]:
tab.div(tab.sum(1),)

In [None]:
tab.div(tab.sum(1).astype(float),axis=0).plot(kind='bar',stacked=True,)
plt.xticks(rotation=0)
plt.ylabel('Percentage')
plt.show()

In [None]:
train.head(2)

In [None]:
sns.catplot(x = 'Sex',y = 'Survived',hue='Pclass',kind = 'point',data=train)

From the above plot, it can be seen that:
- Women from 1st and 2nd Pclass have almost 100% survival chance.
- Men from 2nd and 3rd Pclass have only around 10% survival chance.

In [None]:
sns.catplot(x = 'Pclass',y = 'Survived',hue='Sex',col = 'Embarked',kind = 'point',data=train)

In [None]:
train.columns

From the above plot, it can be seen that:
- Almost all females from Pclass 1 and 2 survived.
- Females dying were mostly from 3rd Pclass.
- Males from Pclass 1 only have slightly higher survival chance than Pclass 2 and 3.

### Embarked vs Survived

In [None]:
train.Embarked.value_counts()

In [None]:
train.groupby('Embarked').Survived.value_counts()

In [None]:
train[['Embarked','Survived']].groupby(['Embarked'],as_index=False).mean()

In [None]:
train.groupby('Embarked').Survived.mean().plot(kind='bar')
plt.ylabel('Mean(Survived)')
plt.xticks(rotation=0)
plt.show()

### Parch vs. Survival

In [None]:
train.Parch.value_counts()

In [None]:
train.groupby('Parch').Survived.value_counts()

In [None]:
train[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean()

In [None]:
train.groupby('Parch').Survived.mean().plot(kind='bar')
plt.ylabel('mean(Survived)')
plt.xticks(rotation=0)
plt.show()

### SibSp vs. Survival

In [None]:
train.SibSp.value_counts()

In [None]:
train.groupby('SibSp').Survived.value_counts()

In [None]:
train[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean()

In [None]:
train.groupby('SibSp').Survived.mean().plot(kind='bar')
plt.ylabel('mean(Survived)')
plt.xticks(rotation=0)
plt.show()

In [None]:
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

sns.violinplot(x="Embarked", y="Age", hue="Survived", data=train, split=True, ax=ax1)
sns.violinplot(x="Pclass", y="Age", hue="Survived", data=train, split=True, ax=ax2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=train, split=True, ax=ax3)


In [None]:
train.Age[train.Age<0]

From *Pclass* violinplot, we can see that:
- 1st Pclass has very few children as compared to other two classes.
- 1st Plcass has more old people as compared to other two classes.
- Almost all children (between age 0 to 10) of 2nd Pclass survived.
- Most children of 3rd Pclass survived.
- Younger people of 1st Pclass survived as compared to its older people.

From *Sex* violinplot, we can see that:
- Most male children (between age 0 to 14) survived.
- Females with age between 18 to 40 have better survival chance.

From *Embarked* Violinplot, we can see that:
- Almost all the Passengers from Embarked Q (Age greater than 40) has low survival rate

### Correlating Features

In [None]:
train.head(2)

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(train.drop(['PassengerId','Name','Sex','Ticket','Cabin','Embarked'],axis=1).corr(),vmax=0.6,square=True,annot=True)

In [None]:
train.head(2)

In [None]:
train_dup = train.copy()

In [None]:
test_dup = test.copy()

### Droping Irrelevent columns

In [None]:
train.drop(['PassengerId','Ticket'],axis=1,inplace=True)

In [None]:
test.drop(['PassengerId','Ticket'],axis=1,inplace=True)

In [None]:
null_counts = train.isnull().sum()
null_counts[null_counts>0]

In [None]:
train.shape

In [None]:
687/891

#### Cabin has more than 75% of null values, In the dataset if we have more than 75% of null values it doesn't provide significant information for prediction so its better to drop those features

In [None]:
train.drop('Cabin',axis =1,inplace=True)

In [None]:
test.drop('Cabin',axis =1,inplace=True)

### Imputing the null values present in the Age and Embarked column

In [None]:
train.Age.plot(kind = 'hist')

### we can see that the age is normaly distributed, so we can go with mean imputation

In [None]:
round(train['Age'].mean(),0)

In [None]:
train['Age'].median()

In [None]:
train['Age'] = train.Age.fillna(round(train.Age.mean(),0))

In [None]:
print(train.Age.min(),train.Age.max())

### SibSp & Parch Feature

Combining *SibSp* & *Parch* feature, we create a new feature named *FamilySize*.

In [None]:
train['FamilySize'] = train['SibSp']+train.Parch + 1

In [None]:
train[['FamilySize','Survived']].groupby(['FamilySize'], as_index = False).mean()

data shows that:

- Having *FamilySize* upto 4 (from 2 to 4) has better survival chance.
- *FamilySize = 1*, i.e. travelling alone has less survival chance.
- Large *FamilySize* (size of 5 and above) also have less survival chance.

### Now will perform some encoding technique

In [None]:
train.head(2)

In [None]:
cols = pd.get_dummies(train[['Sex','Embarked']],drop_first = True).astype('int')

In [None]:
final_train = pd.concat([train,cols],axis=1)

In [None]:
final_train.head(2)

In [None]:
test.head(2)

In [None]:
colstest = pd.get_dummies(test[['Sex','Embarked']],drop_first = True).astype('int')

In [None]:
final_test = pd.concat([test,colstest],axis=1)

In [None]:
final_test.head(2)

### Now will keep only required feature for building the model

In [None]:
final_train.head(2)

In [None]:
X = final_train.drop(['Name','Sex','Embarked','FamilySize','Survived'],axis=1)

In [None]:
X.head()

In [None]:
y = final_train['Survived']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 1)

In [None]:
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

In [None]:
clf = LogisticRegression()
clf.fit(X_train,y_train)

In [None]:
X_train.head()

In [None]:
# Predicting for training and testing
ytrain_pred = clf.predict(X_train)
ytest_pred = clf.predict(X_test)

In [None]:
# Model Evaluation

train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print(f'Training Score {round(train_score,2)}, Testing Score {round(test_score,2)}')

In [None]:
y_train.shape

In [None]:
347+49+65+162

In [None]:
(347+162)/623

In [None]:
confusion_matrix(y_train,ytrain_pred)

In [None]:
y_test.shape

In [None]:
(128+79)/268

In [None]:
confusion_matrix(y_test,ytest_pred)

Support :- The number of samples each metrics was calculated on.


Macro Average:- The Average of Precision, recall and F1 Score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, you must focus on Macro Avg          


Weighted Average:- The Weighted average of precision, recall and F1 Score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favor the majority class.


In [None]:
print(classification_report(y_train,ytrain_pred))

In [None]:
print(classification_report(y_test,ytest_pred))

In [None]:
fpr,tpr,thresholds = roc_curve(y_train,ytrain_pred)  # Calculate ROC curve

In [None]:
roc_auc = auc(fpr,tpr)      # Calculate AUC

In [None]:
print("AUC = ",roc_auc)

In [None]:
(347/(347+94))

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()

------------------------------ checking for test dataset --------------------------------------------

In [None]:
X_train.head(2)

In [None]:
final_test.head(2)

In [None]:
final_test1 = final_test.drop(['Name','Sex','Embarked'],axis=1)

### we want to treat the null values present in the test set

In [None]:
final_test1.isnull().sum()

In [None]:
final_test1.Age.plot(kind = 'hist')

In [None]:
round(final_test1.Age.mean(),0)

In [None]:
final_test1['Age'] = final_test1.Age.fillna(round(final_test1.Age.mean(),0))

In [None]:
final_test1.shape

In [None]:
final_test1.Fare.median()

In [None]:
final_test1['Fare'] = final_test1.Fare.fillna(final_test1.Fare.median())

In [None]:
clf.predict(final_test1)

In [None]:
clf.predict_proba(final_test1)

In [None]:
pd.DataFrame(clf.predict_proba(final_test1))