<a href="https://colab.research.google.com/github/AhaanB29/ML_models/blob/main/Titanic_pred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic - Who all survived?


---



The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# Importing Libraries

---



In [369]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Pre-Processing and Importing the Dataset

---



In [370]:
dataset = pd.read_csv('train.csv')
dataset = dataset.drop(["Name","Ticket","Cabin"],axis=1)
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [371]:
datafram = pd.DataFrame(X)
print(datafram)

     0       1     2  3  4        5  6
0    3    male  22.0  1  0     7.25  S
1    1  female  38.0  1  0  71.2833  C
2    3  female  26.0  0  0    7.925  S
3    1  female  35.0  1  0     53.1  S
4    3    male  35.0  0  0     8.05  S
..  ..     ...   ... .. ..      ... ..
886  2    male  27.0  0  0     13.0  S
887  1  female  19.0  0  0     30.0  S
888  3  female   NaN  1  2    23.45  S
889  1    male  26.0  0  0     30.0  C
890  3    male  32.0  0  0     7.75  Q

[891 rows x 7 columns]


In [372]:
dataset.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64

In [373]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:,2:3])
X[:,2:3] = imputer.transform(X[:, 2:3])
imputer2 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer2.fit(X[:,6:])
X[:,6:] = imputer2.transform(X[:,6:])

In [374]:
pd.DataFrame(X).isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64

In [375]:
print(pd.DataFrame(X))

     0       1          2  3  4        5  6
0    3    male       22.0  1  0     7.25  S
1    1  female       38.0  1  0  71.2833  C
2    3  female       26.0  0  0    7.925  S
3    1  female       35.0  1  0     53.1  S
4    3    male       35.0  0  0     8.05  S
..  ..     ...        ... .. ..      ... ..
886  2    male       27.0  0  0     13.0  S
887  1  female       19.0  0  0     30.0  S
888  3  female  29.699118  1  2    23.45  S
889  1    male       26.0  0  0     30.0  C
890  3    male       32.0  0  0     7.75  Q

[891 rows x 7 columns]


We have completed with handling of missing data and removal of non contributing features.

The next step would be to handle categorical data that are the fields Gender and Embarked.
which will be implemented using One-Hotencoding



In [376]:
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
X[:,1] = np.array(label.fit_transform(X[:,1]))
X[:,6] = np.array(label.fit_transform(X[:,6]))
print(pd.DataFrame(X))

     0  1          2  3  4        5  6
0    3  1       22.0  1  0     7.25  2
1    1  0       38.0  1  0  71.2833  0
2    3  0       26.0  0  0    7.925  2
3    1  0       35.0  1  0     53.1  2
4    3  1       35.0  0  0     8.05  2
..  .. ..        ... .. ..      ... ..
886  2  1       27.0  0  0     13.0  2
887  1  0       19.0  0  0     30.0  2
888  3  0  29.699118  1  2    23.45  2
889  1  1       26.0  0  0     30.0  0
890  3  1       32.0  0  0     7.75  1

[891 rows x 7 columns]


**Splitting into test and training set**

In [377]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

**Scaling the features**
---



In [378]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 2:3] = sc.fit_transform(X_train[:,2:3])
X_test[:,2:3] = sc.transform(X_test[:,2:3])
sc2 = StandardScaler()
X_train[:, 5:6] = sc2.fit_transform(X_train[:,5:6])
X_test[:,5:6] = sc2.transform(X_test[:,5:6])
print(X_train)

[[3 1 -0.028104986699835386 ... 0 -0.18801432489146527 1]
 [1 0 -0.005412181761448325 ... 0 0.5396904377513654 0]
 [2 0 0.29627124654946124 ... 0 -0.463502926868237 2]
 ...
 [2 1 -0.6841998954609948 ... 0 0.8977348711346351 2]
 [3 0 -0.028104986699835386 ... 0 -0.5272434269334508 2]
 [3 1 -0.6841998954609948 ... 0 -0.5164399523461265 2]]


# Building the model

---



In [379]:
from sklearn.linear_model import LogisticRegression
classifier1 = LogisticRegression(random_state = 0)
classifier1.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [380]:
y_pred1 = classifier1.predict(X_test)

In [381]:
from sklearn.neighbors import KNeighborsClassifier
classifier2 = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier2.fit(X_train, y_train)

KNeighborsClassifier()

In [382]:
y_pred2 = classifier2.predict(X_test)

In [383]:
from sklearn.ensemble import RandomForestClassifier
classifier3 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier3.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)

In [384]:
y_pred3= classifier3.predict(X_test)

# Analysing the predictions from the test set
---



In [385]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred1)
print(cm)
print(accuracy_score(y_test, y_pred1))
cm = confusion_matrix(y_test, y_pred2)
print(cm)
print(accuracy_score(y_test, y_pred2))

cm = confusion_matrix(y_test, y_pred3)
print(cm)
print(accuracy_score(y_test, y_pred3))

[[90 16]
 [20 53]]
0.7988826815642458
[[94 12]
 [32 41]]
0.7541899441340782
[[91 15]
 [30 43]]
0.7486033519553073


# Predicting values for the given set

---



In [386]:
dataset = pd.read_csv('test.csv')
dataset = dataset.drop(["Name","Ticket","Cabin"],axis=1)
X_=dataset.iloc[:, 1:].values
pid = dataset.iloc[:,0:1].values

In [387]:
dataset.isnull().sum()

PassengerId     0
Pclass          0
Sex             0
Age            86
SibSp           0
Parch           0
Fare            1
Embarked        0
dtype: int64

In [388]:
print(pd.DataFrame(X_))

     0       1     2  3  4        5  6
0    3    male  34.5  0  0   7.8292  Q
1    3  female  47.0  1  0      7.0  S
2    2    male  62.0  0  0   9.6875  Q
3    3    male  27.0  0  0   8.6625  S
4    3  female  22.0  1  1  12.2875  S
..  ..     ...   ... .. ..      ... ..
413  3    male   NaN  0  0     8.05  S
414  1  female  39.0  0  0    108.9  C
415  3    male  38.5  0  0     7.25  S
416  3    male   NaN  0  0     8.05  S
417  3    male   NaN  1  1  22.3583  C

[418 rows x 7 columns]


In [389]:
imputer.fit(X_[:,2:3])
X_[:,2:3] = imputer.transform(X_[:, 2:3])
imputer.fit(X_[:,5:6])
X_[:,5:6] = imputer.transform(X_[:, 5:6])
imputer2.fit(X[:,6:])
X_[:,6:] = imputer2.transform(X_[:,6:])

In [390]:
print(pd.DataFrame(X_))
X_[:,1] = np.array(label.fit_transform(X_[:,1]))
X_[:,6] = np.array(label.fit_transform(X_[:,6]))

     0       1         2  3  4        5  6
0    3    male      34.5  0  0   7.8292  Q
1    3  female      47.0  1  0      7.0  S
2    2    male      62.0  0  0   9.6875  Q
3    3    male      27.0  0  0   8.6625  S
4    3  female      22.0  1  1  12.2875  S
..  ..     ...       ... .. ..      ... ..
413  3    male  30.27259  0  0     8.05  S
414  1  female      39.0  0  0    108.9  C
415  3    male      38.5  0  0     7.25  S
416  3    male  30.27259  0  0     8.05  S
417  3    male  30.27259  1  1  22.3583  C

[418 rows x 7 columns]


In [391]:

X_[:,1] = np.array(label.fit_transform(X_[:,1]))
X_[:,6] = np.array(label.fit_transform(X_[:,6]))
print(pd.DataFrame(X_))

     0  1         2  3  4        5  6
0    3  1      34.5  0  0   7.8292  1
1    3  0      47.0  1  0      7.0  2
2    2  1      62.0  0  0   9.6875  1
3    3  1      27.0  0  0   8.6625  2
4    3  0      22.0  1  1  12.2875  2
..  .. ..       ... .. ..      ... ..
413  3  1  30.27259  0  0     8.05  2
414  1  0      39.0  0  0    108.9  0
415  3  1      38.5  0  0     7.25  2
416  3  1  30.27259  0  0     8.05  2
417  3  1  30.27259  1  1  22.3583  0

[418 rows x 7 columns]


In [392]:
X_[:,2:3] = sc.transform(X_[:,2:3])
X_[:,5:6] = sc2.transform(X_[:,5:6])


In [393]:
print(X_)

[[3 1 0.33398167508832494 ... 0 -0.5212107667238889 1]
 [3 0 1.2767423885599174 ... 0 -0.5391272489795077 2]
 [2 1 2.408055244725828 ... 0 -0.4810585730726391 1]
 ...
 [3 1 0.6356651033992345 ... 0 -0.5337255116858455 2]
 [3 1 0.015146816929920107 ... 0 -0.5164399523461265 2]
 [3 1 0.015146816929920107 ... 1 -0.2072812414704996 0]]


In [394]:
final_prediction = classifier1.predict(X_)
print(final_prediction)

[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1
 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]


In [395]:
arr = np.concatenate((pid.reshape(len(pid),1), final_prediction.reshape(len(pid),1)),1)

# Creating Submission File

---



In [397]:
import csv
with open("Submission_LC.csv",'w') as fw:
  writer = csv.writer(fw)
  writer.writerow(["PassengerId","Survived"])
  writer.writerows(arr)