# Titanic Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Check out the dataset description [here](https://www.kaggle.com/c/titanic/data).

### Lets start analysis.

## Importing the libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

### Importing the dataset

In [3]:
train = pd.read_csv('dataset/train.csv')
test = pd.read_csv('dataset/test.csv')
test_result = pd.read_csv('dataset/gender_submission.csv')

### Taking care of missing data

In [4]:
train = train.dropna(subset=['Age','Embarked','Fare'])
test = test.join(test_result['Survived'])
test = test.dropna(subset=['Age','Embarked','Fare'])

### Selecting desireable fields

In [5]:
x_train = train.iloc[:, [2, 4, 5, 6, 7, 9, 10, 11]].values
y_train = train.iloc[:, 1].values

x_test = test.iloc[:, [1, 3, 4, 5, 6, 8, 9, 10]].values
y_test = test.iloc[:, 11].values

## Data Preprocessing

### Taking care of missing data in Cabin Field

In [6]:
for x in range(len(x_train)):
    if str(x_train[x, 6]) == "nan": 
        x_train[x, 6] = 0
    else : 
        x_train[x, 6] = len(str(x_train[x, 6]).split(" "))
      
for x in range(len(x_test)):
    if str(x_test[x, 6]) == "nan": 
        x_test[x, 6] = 0
    else : 
        x_test[x, 6] = len(str(x_test[x, 6]).split(" "))

### Taking care of Categorical Data

In [7]:
labelencoder_X_train_1 = LabelEncoder()
labelencoder_X_train_2 = LabelEncoder()
x_train[:, 1] = labelencoder_X_train_1.fit_transform(x_train[:, 1])
x_train[:, 7] = labelencoder_X_train_2.fit_transform(x_train[:, 7])

labelencoder_X_test_1 = LabelEncoder()
labelencoder_X_test_2 = LabelEncoder()
x_test[:, 1] = labelencoder_X_test_1.fit_transform(x_test[:, 1])
x_test[:, 7] = labelencoder_X_test_2.fit_transform(x_test[:, 7])

### Feature Scaling

In [8]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)



## Classification

### 1. Decision Tree Classifier

### Fitting Decision Tree Classification to the Training set

In [9]:
classifier = DecisionTreeClassifier(criterion = 'entropy')
classifier.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Predicting the Test set results

In [11]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [25]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

77.34 % Accuracy


### 2. Random Forest Classifier

### Fitting Random Forest Classification to the Training set

In [26]:
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy')
classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Predicting the Test set results

In [27]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [28]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

78.85 % Accuracy


### 3. Naive Bayes

### Fitting Naive Bayes to the Training set

In [32]:
classifier = GaussianNB()
classifier.fit(x_train, y_train)

GaussianNB(priors=None)

### Predicting the Test set results

In [33]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [34]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

78.85 % Accuracy


### 4. Logistic Regression

### Fitting Logistic Regression to the Training set

In [35]:
classifier = LogisticRegression()
classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Predicting the Test set results

In [36]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [37]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

91.54 % Accuracy


### 5. K-NN

### Fitting K-NN to the Training set

In [38]:
classifier = KNeighborsClassifier()
classifier.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### Predicting the Test set results

In [39]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [40]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

80.97 % Accuracy


### 6. SVM

### Fitting SVM to the Training set

In [41]:
classifier = SVC(kernel = 'linear')
classifier.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Predicting the Test set results

In [42]:
y_pred = classifier.predict(x_test)

### Making the Confusion Matrix

In [43]:
cm = confusion_matrix(y_test, y_pred)
print("{:0.2f} % Accuracy".format(((cm[0,0]+cm[1,1])/y_test.shape[0])*100))

100.00 % Accuracy


## Clearly Support Vector Machine(SVM) with linear kernal gives best accuracy.