# Kaggle_Titanic
## Goal
**Build a predictive model that answers the question: "What sorts of people were more likely to survice?" using passenger data (ie name, age, gender, socio-economic class, etc).**   
## The dataset
- The dataset include passenger information like name, age, gender, socio-economic class, etc.  
- `Train.csv` will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.  
- The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.  
- Using the patterns you find in the `train.csv` data, predict whether the other 418 passengers on board (found in `test.csv`) survived.  
## Submission File Format  
- **PassengerId (sorted in any order)**  
- **Survived (contains your binary predictions: 1 for survived, 0 for deceased)**  


In [1]:
import pandas as pd
import random
import numpy as np
from sklearn.model_selection import train_test_split

Read the data set using pandas

In [2]:
data_train = pd.read_csv("data/train.csv")
data_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


# Data cleaning

In [4]:
data_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### We can see the Age column has some "NaN" values and we can fix this by removing it from the random_number between mean of Age and nums_replace_age

#### Caculate the mean of Age and nums_replace_age

In [5]:
mean_age = round(data_train.Age.mean(), 2)
nums_replace_age = round(data_train.Age.sum()/data_train.shape[0], 2)
print(mean_age, nums_replace_age)

29.7 23.8


In [6]:
random_number = random.uniform(nums_replace_age, mean_age)
random_number

29.45133593580354

#### Replace the "NaN" values with random_number

In [7]:
Age_fixed = data_train.Age.replace(np.nan, round(random_number, 2)) 
Age_fixed

0      22.00
1      38.00
2      26.00
3      35.00
4      35.00
       ...  
886    27.00
887    19.00
888    29.45
889    26.00
890    32.00
Name: Age, Length: 891, dtype: float64

In [8]:
data_train.Age = Age_fixed
data_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.45,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.0000,C148,C


In [9]:
data_train = data_train.set_index("PassengerId")

In [10]:
df = data_train
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


#### Replace male with 0 and female with 1

In [12]:
df.Ticket

PassengerId
1             A/5 21171
2              PC 17599
3      STON/O2. 3101282
4                113803
5                373450
             ...       
887              211536
888              112053
889          W./C. 6607
890              111369
891              370376
Name: Ticket, Length: 891, dtype: object

In [13]:
gender = {"male": 0, "female": 1}
df.Sex = [gender[item] for item in df.Sex]
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,29.45,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,111369,30.0000,C148,C


#### Replace the Ticket from object to numeric

In [14]:
df.Ticket = df.Ticket.str.extract(r'(\d+)', expand = False).astype(float)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,5.0,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,17599.0,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,2.0,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,113803.0,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,373450.0,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,211536.0,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,112053.0,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,29.45,1,2,6607.0,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,111369.0,30.0000,C148,C


#### Save file

In [15]:
df.to_csv("train_after_fixed.csv")

# Model Classification

## Create the data for training and validation

- Create X (After completing the exercise, you can return to modify this line!)  
- Select columns corresponding to features, and preview the data

In [16]:
features = ['Age', 'Sex', 'Ticket', 'Fare']
X = pd.get_dummies(df[features])
X.describe()

Unnamed: 0,Age,Sex,Ticket,Fare
count,891.0,891.0,887.0,891.0
mean,29.64963,0.352413,227846.4,32.204208
std,13.002396,0.47799,502450.6,49.693429
min,0.42,0.0,2.0,0.0
25%,22.0,0.0,11772.5,7.9104
50%,29.45,0.0,110413.0,14.4542
75%,35.0,1.0,347062.5,31.0
max,80.0,1.0,3101317.0,512.3292


In [17]:
y = df.Survived

In [18]:
X.Ticket = X.Ticket.fillna(X.Ticket.mean())

## Process data_test

In [19]:
data_test = pd.read_csv("data/test.csv")
data_test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [20]:
data_test.Age = data_test.Age.replace(np.nan, round(random_number, 2))

In [21]:
data_test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [22]:
data_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.50,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.00,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.00,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.00,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.00,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,29.45,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.00,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.50,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,29.45,0,0,359309,8.0500,,S


In [23]:
test_X = data_test[features]
# test_X = pd.get_dummies(data_test[features])

In [24]:
test_X

Unnamed: 0,Age,Sex,Ticket,Fare
0,34.50,male,330911,7.8292
1,47.00,female,363272,7.0000
2,62.00,male,240276,9.6875
3,27.00,male,315154,8.6625
4,22.00,female,3101298,12.2875
...,...,...,...,...
413,29.45,male,A.5. 3236,8.0500
414,39.00,female,PC 17758,108.9000
415,38.50,male,SOTON/O.Q. 3101262,7.2500
416,29.45,male,359309,8.0500


In [25]:
test_X.Ticket = test_X.Ticket.str.extract(r'(\d+)', expand = False).astype(float)
test_X.Fare = test_X.Fare.replace(np.nan, random.randint(0,100))
gender = {"male": 0, "female": 1}
test_X.Sex = [gender[item] for item in test_X.Sex]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X.Ticket = test_X.Ticket.str.extract(r'(\d+)', expand = False).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X.Fare = test_X.Fare.replace(np.nan, random.randint(0,100))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X.Sex = [gender[item] for item in test_X.Sex]


### Logistic regression


In [26]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state = 1)
model.fit(X, y)

In [27]:
y_pred = model.predict(test_X)
result = pd.DataFrame({'PassengerId': data_test['PassengerId'], 'Survived': y_pred})

In [28]:
result.to_csv('Logistic_Regression.csv', index=False)

> **Accuracy: 0.66507**

### Random Forest 

In [29]:
from sklearn.ensemble import RandomForestClassifier
model_2 = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model_2.fit(X, y)

In [30]:
y_pred = model_2.predict(test_X)
result = pd.DataFrame({'PassengerId': data_test['PassengerId'], 'Survived': y_pred})

In [31]:
result.to_csv('Random_Forest_Classifier.csv', index=False)

> **Accuracy: 0.7799**

###  Decision Tree

In [32]:
from sklearn.tree import DecisionTreeClassifier
model_3 = DecisionTreeClassifier()
model_3.fit(X, y)

In [33]:
y_pred = model_3.predict(test_X)
result = pd.DataFrame({'PassengerId': data_test['PassengerId'], 'Survived': y_pred})

In [34]:
result.to_csv('Decision_Tree_Classifier.csv', index=False)

> **Accuracy: 0.69377**