For Kaggle competitions it is common practice to combine test and train data before preprocessing (imputing missing values).This allows you to have more data from the same pool. Reduces some repetition of tasks. And takes care of categories that might appear in the test data but not in the train data.

**Changes to further optimize model:**
1. Use test+train data to impute missing train Data
2. Linear Regression to impute missing Age data
3. Try more models
4. New Feature Family Size
5. Boosting and Ensemble Models
6. Cross Validation before submitting
7. Age missing sample has lower survival rate than Age not missing sample. So we might want to create a new column with binary information abnout Age missing.

# Titanic - Data Cleaning and Exploration

## Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

raw_train=pd.read_csv('train.csv')
raw_test=pd.read_csv('test.csv')

train=raw_train
test=raw_test

data=pd.concat([test,train],ignore_index=True)
data

## Cleaning

In [None]:
#Missing Data
data.isnull().sum()

In [None]:
#We assign the modal value to the missing Embarked data (the [0] after mode because mode is a pandas series)
mode=data['Embarked'].mode()
data['Embarked'].fillna(mode[0],inplace=True)

In [None]:
plt.hist(data['Age'])

In [None]:
#Replacing null values in Age by mean
data['Age'].fillna(data['Age'].mean(), inplace=True)

In [None]:
#Create new column showing whether Age information is present
data['Age_NA'] =np.where(data.Age.isnull(), 1, 0)

In [None]:
#Replace Missing Values for Fare with Median
data['Fare'] = data['Fare'].fillna(data['Fare'].median())

In [None]:
#Design New Feature of Cabin_Class
data['Cabin_Class'] = data['Cabin'].str[0]

In [None]:
#Replace the Missing values in Cabin_Class with "U" for Unknown
data['Cabin_Class'].fillna('U',inplace=True)

In [None]:
data.isnull().sum()

In [None]:
#Design New Feature Family Size

data['Family_Size']=data['SibSp']+data['Parch']+1

### One Hot Ecoding Emabrked and Cabin_Class 

In [None]:
from feature_engine.encoding import OneHotEncoder

#Create Instance and Fit
# drop_lastto return k-1, false to return k
ohe = OneHotEncoder(top_categories=None, variables=['Cabin_Class', 'Embarked','Sex'], drop_last=True)
ohe.fit(data)


In [None]:
#Transform
data = ohe.transform(data)

In [None]:
#Cabin Feature does not carry much information since many Distinct values and majority values missing.
#Therefore, we drop it along with the other uninformative columns.

data.drop(columns=['Cabin', 'Ticket', 'Name', 'PassengerId'], inplace=True)
data

In [None]:
#splitting data back into test and train
test=data.iloc[:418,:]
train=data.iloc[418:,:].reset_index(drop=True)

In [None]:
train

## Logistic Regression

In [None]:
#split into x and y and converting to np arrays
y_train = train['Survived'].values
x_train = train.drop(columns='Survived').values

y_test = test['Survived'].values
x_test = test.drop(columns='Survived').values

In [None]:
pd.DataFrame(x_test)

In [None]:
#Import Class and create Instance
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter = 2000)

In [None]:
model.fit(x_train,y_train)

In [None]:
prediction=pd.DataFrame(model.predict(x_test))

In [None]:
prediction = prediction.astype({0:'int'})

In [None]:
prediction

### Formatting Results for Submission

In [None]:
result=raw_test['PassengerId']

In [None]:
result=pd.DataFrame(result)
result

In [None]:
result['Survived']=prediction

In [None]:
result

In [None]:
#Make sure to select index=False when saving
pd.DataFrame(result).to_csv(r".\csv\2.submission.csv",index=False)