Save train and test data as pandas DataFrame and create a combine df to make changes quicly. Let's check possible dependencies among features.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score




train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')
combine = [train_df, test_df]
train_df.head()



In [None]:
train_df.info()
print('--------------------------------')
test_df.info()

From pd info we can see that:
    1. size of train dataset is 891 out of 2224 passangers
    2. the percentage of survived passangers in train dataset is 38% which is similar to the overall 32% survival rate
    3. some columns do not have a values. Therefore, I need to fill in the missing ones in Ages, Embarked columns or drop(maybe Cabin)
    4. there are 7 numerical, 2 categorical and 2 features with mixed datatype
    5. we, difenetly, has correletions between Pclass and Survived
    6. checking a Ticket column as the price range can be interesting. There are duplicate tickets(only 681 unique value). We need to check average/median cost of tickets(0 - min value, 512 - max value, median - 14)
    7. three ports(most common is S - 644 times)
    8. mens on the ship were 577 that constitudes 65%. But women show better survival(68% out of passengers that were surveved)
    9. we should watch on average age(29 years), maybe divide into groups. Min ages is 5 months, max is 80 years
    

In [None]:
train_df.describe()




In [None]:
train_df.describe(include=['O'])


In [None]:
survival_sex = train_df.groupby('Sex')['Survived'].mean()
survival_sex


Divide our features on numerical and other type of data(without including Survived and PassengerID columns)

In [None]:
numeric_indices = [0, 1, 2, 5, 6, 7, 9]
categorical_indices = [3, 4, 8, 10]
numic_data = train_df[train_df.columns[numeric_indices]]
categorivcal_data = train_df[train_df.columns[categorical_indices]]


Let's check and compare the correlations between features and Survival column. 
Begin with checking dependencies between numerical features and Survived column

In [None]:
correlation_matrix = train_df.corr(numeric_only=True)['Survived']
correlation_matrix

In [None]:
corr_with_survived = train_df.corr(numeric_only=True)['Survived'].sort_values(ascending=False)
corr_with_survived


In [None]:
pd_matrix = corr_with_survived.to_frame()
plt.figure(figsize=(10, 8))
sns.heatmap(pd_matrix, 
            annot=True,      # показывать числа
            cmap='coolwarm', # синий-красный
            center=0,        # 0 в центре
            square=True,     # квадратные ячейки
            fmt='.2f')       # 2 знака после запятой
plt.title('pd_matrix')
plt.show()

After this actions we can conclude that Fare and Pclass have the strongest correlation. But this is not unexpected, becauese these columns mutually dependent. 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle('Comparison of Categorical Features with Survival', fontsize=16, y=1.02)
# Pclass and Survived
sns.countplot(x='Pclass', hue='Survived', data=train_df, ax=axes[0])
axes[0].set_title('Survival by class')
axes[0].set_ylabel('Count')
axes[0].legend(['Died', 'Survived'])

# Embarked and Survived
sns.countplot(x='Embarked', hue='Survived', data=train_df, ax=axes[1])
axes[1].set_title('Survival by Embarked')
axes[1].set_ylabel('Count')
axes[1].legend(['Died', 'Survived'])


In [None]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

In [None]:
r = sns.FacetGrid(train_df, col='Survived')
r.map(plt.hist, 'SibSp', bins=20)

Now we can transform dataset, fix some columns(fill empty cells), and remove useless features. For this look at the hole features again.

In [None]:
# train_df.head()
test_df.head()

In [None]:
train_df.drop(['PassengerId', 'Cabin'], axis=1, inplace=True)
test_df.drop(['PassengerId', 'Cabin'], axis=1, inplace=True)
train_df.head()

In [None]:
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] 
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 0, 'IsAlone'] = 1
    dataset['Sex'] = dataset['Sex'].map({'male' : 0, 'female' : 1})
    dataset['Embarked'].fillna('S', inplace=True)
    dataset['Embarked'] = dataset['Embarked'].map({'S' : 1, 'C' : 2, 'Q' : 3})




train_df.head()

We removed Ticket column due to the fact that Pclass column can show as the same dependence.

In [None]:
train_df.drop(['FamilySize', 'Parch', 'SibSp', 'Ticket'], axis=1, inplace=True)
test_df.drop(['FamilySize', 'Parch', 'SibSp', 'Ticket'], axis=1, inplace=True)
train_df.head()

If we look at the data in Name, we will see that all if the name row have one common thing(Mr, Ms, Miss or another)

In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Name'].map(lambda x: x.split(',')[1].split('.')[0])
    dataset.drop(['Name'], axis=1, inplace=True)
train_df

Replace all of the rows on Mr, Miss, Mrs, Master or Doctor. As a result, we will get only five different meanings of feature Name.
But, firstly, check the propotion of Survived in each category.

In [None]:
was_array = ['Mr', 'Sir', 'Don', 'Jonkheer', 'Col', 'Major', 'Rev', 'Capt', 'Dr', 'Master', 'Miss', 'Mrs', 'Ms', 'Mme', 'Mlle', 'the Countess', 'Lady', 'Dona']
will_array = ['Mr', 'Mr', 'Mr', 'Mr', 'Mr', 'Mr', 'Mr', 'Mr', 'Mr', 'Master', 'Miss', 'Mrs', 'Mrs', 'Mrs', 'Miss', 'Mrs', 'Mrs', 'Mrs']
title_mapping = dict(zip(was_array, will_array))

At the moment, I do not know how to group better by Age or by the fact of virgin or not.

In [None]:
for i, dataset in enumerate(combine):
    dataset.reset_index(drop=True, inplace=True)
    dataset['Title_clean'] = dataset['Title'].astype(str).str.strip()
    dataset['Title_new'] = dataset['Title_clean'].replace(title_mapping)
    dataset.drop(['Title', 'Title_clean'], axis=1, inplace=True)


In [None]:
train_df.groupby('Title_new')['Age'].mean()

Replace missing values in Age mean value in Title.


In [None]:
for dataset in combine:
    dataset.loc[(dataset.Age.isnull()) & (dataset.Title_new == 'Master'), 'Age']=5
    dataset.loc[(dataset.Age.isnull()) & (dataset.Title_new == 'Miss'), 'Age']=22
    dataset.loc[(dataset.Age.isnull()) & (dataset.Title_new == 'Mr'), 'Age']=33
    dataset.loc[(dataset.Age.isnull()) & (dataset.Title_new == 'Mrs'), 'Age']=36
    dataset['Fare'] = dataset['Fare'].astype(float)
    mean_fare = dataset['Fare'].mean()
    dataset.loc[dataset['Fare'].isnull(), 'Fare']=mean_fare
      

In [None]:
cat_data_train = train_df['Title_new']
cat_data_test = test_df['Title_new']
dummy_feature_train= pd.get_dummies(cat_data_train, drop_first=True, dtype=int)
dummy_feature_test= pd.get_dummies(cat_data_test, drop_first=True, dtype=int)
train_df = pd.concat([train_df, dummy_feature_train], axis=1)
test_df = pd.concat([test_df, dummy_feature_test], axis=1)
train_df.drop(['Title_new'], axis=1, inplace=True)
test_df.drop(['Title_new'], axis=1, inplace=True)

In [None]:
train_df['Fare'] = train_df['Fare'].round(2)
test_df['Fare'] = test_df['Fare'].round(2)
test_df

Before using StandartScaler, I want to look at prediction and score of first weak model. And after that compare results with results after normilization of Age and Fare.

In [None]:
scaler = StandardScaler()
x_train_scaled = train_df.copy()
y_train = train_df['Survived']
train_df = train_df.drop(['Survived'],  axis=1)
x_train_scaled = scaler.fit_transform(train_df)
x_test_scaled[:] = scaler.transform(test_df[:])
x_train_scaled


In [None]:
# y_train = train_df['Survived']
# x_train_scaled = x_train_scaled.drop(['Survived'],  axis=1)
first_model = LogisticRegression(max_iter=1000, random_state=42)
first_model.fit(x_train_scaled, y_train)
pred = first_model.predict(x_train_scaled)
accuracy = accuracy_score(y_train, pred)
accuracy