### Import necessary modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Loading Datasets

In [None]:
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

In [None]:
temp = train_df.groupby("Sex")["Age"].mean().to_frame().reset_index()
temp = temp.rename(columns={"Age": "mean age"})
temp

In [None]:
train_df.head(5)

In [None]:
print("The shape of the train data is (row, column):"+ str(train_df.shape))
print(train_df.info())
print("\n","*"*40, "\n")
print("The shape of the test data is (row, column):"+ str(test_df.shape))
print(test_df.info())

### Few words about variables:

- ##### Categorical:
  - **Nominal**:
    - **Cabin**
    - **Embarked**(Port of Embarkation):
      - C(Cherbourg)
      - Q(Queenstown)
      - S(Southampton)
    - **Sex** (also **Dichotomous**) - "Female" or "Male"
  ---
  - **Ordinal** (variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.)
    - **Pclass** (A proxy for socio-economic status (SES)): 
      - 1 (Upper)
      - 2 (Middle) 
      - 3 (Lower)
  ---
  - **Numeric**:
    - **Discrete**:
      - **Passenger** ID(Unique identifing # for each passenger)
      - **SibSp**
      - **Parch**
      - **Survived** (Our outcome or dependent variable)
    - **Continuous**:
      - **Age**
      - **Fare**
  ---
- ##### Text Variable:
  - **Ticket** (Ticket number for passenger.)
  - **Name** ( Name of the passenger.) 

### Visualization of the Data

#### How many Survived??

In [None]:
f, ax = plt.subplots(1, 3, figsize=(18,8))
train_df['Survived'].value_counts().plot.pie(
  explode=[0,0.1],
  autopct='%1.1f%%',
  ax=ax[0],
  shadow=True
)

ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot(
  x=train_df["Survived"],
  ax=ax[1]
)
ax[1].set_title('Survived')
plt.show()

#### Survived by Sex

In [None]:
f, ax = plt.subplots(1, 2, figsize=(10, 5))

train_df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survieal rate by Sex')

sns.countplot(data=train_df, x="Sex", hue="Survived", ax=ax[1])
ax[1].set_title('Survived vs Dead by Sex')

We can see that even though there is way more men on the ship. Survival rate for women is around 3 times greater than for men and countwise, thre were ~2 times more saved women than men

In [None]:
pd.crosstab(train_df["Pclass"], train_df.Survived,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
f, ax = plt.subplots(1, 2, figsize=(15, 5))

# train_df[['Pclass','Survived']].groupby(['Pclass']).mean().plot(ax=ax[0])
sns.countplot(data=train_df, x="Pclass", ax=ax[0])
ax[0].set_title('# of passangers by each class')

train_df[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar(ax=ax[1])
ax[1].set_title('Survival rate by economic class')

In [None]:
sns.catplot(data=train_df, x="Pclass", y="Survived", hue="Sex", kind="point")
plt.title('# of passangers by each class')
plt.show()

#### Analyzing age

In [None]:
print('Oldest Passenger was of:', train_df['Age'].max(), 'Years')
print('Youngest Passenger was of:', train_df['Age'].min(), 'Years')
print('Average Age on the ship:', train_df['Age'].mean(), 'Years')

In [None]:
f,ax=plt.subplots(1, 2, figsize=(12,5))

sns.violinplot(x="Pclass", y="Age", hue="Survived", data=train_df, split=True,ax=ax[0])
ax[0].set_title("Distributons of Age for Pclass and survival status")
ax[0].set_yticks(range(0,110,10))
sns.violinplot(x="Sex", y="Age", hue="Survived", data=train_df, split=True,ax=ax[1])
ax[1].set_title("Distributons of Age for Sex and survival status")
ax[1].set_yticks(range(0,110,10))
plt.show()

1) The number of children increases with Pclass and the survival rate for passenegers below Age 10(i.e children) looks to be good irrespective of the Pclass.

2) Survival chances for Passenegers aged 20-50 from Pclass1 is high and is even better for Women.

3) For males, the survival chances decreases with an increase in age.

#### Dealing with NaN age values

We can check the Name feature. Looking upon the feature, we can see that the names have a salutation like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups.

In [None]:
train_df['Salutation'] = 0
for i in train_df:
  train_df['Salutation'] = train_df.Name.str.extract('([A-Za-z]+)\.') # lets extract the Salutations

In [None]:
pd.crosstab(train_df.Salutation, train_df.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex

##### Replacing misspelled initials like "Mlle" or "Mme" that stand for "Miss".

In [None]:
train_df['Salutation'].replace(
  ['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],
  ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],
  inplace=True
)

In [None]:
# lets check the average age by Initials
mean_age_by_salutations = train_df.groupby('Salutation')['Age'].mean()
mean_age_by_salutations

##### Filling NaN Ages

In [None]:
for i in range(len(train_df)):
  if pd.isnull(train_df['Age'][i]):
    train_df['Age'][i] = mean_age_by_salutations[train_df['Salutation'][i]]