# Individual Project - Titanic


## Table of Contents

[**Step 3: Data Preparation**](#Step-3:-Data-Preparation)
- [**Deal with Missing Data**](#Deal-with-Missing-Data)
- [**Feature Engineering**](#Feature-Engineering)

[**Step 4: Modeling**](#Step-4:-Modeling)


[Back to Top](#Table-of-Contents)


This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
#### Titanic Story
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers.

#### Objective
 we will build a regression model to predict ticket price(Fare).



[Back to Top](#Table-of-Contents)

## Step 3: Data Preparation
Create new features through feature engineering; Deal with missing values; Clean up data, ie. strip extra white spaces in string values. We will focus on dealing with missing data in this phrase.

In [1]:
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading Data

df_titanic =pd.read_csv("titanic-231005-181053.csv")
df_titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,$7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,$71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,$7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,$53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,$8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,$13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,$30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,$23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,$30.0,C148,C


In [6]:
#check all missing data
df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

### Deal with Missing Data
We will demonstrate filling with mean/mode and estimate from other columns.

#### Fill with Mean/Mode
Embarked only has 2 missing values and there is no obvious way to estimate the missing walue, we will simply fill it with mode of the column, or 'S'

##### Task12: Fill missing Embarked with mode

In [4]:
Emb= df_titanic['Embarked'].fillna(df_titanic['Embarked'].mode()[0], inplace=True)
Emb

#### Fill with Estimated Value

A title is a word used in a person's name, in certain contexts. It may signify either veneration, an official position, or a professional or academic qualification. It's a good indication of age, for example, Mr is for adult man, Master is for young boys.

If we look at all names of Titanic passengers, we can see that the name is in format Last, Title. First. We can use this information to estimate missing ages.

- First, we will use regular expression to extract title from name.
- Then we will convert title to upper case.
- Then we fill missing age with mean age of specific title.

In [13]:
#extract prefix from name 1)Extracting title from the name
df_titanic['Titles'] = df_titanic.Name.str.extract('([A-Za-z]+\.)')
df_titanic[['Titles']]


Unnamed: 0,Titles
0,Mr.
1,Mrs.
2,Miss.
3,Mrs.
4,Mr.
...,...
886,Rev.
887,Miss.
888,Miss.
889,Mr.


##### Task13: convert title to upper case.
To ensure we get accurate mean age of each initial, we convert initial to all upper case.

In [14]:
#titles_upper = df_titanic['Name'].str.extract('([A-Za-z]+\.)')[0].str.upper()
#titles_upper = titles.upper()
#titles_upper = titles.capitalize() 
#titles_upper

df_titanic['Titles'].str.upper()




0        MR.
1       MRS.
2      MISS.
3       MRS.
4        MR.
       ...  
886     REV.
887    MISS.
888    MISS.
889      MR.
890      MR.
Name: Titles, Length: 891, dtype: object

##### Task14: Fill missing age with mean age of the title

In [15]:

# title_count= df_titanic.Name.str.extract('([A-Za-z]+\.)')[0].str.upper().value_counts()
#title_count 
df_titanic['Titles'].value_counts()

# This code above extracts the prefix of the name characters from the 'Name'column , converts to uppercase and then counts the occurrences of each unique value.


Titles
Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Countess.      1
Capt.          1
Ms.            1
Sir.           1
Lady.          1
Mme.           1
Don.           1
Jonkheer.      1
Name: count, dtype: int64

In [35]:
#Grouping Titles
Mean_Age_Title= df_titanic.groupby('Titles').Age.mean()
Mean_Age_Title

Titles
Capt.        70.000000
Col.         58.000000
Countess.    33.000000
Don.         40.000000
Dr.          42.000000
Jonkheer.    38.000000
Lady.        48.000000
Major.       48.500000
Master.       4.574167
Miss.        21.773973
Mlle.        24.000000
Mme.         24.000000
Mr.          32.368090
Mrs.         35.898148
Ms.          28.000000
Rev.         43.166667
Sir.         49.000000
Name: Age, dtype: float64

In [36]:
# filling the missing age value with the average of the means of the titles

df_titanic['Age'].fillna(Mean_Age_Title)

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    21.773973
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

In [37]:
df_titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,titles,Titles
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,$7.25,,S,Mr.,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,$71.2833,C85,C,Mrs.,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,$7.925,,S,Miss.,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,$53.1,C123,S,Mrs.,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,$8.05,,S,Mr.,Mr.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,$13.0,,S,Rev.,Rev.
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,$30.0,B42,S,Miss.,Miss.
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,21.773973,1,2,W./C. 6607,$23.45,,S,Miss.,Miss.
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,$30.0,C148,C,Mr.,Mr.


In [23]:
# Replacing the age with the mean of the ages grouped by titles
df_titanic['Age'].fillna(df_titanic.groupby('Titles')['Age'].transform('mean'),inplace= True)

df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
titles           0
Titles           0
dtype: int64

In [24]:
df_titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
titles           0
Titles           0
dtype: int64

In [9]:
#final= title_count.fillna(df_titanic.groupby('Name').Age.transform('mean')[0], inplace=True)
#We then fill in the missing age with the mean age of the Title

final = df_titanic['Age'].fillna(df_titanic['Age'].mean(), inplace=True)
final


### Feature Engineering
We'll create a new column FamilySize. There are 2 columns related to family size, parch indicates parent or children number, Sibsp indicates sibling and spouse number.

Take one name 'Asplund' as example, we can see that total family size is 7(Parch + SibSp + 1), and each family member has same Fare, which means the Fare is for the whole group. So family size will be an important feature to predict Fare. There're only 4 Asplunds out of 7 in the dataset becasue the dataset is only a subset of all passengers.

In [14]:
df_titanic.Name.str.contains('Asplund')

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Name, Length: 891, dtype: bool

##### Task15: Create column 'FamilySize'
FamilySize = Parch + SibSp + 1

In [18]:
FamilySize=df_titanic.Parch + df_titanic.SibSp + 1
FamilySize

0      2
1      2
2      1
3      2
4      1
      ..
886    1
887    1
888    4
889    1
890    1
Length: 891, dtype: int64

[Back to Top](#Table-of-Contents)

## Step 4: Modeling

Now we have a relatively clean dataset(Except for Cabin column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the desaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise.

##### Task16: Contruct a regresson on Fare
Construct regression model with statsmodels.

Pick Pclass, Embarked, FamilySize as independent variables.

In [15]:
#result =smf.ols("Fare ~ C(Pclass) + C(Embarked) + FamilySize", data=df_titanic).fit()
#result.summary()
import statsmodels.formula.api as smf

In [16]:
df_titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    object 
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
 12  FamilySize   891 non-null    int64  
dtypes: float64(1), int64(6), object(6)
memory usage: 90.6+ KB


In [19]:
# FamilySize=df_titanic.Parch + df_titanic.SibSp + 1
df_titanic['FamilySize'] = df_titanic['Parch'] + df_titanic['SibSp'] + 1

FamilySize.head()

0    2
1    2
2    1
3    2
4    1
dtype: int64

In [21]:
#The $ is taken off becaiuse its a non numerical value and interfers with the ouput

df_titanic["Fare"] = df_titanic["Fare"].str.replace("$", " ").astype(float)
#df_titanic["Fare"]

# Making sure FamilySize is numeric and properly defined in df_titanic
df_titanic['FamilySize'] = df_titanic['Parch'] + df_titanic['SibSp'] + 1

# We convert 'Fare' column to numeric format
df_titanic['Fare'] = pd.to_numeric(df_titanic['Fare'])


# And Fit the OLS model
result = smf.ols("Fare ~ C(Pclass) + C(Embarked) + FamilySize", data=df_titanic).fit()

# Print the summary
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                   Fare   R-squared:                       0.427
Model:                            OLS   Adj. R-squared:                  0.424
Method:                 Least Squares   F-statistic:                     131.9
Date:                Tue, 23 Apr 2024   Prob (F-statistic):          1.92e-104
Time:                        12:25:37   Log-Likelihood:                -4495.8
No. Observations:                 891   AIC:                             9004.
Df Residuals:                     885   BIC:                             9032.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           79.2989      3.543  

In [22]:
result =smf.ols("Fare ~ C(Pclass) + C(Embarked) + FamilySize", data=df_titanic).fit()
result.summary()

0,1,2,3
Dep. Variable:,Fare,R-squared:,0.427
Model:,OLS,Adj. R-squared:,0.424
Method:,Least Squares,F-statistic:,131.9
Date:,"Tue, 23 Apr 2024",Prob (F-statistic):,1.9199999999999998e-104
Time:,12:25:42,Log-Likelihood:,-4495.8
No. Observations:,891,AIC:,9004.0
Df Residuals:,885,BIC:,9032.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,79.2989,3.543,22.381,0.000,72.345,86.253
C(Pclass)[T.2],-59.0955,3.921,-15.073,0.000,-66.790,-51.401
C(Pclass)[T.3],-68.8790,3.253,-21.174,0.000,-75.264,-62.494
C(Embarked)[T.Q],-11.8147,5.446,-2.169,0.030,-22.504,-1.126
C(Embarked)[T.S],-14.9202,3.414,-4.371,0.000,-21.620,-8.220
FamilySize,7.8256,0.789,9.919,0.000,6.277,9.374

0,1,2,3
Omnibus:,1043.506,Durbin-Watson:,2.04
Prob(Omnibus):,0.0,Jarque-Bera (JB):,118621.734
Skew:,5.718,Prob(JB):,0.0
Kurtosis:,58.357,Cond. No.,13.4


**Conclusion:

**R-squared and Adjusted R-squared:**
 In this model, the R-squared is 0.427, indicating that approximately 42.7% of the variance in 'Fare' is explained by the independent variables.

Adjusted R-squared adjusts the R-squared value for the number of independent variables in the model. It penalizes the inclusion of irrelevant variables. The adjusted R-squared is 0.424, which is slightly lower than the R-squared, suggesting that the model may not be overfitting.



***F-statistic and Prob (F-statistic):***
The F-statistic tests the overall significance of the regression model. Here, the F-statistic is 131.9, with a very low p-value (1.92e-104), indicating that the regression model is statistically significant.
Coefficients (coef):
Each coefficient represents the estimated change in the dependent variable ('Fare') for a one-unit change in the corresponding independent variable, holding other variables constant.



**The intercept** term represents the expected value of 'Fare' when all independent variables are zero. In this case, it is 79.2989.

The coefficients for the categorical variables ('Pclass' and 'Embarked') represent the differences in 'Fare' compared to the reference categories. For example, passengers in class 2 ('C(Pclass)[T.2]') are estimated to have fares that are $59.0955 lower than passengers in class 1.



The coefficient for 'FamilySize' suggests that for every one-unit increase in family size, the fare is expected to increase by 7.8256 units.


**Standard Errors (std err) and t-values (t):**

Lower standard errors indicate more precise estimates.
The t-values are the coefficients divided by their standard errors. They measure the significance of each coefficient. Higher absolute t-values indicate greater significance.
A significant t-value (|t|) is typically considered to be greater than 2 in absolute value, indicating that the coefficient is significantly different from zero.


P-values (P>|t|):
P-values indicate the probability of observing the coefficient estimate if the null hypothesis (that the coefficient is equal to zero) is true. Lower p-values suggest greater evidence against the null hypothesis.
**In this output, all p-values are very low (close to zero), indicating that all coefficients are statistically significant.**


Omnibus, Durbin-Watson, Jarque-Bera (JB), Skew, Kurtosis:
These statistics provide additional diagnostic information about the regression model, such as the presence of multicollinearity, autocorrelation, and normality of residuals. For example,**the high value of Omnibus and Kurtosis, as well as the significant skewness and kurtosis, suggest non-normality of residuals**