Hi there! This is my first practice of working on Kaggle data sets and, in a way of data science in general. Recently, I completed a [course by Hastie and Tibshirani](https://www.edx.org/learn/python/stanford-university-statistical-learning-with-python), and I plan to try out most of the techniquues covered in this and following data sets

What to try:
- [ ] Data processing
- [ ] Basic data analysis (descriptive, correlation matrices)
- [ ] Basic data visualisation
- [ ] Regression (SLR, MLR)
- [ ] Assessing model accuracy on train and test split 
- [ ] Classification
    - [ ] Logistic regression
    - [ ] Linear discriminant
    - [ ] K nearest neigbours
- [ ] Resampling
    - [ ] Cross-validation
    - [ ] Bootstrap
- [ ] Best subset selection
- [ ] Shrinkage
    - [ ] Ridge 
    - [ ] Lasso
    - [ ] PCR
- [ ] _maybe_ Smoothing Splines, GAMs
- [ ] Trees
    - [ ] Decision trees
    - [ ] Random forests
    - [ ] Boosting
- [ ] Support Vector Machines
- [ ] *Definitely not here* Deep Learning

In [44]:
import numpy as np
import pandas as pd
# from matplotlib.pyplots import subplots - will have to figure out why it doesn't work
# import seaborn
from sklearn.model_selection import train_test_split, cross_val_score

# Dataset Overview and Notes
For detailed information, visit the Kaggle competition page: [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

## Plan (Inspired by an Introductory Video)
1. **Exploratory Data Analysis (EDA)** - Understand the data, find patterns and outliers.
2. **Train and Tune Model** - Develop a model to predict survival, and optimize its parameters.

## Variable Definitions
Below is a summary of the variables included in the Titanic dataset, with details on their meaning and encoding.

| Variable  |       Definition        | Key                             |
|-----------|:-----------------------:|---------------------------------|
| survival  | Survival                | 0 = No, 1 = Yes                 |
| pclass    | Ticket class            | 1 = 1st, 2 = 2nd, 3 = 3rd       |
| sex       | Sex                     |                                 |
| Age       | Age in years            |                                 |
| sibsp     | # of siblings / spouses aboard the Titanic |            |
| parch     | # of parents / children aboard the Titanic |            |
| ticket    | Ticket number           |                                 |
| fare      | Passenger fare          |                                 |
| cabin     | Cabin number            |                                 |
| embarked  | Port of Embarkation     | C = Cherbourg, Q = Queenstown, S = Southampton |

### Detailed Variable Insights

- `pclass:` Serves as a proxy for socio-economic status (SES)
  - **1st = Upper**
  - **2nd = Middle**
  - **3rd = Lower**

- `age:` Age is fractional if less than 1. If the age is estimated, it is denoted in the form of xx.5.

- `sibsp:` This variable defines family relations as follows:
  - **Sibling** = brother, sister, stepbrother, stepsister
  - **Spouse** = husband, wife (mistresses and fiancés were ignored)

- `parch:` This variable further defines family relations:
  - **Parent** = mother, father
  - **Child** = daughter, son, stepdaughter, stepson
  - Note: Some children traveled only with a nanny, hence `parch=0` for them.


# Basic Data Exploration

In [43]:
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")
gender_pred = pd.read_csv("./gender_submission.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
print("Shape:", train.shape)
print("Columns:", train.dtypes)
train.describe()

Shape: (891, 12)
Columns: PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [36]:
train.describe(include="object")

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [38]:
train["Cabin"].value_counts(dropna = False)

NaN            687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: Cabin, Length: 148, dtype: int64

In [42]:
train.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


# Basic sampling

In [52]:
y = train["Survived"]
X = train.drop(["Survived", "PassengerId"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)