# Titanic disaster model

https://www.kaggle.com/competitions/titanic

https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook

https://www.overleaf.com/read/fqxnygwqtnjs#fbe3c8

### Inrotuction
#### to do:
- briefly describe model
- talk about recent methods used for problems like this\
- talk about methods we want to use
- what do we expect

### Seting up environment

In [71]:
import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from scipy.optimize import minimize, LinearConstraint


In [61]:
gh_path = 'https://raw.githubusercontent.com/dsindy/kaggle-titanic/master/data/'
df_train = pd.read_csv(gh_path + 'train.csv')
df_test = pd.read_csv(gh_path + 'test.csv')
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Feature engeneering
#### To do:
- change values of sex to binary (f.e. male = 0, female = 1),
- reduce number of values (drop columns: Name, Ticket, ?Fare?, Cabin, Embarked),
- drop keys with NaN values

A: Fare value can carry some information about Cabin, if we want to include passenger placement in model.

In [62]:
# Drop unnecessary columns
columns_to_drop = ['Name','Ticket', 'Cabin', 'Embarked']
df_train_processed = df_train.drop(columns=columns_to_drop)
df_test_processed = df_test.drop(columns=columns_to_drop)

# Drop keys with any NaN values
df_train_processed = df_train_processed.dropna()
df_test_processed = df_test_processed.dropna()

# Sex changed to binary
df_train_processed['Sex'] = df_train_processed['Sex'].replace({'male': 0, 'female': 1})
df_test_processed['Sex'] = df_test_processed['Sex'].replace({'male': 0, 'female': 1})

print(df_train_processed)

     PassengerId  Survived  Pclass  Sex   Age  SibSp  Parch     Fare
0              1         0       3    0  22.0      1      0   7.2500
1              2         1       1    1  38.0      1      0  71.2833
2              3         1       3    1  26.0      0      0   7.9250
3              4         1       1    1  35.0      1      0  53.1000
4              5         0       3    0  35.0      0      0   8.0500
..           ...       ...     ...  ...   ...    ...    ...      ...
885          886         0       3    1  39.0      0      5  29.1250
886          887         0       2    0  27.0      0      0  13.0000
887          888         1       1    1  19.0      0      0  30.0000
889          890         1       1    0  26.0      0      0  30.0000
890          891         0       3    0  32.0      0      0   7.7500

[714 rows x 8 columns]


### Perceptron

#### To do:
- define weight vector W (imo randomly) and training set X
- definition of perceptron class (linear, binary[if survived would be changed to -1 and 1] or/and LOGISTIC). That would include: activation function, loss function, fiting (gradient descent/newton-rap), all can be copied and slightly modified form collab "logistic_regression_and_svm",
- spliting training dataset and look for bad predictions,
- comparason with scikit-learn method

### Suport Vector Machine
#### To do: 
- define 6D weight vector (Pclass,Sex,Age,SibSp,Parch,Fare), loss function (Lagrange function), calculate width of the margin (constraint equation), Optimize and get the values of $\vec{\alpha}$, w and bias b.  kernel transformation(s),
- soft- (leanring rate (regulatization) C) and hard-margin,
- spliting training dataset and look for bad predictions,
- compare with scikit-learn method.

$$ L(\vec{\alpha}) = \sum _i \alpha _i - \frac{1}{2} \sum _{i,j} \alpha _i \alpha _j y_i y_j \vec{X}_i\cdot \vec{X} _j. $$
Kuhn-Tucker theorem: the solution we find here will be the same as the solution to the original problem
##### [`REMARK`] It's crucial to keep track of this equation and the dot product within it, as it becomes highly significant later on, especially when dealing with nonlinear data.

In [87]:
Y = df_train_processed['Survived'].values
X = df_train_processed[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=420)

Y_Y_T   = np.zeros((len(Y_train), len(Y_train)))
DOTS    = np.zeros((len(Y_train), len(Y_train)))
alphas = np.ones(Y_train.T.shape)
    
for i,xi in enumerate(X_train):
    for j,xj in enumerate(X_train):
        DOTS[i,j]   = np.dot(xi, xj)
        Y_Y_T[i,j]  = Y_train[i] * Y_train[j]

def Lagrange(alphas):
    sum_alphas = sum(alphas)
    lag = sum_alphas
    for i in range(len(X_train)):
        for j in range(len(X_train)):
            lag = lag - 1/2 * (alphas[i] * alphas[j] * Y_Y_T[i, j] * DOTS[i, j])

    return -lag

# define the constraint
C           = 1.0
# y_train * alphas = 0
constraint  = None
constraints = []
# how many constraints should there be?
for i in range(len(Y_train)):
    constraint_matrix = np.zeros((1, len(Y_train)))
    constraint_matrix[0, i] = Y_train[i]
    lb = 0 
    ub = 0  
    linear_constraint = LinearConstraint(constraint_matrix, lb, ub)
    constraints.append(linear_constraint)
print(constraints)

# a       = optimize.minimize(Lagrange,
#                             x0 = 0.05 * np.random.random(len(Y_train)),
#                             constraints=[constraint] + constraints)
# alphas  = a['x']
# alphas

# supports    =   []
# for i in range(len(alphas)):
#     continue

# W   = np.zeros((2))
# # try to avoid using this for loop
# for i in range(len(alphas)):
#     # W +=
# print("W =", W)

# for i, (alpha, y, x) in enumerate(supports):
#     # find b for any of the support vectors

# slackVars = []
# for i, x in enumerate(X_train):
#     slackVars.append(None)
# slackVars

# Find two support vectors with different values of Y  (±1) .
# Sone        = None
# Smone       = None

# Sone_close  = 1e13
# Smone_close = 1e13

# for i, s in enumerate(X_train):
#     continue

# Sone, Smone

# y_pred  = []
# for x in X_test:
#     if np.dot(W, x) + b < 0:
#         y_pred.append(-1)
#     else:
#         y_pred.append(1)
# accuracy_score(y_pred, Y_test)

[<scipy.optimize._constraints.LinearConstraint object at 0x000002272F0C5430>, <scipy.optimize._constraints.LinearConstraint object at 0x000002272E781880>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DBA10>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DA690>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304D8470>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DA990>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304D92E0>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DBA70>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DAE40>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DBE00>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304DB3B0>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304D9A30>, <scipy.optimize._constraints.LinearConstraint object at 0x00000227304D8B30>

In [88]:
#sklearn for comparason
svm_model = SVC(kernel='linear')  
svm_model.fit(X_train, Y_train)

# Make predictions on the test set
predictions = svm_model.predict(X_test)

# Evaluate the model (optional)
accuracy = svm_model.score(X_test, Y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8333333333333334


#### Random Forest Model
##### To do:
- arono

##### Random forest (kaggle example)

In [5]:
from sklearn.ensemble import RandomForestClassifier
test_data = df_test
y = df_train["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(df_train[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('RF_kaggle_submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


#### Analysis of results

JAK PCA TO MOŻNA BIBLIOTEK UŻYĆ 


#### Summary
To do:
- did we reachedour goals?
- what is interesting in our solutions
- what could be done better? limits and possible improvement