# Titanic Dataset from Kaggle
- https://www.kaggle.com/datasets/yasserh/titanic-dataset/data

### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#### Data:
 - PassengerId: Passenger Id 
 - Survived: O: No 1=Yes   
 - Pclass: 1: 1st 2: 2nd 3 = 3rd
 - Name: Name of the Passenger
 - Sex: male or female
 - Age: Age in Years
 - SibSp: No. of siblings / spouses aboard the Titanic
 - Parch: No. of parents / children aboard the Titanic
 - Ticket: Ticket number
 - Fare: Passenger fare
 - Cabin: Cabin (Removed from final dataset)
 - Embarked: Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton U = Unknown


In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
# run  make requirements first

# https://pypi.org/project/kagglehub/

import kagglehub

# Download latest version
path_titanic_dataset = kagglehub.dataset_download("yasserh/titanic-dataset")

print("Path to dataset files:", path_titanic_dataset)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/leonardoomarbolanosrivera/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1


In [3]:
# Reading data and removing index col
df = pd.read_csv(f"{path_titanic_dataset}/Titanic-Dataset.csv")

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
# 891 rows and 12 cols
df.shape

(891, 12)

In [7]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [8]:

df.loc[df["Cabin"].isnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


#### Cabing tiene demasiados NaN . Seleccionamos todas la columnas excepto Cabin

In [9]:
df=df[["PassengerId",	"Survived",	"Pclass",	"Name",	"Sex",	"Age",	"SibSp",	"Parch",	"Ticket",	"Fare",	"Embarked"]]

In [10]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


# En el caso de Age podemos usar la media para los datos nulos. 

In [11]:
df.loc[df["Age"].isnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,Q


In [12]:
age_mean = df['Age'].mean()
age_mean
print("El valor de la Media de la columna es: %.2f" % age_mean)

El valor de la Media de la columna es: 29.70


In [13]:
#Asignación de Valores de Media a los Valores Nulos
df.loc[:, "Age"] =df.loc[:, 'Age'].fillna(age_mean)




In [14]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

# En el caso de Embarked  Podemos marcarlo con U para desconocido

In [15]:
df.loc[df["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,


In [16]:
df.loc[df["Embarked"].isnull(),"Embarked"]='U'


In [17]:
df.loc[df["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


In [18]:
df.Embarked.value_counts()

Embarked
S    644
C    168
Q     77
U      2
Name: count, dtype: int64

# Funcion sigmoide 

$$ S(x)=\sigma(x) =  \frac{1}{1 + e^{-x}} $$

# Derivada de la funcion sigmoide
$$ \sigma'(x) =  \frac{e^{-x}}{(1 + e^{-x})^2} $$

# Otra forma de escribir la Derivada de la funcion sigmoide

$$ \sigma'(x) =  (\sigma(x))(1-\sigma(x)) $$

####  Tomado de https://interactivechaos.com/es/manual/tutorial-de-deep-learning/derivada-de-la-funcion-sigmoide

# La derivama mostrada arriba es lo mismo que 
$$ y' =  y(1 - y) $$

#### Tomado de https://es.wikipedia.org/wiki/Funci%C3%B3n_sigmoide


#### Estimacion

$$\hat{y} = h_{\Theta}(x) = \frac{1}{1 + e^{-wx + b}} $$


#### Calculo del error usando Cross Entropy


$$ J(w,b) = J({\Theta}) = \frac{1}{N} \Sigma_{i=1}^{n}  [y^{i}log(h_{\Theta}(x^{i})) + (1-y^{i})log(1-h_{\Theta}(x^{i}))]  $$

#### Derivada de J(w,b)

$$ 
J'({\Theta})=
\begin{bmatrix}
\frac{dJ}{dw} \\
\\
\frac{dJ}{db} 
\end{bmatrix}
=
\begin{bmatrix}
...
\end{bmatrix}
=
\begin{bmatrix}
\frac{1}{N}  \Sigma 2x_i(\hat{y}-y_i) \\
\\
\frac{1}{N}  \Sigma 2(\hat{y}-y_i)
\end{bmatrix}
$$


#### Markdown simbols cheatsheet.md  https://gist.github.com/LKS90/252ac41bd4a173be35b0
#### https://es.overleaf.com/learn/latex/Matrices

In [19]:
class LinearReg:

    def __init__(self, lr = 0.01, epochs = 100):
        self.lr = lr
        self.epochs = epochs
        self.weights = None
        self.bias = None


    def fit(self, X, y):

        #This is going to look like (150,1) where m is 150 and n 1
        m, n = X.shape

        #Generates a random of n values in a list of 1 rows
        self.weights = np.random.rand(n, 1)
        self.bias = np.random.rand(1)

        # convert a value like (150,) to (150,1)
        y = y.reshape(m, 1)

        losses = list()
        b_list = list()
        w_list = list()

        for epoch in range(self.epochs):

            # calculate prediction
            # y ^ = wx + b
            y_hat = np.dot(X, self.weights) + self.bias

            # get loss - L - J
            loss = np.mean((y - y_hat)**2) # MSE
            losses.append(loss)

            # calculate gradient
            dw = (-2 / m) * np.dot(X.T, (y - y_hat))
            db = (-2 / m) * np.sum((y - y_hat))

            # update params
            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

            w_list.append(self.weights)
            b_list.append(self.bias)

            print(f"epoch: {epoch}, loss: {loss}, w: {self.weights}, b: {self.bias}")

        return self.weights, self.bias, losses, b_list, w_list
    
    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

In [20]:
class RegresionLogistica():
    def __init__(self, lr = 0.001, epochs = 1000,show=True):
        self.lr = lr
        self.epochs = epochs
        self.weights = None
        self.bias = None
        self.show=show

    def sigmoid(self,x):
        return (1 / (1 + np.exp(-x)))

    def fit(self, X, y):

         #This is going to look like (150,1) (n_samples, n_features)
        n_samples, n_features = X.shape

        #Generates a random of n values in a list of 1 rows
        self.weights = np.random.rand(n_features)
        self.bias = np.random.rand(1)

        losses = list()
        b_list = list()
        w_list = list()
        


        for epoch in range(self.epochs):
            # calculate prediction
            # y ^ = wx + b
            y_hat_linear = np.dot(X, self.weights) + self.bias
            # y^ = sigmoid(wx + b)
            y_hat = self.sigmoid(y_hat_linear)

            #Agregado epsilon por que se rompe si tienes  log(0) con "divide by zero encountered"
            eps = 1e-12  # avoid log(0) or log(1) issues
            loss = (1/n_samples)*np.sum(y * np.log(y_hat + eps) + (1-y) * np.log(1-y_hat + eps))  # Cross entropy
            losses.append(loss)

            # calculate gradient
            ## Derivada de J(w,b) Cross entropy
            # Derivada de w
            dw = (2/n_samples) * np.dot(X.T,(y_hat - y))
            # Derivada de b
            db = (2/n_samples) * np.sum((y_hat - y))

             # update params
            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

            w_list.append(self.weights)
            b_list.append(self.bias)

            if (self.show == True):
                print(f"epoch: {epoch}, loss: {loss} , w: {self.weights}, b: {self.bias}")

    
    
    def predict(self, X):
        # calculate prediction
        # y ^ = wx + b
        y_hat_linear = np.dot(X, self.weights) + self.bias
        # y^ = sigmoid(wx + b)
        y_hat = self.sigmoid(y_hat_linear)

    
        prediccion=[ 1 if y > 0.5 else 0 for y in y_hat ]
        return prediccion


In [21]:
## TEMP
from sklearn import datasets
bc = datasets.load_breast_cancer()
X,y = bc.data,bc.target



In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1234)
X_train.shape

(455, 30)

In [23]:
model_clasificador = RegresionLogistica(lr=0.01)
model_clasificador.fit(X_train,y_train)
y_hat = model_clasificador.predict(X_test)

epoch: 0, loss: -10.141495662329186 , w: [  0.44787562  -0.08982265  -0.45275895  -6.51179875   0.21854227
   0.41892602   0.89610955   0.84344932   0.16714998   0.17983122
   0.82367836   0.65928396   0.95549369   0.35444494   0.2963374
   0.12691309   0.23760437   0.23351926   0.82002334   0.52409131
   0.54398312   0.47879549  -0.73544412 -10.20054758   0.5103756
   0.80811866   0.75681023   0.55592599   0.31833428   0.60523614], b: [0.42785163]
epoch: 1, loss: -17.48952545359836 , w: [ 0.60250169  0.13841207  0.54095138 -0.58798776  0.21971049  0.41992893
  0.89669454  0.84377476  0.16936041  0.18062457  0.82729126  0.67504063
  0.98082404  0.62440551  0.29642838  0.12718182  0.23793669  0.23364296
  0.82028383  0.52413649  0.71430619  0.77952253  0.37209039 -3.04804648
  0.51195597  0.81042496  0.75894508  0.55686969  0.32176703  0.60623896], b: [0.44051097]
epoch: 2, loss: -17.48952545359836 , w: [0.75712776 0.3666468  1.53466171 5.33582323 0.2208787  0.42093184
 0.89727954 0.844

  return (1 / (1 + np.exp(-x)))


In [24]:
def  accuracy(y_pred,y_test):
    return np.sum(y_pred==y_test)/len(y_test)

acc = accuracy(y_hat,y_test)
print(acc)

0.9210526315789473


## Trabajando con el dataset de titatic

# Variable a predecir
- y = Survived: O: No 1=Yes 

## Posibles caracteristicas 
 - Pclass: 1: 1st 2: 2nd 3 = 3rd
 - Sex: male or female
 - Age: Age in Years
 - SibSp: No. of siblings / spouses aboard the Titanic
 - Parch: No. of parents / children aboard the Titanic
 - Fare: Passenger fare
 - Embarked: Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton U = Unknown


# No usables o excluidos
- PassengerId: Passenger Id 
- Name: Name of the Passenger
- Ticket: Ticket number
- Cabin: Cabin (Removed from final dataset)




In [25]:
df=df[["PassengerId",	"Survived",	"Pclass",	"Name",	"Sex",	"Age",	"SibSp",	"Parch",	"Ticket",	"Fare",	"Embarked"]]

In [26]:
# Split trian (find parameters to minimize the cost function) and test 20 % 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)
X_train.shape

(455, 30)