Machine Learning Project - 6: **Titanic Survival Prediction using Logistic Regression**

**Load Data into Collab**

In [1]:
import pandas as pd

df = pd.read_csv("titanic.csv")

print(df.head())

   PassengerId  Pclass  Name     Sex  Age  SibSp  Parch  Ticket   Fare Cabin  \
0            1       3  John    male   22      1      0     A/5   7.25   NaN   
1            2       1  Anna  female   38      1      0  PC 175  71.28   C85   

  Embarked  Survived  
0        S         0  
1        C         1  


**Handling Missing Values:**

In [2]:
# Check for missing values
print(df.isnull().sum())

# Fill missing Age values with the median
df["Age"].fillna(df["Age"].median(), inplace = True)

# Drop Cabin column (too many missing values)
df.drop(columns = ["Cabin"], inplace = True)

# Filling missing Embedded values with most common value
df["Embarked"].fillna(df["Embarked"].mode()[0], inplace = True)


PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
Survived       0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Embarked"].fillna(df["Embarked"].mode()[0], inplace = True)


**Converted Categorical Data into Numbers**

In [None]:
"""from sklearn.preprocessing import LabelEncoder

# Convert 'Sex' column to numbers (Male=1, Female=0)
df["Sex"] = LabelEncoder().fit_transform(df["Sex"])

# Convert 'Embarked' column using One-Hot Encoding
df = pd.get_dummies(df, columns=["Embarked"], drop_first=True)

print(df.head())  # Check transformed data
"""

In [12]:
from sklearn.preprocessing import LabelEncoder

# Convert 'Sex' column into numbers (Male=1, Female=0)
df["Sex"] = LabelEncoder().fit_transform(df["Sex"])

# Check if 'Embarked' column still exists before applying get_dummies
if 'Embarked' in df.columns:
    # Convert 'Embarked' column using One-Hot Encoding
    df = pd.get_dummies(df, columns=["Embarked"], drop_first=True)
else:
    print("Embarked column is not found. It might have been processed already.")


print(df.head())

Embarked column is not found. It might have been processed already.
   PassengerId  Pclass  Name  Sex  Age  SibSp  Parch  Ticket   Fare  Survived  \
0            1       3  John    1   22      1      0     A/5   7.25         0   
1            2       1  Anna    0   38      1      0  PC 175  71.28         1   

   Embarked_1  
0        True  
1       False  


**Define Features(X) & Target(Y):**

In [None]:
x = df.drop(columns = ["PassengerId", "Name", "Ticket", "Survived"])  # Features
y = df["Survived"]  # Target variable

print(x.head())
print(y.head())


**Split Data for Training & Testing:**

In [14]:
from sklearn.model_selection import train_test_split

# split 80% Training & 20% testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

print("Training Data:", x_train.shape, "Testing Data:", x_test.shape)

Training Data: (1, 7) Testing Data: (1, 7)


**Train the Logic Regression Model:**

In [None]:
from sklearn.linear_model import LogisticRegression

# create model
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)

print("Model Trained Successfully")

**Make Prediction:**

In [None]:
y_pred = model.predict(x_test)

# compare actual vs predict
results = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
print(results.head())


**Evaluate Model Performance:**

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Model Accuracy: ", accuracy*100)
print("Confusion Matrix:", cm)



**Predict if a New Passenger will Survive or Not**

In [None]:
new_passenger = pd.DataFrame([[1,1,30,1,0,50,1,0]])

# Predict Survival
prediction = model.predict(new_passenger)

if prediction[0] == 1:
    print("Passenger is likely to SURVIVE")
else:
    print("Passenger is likely to DIE")