# Assignment 1:
The sinking of the Titanic is one of the most infamous
shipwrecks in
history. On April 15, 1912, during her maiden voyage, the widely
considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t
enough lifeboats for everyone onboard, resulting in the death of 1502
out of 2224
passengers and crew. While there was some element of luck involved in
surviving, it seems some groups of
people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers
the question: “what sorts of people were more likely to survive?” using
passenger data (ie name, age, gender, socio-economic class, etc).

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.subplots as sp
import plotly.graph_objects as go

In [2]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3


In [3]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from category_encoders import BinaryEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [4]:
# Data Preprocessing Libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from category_encoders import BinaryEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

# Machine Learing (classification models) Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.feature_selection import SequentialFeatureSelector, SelectKBest, f_regression, RFE, SelectFromModel
from imblearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report, roc_curve, roc_auc_score
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

## Exploring Data

In [5]:
df_train = pd.read_csv('/content/train.csv')
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [6]:
# check the dataset shape
print("Number of Columns in Train data",df_train.shape[1])
print("---------------------------------------")
print("Number of Rows in Train data",df_train.shape[0])

Number of Columns in Train data 12
---------------------------------------
Number of Rows in Train data 891


In [7]:
df_test = pd.read_csv('/content/test.csv')
df_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [8]:
# check the dataset shape
print("Number of Columns in Test data",df_test.shape[1])
print("---------------------------------------")
print("Number of Rows in Test data",df_test.shape[0])

Number of Columns in Test data 11
---------------------------------------
Number of Rows in Test data 418


In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [10]:
# Dropping PassengerId and Ticket columns as they are unique identifier and not useful for predictions.
df_train = df_train.drop(['PassengerId', 'Ticket'], axis=1)

df_test = df_test.drop('Ticket', axis=1)

In [11]:
# checking count the number of unique values in each column of the data
df_train.nunique()

Survived      2
Pclass        3
Name        891
Sex           2
Age          88
SibSp         7
Parch         7
Fare        248
Cabin       147
Embarked      3
dtype: int64

In [12]:
df_train.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S


In [13]:
# Descriptive analysis for categorical data
df_train.describe(include='O')

Unnamed: 0,Name,Sex,Cabin,Embarked
count,891,891,204,889
unique,891,2,147,3
top,"Braund, Mr. Owen Harris",male,B96 B98,S
freq,1,577,4,644


In [14]:
# Descriptive analysis for numerical data
df_train.describe().style.background_gradient()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


## Data Visualization

In [15]:
fig = px.pie(df_train, names='Survived',
             title='Survived Distribution',
             color_discrete_sequence=px.colors.sequential.Mint_r,
             template='plotly_white'
            )

fig.update_traces(textposition='inside',textinfo='percent+label')

fig.show()

In [16]:
fig = px.histogram(df_train, x='Age',  title='Age Distribution',
                   marginal='box', color_discrete_sequence=['#429ea8'],
                   template='plotly_white'
                   )

# Customizing the layout of the histogram
fig.update_layout(
    xaxis=dict(tickmode='linear', dtick=5),  # Adjusting x-axis tick settings
    bargap=0.1  # Setting the gap between bars
)

fig.show()

In [17]:
fig = px.histogram(df_train, x='Sex', color='Survived',
             title='Survival by Sex',
             color_discrete_map={0: '#eb3134', 1: '#10c2de'},
             barmode='group', template='plotly_white', text_auto=True
)

fig.show()

In [18]:
fig = px.histogram(df_train, x='Pclass', color='Survived',
             title='Survival by Pclass',
             color_discrete_map={0: '#eb3134', 1: '#10c2de'},
             barmode='group', template='plotly_white', text_auto=True
)

fig.show()

## Data Preprocessing

In [19]:
df_train['title_name'] = df_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

df_test['title_name'] = df_test['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [20]:
df_train['title_name'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

In [21]:
df_test['title_name'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Ms', 'Col', 'Rev', 'Dr', 'Dona'],
      dtype=object)

In [22]:
def categorize_titles(title):
    if title in ['Mr', 'Mrs', 'Miss', 'Master']:
        return title
    else:
        return 'Other'

# Applying the function to df_train title_name column
df_train['title_name'] = df_train['title_name'].apply(categorize_titles)

df_test['title_name'] = df_test['title_name'].apply(categorize_titles)

In [23]:
# Removing the 'Name' column as full names aren't needed for building the model
df_train = df_train.drop('Name', axis=1)

df_test = df_test.drop('Name', axis=1)

In [24]:
df_train['family_size'] = df_train['SibSp'] + df_train['Parch']

df_test['family_size'] = df_test['SibSp'] + df_test['Parch']

In [25]:
df_train['family_size'].unique()

array([ 1,  0,  4,  2,  6,  5,  3,  7, 10])

In [26]:
df_test['family_size'].unique()

array([ 0,  1,  2,  4,  3,  5,  7,  6, 10])

In [27]:
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,title_name,family_size
0,0,3,male,22.0,1,0,7.25,,S,Mr,1
1,1,1,female,38.0,1,0,71.2833,C85,C,Mrs,1
2,1,3,female,26.0,0,0,7.925,,S,Miss,0
3,1,1,female,35.0,1,0,53.1,C123,S,Mrs,1
4,0,3,male,35.0,0,0,8.05,,S,Mr,0


In [28]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,title_name,family_size
0,892,3,male,34.5,0,0,7.8292,,Q,Mr,0
1,893,3,female,47.0,1,0,7.0,,S,Mrs,1
2,894,2,male,62.0,0,0,9.6875,,Q,Mr,0
3,895,3,male,27.0,0,0,8.6625,,S,Mr,0
4,896,3,female,22.0,1,1,12.2875,,S,Mrs,2


### Handling Missing Data

In [29]:
# checking for missing values in data
df_train.isna().sum() / df_train.shape[0]*100

Survived        0.000000
Pclass          0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
title_name      0.000000
family_size     0.000000
dtype: float64

In [30]:
# Sort the DataFrame based on the 'title_name' column
df_train.sort_values(by='title_name', inplace=True)

# Extract the 'Age' column as a 2D array for imputation
age_data = df_train['Age'].values.reshape(-1, 1)

# Initialize KNN imputer with k=5 (number of nearest neighbors)
imputer = KNNImputer(n_neighbors=5)

# Perform imputation on the 'Age' column
df_train['Age'] = imputer.fit_transform(age_data)

In [31]:
# Dropping the 'Cabin' column due to its significant missing values of 77%.
df_train = df_train.drop('Cabin', axis=1)

In [32]:
# Initialize the SimpleImputer with most frequent strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Reshape the column for imputation (required for 1D arrays)
imputed_column = imputer.fit_transform(df_train['Embarked'].values.reshape(-1, 1))

# Flatten the 2D imputed column to 1D
imputed_column = imputed_column.flatten()

# Replace the original column with the imputed values
df_train['Embarked'] = imputed_column

In [33]:
df_train.isna().sum() / df_train.shape[0]*100

Survived       0.0
Pclass         0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Fare           0.0
Embarked       0.0
title_name     0.0
family_size    0.0
dtype: float64

In [34]:
df_test.isna().sum() / df_test.shape[0]*100

PassengerId     0.000000
Pclass          0.000000
Sex             0.000000
Age            20.574163
SibSp           0.000000
Parch           0.000000
Fare            0.239234
Cabin          78.229665
Embarked        0.000000
title_name      0.000000
family_size     0.000000
dtype: float64

In [35]:
# Sort the DataFrame based on the 'title_name' column
df_test.sort_values(by='title_name', inplace=True)

# Extract the 'Age' column as a 2D array for imputation
age_data = df_test['Age'].values.reshape(-1, 1)

# Initialize KNN imputer with k=5 (number of nearest neighbors)
imputer = KNNImputer(n_neighbors=5)

# Perform imputation on the 'Age' column
df_test['Age'] = imputer.fit_transform(age_data)

In [36]:
# Dropping the 'Cabin' column due to its significant missing values of 78%.
df_test = df_test.drop('Cabin', axis=1)

In [37]:
# Initialize the SimpleImputer with mean strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

# Reshape the column for imputation (required for 1D arrays)
imputed_column = imputer.fit_transform(df_test['Fare'].values.reshape(-1, 1))

# Replace the original column with the imputed values
df_test['Fare'] = imputed_column

In [38]:
df_test.isna().sum() / df_test.shape[0]*100

PassengerId    0.0
Pclass         0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Fare           0.0
Embarked       0.0
title_name     0.0
family_size    0.0
dtype: float64

#### Handling categorical data

For train.csv

In [39]:
# Working with Nominal Features with pandas `get_dummies` function.
df_train = pd.get_dummies(df_train, columns=['Sex', 'title_name', 'Embarked'])

encoded = list(df_train.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

17 total features after one-hot encoding.


In [40]:
df_train.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,family_size,Sex_female,Sex_male,title_name_Master,title_name_Miss,title_name_Mr,title_name_Mrs,title_name_Other,Embarked_C,Embarked_Q,Embarked_S
445,1,1,4.0,0,2,81.8583,2,0,1,1,0,0,0,0,0,0,1
386,0,3,1.0,5,2,46.9,7,0,1,1,0,0,0,0,0,0,1
50,0,3,7.0,4,1,39.6875,5,0,1,1,0,0,0,0,0,0,1
59,0,3,11.0,5,2,46.9,7,0,1,1,0,0,0,0,0,0,1
348,1,3,3.0,1,1,15.9,2,0,1,1,0,0,0,0,0,0,1


For test.csv

In [41]:
# Working with Nominal Features with pandas `get_dummies` function.
df_test = pd.get_dummies(df_test, columns=['Sex', 'title_name', 'Embarked'])

encoded = list(df_test.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

17 total features after one-hot encoding.


In [42]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,family_size,Sex_female,Sex_male,title_name_Master,title_name_Miss,title_name_Mr,title_name_Mrs,title_name_Other,Embarked_C,Embarked_Q,Embarked_S
417,1309,3,30.27259,1,1,22.3583,2,0,1,1,0,0,0,0,1,0,0
89,981,2,2.0,1,1,23.0,2,0,1,1,0,0,0,0,0,0,1
80,972,3,6.0,1,1,15.2458,2,0,1,1,0,0,0,0,1,0,0
154,1046,3,13.0,4,2,31.3875,6,0,1,1,0,0,0,0,0,0,1
64,956,1,13.0,2,2,262.375,4,0,1,1,0,0,0,0,1,0,0


Data training

In [43]:
# First we extract the x Featues and y Label
X = df_train.drop('Survived',axis=1)
y = df_train['Survived']

In [44]:
X.shape, y.shape

((891, 16), (891,))

In [45]:
# Then we Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42,stratify=y)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 712 samples.
Testing set has 179 samples.


### Handling Imbalanced Data

In [46]:
y_train.value_counts()

0    439
1    273
Name: Survived, dtype: int64

In [47]:
sm = SMOTETomek(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [48]:
y_train.value_counts()

1    412
0    412
Name: Survived, dtype: int64

In [50]:
numerical_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'family_size']

# Creating a RobustScaler instance
scaler = RobustScaler()

# Fitting the RobustScaler on the training data
scaler.fit(X_train[numerical_features])

# Transforming (scaling) the continuous features in the training and testing data
X_train_cont_scaled = scaler.transform(X_train[numerical_features])
X_test_cont_scaled = scaler.transform(X_test[numerical_features])

# Replacing the scaled continuous features in the original data
X_train[numerical_features] = X_train_cont_scaled
X_test[numerical_features] = X_test_cont_scaled

X_train

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,family_size,Sex_female,Sex_male,title_name_Master,title_name_Miss,title_name_Mr,title_name_Mrs,title_name_Other,Embarked_C,Embarked_Q,Embarked_S
0,-0.8,0.358407,0.0,0.0,-0.175882,0.0,1,0,0,0,0,1,0,0,0,1
1,0.0,1.316740,0.0,0.0,-0.320640,0.0,0,1,0,0,1,0,0,1,0,0
2,-1.6,0.858407,0.0,0.0,-0.639992,0.0,0,1,0,0,1,0,0,0,0,1
3,0.0,0.000000,0.0,0.0,-0.320455,0.0,1,0,0,0,0,1,0,1,0,0
4,0.0,0.525074,1.0,0.0,0.129104,1.0,1,0,0,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
819,0.0,0.000000,0.0,0.0,-0.297435,0.0,1,0,0,1,0,0,0,0,1,0
820,-1.6,0.578129,1.0,0.0,3.043954,1.0,1,0,0,0,0,0,0,0,0,0
821,0.0,-2.300635,0.0,0.0,-0.096620,0.0,1,0,0,1,0,0,0,0,0,1
822,0.0,-2.138481,0.0,0.0,-0.145675,1.0,0,0,0,0,0,0,0,0,0,1


In [51]:
# Create a heatmap using Plotly
heatmap_data = df_train.corr().values.tolist()

fig = go.Figure(data=go.Heatmap(z=heatmap_data, x=df_train.columns, y=df_train.columns, colorscale='Viridis'))

# Update layout
fig.update_layout(title='Correlation Heatmap',
                  xaxis_title='Features',
                  yaxis_title='Features',
                  template='plotly_white')

# Show the figure
fig.show()

## Models Training and Evaluation

In [52]:
# List of classifiers to evaluate
classifiers = [
    ("Logistic Regression", LogisticRegression(random_state=42, max_iter= 1500, n_jobs=-1)),
    ("KNN", KNeighborsClassifier(n_neighbors=5, n_jobs=-1)),
    ("Gaussian Naive Bayes", GaussianNB()),
    ("SVC", SVC(random_state=42, probability=True)),
    ("Decision Tree", DecisionTreeClassifier(random_state=42)),
    ("Random Forest", RandomForestClassifier(random_state=42, n_jobs =-1)),
    ("AdaBoost", AdaBoostClassifier(random_state=42)),
    ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
    ("XGBoost", xgb.XGBClassifier(random_state=42, n_jobs =-1))
]

In [53]:
# Creating lists for classifier names, mean_test_f1_scores, cross_val_errors, and results.
results = []
mean_test_f1_scores = []
cross_val_errors = []
classifier_names = []

for model_name, model in classifiers:

    # 5-fold Stratified Cross-Validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Perform cross-validation with train scores
    cv_results = cross_validate(model, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1, return_train_score=True)

    # Calculate cross-validation error
    cross_val_error = 1 - np.mean(cv_results['test_score'])

    # Append results to the list
    results.append({
        "Model Name": model_name,
        "Mean Train F1 Score": np.mean(cv_results['train_score']),
        "Mean Test F1 Score": np.mean(cv_results['test_score']),
        "Cross-Validation Error": cross_val_error
    })

    mean_test_f1_scores.append(np.mean(cv_results['test_score']))
    cross_val_errors.append(cross_val_error)
    classifier_names.append(model_name)

# Create a DataFrame from the results list
results_df = pd.DataFrame(results)

# Display the DataFrame
display(results_df)

Unnamed: 0,Model Name,Mean Train F1 Score,Mean Test F1 Score,Cross-Validation Error
0,Logistic Regression,0.857226,0.851671,0.148329
1,KNN,0.903198,0.87814,0.12186
2,Gaussian Naive Bayes,0.847636,0.841842,0.158158
3,SVC,0.861446,0.855307,0.144693
4,Decision Tree,0.988066,0.834203,0.165797
5,Random Forest,0.988121,0.869302,0.130698
6,AdaBoost,0.873123,0.855445,0.144555
7,Gradient Boosting,0.925864,0.87024,0.12976
8,XGBoost,0.983827,0.867469,0.132531


In [54]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the data
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

# Perform data preprocessing and feature engineering
def preprocess_data(data):
    # Fill missing values
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Fare'].fillna(data['Fare'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

    # Convert categorical variables
    data['Sex'] = data['Sex'].map({'female': 0, 'male': 1})
    data = pd.get_dummies(data, columns=['Embarked'])

    return data

train_data = preprocess_data(train_data)
test_data = preprocess_data(test_data)

# Select relevant features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data[features], train_data['Survived'], test_size=0.2, random_state=42)

# Build and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy on the validation set: {accuracy}')

# Make predictions on the test set
predictions = model.predict(test_data[features])

# Prepare predictions for submission
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})
submission.to_csv('submission.csv', index=False)


Accuracy on the validation set: 0.8044692737430168



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

