#### Data Acquisition and Exploration:

Download the Titanic dataset from Kaggle: Titanic - Machine Learning from Disaster.

Use pandas to import and explore the data.

Understand the data structure, identify missing values, and analyze the distribution of features like age, sex, fare class, etc.

In [160]:
import pandas as pd

titanic_data = pd.read_csv("tested.csv")
titanic_data.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,0,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,1,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,0,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,1,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,0,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [161]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [162]:
titanic_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [163]:
titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


#### 

#### Data Cleaning:

Handle missing values in a suitable manner (e.g., imputation, deletion).

Encode categorical variables (e.g., sex, embarked port) into numerical representations for machine learning models.

Consider feature engineering to create new features from existing ones (e.g., family size based on siblings/spouses).

In [164]:
# Handle missing values in a suitable manner (e.g., imputation, deletion)

cleaned_titanic_data  = titanic_data.dropna()
cleaned_titanic_data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [165]:
# Encode categorical variables (e.g., sex, embarked port) into numerical representations for machine learning models.

encode_sex = {'male': 0, 'female': 1}
cleaned_titanic_data['Sex'] = cleaned_titanic_data['Sex'].map(encode_sex)
cleaned_titanic_data ['Sex']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_titanic_data['Sex'] = cleaned_titanic_data['Sex'].map(encode_sex)


12     1
14     1
24     1
26     1
28     0
      ..
404    0
405    0
407    0
411    1
414    1
Name: Sex, Length: 87, dtype: int64

In [166]:
encode_port = {'S': 1, 'Q': 2, 'C': 3}
cleaned_titanic_data ['Embarked'] = cleaned_titanic_data ['Embarked'].map(encode_port)
cleaned_titanic_data ['Embarked']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_titanic_data ['Embarked'] = cleaned_titanic_data ['Embarked'].map(encode_port)


12     1
14     1
24     3
26     3
28     1
      ..
404    3
405    3
407    3
411    2
414    3
Name: Embarked, Length: 87, dtype: int64

In [167]:
# Consider feature engineering to create new features from existing ones (e.g., family size based on siblings/spouses).
# using the assign method to create new feature (Family_size)
cleaned_titanic_data  = cleaned_titanic_data .assign(Family_size = cleaned_titanic_data ['SibSp'] + cleaned_titanic_data ['Parch'])
cleaned_titanic_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_size
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",1,23.0,1,0,21228,82.2667,B45,1,1
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",1,47.0,1,0,W.E.P. 5734,61.175,E31,1,1
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",1,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,3,4
26,918,1,1,"Ostby, Miss. Helene Ragnhild",1,22.0,0,1,113509,61.9792,B36,3,1
28,920,0,1,"Brady, Mr. John Bertram",0,41.0,0,0,113054,30.5,A21,1,0
34,926,0,1,"Mock, Mr. Philipp Edmund",0,30.0,1,0,13236,57.75,C78,3,1
44,936,1,1,"Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)",1,45.0,1,0,11753,52.5542,D19,1,1
46,938,0,1,"Chevre, Mr. Paul Romaine",0,45.0,0,0,PC 17594,29.7,A9,3,0
48,940,1,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",1,60.0,0,0,11813,76.2917,D15,3,0
50,942,0,1,"Smith, Mr. Lucien Philip",0,24.0,1,0,13695,60.0,C31,1,1


#### Data Splitting:

Split the cleaned data into training and testing sets using train_test_split from scikit-learn.

The training set will be used to train the models, and the testing set will be used for unbiased evaluation.

In [184]:
from sklearn.model_selection import train_test_split


x = cleaned_titanic_data.drop(['Survived', 'Name', 'Ticket', 'Cabin'], axis = 1)
y = cleaned_titanic_data['Survived']

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.4, random_state = 42)

#### Model Building and Comparison:

Implement and train the following classification models from scikit-learn:

Logistic Regression

Decision Tree Classifier

Naive Bayes Classifier

Support Vector Machine (SVM)

In [169]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression

reg = LogisticRegression()
reg.fit(x_train, y_train)



Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [170]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(x_train, y_train)


Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [171]:
# Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(x_train, y_train)

Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [172]:
# Support Vector Machine (SVM)

from sklearn.svm import SVC
svm = SVC()
svm.fit(x_train, y_train)

Acurracy:  0.4
F1 Score:  0.571
Precision:  0.4
Recall:  1.0


#### Model Evaluation:

Evaluate the performance of each model on the testing set using metrics like:

Accuracy: Proportion of correct predictions.

F1-score: Harmonic mean of precision and recall.

Precision: Ratio of true positives to all predicted positives.

Recall: Ratio of true positives to all actual positives.

Create a pandas DataFrame to display the performance metrics for each model side-by-side for easy comparison.

In [190]:
from sklearn.metrics import accuracy_score, f1_score, precision_score,recall_score

In [191]:
y_pred = reg.predict(x_test)

# Accuracy 
Accuracy = accuracy_score(y_test, y_pred)
print('Acurracy: ', Accuracy)

# F1 
F1 = f1_score(y_test, y_pred)
print('F1 Score: ', F1)

# Precision
Precision = precision_score(y_test, y_pred)
print('Precision: ', Precision)

# Recall
Recall = recall_score(y_test, y_pred)
print('Recall: ', Recall)

Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [192]:
y_pred = dtree.predict(x_test)

# Accuracy 
Accuracy = accuracy_score(y_test, y_pred)
print('Acurracy: ', Accuracy)

# F1 
F1 = f1_score(y_test, y_pred)
print('F1 Score: ', F1)

# Precision
Precision = precision_score(y_test, y_pred)
print('Precision: ', Precision)

# Recall
Recall = recall_score(y_test, y_pred)
print('Recall: ', Recall)

Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [193]:
y_pred = gnb.predict(x_test)

# Accuracy 
Accuracy = accuracy_score(y_test, y_pred)
print('Acurracy: ', Accuracy)

# F1 
F1 = f1_score(y_test, y_pred)
print('F1 Score: ', F1)

# Precision
Precision = precision_score(y_test, y_pred)
print('Precision: ', Precision)

# Recall
Recall = recall_score(y_test, y_pred)
print('Recall: ', Recall)

Acurracy:  1.0
F1 Score:  1.0
Precision:  1.0
Recall:  1.0


In [194]:
y_pred = svm.predict(x_test)

# Accuracy 
Accuracy = accuracy_score(y_test, y_pred)
print('Acurracy: ', Accuracy)

# F1 
F1 = f1_score(y_test, y_pred)
print('F1 Score: ', F1.round(3))

# Precision
Precision = precision_score(y_test, y_pred)
print('Precision: ', Precision)

# Recall
Recall = recall_score(y_test, y_pred)
print('Recall: ', Recall)

Acurracy:  0.4
F1 Score:  0.571
Precision:  0.4
Recall:  1.0


In [208]:
model_df = {
    'Model': [],
    'Accuracy': [],
    'F1 Score': [],
    'Precision': [],
    'Recall': []
}
for model in[reg,dtree,gnb, svm]:
    y_pred = model.predict(x_test)
    model_df['Model'].append(model.__class__.__name__)
    model_df['Accuracy'].append(accuracy_score(y_test, y_pred))
    model_df['F1 Score'].append(round(f1_score(y_test, y_pred), 1))
    model_df['Precision'].append(precision_score(y_test, y_pred))
    model_df['Recall'].append(recall_score(y_test, y_pred))

metrics = pd.DataFrame(model_df)
print(metrics)



                    Model  Accuracy  F1 Score  Precision  Recall
0      LogisticRegression       1.0       1.0        1.0     1.0
1  DecisionTreeClassifier       1.0       1.0        1.0     1.0
2              GaussianNB       1.0       1.0        1.0     1.0
3                     SVC       0.4       0.6        0.4     1.0
