#Toy Project: Titanic Survival Prediction
**Objective:** Predict whether a passenger survived or not based on features like age, gender, ticket class, etc.


##1. Load the Titanic Dataset:

In [1]:
import pandas as pd

# Load Titanic dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
print(df.head())  # Inspect the first few rows


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


##2. Exploratory Data Analysis (EDA):

In [2]:
# Check for missing values
print(df.isnull().sum())

print("-------------------------")
#number of duplicates:
print(len(df['PassengerId'])-len(df['PassengerId'].drop_duplicates()))

print("-------------------------")
# Descriptive statistics
print(df.describe())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
-------------------------
0
-------------------------
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0

##3. Data Cleaning & Preprocessing:

In [3]:
# Drop unnecessary columns
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Encode categorical variables
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)


##4. Feature Engineering & Scaling:

In [4]:
from sklearn.preprocessing import StandardScaler

# Scale 'Age' and 'Fare' columns
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])


We are not scaling 'SibSb' and 'Parch' because they are small values, but if you have discrete values that need to normalized in your data, you can use MinMaxScaler instead of fit_transform.
from sklearn.preprocessing import MinMaxScaler.

```
from sklearn.preprocessing import MinMaxScaler

# Assume 'SibSp' and 'Parch' are discrete numerical features
scaler = MinMaxScaler()
df[['SibSp', 'Parch']] = scaler.fit_transform(df[['SibSp', 'Parch']])

# Scaled 'SibSp' and 'Parch' will now be between 0 and 1
```
##General Guidance:

- If you are using algorithms that are sensitive to the scale of input data (e.g., logistic regression, SVM, KNN, neural networks), scaling with StandardScaler or MinMaxScaler is usually a good idea, even for non-continuous numerical values.
- For models like tree-based algorithms (e.g., decision trees, random forests, gradient boosting), you can often leave non-continuous numerical features as they are because these models are not sensitive to the scale of the data.

##5. Train-Test Split:

In [5]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = df.drop('Survived', axis=1)
y = df['Survived']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##6. Modeling:
Logistic Regression:

In [6]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred_log = log_reg.predict(X_test)


Decision Tree:

In [8]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train the model
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

# Predict on the test set
y_pred_tree = tree_clf.predict(X_test)


##Model Evaluation:

In [9]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Logistic Regression performance
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(confusion_matrix(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

# Decision Tree performance
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))
print(confusion_matrix(y_test, y_pred_tree))
print(classification_report(y_test, y_pred_tree))


Logistic Regression Accuracy: 0.8100558659217877
[[90 15]
 [19 55]]
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179

Decision Tree Accuracy: 0.7988826815642458
[[92 13]
 [23 51]]
              precision    recall  f1-score   support

           0       0.80      0.88      0.84       105
           1       0.80      0.69      0.74        74

    accuracy                           0.80       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179



##8. Cross-Validation for Model Evaluation:

In [10]:
from sklearn.model_selection import cross_val_score

# Cross-validation for Logistic Regression
cv_scores_log = cross_val_score(log_reg, X, y, cv=5)
print("Logistic Regression CV Scores:", cv_scores_log)

# Cross-validation for Decision Tree
cv_scores_tree = cross_val_score(tree_clf, X, y, cv=5)
print("Decision Tree CV Scores:", cv_scores_tree)


Logistic Regression CV Scores: [0.78212291 0.78089888 0.78651685 0.76966292 0.8258427 ]
Decision Tree CV Scores: [0.81564246 0.81460674 0.81460674 0.78089888 0.82022472]
