<div style="text-align: center;">
    <img src="OneDrive\Bureau\interships tasks\CodSoft.png" style="width: 300px;">
</div>


# Introduction
In this project, we aim to build a predictive model using the Titanic dataset to determine whether a passenger survived the disaster or not. This task serves as a classic introductory project due to the availability of comprehensive data on individual passengers.

# Key Points 
- The dataset provides detailed information about each passenger, including attributes such as PassengerID, Survived , Pclass, Name, Sex, Age, SibSb, Parch , Ticket, Fare , Cabin and Embarked.

- By leveraging this dataset, we will develop a machine learning model capable of predicting survival outcomes based on the given features.

- This project not only demonstrates fundamental data handling and analysis skills but also introduces basic concepts of machine learning classification, making it an ideal starting point for aspiring data scientists and analysts.

# 1.Load the Data

In [14]:
# importing the data 
import pandas as pd 
path  = "OneDrive\Bureau\interships tasks\Datasets Encryptix intership\Titanic-Dataset.csv"
data = pd.read_csv(path)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# 2.Exploratory Data Analysis (EDA)

In [19]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [22]:
# Check if there Is NAN values 
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# 3. Data Preprocessing


In [27]:
# Fill the messing values in the age columns with the median Age 
data["Age"].fillna(data["Age"].median() , inplace = True )

# Fill the missing values in the Cabin with a placeholder Unknown 
data["Cabin"].fillna("Unknown" , inplace = True)

# Fill the missing values with the most frequent values 
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace = True)



In [30]:
# Seperate the target values and features 
X = data.drop(columns = ["PassengerId" , 'Name' , 'Ticket' , 'Cabin' , 'Survived' ])
y = data["Survived"]
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,Unknown,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,Unknown,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,Unknown,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,Unknown,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,Unknown,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [33]:
# Identify categorical and numerical collumns
categorical_cols = ['Sex', 'Embarked']
numerical_cols = ['Age', 'Pclass', 'SibSp', 'Parch', 'Fare']

In [36]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Preprocessing for numerical Data
numerical_transformer = Pipeline(steps = [('scaler', StandardScaler())
])

In [40]:
# Preprocessing for Categorical Data 
categorical_transformer = Pipeline( steps = [('onehot', OneHotEncoder (handle_unknown = 'ignore' ))
                                              ])


In [43]:
# Change vakues with the preprocessing method to change numerical and categorical data
from sklearn.compose import ColumnTransformer
# Bundle prerocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers = [
('num' , numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)    
]
    
)

In [46]:
# Change vakues with the preprocessing method to change numerical and categorical data
from sklearn.compose import ColumnTransformer
# Bundle prerocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers = [
('num' , numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)    
]
    
)

# 4. Implimenting The Logestic Regression Model 


In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


In [55]:
# Define the model 
model = LogisticRegression(max_iter = 1000)
# create and evaluate the pipline 
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
# Split the data into training and validation sets
X_train, X_test , y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
pipeline.fit(X_train,y_train)

# prediction 
y_pred = pipeline.predict(X_test)
y_pred

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1], dtype=int64)

# 5. Model Evaluation 

In [60]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [63]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.8100558659217877

In [66]:
report = classification_report(y_test, y_pred )
report

'              precision    recall  f1-score   support\n\n           0       0.83      0.86      0.84       105\n           1       0.79      0.74      0.76        74\n\n    accuracy                           0.81       179\n   macro avg       0.81      0.80      0.80       179\nweighted avg       0.81      0.81      0.81       179\n'

In [69]:
conf_matrix = confusion_matrix (y_test, y_pred)
conf_matrix

array([[90, 15],
       [19, 55]], dtype=int64)