# Khipus.ai
## Introduction to Machine Learning
### Supervised Learning - Linear Regression
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

### Assignment 2
### Name :(Please Enter Your Name Before Submitting)


## Assignment Instructions
Using the Titanic dataset provided:
1. Import the Titanic.csv file into a pandas dataframe
2. Perform a detailed data exploration.
3. Clean the dataset by addressing missing values.
4. Select relevant features for predicting survival.
5. Split the dataset into training and testing sets (80%-20% split).
6. 6. Use a linear regression model to predict the survivors

## 1. Import the Titanic dataset

In [38]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

In [39]:


# Your code here

# Load the Titanic dataset
titanic_df = pd.read_csv('Titanic.csv')

## 2. Data Exploration
Before working with the dataset, it's important to understand its structure, data types, and summary statistics. Below are examples of how to explore the data.


In [40]:

# Your code here

# Display the first few rows of the dataset
titanic_df.head()



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [41]:

# Display summary statistics
titanic_df.describe()



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


In [42]:
# Check for missing values
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

## 3. Data Cleaning (addressing missing values)
Real-world datasets often contain missing values, duplicate rows, or incorrect data. Cleaning the data ensures its quality and usability for analysis.


In [43]:

# Fill missing 'Age' with the median
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

# Fill missing 'Embarked' with the mode
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

# Drop 'Cabin' as it has too many missing values
titanic_df.drop(columns=['Cabin'], inplace=True)

# Your code here
# Confirm there are no missing values left
titanic_df.isnull().sum()


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           1
Embarked       0
dtype: int64

## Feature Selection 
To build an effective machine learning model, selecting relevant features is crucial. 
Select relevant features for predicting survival.


In [44]:


# Perform one-hot encoding on 'Sex' and 'Embarked' columns
titanic_df = pd.get_dummies(titanic_df, columns=['Sex', 'Embarked'], drop_first=True)

# Define the features you want to use
# Select features and target
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S']

# Your code here
# Select the features from the DataFrame
X = titanic_df[features]  # Features

# Select the target variable from the DataFrame
y = titanic_df['Survived']  # Target variable




## 5. Splitting Training and Test Data

In [45]:
# Your code here
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

## 6. Use a linear regression model to predict the survivors

When working with datasets, it's common to encounter missing values. These missing values can cause issues with many machine learning algorithms, which often require complete data to function correctly. To handle missing values effectively, we can use an imputer. The provided code demonstrates how to use the SimpleImputer from the sklearn.impute module to replace missing values with the mean of the respective column.

In [46]:
# Instantiate the imputer with the strategy to replace missing values with the mean of the column
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the training data and transform it, replacing missing values with the mean
X_train_imputed = imputer.fit_transform(X_train)

# Transform the test data using the same imputer, ensuring consistency in handling missing values
X_test_imputed = imputer.transform(X_test)

### Train the Model
We will train a linear regression model using the training data.

In [47]:
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train_imputed, y_train)

### Evaluate the Model
We'll calculate the mean squared error to evaluate the model performance.

In [48]:

# Make predictions
y_pred = model.predict(X_test_imputed)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R² Score: {r2}')

Mean Squared Error: 1.710959478213727e-30
Root Mean Squared Error: 1.308036497278928e-15
R² Score: 1.0
