# Day 17 – Introduction to Machine Learning

### Objective
- Understand basic concepts of Machine Learning  
- Learn supervised learning workflow  
- Prepare dataset for model training  
- Identify features and target variable

## What is Machine Learning?

Machine Learning is the process of training a computer system to learn patterns from data and make predictions without being explicitly programmed.

### Types of Machine Learning
1. Supervised Learning  
2. Unsupervised Learning  
3. Reinforcement Learning  

In this internship, we are focusing on **Supervised Learning**, because the Titanic dataset has a target variable (Survived).

## Machine Learning Workflow

The general workflow of a machine learning project:

1. Data Collection  
2. Data Cleaning  
3. Exploratory Data Analysis  
4. Feature Engineering  
5. Splitting Data (Train-Test Split)  
6. Model Training  
7. Model Evaluation  
8. Prediction  

This day focuses on steps 5 and preparation for step 6.

In [3]:
# Import required libraries
import pandas as pd

## Load Prepared Dataset

Using the cleaned and feature-engineered Titanic dataset prepared on previous days.

In [4]:
# Load dataset
df = pd.read_csv("./data/clean_data.csv")

# Display first few rows
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Family_size,Family_type,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_nan
0,0,3,22.0,7.25,2,small,0,1,0.0,0.0,1.0,0.0
1,1,1,38.0,71.2833,2,small,1,0,1.0,0.0,0.0,0.0
2,1,3,26.0,7.925,1,alone,1,0,0.0,0.0,1.0,0.0
3,1,1,35.0,53.1,2,small,1,0,0.0,0.0,1.0,0.0
4,0,3,35.0,8.05,1,alone,0,1,0.0,0.0,1.0,0.0


In [5]:
# Basic information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Age           714 non-null    float64
 3   Fare          891 non-null    float64
 4   Family_size   891 non-null    int64  
 5   Family_type   891 non-null    object 
 6   Sex_female    891 non-null    int64  
 7   Sex_male      891 non-null    int64  
 8   Embarked_C    891 non-null    float64
 9   Embarked_Q    891 non-null    float64
 10  Embarked_S    891 non-null    float64
 11  Embarked_nan  891 non-null    float64
dtypes: float64(6), int64(5), object(1)
memory usage: 83.7+ KB


## Identify Features and Target Variable

In supervised learning, we separate:

- Input features (X)
- Target variable (y)

For Titanic dataset:
- Target variable = Survived  
- All other relevant columns = Features

In [6]:
# Define target variable
y = df["Survived"]

# Define input features
X = df.drop("Survived", axis=1)

print("Shape of Features (X):", X.shape)
print("Shape of Target (y):", y.shape)

Shape of Features (X): (891, 11)
Shape of Target (y): (891,)


## Understanding Supervised Learning in Context

This Titanic problem is a **classification problem**, because:

- Output (Survived) has only two values:  
  0 → Not Survived  
  1 → Survived  

Therefore, classification algorithms like:
- Logistic Regression  
- KNN  
- Decision Tree  

will be used in upcoming days.

## Train-Test Split Concept

Before training a model, data must be divided into:

- Training data → to train the model  
- Testing data → to evaluate the model  

This helps to check how well the model performs on unseen data.

In [7]:
from sklearn.model_selection import train_test_split

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Target Shape:", y_train.shape)
print("Testing Target Shape:", y_test.shape)

Training Features Shape: (712, 11)
Testing Features Shape: (179, 11)
Training Target Shape: (712,)
Testing Target Shape: (179,)


## Conclusion of Day 17

Today the following tasks were completed:

- Understood basic machine learning concepts  
- Learned supervised learning workflow  
- Loaded prepared dataset  
- Identified features and target variable  
- Performed train-test split  

This preparation will be used for model building from Day 18 onwards.