# An Introduction to Machine Learning
## Session 1a: Introduction to Machine Learning and Data Exploration

Welcome to Session 1a of our Introduction to Machine Learning course! In this session, we’ll begin by exploring what machine learning is and how it can help us make predictions and uncover insights from data. Machine learning is all about building models that learn from data, helping us answer questions like “What factors impact wine quality?” or “Which characteristics are most common among people who survived the Titanic sinking?”

We’ll be working with the Titanic dataset, which provides information on passengers, including details like age, class, and whether they survived the sinking. Our aim is to explore this data and understand the factors that might influence survival—this process, called Exploratory Data Analysis (EDA), is key to understanding what information is useful in our machine learning models.

By the end of this part of the session, you’ll have a better understanding of what machine learning can do, and you’ll be equipped to start thinking about how to build and evaluate models.

### 1. Importing relevant packages and data.

In [None]:
# Basic imports for data handling and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the Titanic dataset
titanic_train = pd.read_csv("../data/titanic_train.csv")

### 2. Exploratory Data Analysis (EDA)

In [None]:
# Display the first few rows
titanic_train.head()

In [None]:
# Check dataset structure
titanic_train.info()

In [None]:
# Summary statistics
titanic_train.describe()

It is always important to check for missing data points within a data set -- this is not exclusive to machine learning but all data analysis! Have a go at understanding the missingness below.

In [None]:
# EXERCISE: Check for missing values in each column.
# Hint: Use .isnull().sum() to get the count of missing values.

In [None]:
# Survival rate distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='Survived', data=titanic_train)
plt.title('Distribution of Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

We may be interested on understanding how other factors such as passenger class and sex related to survival. try to visualise each of these.

In [None]:
# EXERCISE: Plot survival by passenger class using sns.countplot with hue='Survived'.

In [None]:
# EXERCISE: Plot survival by sex to see how it affects survival rates.

Based on these visualisations, we can now start to hypothesise which features seem important for predicting survival.

EXERCISE: In this cell, write down the features you think are important for predicting survival, and explain why you chose each one.

- Insert features here.