# **Lesson_3.2**

## In this lecture

* Fork repository, update or recreate codespace

* **ML**
* **Supervised** vs. **Unsupervised** ML
* **Features** and **Labels**
* **Classification** vs. **regresson**
* Model selection workflow
* **Plotly Express** Python library
* **Pandas** Python library
* In-class exercise

---

## What is Machine Learning (ML)

* Machine Learning (ML) is a branch of **artificial intelligence** where computers learn patterns from data and improve their performance on a task without being explicitly programmed step by step

* ML algorithms build mathematical models from historical or observed data to make predictions, classifications, or decisions on new, unseen data
* The core idea of ML is learning from experience: the more relevant data the model sees, the better it can *generalise* and perform in real-world applications

### **Supervised** vs. **Unsupervised** ML

* **Supervised Machine Learning** learns from examples where the correct outcome is already known, so the model learns to predict or recognise similar outcomes for new data

* **Unsupervised Machine Learning** works with data where no outcomes are given in advance, and the model’s task is to discover hidden patterns or natural groupings on its own

*So far, we said that machine learning learns from data. But now the key question is: how do we present this data to a computer in a form it can understand?*

### Why do we need features and labels?

* Machine learning models cannot learn directly from raw objects or concepts (like “a student”, “a customer”, or “a picture”) — everything must be represented numerically

* **Features** describe the data in a measurable way, giving the model information it can actually process and compare

* **Labels** tell the model what it should aim for in supervised learning, allowing it to learn what a “correct” outcome looks like

---

### **Features** vs. **labels** — one concrete, intuitive example
**Example**: predicting house prices

* We want a model to predict the price of a house
* To do this, we describe each house using measurable information
* Features are the descriptions of the house (e.g. size, number of rooms, distance to city centre)
* Label is what we want the model to predict (the house price)

* Features describe the situation; the label is the answer we want the model to learn
* In supervised machine learning, we typically have many features and one label

* How to imagine and visualise:

House A  ──► [ size, rooms, location ] ──► £250,000
House B  ──► [ size, rooms, location ] ──► £320,000
House C  ──► [ size, rooms, location ] ──► £180,000

* Left: real-world objects

* Middle: numerical descriptions (features)

* Right: known outcome (label)

**Machine learning** is about learning the relationship between the middle and the right

#### Link to **classification** vs **regression**

* If the label is a **number**, the problem is called **regression** (e.g. predicting price, temperature, demand)
* If the label is a **category**, the problem is called **classification** (e.g. spam/not spam, pass/fail, healthy/diseased)

---

## Model selection workflow

<p align="center">
	<img src="../assets/img/model_selection.jpg" width="800" alt="Model Selection Workflow">
</p>

#### Time series analysis
* Is not exactly a part of the above workflow due to **auto-correlation**. Will be studied separately.

---

## Plotly Express

<p align="center">
	<img src="../assets/img\plotly_logo.png" width="200">
</p>

#### [Plotly Express](https://plotly.com/python/plotly-express/)


In [None]:
# Import modules
import plotly.express as px
from skimage import io

#### Example 1

In [None]:
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color="smoker", barmode="group")
fig.show()

#### Example 2

In [None]:
df = px.data.tips()
fig = px.parallel_categories(df, color="size", color_continuous_scale=px.colors.sequential.Inferno)
fig.show()

#### Example 3

In [None]:
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
fig = px.pie(df, values='pop', names='country', title='Population of European continent')
fig.show()

#### Example 4

In [None]:
df = px.data.gapminder().query("year == 2007")
fig = px.sunburst(df, path=['continent', 'country'], values='pop',
                  color='lifeExp', hover_data=['iso_alpha'])
fig.show()

#### Example 5

In [None]:
df = px.data.tips()
fig = px.box(df, x="day", y="total_bill", color="smoker", notched=False)
fig.show()

#### Example 6

In [None]:
img = io.imread('https://user-images.githubusercontent.com/72614349/179115668-2630e3e4-3a9f-4c88-9494-3412e606450a.jpg')
fig = px.imshow(img)
fig.show()

#### Example 7

In [None]:
img = io.imread('https://thumbs.dreamstime.com/b/ocean-sunset-landscape-bird-high-resolution-image-flying-towards-colorful-romantic-sky-246711534.jpg')
fig = px.imshow(img)
fig.show()


---

## [Pandas](https://pandas.pydata.org/)

<p align="center">
	<img src="../assets/img/pandas_logo.jpg" width="200" alt="Pandas library logo">
</p>

In [None]:
# %pip install pandas  # Install Pandas from within the jupyter notebook cell

In [None]:
import pandas as pd  # Import pandas (as pd - convention)

#### Load dataset
* *.csv format is commonly used, but Pandas allows importing various formats
* `read_csv()` function accepts both local and url

In [None]:
# df = pd.read_csv('../datasets/mall_customers_k-means.csv')
df = pd.read_csv('https://raw.githubusercontent.com/DrSYakovlev/m32895-public-tb2-2026/refs/heads/main/datasets/mall_customers_k-means.csv')


In [None]:
type(df)  # df is Pandas object

#### Getting information about your dataset

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()  # df.head(15)

In [None]:
df.tail()  # df.tail(15)

In [None]:
df.shape  # Tuple again

In [None]:
df.dtypes

In [None]:
df.columns

In [None]:
df.value_counts()

In [None]:
df['Gender'].value_counts()

In [None]:
df.nunique()

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

In [None]:
df.isna()

In [None]:
df.isna().sum()

#### Collection of essential functions and methods for data initial inspection
```
df.info()
df.describe()
df.head()
df.tail()
df.shape
df.dtype
df.columns
df.value_counts()
df['Gender'].value_counts()
df.nunique()
df.isnull()
df.isnull().sum()
df.isna()
df.isna().sum()
```

[Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)

---

### In-class exercise

Write a code to extract and print names of Seaborn demo datasets, print their name, shape and number of rows and columns (separately), and the total number of datasets.

Use:
* Loop
* Tuple unpacking
* f-strings
* DataFrame .shape
* Function calls
* Working with external library

*Remember: The Seaborn datasets are being imported as a Pandas DataFrame*

In [None]:
import seaborn as sns

In [None]:
# sns.get_dataset_names()

In [None]:
# type(sns.get_dataset_names())
    

In [None]:
# Type your solution code here ...

---

### End of lesson routine