# Python
Python is a general purpose programming language. In these workshops, we will use Python 3.8 or higher. Python, like JavaScript, is an interpreted language, as opposed to Java or C, which need to be compiled before running. This implies that some of the errors detected during the source compilation will not be noticed and will generate exceptions during execution.

In this notebook we are going to see some very basic examples of how to use Python and some of its libraries, to get a detailed view of the language, check https://docs.python.org/3/tutorial/

## Variables and types
Python is a strongly typed language. Python has four basic types: integer number `int`, decimal number `float`, boolean `bool` and text strings `str`. When we define a variable, we will not specify the type.

In [None]:
# This is a comment
# Comments start with the character #

# This frame is a cell which can be executed individually.
# To execute a cell press Ctrl+Enter or press the run button in the tools menu

print("Hello World!")

In [None]:
# An integer
myInt = 10

# A float value
myFloat = 3.14

# A boolean
myBool = True

# A string, you can use simple quotes ' or double quotes "
myStr = "example1"
myStr2 = 'example2'


To display output, we use the `print()` function.

In [None]:
print(myInt)
print(myFloat)
print(myBool)
print(myStr)
print(myStr2)

# As Python is strongly typed variables will not change its type automatically, so, for example, if we want to concatenate
# an int with an str, we have to convert the int into a str first
print(myStr + " " + str(myInt))

## Basic operations

In [None]:
# Add
print(3 + 5)

# Subtract
print(3 - 5)

# Multiply
print(3 * 5)

# Divide
print(3 / 5)

# Power 
print(3 ** 5)

## Lists

In [None]:
myList = [1, 2, 3, True, "APSV"]

# Access an element in certain position
print(myList[0])

# We can access elements at the end by using negative index. -1 is the last value, -2 the previous one, etc
print(myList[-1])

# We can obtain a sublist using the syntax myList[start:end]
# If start is not specified, the beginning of the list is used by default; similarly, if end is not specified, the list's end is used.
print(myList[0:2])
print(myList[-3:-1])
print(myList[:2])
print(myList[2:])

In [None]:
# To modify a value
myList[0] = 4
print(myList)

# To add elements at the end of the list
myList.append("Exam")
print(myList)

# To add elements at a specific position
myList.insert(1,23)
print(myList)

# To remove the last element
lastElement = myList.pop()
print(myList)

# To remove a certain element
del(myList[1])
print(myList)

# To remove an element by its value (the first occurrence)
myList.remove(3)
print(myList)

# To obtain the length of any collection
print(len(myList))

In [None]:
# Extra: Python has another type of collection called tuples. Tuples are similar to lists, but the main difference is that tuples are immutable, meaning their elements cannot be modified after creation.
# They are defined using parenthesis instead of brackets. We can access elements in the same way as with lists

myTuple = (1, 2, 3, True, "APSV")

print("First element: " + str(myTuple[0]))

## Dictionaries
Dictionaries are collections of key-value pairs. Dictionaries can be accessed like lists, but they use keys instead of indexes.

The values can have different types, even is common to have dictionaries inside dictionaries (eg. to work with a json).

In [None]:
# Dictionaries are marked with {} characters
# The syntax to set a key-value pair is "key": "value"
myDict = { "key": "value", "APSV": 23, "arr" : [1,2,3] }
print(myDict)

# Access by key
print(myDict["arr"])

## Boolean operations

In [None]:
# AND
print(myInt < 5 and myBool)

# OR
print(myInt > 5 or myBool)

# Negate (NOT)
print(not myBool)

# Check element in list
print(4 in myList)

## Flow control
In Python, code blocks are defined by their indentation level, unlike other languages that use {} to enclose blocks.

In [None]:
# Conditional
if myInt > 5:
    print("myInt is greater than 5")

In [None]:
# Conditional with else
if myInt < 5:
    print("myInt is less than 5")
else:
    print("myInt is greater than 5")

In [None]:
# Nested conditionals
if myInt < 5:
    print("myInt is less than 5")
elif myInt < 15:
    print("myInt is less than 15")
else:
    print("myInt is greater than 15")

In [None]:
# Using a for loop to iterate over the elements of a list.
for i in myList:
    print(i)

In [None]:
# Using a for loop to iterate over a range of values.
for i in range(len(myList)):
    print(str(i) + " - " + str(myList[i]))

In [None]:
# While loop
a = 0
while a < 10:
    print(a)
    a += 1

In [None]:
# Break and continue
a = 0
while True:
    a += 1
    if a > 10:
        break
    if a % 2 == 0:
        continue
    print(a, "is odd")

### Advanced (optional) — List comprehensions

List comprehensions are a concise way to create lists in Python. The general form is:

[expression for item in iterable if condition]

- expression: any valid Python expression that uses ``item`` (for example ``item*2``)
- item: loop variable
- iterable: any iterable (``range()``, list, generator, etc.)
- if condition: an optional filter (only include items that satisfy the condition)

Notes:
- List comprehensions are often more readable and faster than equivalent for-loops for simple transformations.
- For very large numeric data, NumPy vectorized operations (arrays) are usually much faster than either approach.
- If you only need to iterate once without creating a list, use a generator expression (parentheses) to save memory, e.g. ``(x*x for x in range(10))``.

In [None]:
# Squares from 0 to 9:
print([x*x for x in range(10)])

# Filter even numbers:
print([x for x in range(20) if x % 2 == 0])

# Nested comprehension (pairs):
print([(i, j) for i in range(3) for j in range(2)])


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]


In [4]:
# Equivalent for-loop versions (for comparison):
#Squares from 0 to 9:
squares = []
for x in range(10):
    squares.append(x*x)
print(squares)

# Filter even numbers:

evens = []
for x in range(20):
    if x % 2 == 0:
        evens.append(x)
print(evens)

# Nested loop (pairs):
pairs = []
for i in range(3):
    for j in range(2):
        pairs.append((i, j))
print(pairs)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]


This syntax is often complemented with functional programming style helpers that reduce a collection to a single value. In Python, these are built-in functions that receive an iterable as a parameter. For example, `sum([1,2,3])` computes the sum of all elements in the list. See Python's built-in functions: https://docs.python.org/3/library/functions.html (e.g., `sum`, `max`, `min`, `any`, `all`) and the starting guide to Python Funcional Programming https://docs.python.org/3/howto/functional.html.

In [3]:
# Sum of squares from 0 to 9:
print(sum([x*x for x in range(10)]))

# Max even number in a collection:
print(max([x for x in range(20) if x % 2 == 0]))

# Create a dictionary:
print(dict([(i, j) for i in range(3) for j in range(2)]))

285
18
{0: 1, 1: 1, 2: 1}


# Methods
To define a method in python we will use the following syntax `def methodName(arg1, arg2 = "default value")`.

In [None]:
def sumValues(a, b=1):
    return a + b

print(sumValues(2,3))
print(sumValues(2))

## Import modules
In Python external dependencies are usually called modules, to import them we will use `import moduleName`

If you want to learn more about Python or need to install additional modules, you can use the `pip` command to install external libraries (similar to `npm` in JavaScript).

In [None]:
import math
print(math.sqrt(2))
import random
print(random.random())

## Errors
Until now, all the code that we have executed has run correctly, but when we do more complex things, we can get execution errors. Now, we will run some code with errors to see the types of errors we might encounter and how they appear.

In [None]:
# NameError
# This error occurs when a variable or name is not defined.
print(var33)

In [None]:
# TypeError
# Occurs when attempting invalid operations between types at runtime (e.g., concatenating str and int).
print("hello " + 3)

In [None]:
# ZeroDivisionError
print(1 / 0)

In [None]:
# IndexError
# This happens when we try to access an index that is out of bounds or undefined.
a = [1, 2]
print(a[4])

## Problems
### Problem 1
Design a method to calculate the cosine distance between two lists of floats ``a`` and ``b``. The cosine distance or cosine similarity is a measure of the angle between two vectors https://en.wikipedia.org/wiki/Cosine_similarity.

$D =1-\cos(\theta) = 1-{\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = 1-\frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $

To check your solution you can use the following tests:
 - ``cosine_distance([1, 0, -1], [1, 0, -1]) == 0.0``
 - ``cosine_distance([1, 0, -1], [-1, 0, 1]) == 2.0``
 - ``cosine_distance([1, 0, -1], [0, 1, 0]) == 1.0``
 - ``cosine_distance([1, 2, 3], [3, 2, 1]) == 0.2857``

In [None]:
# Use this cell to write your code
def cos_distance( a, b ):
    return

In [None]:
# Try it
print(cos_distance([1,2,3], [3,2,1]))
print( "Obtained: ",cos_distance([1,0,-1], [ 1,0,-1]), ", Expected: ", 0)
print( "Obtained: ",cos_distance([1,0,-1], [-1,0, 1]), ", Expected: ", 2)
print( "Obtained: ",cos_distance([1,0,-1], [ 0,1, 0]), ", Expected: ", 1)


### Problem 2
Design a function that, given a list of lists (a matrix) of numbers, returns the position of the maximum value as a tuple. For example, given the following matrix:

```
matrix = [
    [7, 5, 3],
    [2, 4, 9],
    [1, 6, 8]
]
```

The method should return the tuple ``(1, 2)``.

In [None]:
# Use this cell to write your code
def get_maximum_position( matrix ):
    return

In [None]:
# Try it
matrix = [[7, 5, 3], [2, 4, 9], [1, 6, 8]]
print(get_maximum_position(matrix)) # Should print (1,2)

matrix = [[1,2,3],[4,5,6],[7,8,9]]
print(get_maximum_position(matrix)) # Should print (2,2)

## Optional

Try to solve these problems using list comprehension syntax or calling functions from NumPy. Compare the performance of each approach with the `%timeit` magic (Colab/Jupyter). To make the measurement meaningful, use large inputs and run several iterations or repeated calls.

___
# Pandas
## Python Data Analysis Library
Pandas is an open-source library that provides high-performance data structures and data analysis tools for Python.
https://pandas.pydata.org/

The basic data structure of Pandas is the dataframe. A dataframe is a collection of tabulated data, similar to a SQL table. Dataframes must always have an index column.

The introductory guide of pandas covers more contents than those which we are going to use in this case, but it is recommended to give a quick view to it: https://pandas.pydata.org/docs/getting_started/overview.html

__NumPy__ (http://www.numpy.org/) is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

__Important functions__
Some cells contains methods of functions that are going to be used in the next workshops. Those cells are marked with a comment like this 💥💥💥

In [None]:
import pandas as pd
import numpy as np

The next cell downloads the required data set to carry out the workshop. This kind of code executes a shell command, in this case `curl` to download the file from the internet. The `!` at the beginning of the line indicates that the command is not Python code, but a shell command.

If you are not following the recommended environment setup, you may need to download the file manually from the repository and put it in the same folder as the notebook.

In [None]:
!curl https://raw.githubusercontent.com/APSV-UPM/BusinessIntelligence/main/data/orders.csv > orders.csv
!curl https://raw.githubusercontent.com/APSV-UPM/BusinessIntelligence/main/data/customers.json > customers.json
!curl https://raw.githubusercontent.com/APSV-UPM/BusinessIntelligence/main/data/packages.zip > packages.zip

### Load data
We can load data into dataframes from different sources
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [None]:
# 💥💥💥
trucks_data = [['9073 YGP', 130, 150, 200,  300],
    ['3881 KCC', 130, 200, 245,  400],
    ['1845 GDS', 130, 245, 250,  600],
    ['8725 MHH', 130, 270, 245, 1300]]
trucks = pd.DataFrame(trucks_data, columns = ['plate', 'max_cargo', 'height', 'width', 'length'])
customers = pd.read_json("customers.json")
orders = pd.read_csv("orders.csv")
packages = pd.read_csv("packages.zip")

### Accessing the dataframe
Evaluating a DataFrame variable displays its content in a table format.

In [None]:
trucks

In [None]:
customers

In [None]:
orders

In [None]:
packages

### Index
By default, a dataframe's index is a column of incrementing integers.

In [None]:
# Accessing the rows with index within the specified range
orders[1:3]

In [None]:
# To get a single row we must use the method iloc
orders.iloc[1]

In [None]:
# Getting the index column as a list
orders.index

### Access to data column
We can retrieve the columns data as if the dataframe was a dictionary

In [None]:
# 💥💥💥
# Get a single column
customers["province"]
# Or
customers.province

In [None]:
# Get multiple columns (this returns a dataframe with the selected columns)
customers[["lat","lng"]]

In [None]:
# 💥💥💥
# Get unique values from a column
customers["province"].unique()

### Describe data
When we evaluate a dataframe, pandas will show a table with it contents. If there are many rows, it only shows the beginning and the end.

In [None]:
packages

We can use the method `describe` to know more about one column or the whole dataset. If the column contains numerical values, it calculates statistical measures like the average, count, max, and min. On the other hand, if the column contains strings, it counts the unique values and returns the most frequent one.

In [None]:
# 💥💥💥
packages.describe()

In [None]:
packages["height"].describe()

In [None]:
# 💥💥💥
# List columns and their types
customers.dtypes

In [None]:
# 💥💥💥
# info() prints information about the dataframe including the index dtype and columns, non-null values and memory usage 
orders.info()

### Searching data
We will need to get the rows that match certain conditions; we do this using the `loc` function.

In [None]:
# 💥💥💥
# Searching all packages smaller than 100cm
packages.loc[packages["height"]<100]

In [None]:
# Searching for customers name contains the string 'Ruiz'
customers.loc[customers["name"].str.contains("Ruiz")]

In [None]:
# 💥💥💥
# If we want to retreive a row and we know its index, we can use the method iloc
customers.iloc[23]

### Filtering
Sometimes we will need to filter our data: discard empty values, keep only variables in a range, etc.

In [None]:
# We can perform boolean operations with columns that returns a column of boolean values 
# These boolean columns are called masks
packages["width"]<120

In [None]:
# We will use those columns to filter our data
packages_filtered = packages[packages["width"]<120]

In [None]:
# To filter by the contents of a strings column
# Use na=False in the contains call to avoid errors if the column contains NaN values
customers[customers.province.str.contains("ia")]

### Modifying data in a column

In [None]:
# 💥💥💥
# We can apply a method to each value in a column
def to_inches(x):
    return x * 0.393701
print("Original")
print(packages["width"])
print("Modified")
# Apply and the name of the method we want to execute over each value
print(packages["width"].apply(to_inches))

In [None]:
# ADVANCED
# We can do the same by using lambdas
print(packages["width"].apply(lambda x: x * 0.393701))

### Modifying single values

In [None]:
# 💥💥💥
# We can modify the content of a single cell with the method at. This method work similar to iloc
# It receives the index of the row and the name of the column we want to access. Watch out, we will pass these parameters between brackets, not parenthesis
# dataframe.at[row_index, "column_name"]
print("Previous value:",packages.at[0,"width"])
packages.at[0,"width"] = 100
print("New value:",packages.at[0,"width"])

### Creating new columns

In [None]:
def is_big(x):
    if x > 150:
        return True
    else:
        return False
packages["big"] = packages.height.apply(is_big)

In [None]:
# 💥💥💥
# We can use apply method over the row with the parameter "axis=1"
def volume(x):
    return x.height * x.length * x.width
packages["volume"] = packages.apply(volume, axis=1)

### Advanced (optional) — Vectorization, apply, and method chaining

In Pandas, vectorized operations (working on whole columns/Series) are usually much faster than row-wise `apply` or Python loops. Prefer:

- Vectorized arithmetic and boolean operations on Series/DataFrames
- Method chaining to express multi-step transformations clearly: `.assign(...).query(...).groupby(...).agg(...).sort_values(...)`

Key ideas:
- Avoid iterating rows with `for`, `iterrows`, or row-wise `apply` for simple arithmetic — use column operations instead.
- Use `.assign()` to create new columns within a pipeline, `.query()` for expressive filtering, and `.agg()` with named tuples for clear aggregates.
- For large datasets, consider using categorical dtypes for string columns with fixed values from a limited set to reduce memory and speed up groupbys.

In [None]:
# Demo: vectorized vs apply vs loop for computing volume
import numpy as np
import pandas as pd
import time

# Ensure required columns exist
assert all(col in packages.columns for col in ["width","height","length"]), "packages must have width, height, length"

# 1) Vectorized computation
start = time.perf_counter()
vol_vec = packages["width"] * packages["height"] * packages["length"]
t_vec = time.perf_counter() - start

# 2) Row-wise apply
start = time.perf_counter()
vol_apply = packages.apply(lambda r: r["width"]*r["height"]*r["length"], axis=1)
t_apply = time.perf_counter() - start

# 3) Python loop with iterrows (slowest)
start = time.perf_counter()
vol_loop = []
for _, r in packages.iterrows():
    vol_loop.append(r["width"]*r["height"]*r["length"])
t_loop = time.perf_counter() - start

# Sanity check
ok = np.allclose(vol_vec.values, vol_apply.values) and np.allclose(vol_vec.values, np.array(vol_loop))
print(f"Equal results across methods? {ok}")
print(f"Vectorized: {t_vec:.4f}s | apply(axis=1): {t_apply:.4f}s | iterrows loop: {t_loop:.4f}s)"

# Save vectorized result into the dataframe (preferred)
packages["volume_vec"] = vol_vec
packages.head()

# Rename columns

In [None]:
# The parameter columns is a dictionary with the old names as keys and new names as values

# Most of the methods that modifies the dataframes do not change the dataframe itself, 
# instead they return a copy of it with the modification applied
orders.rename(columns={"VAT_number":"client_VAT"})

### Group by
We can execute group by orders similar to SQL. It creates groups of rows with the same value in a column (or several columns) and then we can apply some functions to the groups, like count, sum, average, etc.

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

The basic syntax is `df.groupby("column").function()` or `df.groupby(["column1", "column2"]).function()`

In [None]:
# 💥💥💥
# Group by order and count the number of rows in each group
packages.groupby("order_id").count()
# There are many predefined aggregating functions like: first, last, median, sum, mean, max, min, etc

### Join dataframes
There are methods to perform joins of two or more dataframes, the basic one (we only need this one in these workshops) is the method `merge`.

Just for curiosity the guide of all join types can be glanced through: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
# 💥💥💥
# Merge both dataframes
data = orders.merge(packages, on="order_id").merge(customers, on="VAT_number")

data

### Save data
Saving a dataframe is quite similar to load data, we will use the methods ``dataframe.to_csv(filename)`` or ``dataframe.to_json(filename)``

In [None]:
data.to_json("data.json")

## Exercises

Display your solutions using the `print` function. For example `print("The solution is " + str(42))` (do not forget to convert numbers to strings with `str()`)

### Exercise 1
Calculate the number of trucks and customers in our datasets

### Exercise 2
Count the number of different provinces that appear in the customers dataframe

### Exercise 3
Show the top10 heaviest packages (tip: you can use the method `sort_values` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html). In this case instead of using the `print` function, just evaluate the dataframe to show it.

#### Optional advanced exercise — Method chaining and performance

- Build a pipeline that:

  1) starts from `packages`
  2) creates `volume = width*height*length` and `weight_kg = weight/1000` using `.assign(...)`
  3) filters large packages with `.query("volume > 150000")`
  4) groups by `province` and `big`, aggregates `count` and average `weight_kg`
  5) sorts by the count descending

- Implement it twice:

  A) using vectorized operations inside a chaining pipeline;
  B) using `.apply(axis=1)` to compute `volume` and `weight_kg`.

- Compare timings with `%timeit` (Colab/Jupyter) and comment on readability.

___
# Matplotlib and Seaborn

__Matplotlib__ (https://matplotlib.org/) is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

__Seaborn__ (https://seaborn.pydata.org/) is a Python library built on top of Matplotlib which offers an easy interface to create plots using dataframes.

In [None]:
# To import these libraries we use these lines
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# 💥💥💥
# To create a new figure, we are going to use always the next steps
# Note that the method returns 2 values, fig and ax, this can be usually found in python
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(7,5))
# The three parameters, nrows, ncols and figsize are optionally defined
# We can define a matrix of subplots by changing the values of nrows and ncols
# We can modify the size of the matrix with the arguments of figsize
# If we don't specify this parameter, the figure will have the minimum size to fit all the drawings

# Here we will define the charts, labels, grid, etc

# Finally we should tell matplotlib to show the figure we have defined
plt.show()

In [None]:
# Variable ax is a numpy array with all the subplots that we define
# We can access to each subplot by its coordinates
ax[0,0] # This is the same as ax[0][0]

In [None]:
fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(7,5))

# Sample function
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

# To plot a line in a certain subplot we can use
ax[0,0].plot(t,s) # t are the horizontal values and s the vertical ones
ax[1,0].plot(s,'g') # If we define only one series it will be taken as the vertical values
# We can define the color of a plot after the values of the line

plt.show()

## Plots with Matplotlib
There are a huge number of different plots available in matplotlib, we are going to see the most basic ones
https://matplotlib.org/api/axes_api.html#plotting

### Line plot
https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html?highlight=plot#matplotlib.axes.Axes.plot

We can define a string after the data to modify the style of the line which will be plotted. These strings are defined in this way `'[color][marker][line style]'`, for example `'b--'` is a dashed blue line or `'ro'` is a sequence of red circles.

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Plot
ax.plot(t,s) 
ax.plot(t,2*s,'g--') # We can plot several lines in the same chart
ax.plot(t,4*s,'r,')

plt.show()

### Bar plot
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data about programming language popularity
x = ["Python", "Java", "Javascript", "C#", "PHP"]
y = [25.13,21.98,8.35,7.5,7.36]

# Plot
ax.bar(x,y) 

plt.show()

### Scatter plot
https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html#matplotlib.axes.Axes.scatter

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data
x = np.arange(0, 10, .1)
y = (x+np.random.rand(100))**2

# Plot
ax.scatter(x,y) 
ax.scatter(x,-y)

plt.show()

### Histogram plot
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

# Sample data
marks = [5.9, 7.9, 7.4, 6.2, 5.7, 8.3, 6  , 6.4, 8.1, 7.1, 5.7, 6.8, 6.2, 6.9, 7.2, 6.8, 7.3, 8.3, 5.3, 7.2, 5.8, 
         5.3, 6.8, 6.9, 5.5, 5.6, 7.7, 8.4, 6.5, 4.4, 6.8, 9.3, 5.7, 9.5, 7.1, 5.9, 8.2, 9  , 6.6, 6.8, 4.9, 6.9, 
         6.2, 6.8, 9.1, 5.8, 7.3, 4.7, 7.4, 3.1, 8.5, 7.9, 5.8, 7.9, 5.1, 5.2, 7.8, 6.3, 6.5, 5.3, 7.5, 6.8, 6.6, 
         6.7, 7.8, 7.6, 10 , 5.8, 8.1, 7.8, 8.5, 5.4, 8.1, 3.6, 6  , 8  , 6.1, 4.9, 6.3, 5.2, 7.3, 7  , 6.7, 5.9, 
         4.2, 5.2, 8.5, 9.2, 7.1, 8.7, 6.6, 8  , 6.9, 5  , 5.9, 8.1, 7  , 8.2, 7.7, 4.2]

# Plot
ax.hist(marks)

plt.show()

## Axes modifications and labels

If we do not configure anything, matplotlib will generate axis scales just big enough to fit the data. Sometimes we may want to define a certain range (e.g., if we are plotting exam marks, we may want to plot from 0 to 10 on the horizontal axis) or we may want to use a different precision.

In addition, it is good practice to include labels on both axes and a title above the chart.

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

ax.hist(marks)

# We can specify the ranges of both axes
ax.set_xlim(0, 10)
ax.set_ylim(0, 30)

# We can also specify the divisions of the axes
ax.set_yticks(np.arange(0,32,2))

# Activate the grid
ax.grid()

# Set a title
ax.set_title("Histogram of last exam's marks")

# Set axes labels
ax.set_xlabel("Grade")
ax.set_ylabel("Number of students")

plt.show()

## Exporting images
We can export our images to use them anywhere. It is always suggested to use vector formats (like svg or eps) if it is possible.

In [None]:
# We specify the file name and the format
# bbox_inches='tight' removes the margins in the image
fig.savefig("marksHistogram.pdf", bbox_inches='tight')

## Plotting with Seaborn
Seaborn can be used in a similar way to matplotlib, by creating a figure and then plotting something on it.

In [None]:
# 💥💥💥
fig, ax = plt.subplots(figsize=(7,5))

# The parameter 'data' defines the dataframe used to generate the chart
# We need to indicate which columns will be used in each axis. 'x' and 'y' in this case
# To use the figure we will set the parameter 'ax' equal to the axis we have just defined
sns.scatterplot(data=customers, x="lat",y="lng", ax=ax)

## Categorical plots

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

# countplot displays a bar chart with the sample's count of each category
sns.countplot(data=orders, y='VAT_number', ax=ax)

## Distribution plots

In [None]:
packages

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(14,12))

# histplot creates an histogram of the indicated column
sns.histplot(data=data, x="volume", ax=ax[0], bins=50)

# boxplots are used to display the distribution of values
# With the 'hue' parameter we can define a column to group the data
sns.boxplot(data=data, x="province", y="weight", ax=ax[1], hue="big")
# Rotating the labels
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=60)

## Relation plots

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(14,6))

# Regplot 
sns.regplot(x="width",y="volume",data=packages, ax=ax[0])

# Scatterplot
sns.scatterplot(data=packages, x="width", y="height", size="volume", ax=ax[1])

### Advanced (optional) — Styling, axes-level vs figure-level, and faceting

Seaborn offers two styles of APIs:
- Axes-level functions (e.g., `sns.scatterplot`, `sns.boxplot`) which draw into a Matplotlib Axes you provide.
- Figure-level functions (e.g., `sns.relplot`, `sns.catplot`, `sns.displot`) which create a figure and return a FacetGrid, making it easy to create multi-plot "small multiples" (facets).

Tips:
- Use `sns.set_theme(style="whitegrid", context="notebook", palette="deep")` to stablish a style across all plots.
- Prefer figure-level functions (`relplot`, `catplot`, `displot`) when you need faceting by a categorical variable.

## Problems

### Problem 1
Plot in a bar chart the number of customers in each province

In [None]:
# Use this cell to write your code

### Problem 2
Combine two kinds of charts (line, bars, scatter, etc) in one plot. You can use the data from previous cells or generate new data.

In [None]:
# Use this cell to write your code

#### Optional advanced exercise — Faceted plots and aesthetics

- Create a figure-level faceted plot (`sns.relplot` or `sns.catplot`) showing two variables of interest, faceted by a categorical column (e.g., `province` or `big`).
- Compare the same information using axes-level functions in a 1×2 layout with `plt.subplots`.
- Experiment with theme settings (`sns.set_theme`) and annotate axes titles/labels. Use `%timeit` (optional) to compare rendering times for small vs large subsets (e.g., `packages.sample(n=2000, random_state=0)`).

<!-- Instructor-only (optional): Uncomment the cell below if you want to show a NumPy vectorized solution and quick assertions in class. -->