# Example 1

## Reading CSV Files in Python

CSV is a file format used to store tabular data (e.g. spreadsheets).  
Each line represents a row, and each value is separated by a delimiter (usually a comma or semicolon).

We need a CSV reader object that can read the file line by line, for which we import `csv` module.
The delimiter tells Python how values are separated (`,` or `;` depending on the file).

In [17]:
import csv

with open("youtube_data.csv") as csv_file:
    reader = csv.reader(csv_file, delimiter=";")
    for row in reader:
        print(row)

['Video', 'Date', 'Views']
['Informatik 1: Session 1 - Installation of Python and Setup', 'Oct 9, 2020', '547']
['Informatik 1: Session 2 - Basics', 'Oct 13, 2020', '425']
['Informatik 1: Session 3 - Lists, For, While, Range', 'Oct 15, 2020', '250']
['Informatik 1: Session 4 - Functions, Tuples and Sets', 'Oct 21, 2020', '416']
['Informatik 1: Session 5 - Dictionaries, File I/O and Codingstandard', 'Oct 21, 2020', '338']
['Informatik 1: Session 6 - error messages, how to read testreports, Ass1 Q&A', 'Oct 25, 2020', '165']
['Informatik 1: Session 7 - Names, Variables, Scope, Namespace and Pythontutor', 'Oct 28, 2020', '139']
['Informatik 1: Session 8 - Error handling, Built-Ins, Import, PIP', 'Oct 29, 2020', '159']
['Informatik 1: Session 9 - Imports, PIP, How to approach tasks', 'Oct 31, 2020', '147']
['Informatik 1: Session 10 - numpy', 'Nov 4, 2020', '122']
['Informatik 1: Lecture 3', 'Oct 20, 2020', '178']
['Informatik 1: Lecture 4', 'Oct 27, 2020', '136']
['Informatik 1: Lecture 5'

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

# Example 2

# Working with Pandas

Most Python data work uses the [**Pandas**](https://pandas.pydata.org/) library — it makes data handling *much* easier.

By convention, `pandas` is imported as `pd`.

Pandas is designed for data analysis and data science, so it’s the go-to tool once you start working with real datasets. It allows for easy data preview, filtering and sorting, handling missing avalues, statistical operations, and more.

In [18]:
# import pandas
# one usually imports pandas as pd:
import pandas as pd

### Read the CSV file

`pd.read_csv()` reads the CSV and automatically converts it into a **DataFrame** (a table-like structure).

In [19]:
# data extraction
data = pd.read_csv("youtube_data.csv", delimiter=";")


`data` is now a full DataFrame object — you can inspect, analyze, and visualize it.

When you type the variable name in a notebook, Jupyter shows a formatted table view of your data.
Each row is one record (e.g., one YouTube video), and each column is a property like Video, Date, or Views.

In [20]:
data

Unnamed: 0,Video,Date,Views
0,Informatik 1: Session 1 - Installation of Pyth...,"Oct 9, 2020",547
1,Informatik 1: Session 2 - Basics,"Oct 13, 2020",425
2,"Informatik 1: Session 3 - Lists, For, While, R...","Oct 15, 2020",250
3,"Informatik 1: Session 4 - Functions, Tuples an...","Oct 21, 2020",416
4,"Informatik 1: Session 5 - Dictionaries, File I...","Oct 21, 2020",338
5,"Informatik 1: Session 6 - error messages, how ...","Oct 25, 2020",165
6,"Informatik 1: Session 7 - Names, Variables, Sc...","Oct 28, 2020",139
7,"Informatik 1: Session 8 - Error handling, Buil...","Oct 29, 2020",159
8,"Informatik 1: Session 9 - Imports, PIP, How to...","Oct 31, 2020",147
9,Informatik 1: Session 10 - numpy,"Nov 4, 2020",122


### Inspect the data

Pandas includes built-in inspection methods:

- `data.head()` - displays first 5 rows of your dataset
- `data.info()` - displays summary information about the dataset
- `data.describe()` - displays quick statistics (mean, min, max, etc.) on numeric columns

In [21]:
# data inspection
data.head()

Unnamed: 0,Video,Date,Views
0,Informatik 1: Session 1 - Installation of Pyth...,"Oct 9, 2020",547
1,Informatik 1: Session 2 - Basics,"Oct 13, 2020",425
2,"Informatik 1: Session 3 - Lists, For, While, R...","Oct 15, 2020",250
3,"Informatik 1: Session 4 - Functions, Tuples an...","Oct 21, 2020",416
4,"Informatik 1: Session 5 - Dictionaries, File I...","Oct 21, 2020",338


In [22]:
# data inspection
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Video   13 non-null     object
 1   Date    13 non-null     object
 2   Views   13 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 444.0+ bytes


In [23]:
# data inspection
data.describe()

Unnamed: 0,Views
count,13.0
mean,239.538462
std,144.532358
min,92.0
25%,139.0
50%,165.0
75%,338.0
max,547.0


### Selecting rows and columns

You can select specific parts of your data using **conditions** and **labels**.

> `.loc[...]` - filters rows based on that condition
- `data.Views > 200` - creates a condition (a list of True/False values for each row)
- `["Video", "Date"]` - selects only these two columns to display.

Result: A smaller table containing only videos with more than 200 views.

In [24]:
# data inspection / data cleaning
data.loc[data.Views > 200, ["Video", "Date"]]

Unnamed: 0,Video,Date
0,Informatik 1: Session 1 - Installation of Pyth...,"Oct 9, 2020"
1,Informatik 1: Session 2 - Basics,"Oct 13, 2020"
2,"Informatik 1: Session 3 - Lists, For, While, R...","Oct 15, 2020"
3,"Informatik 1: Session 4 - Functions, Tuples an...","Oct 21, 2020"
4,"Informatik 1: Session 5 - Dictionaries, File I...","Oct 21, 2020"


Each column in a DataFrame is actually a Pandas **Series** basically, a one-dimensional list of data with labels.

In [25]:
print(type(data.Views))

<class 'pandas.core.series.Series'>


In [26]:
print(data.Views)

0     547
1     425
2     250
3     416
4     338
5     165
6     139
7     159
8     147
9     122
10    178
11    136
12     92
Name: Views, dtype: int64


**Boolean filtering** 

- This creates a list of Boolean values which you can use directly inside `.loc` to filter rows

In [27]:
print(list(data.Views > 200))

[True, True, True, True, True, False, False, False, False, False, False, False, False]


`sum()` adds up all values in e.g. `Views` column.

In [28]:
print(data.Views.sum())

3114


**Normalizing values**

Let’s see what fraction of the total views each video contributes - this shows each video’s share of total views, useful for comparing relative performance.

In [29]:
print(data.Views / data.Views.sum())

0     0.175658
1     0.136480
2     0.080283
3     0.133590
4     0.108542
5     0.052987
6     0.044637
7     0.051060
8     0.047206
9     0.039178
10    0.057161
11    0.043674
12    0.029544
Name: Views, dtype: float64


When you access a column in a DataFrame (e.g. `data.Views`),  
you get a **Pandas Series** — a one-dimensional structure that stores labeled data.

But sometimes, you only want the **raw numerical data**, without labels.  
That’s what `.values` gives you.

In [30]:
data.Views.values

array([547, 425, 250, 416, 338, 165, 139, 159, 147, 122, 178, 136,  92],
      dtype=int64)

In [31]:
type(data.Views.values)

numpy.ndarray

`.values` converts a Pandas Series or DataFrame into a `NumPy` array.

`NumPy` arrays are the foundation of numerical work in Python.

In general: ALWAYS read the documentation of a function / method / property if you use external modules / packages / libraries!

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

# Example 3

### **NumPy**
A core library for scientific computing.  
It’s used for fast calculations on large numerical datasets — much faster than Python lists.

- `np.array()` creates a **NumPy array**.
- Arrays can be **1D (vector)** or **2D (matrix)**.
- NumPy arrays store data more compactly and allow fast math operations.

In [32]:
import numpy as np

vector = np.array([1, 2, 3])
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

Checking Array Properties:
- `.shape` - size of the array (rows, columns); e.g. `(2, 3)`: 2 rows, 3 columns
- `.ndim` - Number of dimensions; e.g. `1D` (vector), `2D` (matrix)

In [33]:
print(matrix.shape)
print(matrix.ndim)

(2, 3)
2


**Slicing**

Slicing in NumPy works just like lists — but more powerful.

In [34]:
print(vector[0])     # first element
print(vector[0:1])   # first element as array
print(vector[0:2])   # first two elements
print(vector[::-1])  # reverse the array

1
[1]
[1 2]
[3 2 1]


In [35]:
print(matrix[0, :])  # first row
print(matrix[:, 0])  # first column
print(matrix[1, :])  # second row
print(matrix[0, 0])  # single element

[1 2 3]
[1 4]
[4 5 6]
1


**Remember:**  
- First index → row  
- Second index → colum

One of NumPy’s biggest strengths is that it applies operations **element-wise**.

Example: calculating **BMI** (Body Mass Index):

In [36]:
participants_weight = np.array([68, 61, 75, 70])
participants_height = np.array([1.60, 1.50, 1.73, 1.80])

participants_bmi = participants_weight / participants_height ** 2

In [37]:
print(participants_bmi)

[26.5625     27.11111111 25.05930703 21.60493827]


**No loops needed** — NumPy automatically performs the math for each element.

NumPy includes many built-in functions for quick statistics.

| Function | Meaning |
|-----------|----------|
| `.mean()` | Average |
| `.max()` | Maximum |
| `.min()` | Minimum |
| `.sum()` | Sum of all elements |
| `.std()` | Standard deviation (spread of values) |

In [38]:
participants_weight = np.array([50, 61, 75, 70])
print(participants_weight.mean())
print(participants_weight.max())
print(participants_weight.min())

64.0
75
50


**Store data for men and women in seperate rows**

You can store different groups of participants in separate rows of a 2D array.

Here:
- **Row 0** → Men  
- **Row 1** → Women

In [39]:
participants_weight = np.array([[68, 61, 75, 70],
                                [50, 53, 61, 59]])

participants_height = np.array([[1.60, 1.50, 1.73, 1.80],
                                [1.55, 1.65, 1.61, 1.73]])

In [40]:
print(participants_weight.shape)
print(participants_height.shape)

(2, 4)
(2, 4)


**Overall mean**

In [41]:
print(participants_weight.mean())
print(participants_height.mean())

62.125
1.6462500000000002


**Mean by gender**

In [42]:
print(participants_weight.mean(axis=0))
print(participants_height.mean(axis=0))

[59.  57.  68.  64.5]
[1.575 1.575 1.67  1.765]


**Understanding `axis`:**
- `axis=0`: go **down** the columns (aggregate vertically).
- `axis=1`: go **across** the rows (aggregate horizontally).

[For a detailed explanation on NumPy axes follow this link.](https://www.sharpsightlabs.com/blog/numpy-axes-explained/)

**Vectorized BMI Calculation (2D)**

- NumPy automatically matches elements by position (same row & column) 
- Works on entire matrices at once — no for-loops required.

In [43]:
participants_bmi = participants_weight / participants_height ** 2

In [44]:
print(participants_bmi)

[[26.5625     27.11111111 25.05930703 21.60493827]
 [20.81165453 19.46740129 23.53304271 19.71332153]]
