### Our goal in **Session03**

- about the basics of Control Flow in Python (e.g. how do we tell the computer what to do with data in this language);
- a bit more about functions in Python;
- a bit about defensive programming in Python;
- several new things that can be done with `pd.DataFrame`, such as
   - Pandas I/O operations
   - `apply` a function to a `pd.DataFrame` column
   - use `filter`, `groupy`, and `agg` to filter out and produce data aggregates from `pd.DataFrame`

In [None]:
import os 
work_dir = os.getcwd()
print(work_dir) # Fine, where I should be
data_dir =os.path.join('_data',work_dir)
data_dir = os.path.join(work_dir, "_data")
display(data_dir)

### 2. Data: The Boston Housing Data Set

In [None]:
import pandas as pd

filename='BostonHousingData.csv'
data_set = pd.read_csv(os.path.join(data_dir,filename)) #the most correct tehnical way
display(data_set)

###### but can be done as a function

#data_set =pd.read_csv('_data\'BostonHousingData.csv')

In [None]:
display(data_set.head(5))

List column data types on data_set. This is how Pandas parses our data. Pay attention, this may not always give the best result in regard to recognized data types.

In [None]:
data_set.dtypes

Let's get more comprehensive information about our data set: .info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB

In [None]:
data_set.info()

Let's now invest some effort to **understand** the data set at hand before we proceed with the `pd.DataFrame` class:

- **crim**: per capita crime rate by town.

- **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

- **indus**: proportion of non-retail business acres per town.

- **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- **nox**: nitrogen oxides concentration (parts per 10 million).

- **rm**: average number of rooms per dwelling.

- **age**: proportion of owner-occupied units built prior to 1940.

- **dis**: weighted mean of distances to five Boston employment centres.

- **rad**: index of accessibility to radial highways.

- **tax**: full-value property-tax rate per \$10,000.

- **ptratio**: pupil-teacher ratio by town.

- **black**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- **lstat**: lower status of the population (percent).

- **medv**: median value of owner-occupied homes in \$1000s.

### 3. Control Flow A: Iterating in Python
Iteration refers to the process of repeatedly executing a block of code to traverse through elements of a collection (like lists, tuples, strings, etc.) or to perform repetitive tasks. Python supports iteration using constructs like loops (for and while) or comprehensions.
Iterable Objects

An iterable is an object capable of returning its members one at a time. Examples of iterable objects include:

    Lists: [1, 2, 3]
    Tuples: (4, 5, 6)
    Strings: "abc"
    Dictionaries: {'a': 1, 'b': 2}
    Sets: {7, 8, 9}
    Range objects: range(5)

All iterable objects implement the __iter__() method, which returns an iterator. An iterator is an object that keeps track of where it is during iteration and provides the __next__() method to get the next element.
##############################################################################
1. We will first grab some values from `data_set` and turn them into a list.

In [None]:
my_data = data_set['medv'][0:20]
print(list(my_data))

Task: check if the rounded values in `my_data` are even or not; print the result for each member of `my_data`.

In [None]:
for number in my_data:
    if round(number) % 2 ==0:
        print("Rounded" +str(number) + "is even")
    else:
        print("Rounded" + str(number) + "is odd")

In [None]:
***Sequences** in Python are iterables: **lists**, **strings**, and **tuples**.

In [None]:
my_data = (1, 2, '3', 4, 5, '7', 8, 9, '10')
for d in my_data:
    print(type(d))

my_data = "Belgrader"
for letter in my_data:
    print(letter) 


**Dictionaries** are **iterables** but **not sequences**.

In [None]:
my_data = {'a':1,
           'b':2,
           'c':3,
           'd':4,
           'e':5,
           'f':6}
for item in my_data:
    print(item)

# Only keys, what about:
for item in my_data.values():
    print(item)

#now keys+ sequence, Because:
my_data.values()

for key in my_data:
    print('When ' + key + ' then ' + str(my_data[key]))

Also, unpacking `my_data.items()` before entering the loop:

In [None]:
for key,value in my_data.items():
     print('When ' + key + ' then ' + str(value))

Let's apply a 20% discount to all prices in `data_set['medv']`!

In [None]:
print("Original prices: ")
print(list(data_set['medv'][0:20]))
medv_discount = list(data_set['medv'])

for price in range(len(medv_discount)):
    medv_discount[price] = round(medv_discount[price] - round(medv_discount[price])*0.2)

print('Discount prices: ')
print(medv_discount[0:20])

What is this: `range(len(medv_discount))`?

In [None]:
len(medv_discount[0:20])
list(range(5,15))

print(list(range(len(medv_discount))))

### 3. Control Flow B: list comprehension
-> Now, **this is interestening:**

What is List Comprehension?

* List comprehension is a concise way to create a new list by applying an expression to each item in an existing iterable (like a list). It replaces longer loops with a single line of code, improving readability and compactness.

[expression for item in iterable]

In [None]:
medv = list(data_set['medv'])
display(medv)

# - list comprehension:
medv_discount = [round(x - .2*x,2)for x in medv]
print("Original prices: ")
print(list(data_set['medv'][0:20]))
print("Discount prices: ")
print(medv_discount[0:20])

Iterates over each price x in the medv list.
Calculates the discounted price (20% off).
Rounds the discounted price to 2 decimal places.
Stores all discounted prices in a new list medv_discount.

In [None]:
my_list = ['Belgrade', 'New York', 'Moscow', 'London', 'New Delhi', 'Tokyo']
[x[0] for x in my_list]

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10]
[x**2 for x in my_list]

l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[element1 + ':' + element2 for element1 in l_1 for element2 in l_2]
#or can do like this , classic nested loop:
for element1 in l_1:
    for element2 in l_2:
        result = (element1 + ':' + element2)
        print(result)

    

Create a list of tuples from list comprehension:


In [None]:
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[(el1,el2) for el1 in l_1 for el2 in l_2]

And now for a bit more complicated expression...

In [None]:
l_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[str(x) + ' is even' if x % 2 ==0 else str(x) + ' is odd' for x in l_1]

### 3. Control Flow C: `while`, `continue`, and `break`

In [None]:
l = list()
x = 0
while x < 100:
    if x % 2 == 0:
        l.append(x)
    x+=1
    print(l)

`break` and `continue` in Python loops:
1) ## break

    # Purpose: Immediately exits the loop, skipping the rest of its iteration and any subsequent iterations.
    Use Case: When a condition is met and you want to stop the loop entirely.

    for num in range(1, 10):
    if num == 5:
        break  # Exits the loop when num is 5
    print(num)
     Output: 1, 2, 3, 4

2) ## continue

    # Purpose: Skips the current iteration of the loop and moves to the next one.
    Use Case: When a condition is met, but you want the loop to continue running for other iterations.

    for num in range(1, 10):
    if num == 5:
        continue  # Skips the iteration when num is 5
    print(num)
 Output: 1, 2, 3, 4, 6, 7, 8, 9

In [None]:

# Break the loop when iterator is at str value in the list

#l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
for i in range(len(l_1)):
     if isinstance(l_1[i], str):  #Check if the element is of type str
        break
     else:
        print(l_1[i])
isinstance(10, int) #True
isinstance(10, float) #False

l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
i = 0
for i in l_1:
    if isinstance(l_1[i], str):
        print(l_1)
        break
    i+=1



In [None]:
l_1 = [1, 2, 3, 4, 5, '6' , 7, 8, 9]
i = 0
while i < len(l_1):
    if isinstance(l_1[i], str):
        break
    else:
        print(l_1[i])
        i +=1


In [None]:
l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
for i in range(len(l_1)):
    if isinstance(l_1[i],str):
        break
    else:
        print(l_1[i])
       

`continue` skips an iteration:
When a condition is met, but you want the loop to continue running for other iterations.

In [None]:
l_1 = [1, 2, '3', 4, 5, '6', 7, 8, '9']
i = 0
while i < len(l_1):
    if isinstance(l_1[i],str):
        i+=1 
        continue
    print(l_1[i])
    i+=1 
    


In [None]:
l_1 = [1, 2, '3', 4, 5, '6', 7, 8, '9']
i = 0
for i in range(len(l_1)):
    if isinstance(l_1[i],str):
        i+=1 
        continue
    print(l_1[i])

### 3. Control Flow D: dictionary comprehension

In [None]:
squares = {num: num*num for num in range (1,11)}
print(squares)
squares[5] #key 5 in this case {1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100} Is '5:25' 



Apply a 20% discount to first 20 elements in `data_set['medv']` represented by a dictionary.

Step 1. Represent `data_set['medv'][0:20]` by a dictionary, introducing property names:

In [None]:
medv = data_set['medv'][0:20]
properties = ['p_' + str(i) for i in range(0, 20)]
medv_dict = dict(zip(properties,medv))
medv_dict


###To refresh learned matter "zip()":

*The # zip () function is used to combine two or more iterables (e.g., lists, tuples) into pairs. When passed to dict(), it creates a dictionary where the first iterable provides the keys and the second iterable provides the values.

* dict(zip(keys_iterable, values_iterable))

keys = ['a', 'b', 'c']
values = [1, 2, 3]
result = dict(zip(keys, values))  # {'a': 1, 'b': 2, 'c': 3}

In [None]:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
result = dict(zip(keys, values))  
display(result) # {'a': 1, 'b': 2, 'c': 3}

Step 2. Dictionary comprehension:


In [None]:
medv_dict_discount = {key: round(value - .2*value, 2) for (key, value) in medv_dict.items()}
medv_dict_discount

medv_dict_discount = {key + '_changed': value for (key, value) in medv_dict.items()}
medv_dict_discount

### 3. Control Flow E: Decisions in Python

we are using `if` and `else` again without telling you about them. It is really simple:

In [None]:
x = 10
if x**2 == 100:
    print("x is definitely 10.")
else:
    print("It is  definitely not 10.")

In [None]:
def is_something_ten(x):
    if x**10 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[is_something_ten(x) for x in l_1]

In [None]:
def is_something_ten(x):
    if x**2 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[str(x) + ' is 10!' if is_something_ten(x) else str(x) + ' is  not 10!' for x in l_1]

We can also branch our `if` statements with `elif`:

In [None]:
x = 50
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('This is strange!')

`if` statements can be nested of course:

In [None]:
x = 19
if x < 20:
    print('Ok it is less than 20, now... ')
    if x < 18:
        print('And it is less than 18 too... ')
    else:
        print('But not less than 18... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

### 4. Pandas I/O operations

In [None]:
import os
work_dir = os.getcwd()
print(work_dir)
filename = os.path.join(work_dir, 'world_indicators.csv')

data_set = pd.read_csv('_data\world_indicators.csv')
display(data_set.head(10).isna)


In [None]:
data_set.isna().sum()
display(data_set.head(5))

`NaN` is, by a convention, the way to represent missing data in `pd.DataFrame`. 

Change row index:

In [None]:
data_set = pd.read_csv('_data\world_indicators.csv', index_col = 0)
data_set.head(20)

In [None]:
data_set.loc['Afghanistan', 'hospital_beds_per_1000']

In [None]:
data_set.iloc[0:25, 3]

In [None]:
my_data = {'a':[1, 2, 3], 
           'b':[4, 5, 6],
           'c':[7, 8 , 9]}
my_data = pd.DataFrame(my_data)
display(my_data)

Write `my_data` as a `.csv` file to `data_dir`

In [None]:
filename = os.path.join(work_dir, 'my_data.csv')
my_data.to_csv(filename)
os.listdir(work_dir)

In [None]:
filename = os.path.join(work_dir, 'my_data.csv')
my_data.to_csv(filename)
display(pd.read_csv(filename, index_col=0))
print(filename)
os.remove(filename)
os.listdir(work_dir)
                        

### 5. Pandas transformations and aggregations: `apply`, `filter`, `groupby`, `agg`

In [None]:
my_data = {'a':[1, 2, 3, 6, 2], 
           'b':[4, 5, 6, 2, 3],
           'c':[7, 8 , 9, 1, 1],
           'd':[3, 4, 1, 4, 2]}
my_data = pd.DataFrame(my_data)
display(my_data)

You remember how we defined the function for testing if number is equal to 10 or not? Let's define another method but test if number is even instead!

In [None]:
def is_even(x):
    if x % 2 ==0:
        return(True)
    else:
        return(False)

def is_even_list(lst):
    return[is_even(x) for x in lst]

Pandas DataFrame gives us method `apply` that is able to perform method along the axes. Remember `axis=0` are rows, `axis=1` are columns. Here is how we do it:

In [None]:
my_data.apply(is_even_list, axis=0)

Now, here, it doesn't make much difference if we call it on rows or columns, since our method takes each cell's value separately, disregarding all of the row's or colum's values.

But what can make difference between calling method on rows and columns? Let's define method that includes all of the rows or columns values in calculation:

In [None]:
def distance_from_the_sum(lst):
    s = sum(lst)
    return [x-s for x in lst]
my_data.apply(distance_from_the_sum , axis=0)

# distance_from_the_sum Function:

    This function takes a list lst, calculates its sum, and returns a new list where each element is the original value minus the sum of the list.
    For example, if lst = [1, 2, 3], the sum s = 6, and the returned list will be [-5, -4, -3] because [1-6, 2-6, 3-6].

    Using distance_from_the_sum with apply:

    When you call my_data.apply(distance_from_the_sum, axis=0), Pandas applies the distance_from_the_sum function to each column of my_data because axis=0 means "apply the function to columns."
    If the DataFrame has multiple columns, each column is treated as a list (lst in the function).

`filter` method. It is used for selecting columns or rows based on their labels.

In [None]:
my_data.filter(['a', 'b'])
my_data.filter(['a', 'b'], axis=1)
my_data.filter(['a', 'b'], axis=0)

When we filter by rows we are essentially filtering by index values. Look at the data frame now. Index values are numbers. Let's do it correctly now:

In [None]:
my_data.filter([0, 2], axis=0)

Let's for the sake of our next example redefine the data set.

In [None]:
my_data = {'age':[20, 34, 30, 25, 20, 34], 
           'town':['Chicago', 'LA', 'SF', 'Chicago', 'SF', 'WA'],
           'name':['Jake', 'Fin', 'Maria', 'Timmy', 'Eric', 'Sarah'],
           'income_in_k':[100, 150, 300, 50, 60, 300]}
my_data = pd.DataFrame(my_data)
display(my_data)

This is all good, but we want to have sum of incomes. There are always more than one option to do it. Something like this:

In [None]:
my_data['income_in_k'].sum()
# But what about the mean income?


In [None]:
my_data['income_in_k'].mean()

You must be wondering if there is a method to get both results at the same time? Well, pay close attention to the next `agg` method:

In [None]:
my_data['income_in_k'].agg(['mean', 'sum'])

Now , all in one place, try it out.
Let's try approaching our data set with different example. Say we need the sum of incomes per age. How can we do it?

In [None]:
my_data.loc[my_data['age']==20, 'income_in_k'].sum()

But this is just for one value of age. Should we go and do it for all ages? NO! There is a much better and faster way. It is by using `groupby` data frame method.

In [None]:
my_data.groupby('age')['income_in_k'].agg(['mean','sum'])

### Why use groupby?

It is used for grouping data and performing aggregate computations, such as calculating sums, means, counts, or custom functions on grouped data.

We can have all sorts of aggregations, some of which are builtin:

## Why is 'age' in parentheses, not brackets?

    In the syntax my_data.groupby('age'), parentheses are used because you're passing a string ('age') as an argument to the groupby() method.
    Parentheses are used to enclose arguments in function calls.

If 'age' were in square brackets (['age']), it would represent a list of column names, typically used for selecting multiple columns.

In [None]:
my_data.groupby('age')['income_in_k'].agg(['mean','sum', 'min', 'max'])

In [None]:
list = [12,15,22,44,51]

def fact_check(numbers):
    return[num*(num-1)for num in numbers]
    
result = fact_check(list)
print(result)

In [None]:
### Filtering Even Numbers: Write a function filter_even(lst) that takes a list of integers and returns a new list containing only the even numbers.
number_list = [12, 124, 19, 285, 222, 224]

# Function to check if a single number is even
def if_even(x):
    return x % 2 == 0  # Return True if the number is even

# Function to filter even numbers from a list
def if_even_list(lst):
    return [x for x in lst if if_even(x)]  # Use list comprehension to filter even numbers

# Call the function and store the result
result = if_even_list(number_list)

# Print the result
print(result)
