# Advanced questions

In [1]:
import pandas as pd
import numpy as np

def get_test_df1():
    return pd.DataFrame({'A' : [1, 2, 3, 4], 'B' : [5, 6, 7, 8]})

def get_test_df2():
    return pd.DataFrame({'A' : np.arange(1, 10000), 'B' : np.arange(1, 10000) + 100000})

## Question 1 

In reference to the dataframe taxis of the exercise "D02_02_data_wrangling.ipynb" => how to define a multiple index on the field pickup + dropoff ?

### Answer

You can use `set_index` with a list

In [2]:
df_multi_index = pd.DataFrame({'A' : [1, 1, 3, 4], 'B' : [5, 6, 7, 8], 'C' : [9, 10, 11, 12]})
df_multi_index.set_index(['A', 'B'], inplace=True)
df_multi_index

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
1,5,9
1,6,10
3,7,11
4,8,12


Then, when using loc, you can use simple values (which will be looked into the first index) or tuples 

In [3]:
df_multi_index.loc[1]

Unnamed: 0_level_0,C
B,Unnamed: 1_level_1
5,9
6,10


In [4]:
df_multi_index.loc[(1, 5)]

C    9
Name: (1, 5), dtype: int64

## Question 2

Considering the data TSA & Holidays in "D02_02_data_wrangling.ipynb" => is it possible to do « rolling joins », ie that Holidays is joined to TSA on date_Holiday <= date_TSA ?

### Answer

Well `pandas` doesn't seem to allow that, so the simpler solution would be to add a column to make the join on.
Let's say you have some development, with some target finish date, and some deliveries, and you want to merge every development with the next delivery (so development_time <= delivery_time) 

In [5]:
development_time = pd.to_datetime([f"2022-10-{i:02d}" for i in range(1, 31, 3)])
delivery_time = pd.to_datetime(['2022-10-01', '2022-10-08', '2022-11-01'])

dev_df = pd.DataFrame({'development_time' : development_time})
delivery_df = pd.DataFrame({'delivery_time' : delivery_time, 'version': ['1.0', '1.1', '1.2']})

print(dev_df)
print(delivery_df)

  development_time
0       2022-10-01
1       2022-10-04
2       2022-10-07
3       2022-10-10
4       2022-10-13
5       2022-10-16
6       2022-10-19
7       2022-10-22
8       2022-10-25
9       2022-10-28
  delivery_time version
0    2022-10-01     1.0
1    2022-10-08     1.1
2    2022-11-01     1.2


In [6]:
# From an algorithmic point of view, there are far better solutions, but this one 
#  is simpler
def find_matching_delivery(dev_time):
    return delivery_df.delivery_time[delivery_df.delivery_time >= dev_time].min()

# apply means that we get, for every row, the result of find_matching_delivery applied
#  to dev_df['development_time']
dev_df['delivery_time'] = dev_df['development_time'].apply(find_matching_delivery)

dev_and_delivery_df = dev_df.merge(delivery_df, on='delivery_time')
dev_and_delivery_df

Unnamed: 0,development_time,delivery_time,version
0,2022-10-01,2022-10-01,1.0
1,2022-10-04,2022-10-08,1.1
2,2022-10-07,2022-10-08,1.1
3,2022-10-10,2022-11-01,1.2
4,2022-10-13,2022-11-01,1.2
5,2022-10-16,2022-11-01,1.2
6,2022-10-19,2022-11-01,1.2
7,2022-10-22,2022-11-01,1.2
8,2022-10-25,2022-11-01,1.2
9,2022-10-28,2022-11-01,1.2


## Question 3

regarding index : i have the feeling that when you define a variable (column) as index, we cannot select that column in a view and that we cannot do operation on it. Or, how do you such operation ?

### Answer

Just use `df.index` instead of the column name, that should cover most case, and otherwise you might just consider changing the index

In [7]:
df_multi_index = pd.DataFrame({'A' : [1, 2, 3, 4], 'B' : [5, 6, 7, 8], 'C' : [9, 10, 11, 12]})
df_multi_index.set_index('A', inplace=True)

In [8]:
df_multi_index['D'] = df_multi_index['B'] + 1
df_multi_index

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,9,6
2,6,10,7
3,7,11,8
4,8,12,9


In [9]:
# Setting df_multi_index['E'] = df_multi_index['A'] + 1, but with 'A' as index
df_multi_index['E'] = df_multi_index.index + 1
df_multi_index

Unnamed: 0_level_0,B,C,D,E
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5,9,6,2
2,6,10,7,3
3,7,11,8,4
4,8,12,9,5


## Question 4

I don't get what are the lambda functions, and what is the difference between `df.assign` and `df.apply`

### Answer

First, these are two different questions

### Answer 4.1 : what are lambda ?

Lambda are also called anonymous `function` it is an advanced syntax, out of the scope of the training, that you use when you need to define a function that you will use only once.

In [10]:
# These two codes are equivalent
def f(x):
    return x + 1
print(f(3))

f = lambda x: x + 1
print(f(3))

4
4


It is mainly used with some functions, like `map` or `apply` that are meant to apply a function to every element of a set of elements

In [11]:
list1 = [1, 2, 3, 4]

# These two codes are equivalent
# ***** First code *****
def f(x):
    return x + 1
# Important, we are not putting parenthesis after the f here, 
#  because we are not calling the function, but referencing it
print(list(map(f, list1)))

# ***** Second code *****
# Here, we use a lambda function to directly declare the function
#  inside the map
print(list(map(lambda x: x + 1, list1)))

[2, 3, 4, 5]
[2, 3, 4, 5]


### Answer 4.2 : difference between apply and assign

Well, they are two different function, `assign` is used on a Dataframe to add some new columns or replace an existing one, `apply` is simply applying a function on every element of a column.

For exemple, these three codes are similar :

In [12]:
# Assign with []
df = get_test_df1()
df['C'] = df['A'] + 1
print("===== case 1 =====")
print(df)

# Assign with assign
df = get_test_df1()
df2 = df.assign(C=df['A'] + 1)
print("===== case 2 =====")
print(df2)

# Assign with assign
df = get_test_df1()
df['C'] = df['A'].apply(lambda x : x + 1)
print("===== case 3 =====")
print(df)

===== case 1 =====
   A  B  C
0  1  5  2
1  2  6  3
2  3  7  4
3  4  8  5
===== case 2 =====
   A  B  C
0  1  5  2
1  2  6  3
2  3  7  4
3  4  8  5
===== case 3 =====
   A  B  C
0  1  5  2
1  2  6  3
2  3  7  4
3  4  8  5


There some differences :

    df.assign(C=df['A'] + 1)
    
Is not modifying the original DataFrame `df` but creating a view with the new column. So it is the version of 

    df['C'] = df['A'] + 1
    
With inplace = False, when you can use assign to chain operation.

`apply` is applying afunction to every element of a column / Series and returning the result. **But you should not use it if you can do the job with simpler syntax**. apply should be use if the operation you want to do cannot be done using numpy arithmetics, and also it is slower than just `df['A'] + 1` because the function is executed as python code.

In [13]:
df = get_test_df2()
%timeit df['C'] = df['A'] + 1
%timeit df['C'] = df['A'].apply(lambda x : x + 1)

108 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
2.42 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Apply is mainly use for some very specific processing on the element of a column, for example date conversion, if you have a format that is not recognized by `pd.to_datetime`

## Question 5 

Naming question : what is the convention to name every "object" that is instanciated in python, whatever is its type ? Do we speak about "variable", "item", "object" ? ... It is mainly useful for google search, forum

### Answer

Short answer : `object`

But we are using the 3 terms `object`, `variable` and `item` in python, with different meaning :
    - an object is anything created by python inside the memory of the process
    - a variable is a name designing an object, or None
    - an item is an element inside a dict

In [14]:
# item example
# This dict has two items
dict1 = {1: 3, 2: 4}

# they can be accessed with the items method :
# k, v stand for key, value
for k, v in dict1.items():
    print((k, v))

(1, 3)
(2, 4)


In [15]:
# name and object example
a = [1, 2, 3]
b = a
c = a

# Here we have three variables a, b, c and only one object, the list
#  declared on the same line as a. The three variables are pointing 
#  toward the same object
# You can check if two variables point to the same object by using the
#  id method
print(id(a) == id(b)) # print True

True


## Question 6

How to do an egality test between two DataFrame output ? At the opposite, how to identify the differences between two objects, especially two DataFrames ?

### Answer

This question was answered during the training (after it was asked), please have a look at Day 4 : Questions and extras => Comparing Dataframes

In short, be careful that two missing values will never be equals, otherwise you use the boolean comparison operators `==` and `!=` and you implement your own logic to deal with na values and choose which part of the DataFrame (or the totality) you want to compare

## Question 7

How to check if an "object" (to be defined) exists within everythig that is instanciated ?

### Answer

I am not completely sure what you mean by that, but you probably want to first ask yourself why you need to do that (it is not a standard behaviour) and second, if you are sure, just use globals(), which is a dict containing all the variable that exists at one point 

In [16]:
weird_variable = pd.DataFrame({'Weird_column' : np.arange(200)})

# Example 1 : checking the variable weird_variable exists
print('weird_variable' in globals()) # print True

# Example 2 : checking there is a DataFrame that exists
for k, v in globals().items():
    if isinstance(v, pd.DataFrame):
        print(f"{k} is a DataFrame")
        
# Example 3 : looking for a DataFrame with a column named Weird_column
def look_for_dataframe():
    for k, v in globals().items():
        if isinstance(v, pd.DataFrame):
            if 'Weird_column' in v.columns:
                return v

weird_df = look_for_dataframe()
# weird_df is pointing to the same DataFrame as weird_variable
assert(id(weird_variable) == id(weird_df))

True
_ is a DataFrame
__ is a DataFrame
___ is a DataFrame
df_multi_index is a DataFrame
_2 is a DataFrame
_3 is a DataFrame
dev_df is a DataFrame
delivery_df is a DataFrame
dev_and_delivery_df is a DataFrame
_6 is a DataFrame
_8 is a DataFrame
_9 is a DataFrame
df is a DataFrame
df2 is a DataFrame
weird_variable is a DataFrame
