## Content of the current notebook:
* Selection of data with Pandas
https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.02-Data-Indexing-and-Selection.ipynb
* Using map and apply

# Selection of data with Pandas

## Selecting a column
`sr_of_column = df[<column_name>]`

The result is a pandas Series.

## Selecting multiple columns
`df_with_selected_columns = df[[<column_name1>,<column_name2>]]`

## Selecting some rows by boolean series
`df_selection = df[<boolean_series>]`

You can select rows from a dataframe, with a series of dtype boolean. Only those columns are selected, where the value of the Series is "true". The result is the selected dataframe.

## Selecting rows and columns
`df_selection = df.loc[<boolean_series>, [<column_name1>,<column_name2>]]`

or

`df_selection = df[<boolean_series>][[<column_name1>,<column_name2>]]`

## Selecting rows by index
`df_selection = df.loc[<index_name>]`

## Selecting rows by position
`df_selection = df.iloc[<position_idx>]`

In [0]:
import pandas as pd

### Load the data from ReDI - Resellers Analysis

In [0]:
# Link to the online location
url = 'https://raw.githubusercontent.com/ReDI-School/python-data-science/master/datasets/reseller/orders.csv'

In [0]:
# Load the data as a Pandas Data Frame and see the content
data = pd.read_csv(url, parse_dates=['datetime_ordered'])

#### How much sales were generated from orders with promo code?

In [0]:
# Exercise 1: Why might the CEO be interested in knowing how much sales where generated by a promo code?

In [0]:
# Exercise 2: Calculate the amount.
df.loc[df.column_name, ['column_name']]

In [0]:
# Exercise 2b: How much without promo code?
data[data['used_promo_code']==False]['sales_price']

### A adhoc request

---

Dear BI Team,

I just heared from the Finance Team, that the customer, with the id
2708, 86, 1005, 1661 are most likely fraudster and did not payed 
their bills. Can you check quickly how much money we lost?

Thanks!

---

In [0]:
# Exercise 3: Answer the request. (Tips: Have look at pd.Series.isin())

#### Calcuate the total sales after the 5th of July


In [0]:
# Exercise 4: Calculate the total sales after the given date.

#### Calcuate the total sales of two periods, which is higher?

1.   First period: from 2018-06-14 to 2018-06-27
2.   Second period: from 2018-07-14 to 2018-07-27

Use the "and" operator. 


In [0]:
# Exercise 5: Calculate the total sales for the first period.

In [0]:
# Exercise 6: Calculate the total sales for the second period.

### Advanced Questions

In [0]:
# Answer Exercise 3 by creating and using an index on the customer_id column

# Map and appy

Map and apply are working very similar, we will focus on the three most common use cases of both.

### Creating a new series from a other on by matching a dict


In [4]:
d = {1: 'Bob', 2: 'Anna'}
sr = pd.Series([1, 2, 2, 1, 3])
sr

0    1
1    2
2    2
3    1
4    3
dtype: int64

In [5]:
sr_str = sr.map(d)
sr_str

0     Bob
1    Anna
2    Anna
3     Bob
4     NaN
dtype: object

### Creating a new series from a other on by applying a function


In [0]:
def invert_string(string):
    return string[::-1]

In [8]:
sr_str[~sr_str.isnull()].map(invert_string)

0     boB
1    annA
2    annA
3     boB
dtype: object

In [11]:
sr_str[~sr_str.isnull()].apply(invert_string)

0     boB
1    annA
2    annA
3     boB
dtype: object

## And now you again

---

Dear BI Team,

We would like to have a better overview how many different articles in each product categories we are selling.

The categories are:

* clothing: T-shirts, Jackets, Trouseres
* acessoire: Bag, Hat
* foodware: shoes, socks

Thanks!

PS: This table should be helpfull:
https://raw.githubusercontent.com/ReDI-School/python-data-science/master/datasets/reseller/product_details.csv

---

In [17]:
# Exercise 7: Load the data, create a new column "category_name", visualize the "category".
url='https://raw.githubusercontent.com/ReDI-School/python-data-science/master/datasets/reseller/product_details.csv'
data2 = pd.read_csv(url)
data2['product_name'].str.lower().unique()


array(['t-shirt', 'shoes', 'socks', 'jacket', 'hat', 'trousers', 'bag'],
      dtype=object)

In [0]:
#data2['hallo'] = 
data2['product_id_2'] = data2['product_id']*100

In [23]:
data2.head()

Unnamed: 0,product_id,product_name,product_brand,hallo,product_id_2
0,2448,T-shirt,Reebok,bye,244800
1,5425,Shoes,Adidas,bye,542500
2,1254,T-shirt,Jack Wolfskin,bye,125400
3,1254,T-shirt,Jack Wolfskin,bye,125400
4,7787,T-shirt,Adidas,bye,778700


### Advanced

As you know, the discount from the promo_code was not included in the sales price. A previous colleque in your team developed the method to apply a discount of 20 Euros whenever a promo code was used.



In [0]:
def apply_discount(row):
    if row.used_promo_code:
        return row.sales_price - 20
    else:
        return row.sales_price

PS: Here we use apply a bit different as in the example above. If you use apply for a full dataframe the method which you are using can perform operations on the whole column or row (depending on if you use axis=0 or axis=1).

In [0]:
# Exercise 8: Create a new column "sales_price_after_discount" by applying "apply_discount" to the rows of the dataframe

### Bonus

In a revision of the data the finance department realized that there is a mistake in the logic.


1.   Promo Codes can only be applied, when the sales price is larger then 50 Euro
2.   From July on a discount of 20% was given, instead of a fixed discount of 20Euro

Fix the method accordingly.
Calculate the sum of sales with the old and with the new logic. How much is the difference?