<a href="https://colab.research.google.com/github/MatteoZancanaro-5758278/M_Zancanaro-Programming-BigDataAnalytics/blob/main/Basic_Assignments/2_03_Calculated_Fields%2C_Indexing_and_Subsetting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.03 Pandas: Calculated fields, indexing and subsetting dataframes
### Calculated Fields
In many forms of data analysis often we will want to generate new fields/columns based on existing data or as some combination of them. In other words, this is data that we are calculating rather than reading.

We will create some calculated fields based on the dictionary input from the last session. If you have started a new notebook run the below code to recreate the dataframe:

In [3]:
import pandas as pd
import numpy as np


# create a dictionary of orders
orders = {'o10001':{'date':'2025/01/10', 'product':'Blockchain database', 'quantity':'1'},
            'o10002':{'date':'2025/01/13', 'product':'Stock market prediction engine', 'quantity':'2'},
            'o10003':{'date':'2025/01/14', 'product':'Portfolio optimisation tool', 'quantity':'10'},
            'o10004':{'date':'2025/01/15', 'product':'Man\'s suit', 'quantity':'2'}
}

# convert to a dataframe
orders_df = pd.DataFrame(orders)

# create a dicitonary of products
products = {'123':{'name':'Blockchain database', 'cost_price':12.12, 'sale_price':15.00},
            '124':{'name':'Stock market prediction engine', 'cost_price':2.15, 'sale_price':9.99},
            '125':{'name':'Portfolio optimisation tool', 'cost_price':22.45, 'sale_price':49.99},
            '126':{'name':'Financial services chatbot', 'cost_price':0.45, 'sale_price':2.99},
            '127':{'name':'Man\'s suit', 'cost_price':0.78, 'sale_price':1.49}
}

# convert to a dataframe
products_df = pd.DataFrame(products)

#transponse
product_df = products_df.transpose()
orders_df = orders_df.transpose()

#join(left)
joined_df = orders_df.merge(product_df, how="left", left_on="product", right_on="name")

#drop the repated column and display on screen
joined_df = joined_df.drop(["name"], axis=1)
joined_df

Unnamed: 0,date,product,quantity,cost_price,sale_price
0,2025/01/10,Blockchain database,1,12.12,15.0
1,2025/01/13,Stock market prediction engine,2,2.15,9.99
2,2025/01/14,Portfolio optimisation tool,10,22.45,49.99
3,2025/01/15,Man's suit,2,0.78,1.49


Expanding our example, we now want to know the total cost price and total sale price for each order, calculated by multiplying the unity prices by the quantity. Again, pandas makes this relatively easy to do â€¦ in theory at least. The following code would calculate, working much like a normal calculation on a single variable (as opposed to a whole column):

In [4]:
joined_df["total_cost_price"] = joined_df["quantity"] * joined_df["cost_price"]
joined_df["total_sale_price"] = joined_df["quantity"] * joined_df["sale_price"]

joined_df

TypeError: can't multiply sequence by non-int of type 'float'

We cannot multiply those columns because are not all float variables

In [5]:
joined_df.dtypes

Unnamed: 0,0
date,object
product,object
quantity,object
cost_price,object
sale_price,object


We would want our quantity, cost_price and sale_price to be of a numeric type (float or integer). In this case everything is listed as an "object". Object is pandas' most flexible data type (dtype) designed to work with "text or mixed numeric and non-numeric values". While pandas will try to infer the relevant type for data, the fallback option is to assign as object which is what has happened here. However, we can fix this fairly easily with a for loop that converts the relevant fields to floats. An alternative approach, particularly when you have many columns of data, is to ask pandas to convert them as a set using convert_dtypes (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html)

In [6]:
field_list=["quantity", "cost_price","sale_price"]
for field in field_list:
  joined_df[field]=joined_df[field].astype(float)

joined_df['total_cost_price'] = joined_df['quantity'] * joined_df['cost_price']
joined_df['total_sale_price'] = joined_df['quantity'] * joined_df['sale_price']
joined_df

Unnamed: 0,date,product,quantity,cost_price,sale_price,total_cost_price,total_sale_price
0,2025/01/10,Blockchain database,1.0,12.12,15.0,12.12,15.0
1,2025/01/13,Stock market prediction engine,2.0,2.15,9.99,4.3,19.98
2,2025/01/14,Portfolio optimisation tool,10.0,22.45,49.99,224.5,499.9
3,2025/01/15,Man's suit,2.0,0.78,1.49,1.56,2.98


### Indexing and Subsetting Dataframes
As with strings and lists, we can index/slice pandas dataframes based on a range of criteria. Note, this is an area where the recommended pandas syntax has changed significantly over the years. In the past .ix (index) was the preferred syntax, and you will still often see this in older libraries/tutorials. Today, however, _.loc_ (location) and _.iloc_ (index location) are the prefered options and the ones we will use here. We will start by indexing item zero (the first item) in our dataframe:

In [8]:
joined_df.iloc[0:2]

Unnamed: 0,date,product,quantity,cost_price,sale_price,total_cost_price,total_sale_price
0,2025/01/10,Blockchain database,1.0,12.12,15.0,12.12,15.0
1,2025/01/13,Stock market prediction engine,2.0,2.15,9.99,4.3,19.98


In [9]:
joined_df["product"]

Unnamed: 0,product
0,Blockchain database
1,Stock market prediction engine
2,Portfolio optimisation tool
3,Man's suit


In [10]:
df_subset_one = joined_df.iloc[0:2]
df_subset_one

Unnamed: 0,date,product,quantity,cost_price,sale_price,total_cost_price,total_sale_price
0,2025/01/10,Blockchain database,1.0,12.12,15.0,12.12,15.0
1,2025/01/13,Stock market prediction engine,2.0,2.15,9.99,4.3,19.98


In [11]:
df_subset_two = joined_df[["product", "quantity", "total_sale_price"]]
df_subset_two

Unnamed: 0,product,quantity,total_sale_price
0,Blockchain database,1.0,15.0
1,Stock market prediction engine,2.0,19.98
2,Portfolio optimisation tool,10.0,499.9
3,Man's suit,2.0,2.98


### Exporting Dataframes
Finally, we can use Pandas to efficiently export our Dataframe to file. There are multiple export options available, see the documentation for more details: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#conversion, to include Excel, many major databases, and even HTML. However, typically the most useful is to CSV (for reusability):

In [12]:
joined_df.to_csv("joined_dataframe.csv", sep=",")

### EXERCISES
1. Using _product\_df_, can you create a calculated field for the amount of markup associated with each product? The formula for this should be:
$markup = \frac{(sale \hspace{0.1cm} price \hspace{0.2cm} - \hspace{0.2cm} cost \hspace{0.2cm} price)}{cost \hspace{0.1cm} price}$

2. Can you also create a calculated field for the percentage markup?

3. Working this time with _joined\_df_, can you create a subset of the dataframe where the total sale price is less than Â£20.00?

In [13]:
#1
product_df["Markup"] = (product_df["sale_price"]-product_df["cost_price"]) / product_df["cost_price"]
product_df

Unnamed: 0,name,cost_price,sale_price,Markup
123,Blockchain database,12.12,15.0,0.237624
124,Stock market prediction engine,2.15,9.99,3.646512
125,Portfolio optimisation tool,22.45,49.99,1.226726
126,Financial services chatbot,0.45,2.99,5.644444
127,Man's suit,0.78,1.49,0.910256


In [17]:
#2
product_df["Markup Percentage"]=product_df["Markup"] * 100
#changing name too
product_df.rename(columns = {"Markup Percentage": "Markup Percentage %"}, inplace=True)
product_df

Unnamed: 0,name,cost_price,sale_price,Markup,Markup Percentage %
123,Blockchain database,12.12,15.0,0.237624,23.762376
124,Stock market prediction engine,2.15,9.99,3.646512,364.651163
125,Portfolio optimisation tool,22.45,49.99,1.226726,122.672606
126,Financial services chatbot,0.45,2.99,5.644444,564.444444
127,Man's suit,0.78,1.49,0.910256,91.025641


In [20]:
#3
joined_df_subset = joined_df[joined_df["total_sale_price"]<20]
joined_df_subset

Unnamed: 0,date,product,quantity,cost_price,sale_price,total_cost_price,total_sale_price
0,2025/01/10,Blockchain database,1.0,12.12,15.0,12.12,15.0
1,2025/01/13,Stock market prediction engine,2.0,2.15,9.99,4.3,19.98
3,2025/01/15,Man's suit,2.0,0.78,1.49,1.56,2.98
