Mounting Google Drive in Notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importing pandas library and Loading "Bake Sale with Profit Per Item" data saved in previous excercise

In [None]:
#import pandas
import pandas as pd

#save filepath
file_path = "/content/drive/MyDrive/Coding_Dojo_Data_Science_Excercises/02_Pandas_for_Data_Manipulation/data/bake-sale-data-with-profit-per-item.csv"

# load in the data as a pandas dataframe
df = pd.read_csv(file_path)

Checking if dataframe loaded correctly using .head() and .info()

In [None]:
df.head()

Unnamed: 0,Item,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
0,brownie,Hugo,Individually Wrapped (Plastic),2.25,0.25,2.0,17,19,25,25
1,cookie,Sally,Individually Wrapped (Plastic),1.25,0.5,0.75,40,32,38,38
2,cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
3,cupcake,Joe,Boxed (Cardboard),3.5,0.75,2.75,10,14,12,12
4,fudge,Hugo,Individually Wrapped (Foil),3.0,1.0,2.0,0,20,22,22


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Item                   12 non-null     object 
 1   Student                12 non-null     object 
 2   Packaging              12 non-null     object 
 3   Price                  12 non-null     float64
 4   Expense                12 non-null     float64
 5   Profit per Item        12 non-null     float64
 6   Quantity Sold (Day 1)  12 non-null     int64  
 7   Quantity Sold (Day 2)  12 non-null     int64  
 8   Quantity Sold (Day 3)  12 non-null     int64  
 9   Quantity Sold (Day 4)  12 non-null     int64  
dtypes: float64(3), int64(4), object(3)
memory usage: 1.1+ KB



Set index for DataFrame as Item

In [None]:
df = df.set_index("Item")


## I) Filtering

Filter all Items that have a price greater than 5 -> return Series with values known as "Boolean Indexes"

In [None]:
df["Price"]>5

Item
brownie                 False
cookie                  False
cake                     True
cupcake                 False
fudge                   False
banana bread            False
torte                    True
scone                   False
muffin                  False
rice krispies treats    False
apple pie                True
key lime pie             True
Name: Price, dtype: bool

Apply "Boolean Index" to filter data. Logic: filter all items with price greater than 5 -> Return a DataFrame


In [None]:
df.loc[ df['Price'] > 5 ]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
torte,Jade,Boxed (Clear Plastic),10.5,5.5,5.0,0,5,7,7
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


Save "Boolean Indexes" for _price of Item greater than 5_ in a variable and use those indexes for doing the filter

In [None]:
# Save the boolean index as a separate variable
filter_price = df["Price"] > 5

# Apply the filter to create a dataframe of selected values using square brackets
df_filtered = df[filter_price]

# Check results
df_filtered

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
torte,Jade,Boxed (Clear Plastic),10.5,5.5,5.0,0,5,7,7
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


 Check how many baked goods met our price criterion by getting the sum -> Get the sum of True values for the filter (True is 1 and False is 0)

In [None]:
filter_price.sum()

4

Get the length of the filtered datarame. Note that this would be the same as the length of your filtered dataframe.

In [None]:
len(df_filtered)

4

Lets say we want to get the Items baked by "Sally". If we say `df.loc["Sally"]` this will not work, because `Student` is not the index of the DataFrame.

In [None]:
df.loc["Sally"]

KeyError: ignored

Instead we need to apply a filter using string with a condition like `df['Student'] == 'Sally'`

In [None]:
filter_sally = df['Student'] == 'Sally'
df.loc[filter_sally]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cookie,Sally,Individually Wrapped (Plastic),1.25,0.5,0.75,40,32,38,38
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3


## II) Filtering for Parts of Strings

Create a filter for packaging values that contain "individual" string and apply filter

In [None]:
filter_individual = df['Packaging'].str.contains('Individual')
df[filter_individual]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
brownie,Hugo,Individually Wrapped (Plastic),2.25,0.25,2.0,17,19,25,25
cookie,Sally,Individually Wrapped (Plastic),1.25,0.5,0.75,40,32,38,38
fudge,Hugo,Individually Wrapped (Foil),3.0,1.0,2.0,0,20,22,22
banana bread,Mark,Individually Wrapped (Foil),2.75,0.5,2.25,0,10,13,13
scone,Jade,Individually Wrapped (Plastic),2.0,1.25,0.75,0,8,10,10
muffin,Anne,Individually Wrapped (Foil),1.5,0.75,0.75,0,3,4,4
rice krispies treats,Martina,Individually Wrapped (Plastic),1.25,0.25,1.0,0,0,10,30


## III) Using the Inversion Operator (~)

Use same filter to get the opposite using `~` (inversion operator)

In [None]:
df[~filter_individual]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
cupcake,Joe,Boxed (Cardboard),3.5,0.75,2.75,10,14,12,12
torte,Jade,Boxed (Clear Plastic),10.5,5.5,5.0,0,5,7,7
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


Slice out 1 entire row using 'cake' as the index -> return DataFrame

In [None]:
df.index

Index(['brownie', 'cookie', 'cake', 'cupcake', 'fudge', 'banana bread',
       'torte', 'scone', 'muffin', 'rice krispies treats', 'apple pie',
       'key lime pie'],
      dtype='object', name='Item')

Slice out multiple entire rows using a list of items as index -> return DataFrame

In [None]:
df.loc[['cake', 'brownie']]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
brownie,Hugo,Individually Wrapped (Plastic),2.25,0.25,2.0,17,19,25,25


Slicing 1 row and 1 column at the same time passing an index (row) and an specific label (column)  -> return value located in cell instead of Series

### IV) Using str.contains in index column

Lets check now the index column and look for the items that contains the string `'pie'`

In [None]:
# Create a filter for any index value that contains "pie"
filter_contains_pie = df.index.str.contains("pie")
# Apply the filter
df[filter_contains_pie]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
rice krispies treats,Martina,Individually Wrapped (Plastic),1.25,0.25,1.0,0,0,10,30
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


Item `rice krispies treats` appears because it contains the substring `'pie'`. Let's create a filter  with `' pie'`, adding a space at the beggining to not consider that item

In [None]:
# Create a filter for any index value that contains "pie"
filter_contains_pie = df.index.str.contains(' pie')
# Apply the filter
df[filter_contains_pie]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


### V) Combining Filters

Filters can be combined using `&` and `|` operators.

Example 1: filter for the baked goods that were baked by Jade **AND** had a price > $5?

In [None]:
# Create a filter for Jade
filter_jade = df['Student']=='Jade'
# Create a filter for Price
filter_price = df['Price'] > 5
# Combine the filters for AND
filters_combined_and = filter_jade & filter_price
# Apply the combined filter
df.loc[filters_combined_and]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
torte,Jade,Boxed (Clear Plastic),10.5,5.5,5.0,0,5,7,7


Example 2: filter for the baked goods that were baked by Jade **OR** had a price > $5?

In [None]:
# Combine the filters for OR
filters_combined_or = filter_jade | filter_price
# Apply the combined filter
df.loc[filters_combined_or]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cake,Martina,Boxed (Clear Plastic),9.5,5.0,4.5,1,2,0,0
torte,Jade,Boxed (Clear Plastic),10.5,5.5,5.0,0,5,7,7
scone,Jade,Individually Wrapped (Plastic),2.0,1.25,0.75,0,8,10,10
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3
key lime pie,Martina,Box (Cardboard),9.5,4.5,5.0,0,0,3,3


### VI) No Mathces

It is possible to get no matches in when applying filters. For example, look for items that are less than $2 and made by Jade.

In [None]:
# Create a new filter for less than 2
filter_price_less = df['Price'] < 2
# Combine the filters
df.loc[filter_price_less & filter_jade]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


##VII) Using isin() method

For checking multiple values at once it is possible to check multiple '|' bitwise operators. But that is not recomended. Instead we can use the `isin()` method.

Let's say we want to check that we want to retrieve the students from a list that bake items

In [None]:
# List of students
students = ["Tom","Anne",'Martin', "Jose",'Sally']
# Creating a filter for any student in the list
filter_students_that_bake = df['Student'].isin(students)
# Apply filter
df.loc[filter_students_that_bake]

Unnamed: 0_level_0,Student,Packaging,Price,Expense,Profit per Item,Quantity Sold (Day 1),Quantity Sold (Day 2),Quantity Sold (Day 3),Quantity Sold (Day 4)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cookie,Sally,Individually Wrapped (Plastic),1.25,0.5,0.75,40,32,38,38
muffin,Anne,Individually Wrapped (Foil),1.5,0.75,0.75,0,3,4,4
apple pie,Sally,Box (Cardboard),10.0,4.0,6.0,0,0,2,3


### V) Changing Values with .loc

Locate the price value of an item and change it

In [None]:
# Locate the Price of a torte and change the value
df.loc['torte',"Price"] = 11.5

In [None]:
# Inspect torte
df.loc['torte']

Student                                   Jade
Packaging                Boxed (Clear Plastic)
Price                                     11.5
Expense                                    5.5
Profit per Item                            5.0
Quantity Sold (Day 1)                        0
Quantity Sold (Day 2)                        5
Quantity Sold (Day 3)                        7
Quantity Sold (Day 4)                        7
Name: torte, dtype: object