# Scale Up With Python

## Part 1: Let's Get Coding

**Contents:**

* **This is a Jupyter Notebook**
* **Basic Python data types**
    * Strings
    * Integers/Float
    * Boolean
    * Sequences (Lists)
    * Mappings (Dictionaries)
    
    
* **What can we do with this data?**
    * Variables
    * Methods
    * Loops
    * Conditional Logic
    * Functions
    
    
* **Other data objects?**
    * Install/Import
    * A few more types
    * Pandas DataFrames walkthrough

### Notebooks

Kernel: underlying environment/files for a given session

Here is an empty cell:

Here is a cell with a comment

In [None]:
# generic comment


Please create another cell:

### Types

###### Basics

[Strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) - *a proxy for raw text, signified by quote enclosure*

In [1]:
'Howdy!'

'Howdy!'

In [2]:
# Can be empty
''

''

In [3]:
# Triple quotation allows for linebreaks
"""Well here's a lengthy piece of



text
"""

"Well here's a lengthy piece of\n\n\n\ntext\n"

In [4]:
# Can use print 'function' to view text cleanly
print("""Well here's a lengthy piece of



text
""")

Well here's a lengthy piece of



text



In [5]:
# Even non-'text' items are strings if enclosed by quotation marks
'20'

'20'

In [6]:
# Function to check data type
type('20')

str

[Integers](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) - *whole numbers*

In [7]:
100

100

In [8]:
# Can perform operations on integers
100+20

120

In [9]:
# Can perform operations on integers
100*20

2000

In [10]:
# Can perform operations on integers -> division will automatically become decimalised 'float'
100/20

5.0

In [11]:
# Can perform operations on integers -> can easily go between float/integer for whole numbers
int(100/20)

5

In [12]:
# Function to check data type
type(20)

int

[Float](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) - *all real numbers, signified with decimal point*

In [13]:
100.0

100.0

In [14]:
# Can perform operations on floats
100.0+20.0

120.0

In [15]:
# Can perform operations on floats
100.0/20.0

5.0

In [16]:
# Can perform operations on floats and integers -> mixed arithmetic 
100.0**2

10000.0

In [17]:
# Function to check data type
type(20.0)

float

Boolean - *logical type, either yes or no (True or False), no quotation marks!*

In [18]:
True

True

In [19]:
False

False

In [20]:
# Can go between integer/boolean
bool(1)

True

In [21]:
# Can go between integer/boolean
bool(0)

False

In [22]:
# Boolean does not equal string!
True=='True'

False

In [23]:
# Can't easily convert from string!
False==bool('False')

False

In [24]:
# Function to check data type
type(True)

bool

[None](https://docs.python.org/3/reference/datamodel.html#none) - *Nada, Zilch, Nothing at All*

In [25]:
None

In [26]:
# Function to check data type
type(None)

NoneType

###### [Sequences ](https://docs.python.org/3/reference/datamodel.html#sequences)
* Excluding Tuples/Ranges

Lists - *flexible, mutable, ordered group of data objects*

In [27]:
# Can consist of multiple data types
[1,3,'10',None,15]

[1, 3, '10', None, 15]

In [28]:
# List can contain list
[1,3,[10,5]]

[1, 3, [10, 5]]

In [29]:
# Can be combined
[1,3,'10',None,15]+['17']

[1, 3, '10', None, 15, '17']

In [30]:
# Ordered: can subselect items - FROM ZERO
[1,3,'10',None,15][0]

1

In [31]:
# Index 3 is actually the fourth entry in the sequence
[1,3,'10',None,15][3]

In [32]:
# Negative indicies select in reverse
[1,3,'10',None,15][-1]

15

In [33]:
# Index 10 doesnt exist
[1,3,'10',None,15][10]

IndexError: list index out of range

In [34]:
# We can select a range of entries, does not include endpoint
[1,3,'10',None,15][0:2]

[1, 3]

In [35]:
# Function to check data type
type([])

list

###### [Mapping ](https://docs.python.org/3/reference/datamodel.html#mappings)

Dictionaries - *set of mappings based on key,value pairs*

In [36]:
# Value type flexible
{'integer':10,
'float':8.5,
'string':'numbers'}

{'integer': 10, 'float': 8.5, 'string': 'numbers'}

In [37]:
# Can't have multiple instances of the same word in a dictionary -> must be single key but can have multiple definitions (list)
{'entry1':5,
'entry2':[10,20],
'entry1':(5.0,5.0)}

{'entry1': (5.0, 5.0), 'entry2': [10, 20]}

In [38]:
# Can 'lookup' dictionary value
{'integer':10,
'float':8.5,
'string':'numbers'}['integer']

10

In [39]:
# Function to check data type
type({'integer':10,
'float':8.5,
'string':'numbers'})

dict

### [Variables](https://realpython.com/python-variables/)

Some basic examples

In [40]:
# Set
fruit='apple'
print(fruit)

apple


In [41]:
# Compare
building_1=105.5
building_2=200.6

building_2-building_1

95.1

In [42]:
# Combine
software_list_mac=['Safari','Garageband','Apple TV']
software_list_windows=['Microsoft Excel', 'Microsoft Word','Outlook']

software_list_mac+software_list_windows

['Safari',
 'Garageband',
 'Apple TV',
 'Microsoft Excel',
 'Microsoft Word',
 'Outlook']

Dictionary Operations - *too verbose to define and subselect all in one cell, let's use a variable to simplify*

In [43]:
# Set
shopping_list={'apple':6,
'banana':4,
'hobnob (pack)':1,
'salmon':2,
'crab':1}

In [44]:
# Lookup
shopping_list['apple']

6

In [45]:
# Lookup
shopping_list['hobnob (pack)']

1

In [46]:
# Lookup
shopping_list['salmon']=1

In [47]:
# View
shopping_list

{'apple': 6, 'banana': 4, 'hobnob (pack)': 1, 'salmon': 1, 'crab': 1}

In [48]:
# Method - view keys
shopping_list.keys()

dict_keys(['apple', 'banana', 'hobnob (pack)', 'salmon', 'crab'])

In [49]:
# Method - lookup values
shopping_list.values()

dict_values([6, 4, 1, 1, 1])

List Operations - *too verbose to define and subselect all in one cell, let's use a variable to simplify*

In [50]:
# Set/View
inventory=['apple','banana','pear','jackfruit','digestives (pack)','penguins (pack)',
           'hobnob (pack)','rice (pack)','potato','cod','prawns','seabass','salmon','tuna']

inventory

['apple',
 'banana',
 'pear',
 'jackfruit',
 'digestives (pack)',
 'penguins (pack)',
 'hobnob (pack)',
 'rice (pack)',
 'potato',
 'cod',
 'prawns',
 'seabass',
 'salmon',
 'tuna']

In [51]:
# Subselect
inventory[0]

'apple'

In [52]:
# Indexed but no lookup
inventory['apple']

TypeError: list indices must be integers or slices, not str

In [53]:
# Method - add item
inventory.append('snickers')
inventory

['apple',
 'banana',
 'pear',
 'jackfruit',
 'digestives (pack)',
 'penguins (pack)',
 'hobnob (pack)',
 'rice (pack)',
 'potato',
 'cod',
 'prawns',
 'seabass',
 'salmon',
 'tuna',
 'snickers']

In [54]:
# Method - add many items (list)
inventory.extend(['galaxy'])
inventory

['apple',
 'banana',
 'pear',
 'jackfruit',
 'digestives (pack)',
 'penguins (pack)',
 'hobnob (pack)',
 'rice (pack)',
 'potato',
 'cod',
 'prawns',
 'seabass',
 'salmon',
 'tuna',
 'snickers',
 'galaxy']

Assignment vs Logic - *how do we compare variables?*

In [55]:
# Single '=' signifies assignment
inventory[-1]='haddock'
inventory

['apple',
 'banana',
 'pear',
 'jackfruit',
 'digestives (pack)',
 'penguins (pack)',
 'hobnob (pack)',
 'rice (pack)',
 'potato',
 'cod',
 'prawns',
 'seabass',
 'salmon',
 'tuna',
 'snickers',
 'haddock']

In [56]:
# Double '==' signifies comparison -> returns boolean
inventory[-1]=='haddock'

True

In [57]:
# Double '==' signifies comparison -> returns boolean
inventory[-1]=='tuna'

False

In [58]:
# Can use 'in' to check if value exists in sequence -> returns boolean
'haddock' in inventory

True

In [59]:
# Can use 'in' to check if value exists in sequence -> returns boolean
'Haddock' in inventory

False

In [60]:
# Can use 'and' (&) to combine multiple pieces of logic -> returns boolean
('haddock' in inventory) & ('wine' in inventory)

False

In [61]:
# Can use 'or' (|) to combine multiple pieces of logic -> returns boolean
('haddock' in inventory) | ('wine' in inventory)

True

### What can we do with these variables?

[Loops](https://docs.python.org/3/tutorial/datastructures.html#looping-techniques) - *iterate through sequence*

In [62]:
# For loop
for item in inventory:
    print(item)

apple
banana
pear
jackfruit
digestives (pack)
penguins (pack)
hobnob (pack)
rice (pack)
potato
cod
prawns
seabass
salmon
tuna
snickers
haddock


In [63]:
# While loop - be careful of infinite loop
i=0
while i<12:
    print(inventory[i])
    i+=1

apple
banana
pear
jackfruit
digestives (pack)
penguins (pack)
hobnob (pack)
rice (pack)
potato
cod
prawns
seabass


[Conditional Statements](https://docs.python.org/3/tutorial/controlflow.html) - *performs action depending on logic*

In [64]:
item='apple'

if item in inventory:
    print('In Inventory')
else:
    print('Unavailable')

In Inventory


In [65]:
item='grapefruit'

if item in inventory:
    print('In Inventory')
else:
    print('Unavailable')

Unavailable


In [66]:
# Can combine loops and conditional statements
all_items_shopping_list={}

# Loop through each item in stores inventory
for item in inventory:

    # Check if item is in shopping list
    if item in shopping_list.keys():
        # If item in shopping list, add to all_items_shopping_list with desired purchase volume
        all_items_shopping_list[item]=shopping_list[item] 
        
    else:
        # If item in not shopping list, add to all_items_shopping_list with purchase volume equals zero
        all_items_shopping_list[item]=0      

In [67]:
all_items_shopping_list

{'apple': 6,
 'banana': 4,
 'pear': 0,
 'jackfruit': 0,
 'digestives (pack)': 0,
 'penguins (pack)': 0,
 'hobnob (pack)': 1,
 'rice (pack)': 0,
 'potato': 0,
 'cod': 0,
 'prawns': 0,
 'seabass': 0,
 'salmon': 1,
 'tuna': 0,
 'snickers': 0,
 'haddock': 0}

###### Functions

Native

In [68]:
# Universal: View object below cell
print(all_items_shopping_list)

{'apple': 6, 'banana': 4, 'pear': 0, 'jackfruit': 0, 'digestives (pack)': 0, 'penguins (pack)': 0, 'hobnob (pack)': 1, 'rice (pack)': 0, 'potato': 0, 'cod': 0, 'prawns': 0, 'seabass': 0, 'salmon': 1, 'tuna': 0, 'snickers': 0, 'haddock': 0}


In [69]:
# Methods: tied to data type
all_items_shopping_list.keys()

dict_keys(['apple', 'banana', 'pear', 'jackfruit', 'digestives (pack)', 'penguins (pack)', 'hobnob (pack)', 'rice (pack)', 'potato', 'cod', 'prawns', 'seabass', 'salmon', 'tuna', 'snickers', 'haddock'])

In [70]:
# Methods: tied to data type
all_items_shopping_list.items()

dict_items([('apple', 6), ('banana', 4), ('pear', 0), ('jackfruit', 0), ('digestives (pack)', 0), ('penguins (pack)', 0), ('hobnob (pack)', 1), ('rice (pack)', 0), ('potato', 0), ('cod', 0), ('prawns', 0), ('seabass', 0), ('salmon', 1), ('tuna', 0), ('snickers', 0), ('haddock', 0)])

[Custom](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)

In [71]:
# Create price dictionary
inventory_price_dict={
'apple':0.25,
'banana':0.25,
'pear':0.3,
'jackfruit':0.6,
'digestives (pack)':1.5,
'penguins (pack)':2,
'hobnob (pack)':2,
'rice (pack)':2.5,
'potato':0.5,
'cod':3.5,
'prawns':4,
'seabass':5,
'salmon':4,
'tuna':5,
'snickers':0.75,
'haddock':3    
}

In [72]:
# Define function to calculate total price of shop based on input dictionary 'shopping_list'

def shopping_spend(shopping_list:dict):
    
    # Spend starts at zero
    spend=0

    # Loop through each item in your shopping list
    for item in shopping_list.keys():
        
        # Check if item in available in inventory
        if item in inventory_price_dict.keys():
            # If item available, multiply cost by purchase volume and add to existing spend total
            spend+=inventory_price_dict[item]*shopping_list[item]
        
        # Check if item unavailable, add zero to spend and move onto the next item on the shopping list    
        else:
            spend+=0
    
     # After looping through complete shopping list, return total spend
    return(spend)


In [73]:
# Run shopping_spend function over shopping list
shopping_spend(shopping_list)

8.5

### [Packages](https://docs.python.org/3/tutorial/modules.html#packages)


pip install - *standard method of installing python packages/libraries*

In [75]:
# '!' sends command to Terminal
!pip install numpy



In [76]:
# includes installation of dependencies 
!pip install pandas



In [77]:
# includes installation of dependencies 
!pip install seaborn



Let's Import A Couple

In [78]:
# Import whole package as variable
import numpy as np
import pandas as pd
import seaborn as sns

###### [Numpy](https://numpy.org/) Types

* NumPy is a broader mathematics package for Python
* Many uses, notably quick vector calculations

1D Numpy Arrays - *vector*

In [79]:
vector=np.array([0.0,1.5,4.5])
vector

array([0. , 1.5, 4.5])

2D Numpy Arrays - *matrix*

In [80]:
matrix=np.array([[0.5,1.0,10.0],
        [0.0,3.5,5.5]])
matrix

array([[ 0.5,  1. , 10. ],
       [ 0. ,  3.5,  5.5]])

In [81]:
# Quick and built in linear algebra methods/functions
np.matmul(matrix,vector.T)

array([46.5, 30. ])

### [Pandas DataFrames](https://pandas.pydata.org/docs/user_guide/dsintro.html)

* Pandas is a common package used for data analytics
* It is dependant on NumPy and several other libraries
* The main benefit is it's easy to use 'DataFrame' object

Introduction - *what does an unseen dataframe look like?*

In [82]:
# Import ready-made pandas dataframe
sample_df = sns.load_dataset('iris')
sample_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [83]:
# Let's look at every row
pd.set_option('display.max_rows', 150)
sample_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [84]:
# Let's just look at the first few
sample_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [85]:
# Let's look at the column types
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [86]:
# Let's look at the (numeric) column values
sample_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [87]:
# Any None values?
sample_df.isna().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Under The Hood - *you've already seen DataFrames (sort of)*

In [88]:
# A DataFrame is just a dictionary of dictionaries!
sample_df.to_dict()

{'sepal_length': {0: 5.1,
  1: 4.9,
  2: 4.7,
  3: 4.6,
  4: 5.0,
  5: 5.4,
  6: 4.6,
  7: 5.0,
  8: 4.4,
  9: 4.9,
  10: 5.4,
  11: 4.8,
  12: 4.8,
  13: 4.3,
  14: 5.8,
  15: 5.7,
  16: 5.4,
  17: 5.1,
  18: 5.7,
  19: 5.1,
  20: 5.4,
  21: 5.1,
  22: 4.6,
  23: 5.1,
  24: 4.8,
  25: 5.0,
  26: 5.0,
  27: 5.2,
  28: 5.2,
  29: 4.7,
  30: 4.8,
  31: 5.4,
  32: 5.2,
  33: 5.5,
  34: 4.9,
  35: 5.0,
  36: 5.5,
  37: 4.9,
  38: 4.4,
  39: 5.1,
  40: 5.0,
  41: 4.5,
  42: 4.4,
  43: 5.0,
  44: 5.1,
  45: 4.8,
  46: 5.1,
  47: 4.6,
  48: 5.3,
  49: 5.0,
  50: 7.0,
  51: 6.4,
  52: 6.9,
  53: 5.5,
  54: 6.5,
  55: 5.7,
  56: 6.3,
  57: 4.9,
  58: 6.6,
  59: 5.2,
  60: 5.0,
  61: 5.9,
  62: 6.0,
  63: 6.1,
  64: 5.6,
  65: 6.7,
  66: 5.6,
  67: 5.8,
  68: 6.2,
  69: 5.6,
  70: 5.9,
  71: 6.1,
  72: 6.3,
  73: 6.1,
  74: 6.4,
  75: 6.6,
  76: 6.8,
  77: 6.7,
  78: 6.0,
  79: 5.7,
  80: 5.5,
  81: 5.5,
  82: 5.8,
  83: 6.0,
  84: 5.4,
  85: 6.0,
  86: 6.7,
  87: 6.3,
  88: 5.6,
  89: 5.5,
  90

In [89]:
# Let's look at this again
sample_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [90]:
# Column selection is the same as dictionary value lookup
sample_df['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
5      5.4
6      4.6
7      5.0
8      4.4
9      4.9
10     5.4
11     4.8
12     4.8
13     4.3
14     5.8
15     5.7
16     5.4
17     5.1
18     5.7
19     5.1
20     5.4
21     5.1
22     4.6
23     5.1
24     4.8
25     5.0
26     5.0
27     5.2
28     5.2
29     4.7
30     4.8
31     5.4
32     5.2
33     5.5
34     4.9
35     5.0
36     5.5
37     4.9
38     4.4
39     5.1
40     5.0
41     4.5
42     4.4
43     5.0
44     5.1
45     4.8
46     5.1
47     4.6
48     5.3
49     5.0
50     7.0
51     6.4
52     6.9
53     5.5
54     6.5
55     5.7
56     6.3
57     4.9
58     6.6
59     5.2
60     5.0
61     5.9
62     6.0
63     6.1
64     5.6
65     6.7
66     5.6
67     5.8
68     6.2
69     5.6
70     5.9
71     6.1
72     6.3
73     6.1
74     6.4
75     6.6
76     6.8
77     6.7
78     6.0
79     5.7
80     5.5
81     5.5
82     5.8
83     6.0
84     5.4
85     6.0
86     6.7
87     6.3
88     5.6
89     5.5
90     5.5

In [91]:
# Instead of an outright dictionary this is a special Pandas data type called a 'Series'
type(sample_df['sepal_length'])

pandas.core.series.Series

In [92]:
# Let's lookup the first value of this 'Series'
sample_df['sepal_length'][0]

5.1

In [93]:
# It's native type we recognise! Float
type(sample_df['sepal_length'][0])

numpy.float64

Let's Create Our Own From Scratch!

In [94]:
# Starting with the price dictionary
inventory_price_dict

{'apple': 0.25,
 'banana': 0.25,
 'pear': 0.3,
 'jackfruit': 0.6,
 'digestives (pack)': 1.5,
 'penguins (pack)': 2,
 'hobnob (pack)': 2,
 'rice (pack)': 2.5,
 'potato': 0.5,
 'cod': 3.5,
 'prawns': 4,
 'seabass': 5,
 'salmon': 4,
 'tuna': 5,
 'snickers': 0.75,
 'haddock': 3}

In [95]:
# We can reformat into a DataFrame (don't need to understand syntax unless you're interested)
food_price_df=pd.DataFrame.from_dict(inventory_price_dict,orient='index',columns=['Price']).reset_index().rename(columns={'index':'Item'})
food_price_df

Unnamed: 0,Item,Price
0,apple,0.25
1,banana,0.25
2,pear,0.3
3,jackfruit,0.6
4,digestives (pack),1.5
5,penguins (pack),2.0
6,hobnob (pack),2.0
7,rice (pack),2.5
8,potato,0.5
9,cod,3.5


Let's Import Data as a DataFrame!

In [96]:
# Import csv using inbuilt pandas (pd) function [read_csv]
food_guide_df=pd.read_csv('Food_Guide.csv')
food_guide_df

Unnamed: 0,Item,Health_Rating,Allergens
0,apple,A,
1,banana,A,
2,pear,A,
3,jackfruit,A,
4,digestives (pack),E,"[""Gluten""]"
5,penguins (pack),F,"[""Gluten"",""Dairy""]"
6,hobnob (pack),E,"[""Gluten"",""Dairy""]"
7,rice (pack),C,
8,potato,C,
9,cod,B,


In [97]:
# Let's take a look at types...no list specificity
food_guide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Item           16 non-null     object
 1   Health_Rating  16 non-null     object
 2   Allergens      5 non-null      object
dtypes: object(3)
memory usage: 512.0+ bytes


In [98]:
# Let's convert Allergens to list via one-line loop and 'ast' package

import ast

food_guide_df['Allergens']=[ast.literal_eval(x) for x in food_guide_df['Allergens'].fillna("[]")]
food_guide_df

Unnamed: 0,Item,Health_Rating,Allergens
0,apple,A,[]
1,banana,A,[]
2,pear,A,[]
3,jackfruit,A,[]
4,digestives (pack),E,[Gluten]
5,penguins (pack),F,"[Gluten, Dairy]"
6,hobnob (pack),E,"[Gluten, Dairy]"
7,rice (pack),C,[]
8,potato,C,[]
9,cod,B,[]


In [99]:
type(food_guide_df['Allergens'][0])

list

Data Wrangling Basics

In [100]:
# Subselect Row For Stacking
food_guide_df[15:]

Unnamed: 0,Item,Health_Rating,Allergens
15,haddock,B,[]


In [101]:
# Stack data using inbuilt pandas (pd) function [concat]
food_guide_df_duped=pd.concat([food_guide_df,food_guide_df[15:]]).reset_index(drop=True)
food_guide_df_duped

Unnamed: 0,Item,Health_Rating,Allergens
0,apple,A,[]
1,banana,A,[]
2,pear,A,[]
3,jackfruit,A,[]
4,digestives (pack),E,[Gluten]
5,penguins (pack),F,"[Gluten, Dairy]"
6,hobnob (pack),E,"[Gluten, Dairy]"
7,rice (pack),C,[]
8,potato,C,[]
9,cod,B,[]


In [102]:
# Dedupe data using inbuilt pandas (pd) DataFrame method [drop_duplicates]
food_guide_df_deduped=food_guide_df_duped.drop_duplicates(subset='Item')
food_guide_df_deduped

Unnamed: 0,Item,Health_Rating,Allergens
0,apple,A,[]
1,banana,A,[]
2,pear,A,[]
3,jackfruit,A,[]
4,digestives (pack),E,[Gluten]
5,penguins (pack),F,"[Gluten, Dairy]"
6,hobnob (pack),E,"[Gluten, Dairy]"
7,rice (pack),C,[]
8,potato,C,[]
9,cod,B,[]


In [103]:
# Merge data using inbuilt pandas (pd) DataFrame method [merge]
food_guide_price_df=food_guide_df_deduped.merge(food_price_df)
food_guide_price_df

Unnamed: 0,Item,Health_Rating,Allergens,Price
0,apple,A,[],0.25
1,banana,A,[],0.25
2,pear,A,[],0.3
3,jackfruit,A,[],0.6
4,digestives (pack),E,[Gluten],1.5
5,penguins (pack),F,"[Gluten, Dairy]",2.0
6,hobnob (pack),E,"[Gluten, Dairy]",2.0
7,rice (pack),C,[],2.5
8,potato,C,[],0.5
9,cod,B,[],3.5


In [104]:
# Explode list column using inbuilt pandas (pd) DataFrame method [explode]
food_guide_price_df_long=food_guide_price_df.explode('Allergens').fillna('').reset_index(drop=True)
food_guide_price_df_long

Unnamed: 0,Item,Health_Rating,Allergens,Price
0,apple,A,,0.25
1,banana,A,,0.25
2,pear,A,,0.3
3,jackfruit,A,,0.6
4,digestives (pack),E,Gluten,1.5
5,penguins (pack),F,Gluten,2.0
6,penguins (pack),F,Dairy,2.0
7,hobnob (pack),E,Gluten,2.0
8,hobnob (pack),E,Dairy,2.0
9,rice (pack),C,,2.5


In [105]:
# Reformat dataframe using inbuilt pandas (pd) DataFrame methods [groupby/apply]
food_guide_price_df_wide=food_guide_price_df_long.groupby(['Item','Health_Rating','Price'])['Allergens'].apply(list).reset_index()
food_guide_price_df_wide

Unnamed: 0,Item,Health_Rating,Price,Allergens
0,apple,A,0.25,[]
1,banana,A,0.25,[]
2,cod,B,3.5,[]
3,digestives (pack),E,1.5,[Gluten]
4,haddock,B,3.0,[]
5,hobnob (pack),E,2.0,"[Gluten, Dairy]"
6,jackfruit,A,0.6,[]
7,pear,A,0.3,[]
8,penguins (pack),F,2.0,"[Gluten, Dairy]"
9,potato,C,0.5,[]


In [106]:
# Add to dataframe
food_guide_price_df_long['Price_Inflated']=food_guide_price_df_long['Price']*1.05
food_guide_price_df_long

Unnamed: 0,Item,Health_Rating,Allergens,Price,Price_Inflated
0,apple,A,,0.25,0.2625
1,banana,A,,0.25,0.2625
2,pear,A,,0.3,0.315
3,jackfruit,A,,0.6,0.63
4,digestives (pack),E,Gluten,1.5,1.575
5,penguins (pack),F,Gluten,2.0,2.1
6,penguins (pack),F,Dairy,2.0,2.1
7,hobnob (pack),E,Gluten,2.0,2.1
8,hobnob (pack),E,Dairy,2.0,2.1
9,rice (pack),C,,2.5,2.625


[Basic Data Analysis](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

In [107]:
# Subselect Rows
food_guide_price_df_long[10:15]


Unnamed: 0,Item,Health_Rating,Allergens,Price,Price_Inflated
10,potato,C,,0.5,0.525
11,cod,B,,3.5,3.675
12,prawns,B,Shellfish,4.0,4.2
13,seabass,B,,5.0,5.25
14,salmon,A,,4.0,4.2


In [108]:
# Subselect Columns
food_guide_price_df_long[['Item','Health_Rating']]


Unnamed: 0,Item,Health_Rating
0,apple,A
1,banana,A
2,pear,A
3,jackfruit,A
4,digestives (pack),E
5,penguins (pack),F
6,penguins (pack),F
7,hobnob (pack),E
8,hobnob (pack),E
9,rice (pack),C


In [109]:
# Subselect both Rows and Columns using inbuilt pandas DataFrame method (loc)
food_guide_price_df_long.loc[10:15,['Item','Health_Rating']]


Unnamed: 0,Item,Health_Rating
10,potato,C
11,cod,B
12,prawns,B
13,seabass,B
14,salmon,A
15,tuna,B


In [110]:
# Can filter rows using boolean comparison
food_guide_price_df_long['Price']>2.0


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11     True
12     True
13     True
14     True
15     True
16    False
17     True
Name: Price, dtype: bool

In [111]:
# Can filter rows using boolean comparison
spenny_food=food_guide_price_df_long[food_guide_price_df_long['Price']>2.0]
spenny_food

Unnamed: 0,Item,Health_Rating,Allergens,Price,Price_Inflated
9,rice (pack),C,,2.5,2.625
11,cod,B,,3.5,3.675
12,prawns,B,Shellfish,4.0,4.2
13,seabass,B,,5.0,5.25
14,salmon,A,,4.0,4.2
15,tuna,B,,5.0,5.25
17,haddock,B,,3.0,3.15


In [112]:
# Let's sort resultant DataFrame
spenny_food.sort_values('Price',ascending=False)


Unnamed: 0,Item,Health_Rating,Allergens,Price,Price_Inflated
13,seabass,B,,5.0,5.25
15,tuna,B,,5.0,5.25
12,prawns,B,Shellfish,4.0,4.2
14,salmon,A,,4.0,4.2
11,cod,B,,3.5,3.675
17,haddock,B,,3.0,3.15
9,rice (pack),C,,2.5,2.625


In [113]:
# We can use inbuilt pandas (pd) DataFrame method [value_counts]
spenny_food.sort_values('Health_Rating')['Health_Rating'].value_counts(sort=False)

A    1
B    5
C    1
Name: Health_Rating, dtype: int64

In [114]:
# We can use inbuilt pandas (pd) DataFrame methods [groupby/mean] to query dataframe
spenny_food.sort_values('Health_Rating').groupby('Health_Rating')['Price'].mean().reset_index()


Unnamed: 0,Health_Rating,Price
0,A,4.0
1,B,4.1
2,C,2.5


## Part 2: Okay, Let's *Really* Get Programming
