# Scale Up With Python

## Part 1: Let's Get Coding

**Contents:**

* **This is a Jupyter Notebook**
* **Basic Python data types**
    * Strings
    * Integers/Float
    * Boolean
    * Sequences (Lists)
    * Mappings (Dictionaries)
    
    
* **What can we do with this data?**
    * Variables
    * Methods
    * Loops
    * Conditional Logic
    * Functions
    
    
* **Other data objects?**
    * Install/Import
    * A few more types
    * Pandas DataFrames walkthrough

### Notebooks

Kernel: underlying environment/files for a given session

Here is an empty cell:

Here is a cell with a comment

In [None]:
# generic comment


Please create another cell:

### Types

###### Basics

[Strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) - *a proxy for raw text, signified by quote enclosure*

In [None]:
'Howdy!'

In [None]:
# Can be empty
''

In [None]:
# Triple quotation allows for linebreaks
"""Well here's a lengthy piece of



text
"""

In [None]:
# Even non-'text' items are strings if enclosed by quotation marks
'20'

In [None]:
# Function to check data type
type('20')

[Integers](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) - *whole numbers*

In [None]:
100

In [None]:
# Can perform operations on integers
100+20

In [None]:
# Can perform operations on integers
100*20

In [None]:
# Function to check data type
type(20)

[Float](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) - *all real numbers, signified with decimal point*

In [None]:
100.0

In [None]:
# Can perform operations on floats
100.0+20.0

In [None]:
# Can perform operations on floats and integers -> mixed arithmetic 
100.0**2

In [None]:
# Function to check data type
type(20.0)

Boolean - *logical type, either yes or no (True or False), no quotation marks!*

In [None]:
True

In [None]:
False

In [None]:
# Can go between integer/boolean
bool(1)

In [None]:
# Can go between integer/boolean
bool(0)

In [None]:
# Boolean does not equal string!
True=='True'

In [None]:
# Can't easily convert from string!
bool('False')

In [None]:
# Function to check data type
type(True)

[None](https://docs.python.org/3/reference/datamodel.html#none) - *Nada, Zilch, Nothing at All*

In [None]:
None

In [None]:
# Function to check data type
type(None)

###### [Sequences ](https://docs.python.org/3/reference/datamodel.html#sequences)
* Excluding Tuples/Ranges

Lists - *flexible, mutable, ordered group of data objects*

In [None]:
# Can consist of multiple data types
[1,3,'10',None,15]

In [None]:
# List can contain list
[1,3,[10,5]]

In [None]:
# Can be combined
[1,3,'10',None,15]+['17']

In [None]:
# Ordered: can subselect items - FROM ZERO
[1,3,'10',None,15][0]

In [None]:
# Index 3 is actually the fourth entry in the sequence
[1,3,'10',None,15][3]

In [None]:
# Negative indicies select in reverse
[1,3,'10',None,15][-1]

In [None]:
# Index 10 doesnt exist
[1,3,'10',None,15][10]

In [None]:
# We can select a range of entries, does not include endpoint
[1,3,'10',None,15][0:2]

In [None]:
# Function to check data type
type([])

###### [Mapping ](https://docs.python.org/3/reference/datamodel.html#mappings)

Dictionaries - *set of mappings based on key,value pairs*

In [None]:
# Value type flexible
{'integer':10,
'float':8.5,
'string':'numbers'}

In [None]:
# Can't have multiple instances of the same word in a dictionary -> must be single key but can have multiple definitions (list)
{'entry1':5,
'entry2':[10,20],
'entry1':(5.0,5.0)}

In [None]:
# Can 'lookup' dictionary value
{'integer':10,
'float':8.5,
'string':'numbers'}['integer']

In [None]:
# Function to check data type
type({'integer':10,
'float':8.5,
'string':'numbers'})

### [Variables](https://realpython.com/python-variables/)

Some basic examples

In [None]:
# Set
fruit='apple'
print(fruit)

In [None]:
# Compare
building_1=105.5
building_2=200.6

building_2-building_1

Dictionary Operations - *too verbose to define and subselect all in one cell, let's use a variable to simplify*

In [None]:
# Set
shopping_list={'apple':6,
'banana':4,
'hobnob (pack)':1,
'salmon':2,
'crab':1}

In [None]:
# Lookup
shopping_list['apple']

In [None]:
# Updating
shopping_list['salmon']=1

In [None]:
# View
shopping_list

In [None]:
# Method - view keys
shopping_list.keys()

List Operations - *too verbose to define and subselect all in one cell, let's use a variable to simplify*

In [None]:
# Set/View
inventory=['apple','banana','pear','jackfruit','digestives (pack)','penguins (pack)',
           'hobnob (pack)','rice (pack)','potato','cod','prawns','seabass','salmon','tuna']

inventory

In [None]:
# Subselect
inventory[0]

In [None]:
# Method - add item
inventory.append('snickers')
inventory

In [None]:
# Method - add many items (list)
inventory.extend(['galaxy'])
inventory

Assignment vs Logic - *how do we compare variables?*

In [None]:
# Single '=' signifies assignment
inventory[-1]='haddock'
inventory

In [None]:
# Double '==' signifies comparison -> returns boolean
inventory[-1]=='haddock'

In [None]:
# Double '!=' signifies comparison -> returns boolean
inventory[-1]!='tuna'

In [None]:
# Can use 'in' to check if value exists in sequence -> returns boolean
'haddock' in inventory

In [None]:
# Can use 'in' to check if value exists in sequence -> returns boolean
'Haddock' in inventory

In [None]:
# Can use 'and' (&) to combine multiple pieces of logic -> returns boolean
('haddock' in inventory) & ('wine' in inventory)

In [None]:
# Can use 'or' (|) to combine multiple pieces of logic -> returns boolean
('haddock' in inventory) | ('wine' in inventory)

### What can we do with these variables?

[Conditional Statements](https://docs.python.org/3/tutorial/controlflow.html) - *performs action depending on logic*

In [None]:
item='apple'

if item in inventory:
    print('In Inventory')
else:
    print('Unavailable')

In [None]:
item='grapefruit'

if item in inventory:
    print('In Inventory')
else:
    print('Unavailable')

[Loops](https://docs.python.org/3/tutorial/datastructures.html#looping-techniques) - *iterate through sequence*

In [None]:
# For loop
for item in inventory:
    print(item)

In [None]:
# While loop - be careful of infinite loop
i=0

while i<12:
    print(inventory[i])
    i+=1

In [None]:
# Can combine loops and conditional statements
all_items_shopping_list={}

# Loop through each item in stores inventory
for item in inventory:

    # Check if item is in shopping list
    if item in shopping_list.keys():
        # If item in shopping list, add to all_items_shopping_list with desired purchase volume
        all_items_shopping_list[item]=shopping_list[item] 
        
    else:
        # If item in not shopping list, add to all_items_shopping_list with purchase volume equals zero
        all_items_shopping_list[item]=0      

In [None]:
all_items_shopping_list

###### Functions

Native

In [None]:
# Universal: View object below cell
print(all_items_shopping_list)

[Custom](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)

In [None]:
# Create price dictionary
inventory_price_dict={
'apple':0.25,
'banana':0.25,
'pear':0.3,
'jackfruit':0.6,
'digestives (pack)':1.5,
'penguins (pack)':2,
'hobnob (pack)':2,
'rice (pack)':2.5,
'potato':0.5,
'cod':3.5,
'prawns':4,
'seabass':5,
'salmon':4,
'tuna':5,
'snickers':0.75,
'haddock':3    
}

In [None]:
# Define function to calculate total price of shop based on input dictionary 'shopping_list'

def shopping_spend(shopping_list:dict,inventory_price_dict:dict):
    
    # Spend starts at zero
    spend=0

    # Loop through each item in your shopping list
    for item in shopping_list.keys():
        
        # Check if item in available in inventory
        if item in inventory_price_dict.keys():
            # If item available, multiply cost by purchase volume and add to existing spend total
            spend+=inventory_price_dict[item]*shopping_list[item]
        
        # Check if item unavailable, add zero to spend and move onto the next item on the shopping list    
        else:
            spend+=0
    
     # After looping through complete shopping list, return total spend
    return(spend)


In [None]:
# Run shopping_spend function over shopping list
shopping_spend(shopping_list,inventory_price_dict)

### [Packages](https://docs.python.org/3/tutorial/modules.html#packages)


pip install - *standard method of installing python packages/libraries*

In [None]:
# '!' sends command to Terminal
!pip install numpy

In [None]:
# includes installation of dependencies 
!pip install pandas

In [None]:
# includes installation of dependencies 
!pip install seaborn

Let's Import A Couple

In [None]:
# Import whole package as variable
import numpy as np
import pandas as pd
import seaborn as sns

### [Pandas DataFrames](https://pandas.pydata.org/docs/user_guide/dsintro.html)

* Pandas is a common package used for data analytics
* It is dependant on NumPy and several other libraries
* The main benefit is it's easy to use 'DataFrame' object

Introduction - *what does an unseen dataframe look like?*

In [None]:
# Import ready-made pandas dataframe
sample_df = sns.load_dataset('iris')
sample_df

In [None]:
# Let's look at every row
pd.set_option('display.max_rows', 150)
sample_df

In [None]:
# Let's just look at the first few
sample_df.head()

In [None]:
# Let's look at the column types
sample_df.info()

In [None]:
# Let's look at the (numeric) column values
sample_df.describe()

In [None]:
# Any None values?
sample_df.isna().sum()

Under The Hood - *you've already seen DataFrames (sort of)*

In [None]:
# A DataFrame is just a dictionary of dictionaries!
sample_df.to_dict()

In [None]:
# Let's look at this again
sample_df

In [None]:
# Column selection is the same as dictionary value lookup
sample_df['sepal_length']

In [None]:
# Instead of an outright dictionary this is a special Pandas data type called a 'Series'
type(sample_df['sepal_length'])

In [None]:
# Let's lookup the first value of this 'Series'
sample_df['sepal_length'][0]

In [None]:
# It's native type we recognise! Float
type(sample_df['sepal_length'][0])

Let's Create Our Own From Scratch!

In [None]:
# Starting with the price dictionary
inventory_price_dict

In [None]:
# We can reformat into a DataFrame (don't need to understand syntax unless you're interested)
food_price_df=pd.DataFrame.from_dict(inventory_price_dict,orient='index',columns=['Price']).reset_index().rename(columns={'index':'Item'})
food_price_df

Let's Import Data as a DataFrame!

In [None]:
# Import csv using inbuilt pandas (pd) function [read_csv]
food_guide_df=pd.read_csv('Food_Guide.csv')
food_guide_df

In [None]:
# Let's take a look at types...no list specificity
food_guide_df.info()

In [None]:
# Let's convert Allergens to list via one-line loop and 'ast' package
import ast

food_guide_df['Allergens']=[ast.literal_eval(x) for x in food_guide_df['Allergens'].fillna("[]")]
food_guide_df

In [None]:
type(food_guide_df['Allergens'][0])

Data Wrangling Basics

In [None]:
# Subselect Row For Stacking
food_guide_df[15:] 

In [None]:
# Stack data using inbuilt pandas (pd) function [concat]
food_guide_df_duped=pd.concat([food_guide_df,food_guide_df[15:]]).reset_index(drop=True)
food_guide_df_duped

In [None]:
# Dedupe data using inbuilt pandas (pd) DataFrame method [drop_duplicates]
food_guide_df_deduped=food_guide_df_duped.drop_duplicates(subset='Item')
food_guide_df_deduped

In [None]:
# Merge data using inbuilt pandas (pd) DataFrame method [merge]
food_guide_price_df=food_guide_df_deduped.merge(food_price_df)
food_guide_price_df

In [None]:
# Explode list column using inbuilt pandas (pd) DataFrame method [explode]
food_guide_price_df_long=food_guide_price_df.explode('Allergens').fillna('').reset_index(drop=True)
food_guide_price_df_long

In [None]:
# Reformat dataframe using inbuilt pandas (pd) DataFrame methods [groupby/apply]
food_guide_price_df_wide=food_guide_price_df_long.groupby(['Item','Health_Rating','Price'])['Allergens'].apply(list).reset_index()
food_guide_price_df_wide

In [None]:
# Add to dataframe
food_guide_price_df_long['Price_Inflated']=food_guide_price_df_long['Price']*1.05
food_guide_price_df_long

[Basic Data Analysis](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

In [None]:
# Subselect Rows
food_guide_price_df_long[10:15]


In [None]:
# Subselect Columns
food_guide_price_df_long[['Item','Health_Rating']]


In [None]:
# Subselect both Rows and Columns using inbuilt pandas DataFrame method (loc)
food_guide_price_df_long.loc[10:15,['Item','Health_Rating']]


In [None]:
# Can filter rows using boolean comparison
food_guide_price_df_long['Price']>2.0


In [None]:
# Can filter rows using boolean comparison
spenny_food=food_guide_price_df_long[food_guide_price_df_long['Price']>2.0]
spenny_food

In [None]:
# Let's sort resultant DataFrame
spenny_food.sort_values('Price',ascending=False)


In [None]:
# We can use inbuilt pandas (pd) DataFrame method [value_counts]
spenny_food.sort_values('Health_Rating')['Health_Rating'].value_counts(sort=False)

In [None]:
# We can use inbuilt pandas (pd) DataFrame methods [groupby/mean] to query dataframe
spenny_food.sort_values('Health_Rating').groupby('Health_Rating')['Price'].mean().reset_index()


## Part 2: Okay, Let's *Really* Get Programming

#### Contents:

* **Querying of Dataframes**
    * AND statements
    * OR statements
    * Crazy statements
    
* **Plotting with Pandas and Matplotlib**
    * Histograms and KDEs
    * Box plots
    * Grouped box plots
    * Heat maps
* **Handling missing data**
    * Finding missing data in dataframes
    * Simple imputation
    * Multiple imputation
* **Building some simple machine learning models** (Linear Regression and Random Forests)
    * `food_guide_price_df`
    * `sample_df`
    * `diabetes_df`

## Querying of Dataframes
We have seen how to do some fairly simple filtering of dataframes based on certain values, let's look at some more complex filtering based on multiple criteria, and using AND and OR statements.

First off we will see that we can also filter dataframes using the ```.loc[]``` method. As in the example we have seen previously:

In [None]:
food_guide_price_df.loc[food_guide_price_df['Price']>2.0]

and we can see that our jupyter output is the same as before. If we want to fulter on multiple statements, we can use the ```&``` operator. Suppose we want to get all groceries costing more that £2 with a health rating of B:

In [None]:
food_guide_price_df.loc[(food_guide_price_df['Price']>2.0) & (food_guide_price_df['Health_Rating']=='B')]

<div class="alert alert-block alert-warning">
<b>Important:</b> Note in the code snippet above that there are brackets around each filter, this is necessary!
</div>

When applying multiple filters, the code should appear like

```df.loc[(filter_1) & (filter_2) & (filter_3) & ...]```

We can look for records in our dataframe where the price is less than £2 and the health rating is a B:

In [None]:
food_guide_price_df.loc[(food_guide_price_df['Price']<2.0) & (food_guide_price_df['Health_Rating']=='B')]

The empty dataframe response tells us that there are indeed no records satisfying both of those filter conditions.

In the cell below, write some code to filter the ```food_guide_price_df``` where the price is more than £2.50 and the health rating is A

If the record for salmon was the only record to come back, that's the correct result! If not, have another look at the examples and see if there's an issue somewhere.

Now let's have a look at or statements, indicated by the ```|``` operator - yes, you finally get to use this key on your keyboard! Suppose we want to see all food items with a health rating of an A or a C

In [None]:
food_guide_price_df.loc[(food_guide_price_df['Health_Rating']=='A') | (food_guide_price_df['Health_Rating']=='C')]

Now it is a little annoying that we can't see the whole piece of code on one line... we can insert new lines in the middle of these statements without it causing an issue

In [None]:
food_guide_price_df.loc[(food_guide_price_df['Health_Rating']=='A') | 
                             (food_guide_price_df['Health_Rating']=='C')]

Now try yourself listing out both the cheap and expensive items: that is, those with a price lower than £0.50

In [None]:
food_guide_price_df.loc[(food_guide_price_df['Price']<0.5) | (food_guide_price_df['Price']>4)]

If you are getting back the data for apples, bananas, pears, seabass and tuna, that's correct! Anything else and you'll need to check your code and tweak it a little.

Finally, the holy grail where the flexibility of Python extends the filtering you can do in Excel: combinations of AND and OR statements! We can really combine these things in as many ways as we like. Say I want to see either foods with a health rating of A and less than £1, or a health rating of E and a price of more than £1, we can do the following:

In [None]:
food_guide_price_df.loc[
    (
        (food_guide_price_df['Price']<1) & (food_guide_price_df['Health_Rating']=='A')
    ) 
    | 
    (
        (food_guide_price_df['Price']>1) & (food_guide_price_df['Price']=='E')
    )
]

Now you may think we have formatted the code very strangely in the code cell above, but that's because we wanted to emphasise the format of the brackets. Remember that **each condition needs to have brackets around it** when running multiple filters. This is true when we are nesting logic in this way. Simplifying the code above, what we have written follows the following structure
                    
```df.loc[( (filter1)&(filter2) ) | ( (filter3)&(filter4) )]```

See how that the conditions either side of the OR clause are also bracketed. Now do some practice and try constructing all different types of filters, we can combine AND and OR statments is *any way*, so think of some different combinations you might want to try out yourself. A couple of ideas if you need some inspiration are:

```df.loc[( (filter1)&(filter2) ) | (filter3) | (filter4)]```

```df.loc[( (filter1)|(filter2) ) & (filter3) ]```

If you wanted to practice on a dataset with some more variables, you can always use ```sample_df``` instead of ```food_guide_price_df```. We have left a few blank cells below for you use use, but feel free to add in some more if you need them!

I am sure you can see how this flexibility of querying data, alongside the functionality like `explode`, `merge`, `pivot` that you have seen earlier, is incredibly powerful in allowing users to understand the data they are looking at from different sources. Take a look at the next section to see how plots can also help you with your exploratory data analysis.

## Plotting with Pandas and Matplotlib

We can do some simple plots with `pandas` which make life really easy. The code follows the following structure

```df['col'].plot.plot_type()```

where you will swap out ```plot_type()``` for the plot that you want. For example, we can easily plot continuous data like prices in a histogram:

In [None]:
food_guide_price_df['Price'].plot.hist()

There is a type of plot called a __[Kernel Density Estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)__ (KDE) which does a similar job to a histogram but avoids categorising the data - the tails that run below 0 and above 5 are a feature of the KDEs, and they are far more useful with larger amounts of data. Also note that the y-axis has a different scale, this is because KDEs actually return __[probability densities](https://en.wikipedia.org/wiki/Probability_density_function)__, but this is not something to worry about for now!

In [None]:
food_guide_price_df['Price'].plot.kde()

We can also plot counts of categorical variables using the pandas ```value_counts()``` method.

In [None]:
food_guide_price_df['Health_Rating'].value_counts().plot.bar()

Now we can see that the order of the axis is a little annoying, we really would like them to appear in logical order, which we can control by ordering the series with ```sort_values()``` before we plot, and adding the ```sort=False``` argument to the ```value_counts``` method, as in the cell below.

In [None]:
food_guide_price_df['Health_Rating'].sort_values().value_counts(sort=False).plot.bar()

Now try constructing a couple of plots for the ```sample_df```:
* A histogram of the ```petal_length``` column
* A bar chart showing counts for the ```species``` column
* A KDE for the ```sepal_length``` column

Here is a list of all the different plots offered to us in Pandas:
* ‘line’ : line plot (default)
* ‘bar’ : vertical bar plot
* ‘barh’ : horizontal bar plot
* ‘hist’ : histogram 
* ‘box’ : boxplot
* ‘kde’ : Kernel Density Estimation plot
* ‘density’ : same as ‘kde’
* ‘area’ : area plot
* ‘pie’ : pie plot
* ‘scatter’ : scatter plot (DataFrame only)
* ‘hexbin’ : hexbin plot (DataFrame only)

Have a look at the __[Pandas .plot documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)__ to see some other arguments that can be added to plotting methods to customise your plots, and you can see more examples of this in the __[Pandas visualisation guide](https://pandas.pydata.org/docs/user_guide/visualization.html)__.

The Pandas plotting functionality is all bought to us via `matplotlib` - and those who already have some Python experience might find this __[interactive Binder notebook](https://mybinder.org/v2/gh/matplotlib/mpl-brochure-binder/main?labpath=MatplotlibExample.ipynb)__ an interesting example of some of the more complex visualisation tools that `matplotlib` provide. But let's look at some more interesting plots of our data using `matplotlib`. First we import the `pyplot` module from `matplotlib` under the alias `plt` that actually manages all the plotting.

In [None]:
from matplotlib import pyplot as plt

Have you heard of a box-plot before? It tells us about the shape of our data by presenting the minimum and maximum points, along with the quartiles - that is the 25%, 50% and 75% positions of the data points.

In [None]:
plt.boxplot(food_guide_price_df['Price'])

Now this is not the neatest boxplot. We can tidy this up in a few ways: we will
* Remove the label on the x-axis
* Add a label to the y-axis
* Add a title to our plot
* Remove the redundant cell output above the image

We will do this by defining a space for the plot which we have full control over, called a ```subplot```

In [None]:
# Define the subplot we will plot against
fig, ax = plt.subplots()

# Add to boxplot to the figure. We use labels = [''] to replace the x-label with an empty string
ax.boxplot(food_guide_price_df[['Price']], labels = [''])

# We can use set_xlabel(), set_ylabel() and set_title() to add these to our plots
ax.set_ylabel('Price (£)')
ax.set_title('Box plot of grocery price'); 
# The use of the semicolon at the end of the cell removes the redundant output

Boxplots are commonly presented horizontal which makes them a little easier to read, and they take up less space on the page. We do this by setting the ```vert``` argument of the boxplot function to ```False```. Note that in swapping the orientation, we are now using ```set_xlabel``` instead of ```set_ylabel```

In [None]:
fig, ax = plt.subplots()
ax.boxplot(food_guide_price_df[['Price']], labels = [''], vert=False)
ax.set_xlabel('Price (£)')
ax.set_title('Box plot of grocery price'); 

We can actually break this price information down by grouping it by the health rating. We can do this by creating a dictionary which you have seen earlier. The keys of our dictionary will be the different health ratings, and the values will be lists of the values for that health category. It is important when programming to be able to move between different types of objects so have all options at your disposal!

In [None]:
grocery_dict = {}

# We sort the values to get them in the order we want to present them in
health_ratings = food_guide_price_df['Health_Rating'].sort_values().unique()

# We will iterate through each health rating and add the results to the dictionary as we go
for rat in health_ratings:
    # Obtain the prices for each health rating
    prices = food_guide_price_df.loc[food_guide_price_df['Health_Rating'] == rat]['Price']
    
    # We need this as a list, so we use the .to_list() method
    prices_list = prices.to_list()
    
    # Add the results to the dictionary
    grocery_dict[rat] = prices_list
    
grocery_dict

Now we create the boxplot by inputting the values of the dictionary, and setting the x-axis as the keys of the dictionary.

In [None]:
fix, ax = plt.subplots()
ax.boxplot(grocery_dict.values(), labels=grocery_dict.keys(), vert=False)
ax.set_xlabel('Price')
ax.set_ylabel('Health Rating')
ax.set_title('Grocery Price by Health Rating');

The boxes look a little funny because there are very small amounts of data, and in some cases, not enough to get different values for the quartiles, but we can still see some interesting things. The price varies considerably by price - foods with a health rating of A are cheaper than all other health ratings, and foods with a health rating of B are considerably higher in price than all other health ratings.

Give this a try yourself now! Using the `sample_df`, create box plots for `petal width` and `petal length`. Then create box plots for these variables grouped by the species type, and see if any interesting patterns appear. You will need to write the code to create the analagous dictionaries in the example above, and then use the dictionaries to plot the data.

We can also construct interesting scatter graphs with `matplotlib`. I will demonstrate this using `sample_df`

In [None]:
fig, ax = plt.subplots()
plt.scatter(x = sample_df['sepal_width'], y = sample_df['petal_width'])
ax.set_xlabel('Sepal Width')
ax.set_ylabel('Sepal Length');

It almost looks like it breaks into two different clusters...perhaps this is due to the different species we have collected data from! Let's colour the points by species to see if this reveals anything interesting. *Note the american spelling of the argument* `color`*, it gets me every time!*

In [None]:
fig, ax = plt.subplots()
plt.scatter(x = sample_df['sepal_width'], y = sample_df['petal_width'], color = sample_df['species'])
ax.set_xlabel('Sepal Width')
ax.set_ylabel('Sepal Length');

Ah, an error! This is because life isn't always as easy as we might like it to be. We cannot simply just insert the column into the `color` argument. Instead, we will have to plot the points for each species in stages, which will give us the different colours. And yes, error messages and traceback in Python can sometimes be very long and complicated to understand, often only the last line is the bit you really need which will point you to the issue.

In [None]:
fig, ax = plt.subplots()
species = sample_df['species'].unique()
for s in species:
    species_df = sample_df.loc[sample_df['species'] == s]
    ax.scatter(x = species_df['sepal_width'], y = species_df['petal_width'])
ax.set_xlabel('Sepal Width')
ax.set_ylabel('Sepal Length');

So there is some underlying pattern based on the species! Though, we do not know which species comes to which colour. We can sort this out by applying a `label` to each scatter plot, and then adding a `legend`, also known as a key, to our plot.

In [None]:
fig, ax = plt.subplots()
species = sample_df['species'].unique()
for s in species:
    species_df = sample_df.loc[sample_df['species'] == s]
    ax.scatter(x = species_df['sepal_width'], y = species_df['petal_width'], label = s)
ax.set_xlabel('Sepal Width')
ax.set_ylabel('Sepal Length')
ax.legend();

This plot tells us that we could probably build a pretty good classification model for species based on both the sepal width and sepal length. Using all of the things you have learned about plotting so far, see what other interesting combinations of variables you can plot - do other combinations of variables appear to discriminate between species of iris flower?

Now we will quickly look at one other type of plot - a heatmap. This can be very useful when exploring relationships beteween variables, like *correlation*. Correlation is a measure between -1 and 1 that describes the relationship between two variables. A correlation of 1 means that as one increases, the other certainly increases, and a correlation of -1 means that as one increases, the other certainly decreases. Note that it **does not** measure how much the increase or decrease is. We can very easily obtain correlations between variables in pandas! Let's remove the `species` column from `sample_df` and then calculate the correlation between the other variables.

In [None]:
cor_df = sample_df.drop('species', axis=1).corr()
cor_df

We can plot this very nicely on a heat map! Now, `matplotlib` does not have any easy implementation of heat map plots, so we turn to another plotting library called `seaborn`. We import it under the alias `sns`

In [None]:
import seaborn as sns
sns.heatmap(cor_df)

One thing we may want to implement to improve this plot is to have the scale start at -1 instead of -0.4. We might also want stronger colours at -1 and 1, with a very light colour at 0. We can make these changes! Firstly by adding the arguments `vmin` and `vmax` to the `heatmap` function, we can change the colour scale. Seaborn has a whole load of built in __[colour maps](https://seaborn.pydata.org/tutorial/color_palettes.html)__ which we can take advantage of here, we will use the `vlag` colour palette.

In [None]:
sns.heatmap(cor_df, vmin = -1, vmax = 1, cmap = 'vlag')

We can now very easily see the very strong correlation structure in the iris dataset.

## Handling missing data
Missing data appears almost everywhere! It's not something we have to deal with at Acin very often - unless we think of prospect data as missing... ;) - but in other applications, for example medical data, it's an inevitability. In this section we will have a look at how we identify missing data, some simple methods for dealing with the issue, and touch on some more golden-standard methods.

I will be demonstrating methods with the `food_guide_price_df` dataset having deleted a set of values, but you will be practicing with the `diabetes_df` dataset taken from __[Kaggle](https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset)__. This is a real dataset which I have ommitted some values from. Below you can see the first 5 rows of the dataframe.

In [None]:
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head()

### Finding missing data in dataframes
Firstly we will add in some missing values to our dataframe

In [None]:
missing_indices = [2,5,10,14,12,14]
food_missing = food_guide_price_df.copy()

# we replace the price value at these indices to the missing type None
food_missing.loc[missing_indices, 'Price'] = None

food_missing

There are useful methods to help us locate missing data in a dataset

In [None]:
food_missing.isna()

We can see that the values that are missing are identified as True in the table above. For large datasets, it can be useful to see a heatmap to identify where values are missing. In the section on plotting, we cover generating heatmaps with `seaborn`

In [None]:
import seaborn as sns
sns.heatmap(food_missing.isna())

We can see that when we do heatmaps of boolean variables, 0 corresponds to `False` (in our case, a value that is not missing), and 1 corresponds to True (a missing value). For small datasets, this is perhaps a little excessive, but try this out with the diabetes dataset, `diabetes_df`.

<div class="alert alert-block alert-success">
<b>Advanced Exercise:</b> For those who are a little more confident working with Python, take a look at <a href="https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009">this example</a> which uses the <i>missingno</i> package to construct some more advanced summaries of the missing data.
</div>

The process by which we aim do deal with this missing data is called *imputation*. It is possible to do a complete case analysis (CCA), but this involves throwing away important data that we might want to use. Moreover statistically it is not always right to just not use this data as there maybe some mechanism causing missing data [ i.e. the data is 'missing not at random' (MNAR) instead of 'missing completely at random' (MCAR) ] - an example of this would be that, perhaps in a study, people who are older are more reluctant to provide their age. Getting rid of this data means that any results are biased/not relevent to the older population. The field of missing data management in statistics is vast with new methods aimed to minimise biases caused by missing data, and managing different types of missing data all the time!

We can filter dataframes using the `isna()` or `notna()` methods.

In [None]:
food_missing.loc[food_missing['Price'].isna()]

In [None]:
food_missing.loc[food_missing['Price'].notna()]

### Simple Imputation
There are some easy methods of filling in the gaps, for example, we can replace all the missing values with the mean or median values. Let's start by calculating the mean and median prices.

In [None]:
mean_price = food_missing['Price'].mean()
median_price = food_missing['Price'].median()

print('Mean price: ' + str(mean_price))
print('Median price: ' + str(median_price))

In [None]:
food_missing 

Now we can fill in the missing values. We start by defining two dataframes as copies of `food_missing`, one for the mean imputed values, and another for the median imputed values, and then fill them in by using the `fillna()` method. Note that the `.copy()` method is required as pandas will chain assignments of dataframes together, meaning a change to `food_mean` below would also change `food_missing`.

In [None]:
food_mean = food_missing.copy()
food_median = food_missing.copy()

In [None]:
food_mean['Price'] = food_mean['Price'].fillna(mean_price, inplace = False)
food_median['Price'] = food_median['Price'].fillna(median_price, inplace = False)

# We can check whether the values are still missing 
food_mean

Now from our imputed datasets, say we are interested in the average price:

In [None]:
print('Mean Imputation average price :' + str(food_mean['Price'].mean()))
print('Median Imputation average price: ' + str(food_median['Price'].mean()))

So we have two different values for what the average food price in a grocery store is...which one is correct, or the better average? We now have some **uncertainty** in our estimate as a direct cause of the missing data. Multiple imputation aims to help us give better estimates from missing values and allows us to quantify the uncertainty of that estimate.

But, before you do that, try out some of these imputation methods on `diabetes_df`

### Multiple Imputation
This is where we actually repeat the imputation multiple times with multiple different values. We can then calculate the statistic we are interested in - the average price - across these different inputed datasets to understand how much the statistic varies.

One strategy we could try is randomly sampling prices within the range of observed prices of other groceries. The  `numpy` package has a function `.uniform(min, max, n)` which randomly generates `n` numbers between `min` and `max`.

In [None]:
import numpy as np

min_price = food_missing['Price'].min()
max_price = food_missing['Price'].max()

np.random.uniform(min_price, max_price)

We can record how many missing values we need to impute, and generate that number of missing values

In [None]:
num_missing_prices = sum(food_missing['Price'].isna())
print(str(num_missing_prices) + ' missing values to impute')

imputed_values = np.random.uniform(min_price, max_price, num_missing_prices)
imputed_values

If you run the previous two cells multiple times, you will see that they generate new numbers every time.

Say we want to run the imputation procedure 100 times. What we will do is, looping over the values 0-99 (via the `range()` function) we will complete an imputation round in each iteration of the loop, and store the mean value from that round in a list which we can analyse afterwards.

In [None]:
average_price = []

for i in range(100):
    # Create a new copy of the dataframe to work with
    imputed_data = food_missing.copy()
    
    # Create a list of imputed values that we will insert into imputed_data
    imputed_values = np.random.uniform(min_price, max_price, num_missing_prices)
    
    # Insert the missing data into the dataframe
    imputed_data.loc[imputed_data['Price'].isna(), 'Price'] = imputed_values
    
    # Calculate the new average price
    imputed_mean = imputed_data['Price'].mean()
    
    # Add the imputed mean to the average_price list
    average_price = average_price + [imputed_mean]
    
average_price

We can use some of the plotting methods earlier to visualise this dataset

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.boxplot(average_price, vert=False, labels = [''])
ax.set_xlabel('Average grocery price');

We now have a *distribution* for the average price of groceries under the uncertainty caused by the missing values - indeed 50% of the time the average fell between 2.2 and 2.4, and we could say with reasonable certainty that the true mean falls between £1.90 and £2.60.

But we could probably narrow down this range of values by using other information. If you have completed the section on Plotting, you'll know that there does seem to be some relationship between `Health_Rating` and `Price` - information we could be using! This is where multiple inputation get's very interesting.

Try using some of the methods you have seen on the `diabetes_df` dataframe to quantify how variable the average `SkinThickness` is due to the missing data.

For those that want more of a challenge, try and impute values for the `food_missing` dataframe using information about the `Health_Rating`. Some tips:
* For each rating, store the max and min values in a dictionary.
    * What might you do to handle the lack of data with a `Health_Rating` of F?
* In the imputation process, refer to that dicionary to retrieve the `min` and `max` values for the random selection.
* Because rows with missing values have different `Health_Rating` values, you might need to iterate through each `Health_Rating`

## Building some simple models

Here we will not cover in great depth how the algorithms we are using work, only some brief details will be provided. For more information, you can head to __[this discussion of Regression](https://www.codecademy.com/article/introduction-regression-analysis)__ or __[this discussion of Random Forests](https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/)__. You can also head over to the '*Statistical Modelling Demo*' directory to see some more details on statistical modelling. Our objective will be to build some 'models' that can predict the outcome in certain scenarios. That is, will will not look in much detail at inference and understanding how these predictions are made.

### Example 1: `food_guide_price_df`
Our first objective here will be to predict `Price` given a `Health_Rating`.
#### Set Up
We will construct a 'training' set which will be used to fit the model and 'test' set to evaluate the model. This split wil be generated by randomly selecting 25% of the dataset to include in the test set. The `sklearn` package offers useful classes and methods to split the data, train the different models we will be using, and evaluate them. Note that we set a 'random seed' to ensure that we get the same split consistently.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import numpy as np
np.random.seed(9001)

split = train_test_split(food_guide_price_df)
split

Here we can see that the output of the `train_test_split` function is a list of the form `[train, test]`.

In [None]:
train = split[0]
test = split[1]

Now we pick out the columns we are using to make predictions, and the targets for both the train and test sets. We refer to the columns we use to make predictions as `X` and the targets as `y`, that is, we are using `X` to predict `y`

In [None]:
X_train = train['Health_Rating']
y_train = train['Price']

X_test = test['Health_Rating']
y_test = test['Price']

X_train

To be able to construct a model which uses this categorical data, we need to *one hot encode* our data. That is, for each `health_rating`, we will have a column of 0/1 binary indicators. `sklearn` can once again help us out

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_train_encoded = encoder.fit_transform(np.array(X_train).reshape(-1,1))
X_test_encoded = encoder.transform(np.array(X_test).reshape(-1,1))

X_train_encoded.toarray()
# the 12x5 matrix outputted is precisely what we wanted

We can see which feature is which with the `get_feature_names_out` method.

In [None]:
encoder.get_feature_names_out()

That is, the columns of our matrix above are in the order A, B, X, E, F.

#### Regression

We initiate the `LinearRegression()` class, and then fit the model based on the training data with the `.fit(X,y)` method

In [None]:
regression = LinearRegression()
regression.fit(X_train_encoded, y_train)

Congratulations, technically you have just trained your first machine learning model! Now we can predict our test set values

In [None]:
regression_predicted_values = regression.predict(X_test_encoded)
regression_predicted_values

Let's set up a results dataframe where we can start to see how well we are doing

In [None]:
test_results_df = split[1].copy()
test_results_df['Regression'] = regression_predicted_values
test_results_df

#### Random Forest

In [None]:
forest = RandomForestRegressor()
forest.fit(X_train_encoded, y_train)

In [None]:
forest_predicted_values = forest.predict(X_test_encoded)
forest_predicted_values

In [None]:
test_results_df['Forest'] = forest_predicted_values
test_results_df

Well the linear regression model has the better predictions for two of the items, and the random forest the better predictions for the other two items, however, the predictions aren't very good. This is because we are using a very small dataset for which the training set is not representitive of the test set. Let's look at an example where we are using some more predictors.

### Example 2: `sample_df`

#### Set Up
We split the data as before

In [None]:
np.random.seed(9001)

split_iris = train_test_split(sample_df)
train_iris = split_iris[0]
test_iris = split_iris[1]

Now we will use a few more predictors - this time we will predict `petal_width` from `petal_length`, `sepal_width`, and `sepal_length`

In [None]:
X_train_iris = train_iris[['petal_length', 'sepal_width', 'sepal_length']]
y_train_iris = train_iris['petal_width']

X_test_iris = test_iris[['petal_length', 'sepal_width', 'sepal_length']]
y_test_iris = test_iris['petal_width']


#### Regression

In [None]:
regression_iris = LinearRegression()
regression_iris.fit(X_train_iris, y_train_iris)

In [None]:
X_train_iris

In [None]:
regression_predicted_values_iris = regression_iris.predict(X_test_iris)
regression_predicted_values_iris

In [None]:
test_results_df_iris = split_iris[1].copy()
test_results_df_iris['Regression'] = regression_predicted_values_iris
test_results_df_iris.head()

#### Random Forest

In [None]:
forest_iris = RandomForestRegressor()
forest_iris.fit(X_train_iris, y_train_iris)

In [None]:
forest_predicted_values_iris = forest_iris.predict(X_test_iris)
forest_predicted_values_iris

In [None]:
test_results_df_iris['Forest'] = forest_predicted_values_iris
test_results_df_iris.head()

#### Which method performed best? 

A typical measure of accuracy for these types of regression tasks would be the mean squared error. That is, we calculate the difference between the true and predicted values, square them all, and take the average. 

In [None]:
regression_errors = test_results_df_iris['Regression'] - test_results_df_iris['petal_width']
squared_reg_errors = regression_errors**2
regression_mse = np.mean(squared_reg_errors)

forest_errors = test_results_df_iris['Forest'] - test_results_df_iris['petal_width']
squared_forest_errors = forest_errors**2
forest_mse = np.mean(squared_forest_errors)

print('Linear Regression mean squared error: ' + str(regression_mse))
print('Random Forest mean squared error: ' + str(forest_mse))

In this case, we see that the linear regression model has a lower mean squared error, and therefore would be considered the better model.

### Exercise: `diabetes_df`
You have seen some examples of how we use linear regression models and random forests to predict values, and assess which method is best. Have a play with `diabetes_df`: set yourself a prediction task, try out these two models and see how well they perform. Which one was better? Explore the predictions, maybe do some plots, and try to understand the results. If you have time, maybe check out some other __[supervised learning methods](https://scikit-learn.org/stable/supervised_learning.html)__, in particular, LASSO (1.1.3), Elastic Net (1.1.5), Support Vector Machines for Regression (1.4.2) and Gradient Boosting (1.11.4) might work well.

In [None]:
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head()