# Exercise 00: Initialising Python 

`5 mins`

You can use Google Colab or a local python file on your computer as you prefer. 

Check you can import the libraries and there aren't any problems. 

## Demonstrators! Do not help people with Advanced/Super Advanced, help the other students.

**Advanced** If you are working locally, set up your project directories. 

**Super Advanced** Initialise a python environment and initialise Git. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm

# Pip install for Google Colab
!pip install palmerpenguins

# On local computers, they need to pip install some of these. 
#     pip install package-name

# If they're using an environment it will be something like 
#     conda install package-name
#     poetry add package-name
#     pyenv install package-name
# If they get stuck, make them use Google Colab instead. 

from palmerpenguins import load_penguins


# Set figure styles
sns.set_theme(style="white")
sns.set_style("white")

# high resolution plots
%config InlineBackend.figure_format = 'retina'

penguins = load_penguins()


# Exercise 01: Looking at a Dataset with Pandas
5 mins

- Load the penguins library
- Find out what these methods do:
  - `.describe()`
  - `.head()`
  - `.info()`
  - `.tail()`
- Find out what these attributes do:
  - `.columns`
  - `.ndim`
  - `.size`
- Create a simple plot of two variables
- Create a plot of just the Gentoo penguins. 

**Advanced** Save a copy of your raw data as `.csv` and put it in the `raw_data/` subdirectory. 

**Super Advanced** Find the raw penguins dataset online (it has 17 columns) and explore it. Investigate what methods you could use to clean it up. Put your pipeline in `cleaning.py.` Use git add, git commit, git push for version control. 





In [None]:
penguins.info()
penguins.describe()
penguins.head()
penguins.tail()

penguins.shape
penguins.columns
penguins.ndim
penguins.size

penguins.plot(kind="scatter", x="body_mass_g", y="bill_length_mm")

adelie_only = penguins[penguins["species"] == "Adelie"]

adelie_only.plot(kind="scatter", x="body_mass_g", 
                                 y="bill_length_mm", 
                                 title="Adelie Penguins only")


# Exercise 02: Using Matplotlib to Plot

`5 mins`

Create a scatter plot with all three penguins in different colours. 

*I will be recapping for loops in the next section. If this is very familiar to you, try the following instead of listening to the next section.*

**Advanced** Create a for loop for this. 

**Super Advanced** Create a function for this. 

In [None]:
# No Loops:

# Subset the data
adelie_only = penguins[penguins["species"] == "Adelie"]
gentoo_only = penguins[penguins["species"] == "Gentoo"]
chinstrap_only = penguins[penguins["species"] == "Chinstrap"]

# Then we plot the Gentoo penguins on the same plot as the Adelie penguins:
plt.scatter(gentoo_only["body_mass_g"], gentoo_only["bill_length_mm"], color="lightseagreen")
plt.scatter(adelie_only["body_mass_g"], adelie_only["bill_length_mm"], color="coral")
plt.scatter(chinstrap_only["body_mass_g"], chinstrap_only["bill_length_mm"], color="mediumorchid")

# Figure Aesthetics
plt.title("All Penguins")
plt.xlabel("Body mass (g)")
plt.ylabel("Bill length (mm)")

# We can also add a legend to the plot 
# (Important! This list of penguin names is in the same order as they were plotted):
plt.legend(["Gentoo", "Adelie", "Chinstrap"])



In [None]:
# Advanced, with a loop:

plt.figure()

# Then we can use a for loop to plot the data for each species:
for species in species_list:
    subset_species_only = penguins[penguins["species"] == species]
    plt.scatter(subset_species_only["body_mass_g"], 
                subset_species_only["bill_length_mm"], label=species)

plt.legend()

# And the rest of the plot
plt.title("All penguins")
plt.xlabel("Body mass (g)")
plt.ylabel("Bill length (mm)")


# Exercise 03: Another Kind of Plot

`5 mins`

Try to now plot some histograms for a given variable in the data frame with each species over the top of each other using a for loop. 

Hint: You can make sure your histogram is slightly transparent by adding the argument alpha = 0.5. 

**Advanced** Create a function for your for plotting for loop. 

**Super Advanced** Continue creating a function for your plotting loop. When done, save it as a function in a separate file, and try to call it into your main script. 



In [None]:
plt.figure()

# Then we can use a for loop to plot the data for each species:
for species in species_list:
    subset_species_only = penguins[penguins["species"] == species]

    plt.hist(subset_species_only["body_mass_g"], 
             bins=20, 
             alpha = 0.5, 
             label = species, 
             color = colour_dict[species])

plt.legend()

# And the rest of the plot
plt.xlabel("Body mass (g)")
plt.ylabel("Bill length (mm)")
plt.legend()


# 10 MIN BREAK

- Check you have a subplot figure and a loop within a loop.
- Finish any of the exercises you are doing. 



In [None]:
# Final Subplot Code

data_variable_dict = {"body_mass_g"     :   "Body mass (g)", 
                      "bill_length_mm"  :   "Bill length (mm)", 
                      "bill_depth_mm"   :   "Bill depth (mm)"}

fig, axes = plt.subplots(2, 2, figsize=(7, 7),constrained_layout=True)

for species in species_list:
  
    subset_species_only = penguins[penguins["species"] == species]

    # The top left subplot is different to the others. 

    axes[0, 0].scatter(subset_species_only["body_mass_g"], 
                subset_species_only["bill_length_mm"], 
                label = species, 
                color = colour_dict[species])
    
    axes[0, 0].set_title("Body mass vs bill length")
    axes[0, 0].set_xlabel("Body mass (g)")
    axes[0, 0].set_ylabel("Bill length (mm)")
    axes[0, 0].legend()
    
    counter = 1 # Need to start from the second subplot. 

    # Plotting subplots 2-4. 
    for column_name in data_variable_dict.keys():

        current_ax = axes.flatten()[counter] # Makes the subplots 0,1,2,3 rather than [0,0] etc

        current_ax.hist(subset_species_only[column_name], 
                        bins=20, 
                        alpha = 0.5, 
                        label = species, 
                        color = current_colour)

        current_ax.set_ylabel(data_variable_dict[column_name])

        counter = counter + 1
plt.show()


# Exercise 04: Printing P Value Results

- Make sure you have this part of the lesson working.
- Try a p_value of 0.5. Can you fix the code?

Advanced people: Make this into a function. 


In [None]:
# The complete code

p_value = 0.05

if p_value > 0.05:
    print("The p-value is larger than the critical threshold. We cannot reject the null hypothesis.")
    print("The p-value = " + str(p_value))

elif p_value < 0.001:
    print("The p-value is smaller than the critical threshold. We can reject the null hypothesis.")
    print("The p-value is < 0.001.")

elif p_value < 0.05:
    print("The p-value is smaller than the critical threshold. We can reject the null hypothesis.")
    print("The p-value = " + str(p_value))

else:
    print("Your p-value defies maths. I don't know what to do with it.")

In [3]:
# To fix the 0.5 bug, change at the top to >=

if p_value >= 0.05:

True

# Exercise 05: Function P Value Results

`5 mins`

- Make sure you have a function working
- Try calling it with different values


In [5]:
# Function version

def p_value_to_words(p_value):

  p_value_rounded = str(round(p_value, 3))

  if p_value >= 0.05:
      print(":( We cannot reject the null hypothesis. p-value = " + p_value_rounded + " :(")
  elif p_value < 0.001:
      print("*** The p-value is smaller than the critical threshold. p-value is < 0.001." + " ***")
  elif p_value < 0.05:
      print("* The p-value is smaller than the critical threshold. p-value = " + p_value_rounded + ". *")
  else:
      print("Your p-value defies maths. I don't know what to do with it.")


# Calling the function
p_value_to_words(0.05)

:( We cannot reject the null hypothesis. p-value = 0.05 :(


## Exercise 06 -- Linear Model Function
`10 mins`

- Put Linear Model in a function
- Nest the printing p value function inside

**Advanced People** -- Make the function generalisable for both linear regression and multiple linear regression.

**Super Advanced People** -- Find out how to run blocks of R code, or an R script, within Python. 

In [None]:
def linear_model_species(penguins_clean,species_name, x_variable, y_variable):

    model_string = y_variable + " ~ " + x_variable

    linear_model = ols(model_string, 
                       data=penguins_clean[penguins_clean["species"] == species_name]).fit()

    r_squared = round(linear_model.rsquared,2)
    print(species + " (" + x_variable + " vs " + y_variable + ") :  R^2 value = " + str(r_squared))

    # We can call our previous function from inside here. This is called a nested function.
    p_value_to_words(linear_model.f_pvalue)
    print("\n")

    return linear_model

# ---
species_list = penguins_clean["species"].unique().tolist()

coords = {"Adelie": (4900, 195), 
          "Gentoo": (6100, 225), 
          "Chinstrap": (4800, 204)}

colour_dict = {"Adelie": "coral", 
               "Gentoo": "lightseagreen", 
               "Chinstrap": "mediumorchid"}

print("Linear model results for bill length vs body mass for each species...")

for species in species_list:

    # Running the function on each species.
    model = linear_model_species(penguins_clean,species, "body_mass_g", "bill_length_mm")
    
    # Now plot an annotation for each loop. 
    text = "R^2 = " + str(round(model.rsquared,2))

    # Now we're using the coords dictionary and the colour dictionary. 
    plt.annotate(text, xy = coords[species], color = colour_dict[species])



---

# OTHER CODE

## Dictionaries

In [None]:
colour_dict = {"Adelie": "coral", 
               "Gentoo": "lightseagreen", 
               "Chinstrap": "mediumorchid"}

data_variable_dict = {"body_mass_g"     :   "Body mass (g)", 
                      "bill_length_mm"  :   "Bill length (mm)", 
                      "bill_depth_mm"   :   "Bill depth (mm)"}

coords = {"Adelie": (4900, 195), 
          "Gentoo": (6100, 225), 
          "Chinstrap": (4800, 204)}

## Statistics

In [None]:
# T-Test
ttest_results = stats.ttest_ind(adelie_body_mass, gentoo_body_mass)

# ANOVA
anova = ols('body_mass_g ~ C(species)', data=penguins_clean).fit()

# Tukey Posthoc Test
multi_comparison = multi.MultiComparison(penguins_clean["body_mass_g"], 
                                         penguins_clean["species"])
multi_comparison_results = multi_comparison.tukeyhsd()

# Linear Regression Model
linear_model = ols('body_mass_g ~ bill_length_mm', 
                   data=penguins_clean).fit()
linear_model.summary()

# Multiple Regression Model 
multi_linear_model = ols("body_mass_g ~ bill_length_mm + flipper_length_mm", 
                         data=penguins_clean).fit()

# Linear Mixed Effects Model -- Same slope, different species intercept
lmm_model = sm.MixedLM.from_formula("body_mass_g ~ bill_length_mm", 
                                    data=penguins_clean, groups="species").fit()

# Linear Mixed Effects Model -- Different species intercept and different species slope
lmm_model = sm.MixedLM.from_formula("body_mass_g ~ bill_length_mm", 
                                    data=penguins_clean, groups="species", 
                                    re_formula="~bill_length_mm").fit()



## How to compare each species with each other in a loop

In [None]:
species_list = penguins_clean["species"].unique().tolist()

for species in species_list:
    for species_compare in species_list:

        mass_current_species = penguins_clean[penguins_clean["species"] == species]["body_mass_g"]
        mass_compare_species = penguins_clean[penguins_clean["species"] == species_compare]["body_mass_g"]
            

        if species != species_compare:

            t_test = stats.ttest_ind(mass_current_species, mass_compare_species)
            p_value = t_test[1]

            print("T test for " + species + " vs " + species_compare)
            p_value_to_words(p_value)
            print("\n")

    species_list.remove(species)

## Making Variables Categories



In [None]:
penguins_clean["species"] = penguins_clean["species"].astype("category")

## Pair-plot

In [None]:
figure = sns.pairplot(penguins_clean, 
                      hue="species", 
                      diag_kind="hist", 
                      corner=True, 
                      palette=colour_dict)
figure.fig.set_figheight(7)
figure.fig.set_figwidth(7)


## Plotting Linear Regression in Seaborn

In [None]:
# Entire dataset
sns.lmplot(x="body_mass_g", y="bill_length_mm", data=penguins_clean)

# Grouped by species
sns.lmplot(x="body_mass_g", y="bill_length_mm", data=penguins_clean, hue="species", palette = colour_dict)
