## Machine Learning for Neuroscience, <br>Department of Brain Sciences, Faculty of Medicine, <br> Imperial College London
### Contributors: Francesca Palermo, Nan Fletcher-Lloyd, Alex Capstick, Yu Chen
**Winter 2022**

# Python for Beginners

This tutorial is adapted from the tutorials provided by DataCamp (https://www.datacamp.com) where further examples and practice questions can be found.

## Basic Python Syntax

### Print Statements

In [None]:
print("Hello, World!")

### Operators & Booleans

Operators allow simple outcomes to be obtained.
+ The sum of values is found using +
+ The difference between values is found using -
+ The product of values is found using *
+ The quotient of values is found using /

In [None]:
5 + 5

In [None]:
10 - 5

In [None]:
5 * 5

In [None]:
5 / 5

Booleans allow us to compare two values. 
+ less than is <
+ greater than is >
+ equal to or less than is <= 
+ equal to or greater than is >= 

In [None]:
5 < 10

In [None]:
5 > 10

In [None]:
5 <= 10

In [None]:
5 >= 10

### Creating and Assigning Variables

A number can be assigned to a name (the variable) and the value returned by calling that same variable.

In [None]:
variable_1 = 5

In [None]:
variable_1

### Data Types

Values in python can take several different forms. Some of the most common forms are listed.
+ int represents integers
+ float represents decimal numbers
+ str or "strings" represents text.

## Lists

Lists are a type of data structure in which values can be stored and accessed. An advantage of lists is that the data types do not need to be the same (see below, where we have an integer, float, and string, respectively). 

A list can be built by assigning a name to a set of values in square brackets (see below).

In [None]:
list = [5, 5.0,'five']

Each item within the list can be accessed by calling the name assigned to the list alongside the position of that item within the list in squared brackets. Note, the first value in a list is always given the positional value 0, the second the value 1, and so on, so that in a list of N data points, the final data point will have the positional value N-1.

In [None]:
list[0]

Lists can also be built inside of lists. 

In the example below, each value is separated into its own list by data type.

In [None]:
list = [["int", 5], ["float", 5.5], ["str", 'five']]

In [None]:
list


Other types of data structures such as dataframes, series, and arrays will be discussed in further detail later in this tutorial.

## Functions

### Built-in Functions & Methods

Python has several set functions and methods which can be called with regards to an object. Methods are all functions, acting on objects, but not all functions are methods, as these only need take objects as inputs. Below we explore an incomprehensive list of the functions and methods available using Python.

First, let's create a simple list of numerical values.

In [None]:
heights = [160, 165, 185, 155, 175, 180]

In [None]:
heights

Great! But these heights are out of order. To order these, we apply the sort () method. This will order values in ascending order. 

In [None]:
heights.sort()

In [None]:
heights

The sort () method is one way of trying to find the minimum and maximum values within a list. An alternative to this is to call the max () and min () functions.

In [None]:
max(heights)

In [None]:
min(heights)

Let's now create a second list of names that correspond to these heights.

In [None]:
names = ['Charlie','Brooke','Taylor']

In [None]:
names

We can check the length of this list to make sure it matches the previous list. We do this using the len () function.

In [None]:
len(names)

Oops! Looks like we're missing a few names. How do we add these on? There are two methods:
* The append () method adds a single element.
* The extend () method adds multiple elements.

In [None]:
names.append('Morgan')

In [None]:
names

In [None]:
names.extend(['Riley','Jamie'])

In [None]:
names

Perfect! Now, we can use the index () method to find out the names of the smallest and tallest people.

First, let's double check the type of each dataframe using the type () function.

In [None]:
type(heights)
type(names)

Now, let's find the index number of the minimum and maximum heights.

In [None]:
min_height = min(heights)
max_height = max(heights)
min_index = heights.index(min_height)
max_index = heights.index(max_height)

With the min and max hieght index positions, we can then discern the names of the smallest and tallest people in our group of six.

In [None]:
smallest_name = names[min_index]
tallest_name = names[max_index]

In [None]:
smallest_name

In [None]:
tallest_name

It worked! Charlie has the smallest height and Jamie has the tallest.

### User-Defined Functions

There are four main steps to defining your own function:
1. Use keyword def and follow up with function name.
2. Add parameters to function with parentheses, ending line with a colon.
3. Add statements that function should execute.
4. End function with return if function should output or without otherwise.

In [None]:
def hello():
    print("Hello World!")
    return

## Packages

### Installing packages

Packages are collections of modules which themselves are a collection of classes, functions, and variables etc.

In Jupyter Notebook this will look as follows: 

!pip install --user package_name_here -U

### Importing packages with an alias

Before we can begin to use a package, we must import it into our environment, and we can do so in a way that assigns an alias to the package (usually a recognisable short-hand). 

For example: 

import package_name_here as alias

## Pandas for Beginners

Pandas is a python library that allows easy handling of tabular data. 

Pandas can take data from a wide range of sources, including Excel and CSV files.

Pandas allow data to be stored and accessed as series or dataframes. Series objects are like arrays and dataframes are built from a set of series objects.

First, we want to install pandas. Providing you're using Jupyter Notebook, pandas should already be installed; however, running the following code allows us to check: 

In [None]:
!pip install --user pandas -U

Next, we want to import pandas with the alias pd.

In [None]:
import pandas as pd

Now, let's look at importing a csv file. 

We do this with the read command and can do so using the path_to_file or from a URL.

See below:

In [None]:
tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
print(tips)

In the above example, we assigned the name df to the csv file.

To read the csv file using path_to_file, the same code is used, replacing the URL with either file.csv if the file is in your working directory (the relative path) or the absolute path.

Since there are many rows in the DataFrame, we see that most the data is truncated.

To see the first or last few entries, we use the commands head or tail, respectively.

In [None]:
tips.head()

In [None]:
tips.tail()

In the above examples, we only see the first and last five entries. 

The number of entries that can be seen using this method can be changed between the values of 1 and 60. For example:

In [None]:
tips.head(10)

In [None]:
tips.tail(10)

You can change the shape of the dataframe by selecting for specific columns or rows.

In [None]:
tips_1 = tips[['tip','day']]
print(tips_1)

In [None]:
tips_1 = tips_1.iloc[0:50, :]
print(tips_1)

You can also rename columns to make them easier to remember and/or call.

In [None]:
tips_1.columns = ['Tips','Day of Week']
print(tips_1)

Pandas can be used to calculate descriptive statisitcs of a row, column, or group, such as in the examples below:

In [None]:
tips_mean = tips_1.mean(axis=0)
tips_mean

In [None]:
tips_median = tips_1.median(axis=0)
tips_median

Note, to return the mean value across several columns, use axis=1.

But what if we wanted to find the mean tip for each day.

In [None]:
tips_day_mean = tips_1.groupby(['Day of Week'])['Tips'].mean()
tips_day_mean

Pandas dataframes can be further modified by combining and filtering data.

Let's do this in two stages. First, say we want to find the sum of the total bill and tip for each customer. 

In [None]:
tips #print unmodified df

In [None]:
tips['amount'] = tips['total_bill'] + tips['tip'] 
#creates new column 'amount' in df and enters the sum of total_bill and tip by row 
tips

Now, let's say we only want these values for those customers who visited on a Saturday.

In [None]:
tips_day = tips.set_index('day').loc['Sat'].reset_index()
tips_day

And what if we only want these values for those customers who had size 3, or 3 or more.

In [None]:
tips_size1 = tips[tips['size'] == 3] # == means only
tips_size2 = tips[tips['size'] >= 3] # > greater than, >= greater than or qual to, < smaller than, <= smaller than or equal to

In [None]:
tips_size1

In [None]:
tips_size2

Another asset of pandas is that it can convert strings to datetime objects, using pd.to_datetime(df). From here on, datetime objects can be modified to display year, month, day etc. only. This is also a particularly useful function for dealing with transitions.

## NumPy for Beginners

NumPy stands for numeric or numerical python and is another Python library with the tools for solving mathematical models of problems on a computer.

One of these tools is the NumPy array.

A NumPy array is a data structure in which data can be stored and accessed as a multi-dimensional array object. 

Arrays have several advantages over Python lists, as they are a more compact data structure allowing for efficient computation of matrices and arrays.

At a structural level, an array is a combination of four pointers:
+ data 
+ dtype 
+ shape
+ strides

In [None]:
import numpy as np

We will now show you how to make arrays of zeros, ones, and using random values.

In [None]:
np.zeros((3,3))

In [None]:
np.ones((3,3))

In [None]:
np.random.random((3,3))

Here, the two numbers in parentheses indicate the number of rows and columns in the matrix.

To create an array with an number, use the following approach:

In [None]:
np.full((3,3),7) # where the last number is the fill-in value

To create an array of evenly-spaced values, we use the following function:

In [None]:
np.linspace(0, 2, 9) 
# where 0 and 2 indicate the start and endpoint values of the array (the range) and the final number 9 indicates the total number of values.

## Visualisations in Python

There are two main data visualisation packages that can be used with Python. These are Matplotlib and Seaborn.

While the two are complementary, Seaborn specifically targets statistical data visualisation and extends Matplotlib by working with different parameters past the default Matplotlib parameters. Here we will focus on the use of Seaborn extending Matplotlib.

### Matplotlib

In [None]:
import matplotlib.pyplot as plt

### Seaborn

In [None]:
import seaborn as sns

Now, let's load one of the built-in data sets in the Seaborn library.

In [None]:
tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

In [None]:
tips

Seaborn allows us to visualise data using different types of plots. Here, we will demonstrate the differences between  swarmplots, factor/barplots, violinplots, and boxplots. This can best be visualised by plotting multiple subplots in one plot using plt.subplots*. We use figsize to control the size of the overall plot.

In [None]:
fig,axes = plt.subplots(nrows=2, ncols=2, figsize=(10,8))

sns.swarmplot(x="day", y="tip", data=tips, ax=axes[0,0])
sns.barplot(x="day", y="tip", data=tips, ax=axes[0,1])
sns.violinplot(x="day", y="tip", data=tips, ax=axes[1,0])
sns.boxplot(x="day", y="tip", data=tips, ax=axes[1,1])


As can be seen in the plot above, each plot type provides a slightly different perspective.

The first five plots (swarmplot, barplot, violinplot, and boxplot) are good for visualising the difference between two groups (i.e. where one variable is categorical). To differing degress, each of these plots show the distribution of values in each group (here, the distribution of tips over different days of the week).

The swarmplot shows distribution of tip size across the differnt days.

The barplot represents an estimate of central tendency (mean, median etc. - here it is median) by the height with some indication of uncertainty around that estimate by the errorbars.

The violinplot shows distribution of tip size and give an estimate of kernel density ()

The boxplot shows the distribution of the tip size, where all plot components correspond to an actual data point (minimum, lower quartile, median quartile, upper quartile, and maximum, as well as showing each individual outlier).


Another plot that is good for visualising data is the scatterplot. This plot allows you to visualise whether there is any relationship between two sets of continuous data (here, the relationship between the size of the tip and the total bill).

In [None]:
sns.scatterplot(y="tip", x="total_bill", data=tips)

One of the major motivations for using Seaborn with Matplotlib is that it allows us more control over the paramaters of the plot, such as adding another variable, changing the color palette, setting the limits of the x and y axes, and providing labels and titles etc. Such changes can help us to make our plots more visually appealing and easier to explain.

Using the plot above, we will now add another categorical variable to this plot using the hue and style function to denote between the different groups.

In [None]:
sns.scatterplot(y="tip", x="total_bill", data=tips, hue="day", style="day")

We can also change the color palette using the palette function and the limits of the x and y axes using the plt.x/ylim function.

Here, we've chosen a color palette known as viridis. 

In [None]:
sns.scatterplot(y="tip", x="total_bill", data=tips, hue="day", style="day", palette='viridis')
plt.xlim(0, 60)
plt.ylim(0, 12)

Learn more about choosing and creating color palettes in seaborn using the links below: 

https://seaborn.pydata.org/tutorial/color_palettes.html - how to choose/create a color palette
https://www.dropbox.com/s/8autfrvx6dpll96/pal.pdf - in-built seaborn color palettes by name
https://medium.com/swlh/how-to-create-a-seaborn-palette-that-highlights-maximum-value-f614aecd706b - matplotlib colors by name

Now create your own color palette!

Finally, we can tidy the labels of the x and y axes and add a title. We can also edit the legend and move it outside of the plot.

In [None]:
sns.scatterplot(y="tip", x="total_bill", data=tips, hue="day", style="day", palette='viridis')
plt.xlim(0, 60)
plt.ylim(0, 12)
plt.xlabel('Total Tip, £')
plt.ylabel('Total Bill, £')
plt.title('Categorical Scatterplot of Tips')
plt.legend(title='Day of Week', ncol=4, bbox_to_anchor=[0.89, -0.2])

It might be helpful to play around with the legend coordinates so you can see how this works.

Try answering the following:

Which coordinate determines the x and y position of the legend, respectively?

In which direction does a negative value move the x and y position of the legend, respectively?

In which direction does a positive value move the x and y position of the legend, respectively?