# Welcome to the Beginner Python Workshop 

**Topic: Scripting and plotting with a dataset**

This notebook will give you a basic introduction to the Python world. Some of the topics mentioned below is also covered in the [tutorials and tutorial videos](https://github.com/GuckLab/Python-Workshops/tree/main/tutorials)

Eoghan O'Connell, Guck Division, MPL, 2021

In [106]:
# notebook metadata you can ignore!
info = {"workshop": "05",
        "topic": ["scripting", "plotting", "pandas",
                  "matplotlib", "csv", "iris", "data",
                  "curve fitting"],
        "version" : "0.0.2"}

### How to use this notebook

- Click on a cell (each box is called a cell). Hit "shift+enter", this will run the cell!
- You can run the cells in any order!
- The output of runnable code is printed below the cell.
- Check out this [Jupyter Notebook Tutorial video](https://www.youtube.com/watch?v=HW29067qVWk).

See the help tab above for more information!


# What is in this Workshop?
In this notebook we cover:
- How to open a `.csv` file (excel/csv/tsv spreadsheet) with pandas
- How to work with pandas dataframes
   - Looking at columns, rows, slicing, indexing, concat, changing cells
   - How to convert between pandas dataframes and numpy arrays
- How to plot with pandas and matplotlib
   - Simple figures
   - Subfigures, 3D plots
- Curve fitting 

Check out the tutorial video series by Corey Schafer on pandas [here](https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS).

In [1]:
# import necessary modules
%matplotlib nbagg
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit

### Opening a `.csv` file (excel) 

We will look at the iris dataset. This includes data on three species of the iris flower genus. For each of the species, the Petal width and length, and the Sepal width and length.

In [2]:
df = pd.read_csv(r"../data/iris.csv")

In [9]:
# print out the first rows of the dataframe

df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
df.tail()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [11]:
# you can see the documentation for more parameter options:
#  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

# For example, we can only use some columns to create the DataFrame

df = pd.read_csv(r"../data/iris.csv", usecols=["sepallength", "petallength"])
print(df.head())

   sepallength  petallength
0          5.1          1.4
1          4.9          1.4
2          4.7          1.3
3          4.6          1.5
4          5.0          1.4


In [27]:
df = pd.read_csv(r"../data/iris.csv")
df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### How to work with pandas dataframes

In [28]:
# the left column are the row numbers, accessible with df.index

print(df.index)

print(df.index[0])

print(df.index[0:2])

RangeIndex(start=0, stop=150, step=1)
0
RangeIndex(start=0, stop=2, step=1)


In [29]:
# columns can be accessed with df.columns

print(df.columns)

print(df.columns[0])

print(df.columns[0:2])

Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')
sepallength
Index(['sepallength', 'sepalwidth'], dtype='object')


In [39]:
# we can access column data with df["column title"]

print(df["sepallength"])
print("\n")
print(df["sepallength"][0])
print("\n")
print(df["sepallength"][0:2])

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepallength, Length: 150, dtype: float64


5.1


0    5.1
1    4.9
Name: sepallength, dtype: float64


In [47]:
type(df["sepallength"])
type(df[["sepallength", "petallength"]])

pandas.core.frame.DataFrame

In [43]:
# we can access multiple column data too by listing the column titles

df[["sepallength", "petallength"]]

Unnamed: 0,sepallength,petallength
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5
4,5.0,1.4
...,...,...
145,6.7,5.2
146,6.3,5.0
147,6.5,5.2
148,6.2,5.4


#### Dataframe methods

In [49]:
df.head(10)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [55]:
# we can easily get column values that match other column values with the df.loc method

petallength_setosa = df.loc[df["class"] == "Iris-setosa", "petallength"]

print(petallength_setosa[0:10])

# lets break that up
# df["class"]
# df["class"] == "Iris-setosa"

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
5    1.7
6    1.4
7    1.5
8    1.4
9    1.5
Name: petallength, dtype: float64


In [65]:
# we can easily get statistics of our data

print(df.mean(numeric_only=True))

# print(df.std(numeric_only=True))
# print(df.median(numeric_only=True))
# print(df.mode(numeric_only=True))

sepallength    5.843333
sepalwidth     3.054000
petallength    3.758667
petalwidth     1.198667
dtype: float64


In [66]:
# let's add a new column that is the mean of all other columns

df["mean along row"] = df.mean(numeric_only=True, axis=1)

df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row
0,5.1,3.5,1.4,0.2,Iris-setosa,2.55
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.35
3,4.6,3.1,1.5,0.2,Iris-setosa,2.35
4,5.0,3.6,1.4,0.2,Iris-setosa,2.55


In [69]:
# you can assign a new value to a specific cell with df.at

print(df.at[0, "mean along row"])

df.at[0, "mean along row"] = 42

print(df.at[0, "mean along row"])

2.55
42.0


In [70]:
df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row
0,5.1,3.5,1.4,0.2,Iris-setosa,42.0
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.35
3,4.6,3.1,1.5,0.2,Iris-setosa,2.35
4,5.0,3.6,1.4,0.2,Iris-setosa,2.55


In [84]:
# we can concatenate (join together) several dataframes with pd.concat

df2 = pd.concat([df, df])

print(len(df))
print(len(df2))

150
300


In [85]:
df2

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row
0,5.1,3.5,1.4,0.2,Iris-setosa,42.000
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.350
3,4.6,3.1,1.5,0.2,Iris-setosa,2.350
4,5.0,3.6,1.4,0.2,Iris-setosa,2.550
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,4.300
146,6.3,2.5,5.0,1.9,Iris-virginica,3.925
147,6.5,3.0,5.2,2.0,Iris-virginica,4.175
148,6.2,3.4,5.4,2.3,Iris-virginica,4.325


In [86]:
df2 = pd.concat([df, df], ignore_index=True)
df2

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row
0,5.1,3.5,1.4,0.2,Iris-setosa,42.000
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.350
3,4.6,3.1,1.5,0.2,Iris-setosa,2.350
4,5.0,3.6,1.4,0.2,Iris-setosa,2.550
...,...,...,...,...,...,...
295,6.7,3.0,5.2,2.3,Iris-virginica,4.300
296,6.3,2.5,5.0,1.9,Iris-virginica,3.925
297,6.5,3.0,5.2,2.0,Iris-virginica,4.175
298,6.2,3.4,5.4,2.3,Iris-virginica,4.325


In [95]:
# we can concatenate along columns too

df3 = pd.concat([df, df], axis=1)

print(len(df.columns))
print(len(df3.columns))

6
12


In [96]:
df3

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row,sepallength.1,sepalwidth.1,petallength.1,petalwidth.1,class.1,mean along row.1
0,5.1,3.5,1.4,0.2,Iris-setosa,42.000,5.1,3.5,1.4,0.2,Iris-setosa,42.000
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.350,4.7,3.2,1.3,0.2,Iris-setosa,2.350
3,4.6,3.1,1.5,0.2,Iris-setosa,2.350,4.6,3.1,1.5,0.2,Iris-setosa,2.350
4,5.0,3.6,1.4,0.2,Iris-setosa,2.550,5.0,3.6,1.4,0.2,Iris-setosa,2.550
...,...,...,...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,4.300,6.7,3.0,5.2,2.3,Iris-virginica,4.300
146,6.3,2.5,5.0,1.9,Iris-virginica,3.925,6.3,2.5,5.0,1.9,Iris-virginica,3.925
147,6.5,3.0,5.2,2.0,Iris-virginica,4.175,6.5,3.0,5.2,2.0,Iris-virginica,4.175
148,6.2,3.4,5.4,2.3,Iris-virginica,4.325,6.2,3.4,5.4,2.3,Iris-virginica,4.325


#### Dataframes and Arrays

Sometimes we want to convert numeric data between dataframes and numpy arrays.
Neither is "better", but each has its own strengths

In [97]:
# converting numeric data

print(type(df))
print(type(df["petalwidth"]))

arr1 = np.array(df["petalwidth"])

print(type(arr1))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>


In [98]:
# let's compare the time to compute the mean
# remember that you usually don't have to worry about speed, only when you have very big datasets

series1 = df["petalwidth"]

print("Time taken for pandas series:")
%timeit -r 3 -n 100 series1.mean()

print("\nTime taken for numpy array:")
%timeit -r 3 -n 100 arr1.mean()

Time taken for pandas series:
75.5 µs ± 18.1 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

Time taken for numpy array:
5.94 µs ± 229 ns per loop (mean ± std. dev. of 3 runs, 100 loops each)


In [99]:
df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class,mean along row
0,5.1,3.5,1.4,0.2,Iris-setosa,42.0
1,4.9,3.0,1.4,0.2,Iris-setosa,2.375
2,4.7,3.2,1.3,0.2,Iris-setosa,2.35
3,4.6,3.1,1.5,0.2,Iris-setosa,2.35
4,5.0,3.6,1.4,0.2,Iris-setosa,2.55


In [19]:
# numpy isn't really designed for strings! Use dataframes for strings and non-numeric data

arr2 = np.array(df["class"])
print(arr2.dtype)
print(arr2[45:55])

object
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor']


### How to plot with pandas and matplotlib

We will plot first with pandas, and see how it uses matplotlib in the background!

#### Simple figures

In [100]:
df = pd.read_csv(r"../data/iris.csv")

# plotting with pandas is really easy!

df.plot.scatter(x="petallength", y="petalwidth", alpha=0.6)

<IPython.core.display.Javascript object>

<AxesSubplot: xlabel='petallength', ylabel='petalwidth'>

In [101]:
# we can look at boxplots

df.boxplot(by="class")

<IPython.core.display.Javascript object>

array([[<AxesSubplot: title={'center': 'petallength'}, xlabel='[class]'>,
        <AxesSubplot: title={'center': 'petalwidth'}, xlabel='[class]'>],
       [<AxesSubplot: title={'center': 'sepallength'}, xlabel='[class]'>,
        <AxesSubplot: title={'center': 'sepalwidth'}, xlabel='[class]'>]],
      dtype=object)

In [102]:
# using matplotlib only, basic plot

x = df["petallength"]
y = df["petalwidth"]

plt.figure()
plt.plot(x, y, linestyle="", marker="o", alpha=0.6)
plt.show()


<IPython.core.display.Javascript object>

#### Advanced figures and subfigures

In [103]:
# annotating a plot with some stats and info

# first get the mean of setosa petalwidth and petallength

setosa_petallength_mean = df[0:50]["petallength"].mean()
setosa_petalwidth_mean = df[0:50]["petalwidth"].mean()

fig, axs = plt.subplots(figsize=(6, 3))
df[0:50].plot.scatter(x="petallength", y="petalwidth", ax=axs, alpha=0.6)

axs.axvline(setosa_petallength_mean, color="black", alpha=0.6)
axs.axhline(setosa_petalwidth_mean, color="red", alpha=0.6)

axs.text(setosa_petallength_mean-0.03, 0.32, "Mean Petal Length", rotation=90)
axs.text(0.97, setosa_petalwidth_mean+0.01, "Mean Petal Width", color="red")

axs.set_ylim((0, 0.9))

axs.set_title("Setosa Petal Width vs. Length")
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [104]:
# we can use matplotlib to edit our figure

fig, axs = plt.subplots(figsize=(6, 3))

df[0:50].plot(ax=axs)

axs.set_xlabel("Setosa Measurement")
axs.set_ylabel("Length (cm)")

axs.set_ylim((-0.5, 8))
axs.legend(ncol=2, loc="upper right")

plt.tight_layout()
plt.show()
plt.savefig(r"../data/Iris Setosa measurements.png")

<IPython.core.display.Javascript object>

In [26]:
# fig, axs = plt.subplots(2, 2, figsize=(5, 5))
# ax1, ax2, ax3, ax4 = axs.flatten()

fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(234)
ax3 = fig.add_subplot(235)
ax4 = fig.add_subplot(236)

linestyle = ["-", "--", "-.", "-"]

df.plot(ax=ax1, style=linestyle)
df[0:50].plot(ax=ax2, style=linestyle, legend=None)
df[50:100].plot(ax=ax3, style=linestyle, legend=None)
df[100:150].plot(ax=ax4, style=linestyle, legend=None)

ax1.set_xlabel("All Measurements")
ax2.set_xlabel("Setosa Measurement")
ax3.set_xlabel("Versicolor Measurement")
ax4.set_xlabel("Virginica Measurement")

ax1.set_ylabel("Length (cm)")
ax2.set_ylabel("Length (cm)")

ax1.set_ylim((-0.5, 10.5))
ax1.legend(ncol=2, loc="upper left")

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

#### 3D plots

In [105]:
# There is a third dimension???

xlab = "petallength"
ylab = "petalwidth"
zlab = "sepallength"

x = df[xlab]
y = df[ylab]
z = df[zlab]

fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(projection='3d')

for (start, stop), color in zip([(0, 50), (50, 100), (100, 150)], ["k", "blue", "red"]):
    ax.plot(xs=x[start:stop], ys=y[start:stop], zs=z[start:stop], color=color, linestyle="", marker="^")

ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
ax.set_zlabel(zlab)
ax.set_title("Wow, am I wearing 3D glasses?")

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Curve fitting

We can do some curve fitting with scipy or other packages
See the non-linear least-squares curve fitting documentation here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

In [28]:
# least-square fitting of 2D data

# create a linear fitting function (n=1)
def linear_fit(x, m, c):
    y = (m * x) + c
    return y

In [29]:
# get some data from our dataframe

xlab = "petallength"
ylab = "petalwidth"

x = df[xlab]
y = df[ylab]

popt, pcov = curve_fit(linear_fit, x, y)
popt2, pcov2 = curve_fit(linear_fit, x[0:50], y[0:50])

In [30]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, linestyle="", marker="o", alpha=0.6, label='Data')
ax1.plot(x, linear_fit(x, *popt), 'k-', label='Our fit')
ax1.plot(x[0:50], linear_fit(x[0:50], *popt2), 'r-', label='Our fit 2')

ax1.set_xlabel(xlab)
ax1.set_ylabel(ylab)
ax1.set_title("Linear fit of Petal Width vs. Length")
ax1.legend()
plt.show()

<IPython.core.display.Javascript object>

In [31]:
# what about other polynomial degrees?

def n2_fit(x, m, c, a):
    y = (a * x**2) + (m*x) + c
    return y

def n3_fit(x, m, c, a, b):
    y = (b * x**3) + (a * x**2) + (m*x) + c
    return y

xlab = "sepallength"
ylab = "petalwidth"
x = df[xlab]
y = df[ylab]

x2 = x[0:150]
y2 = y[0:150]
popt2, pcov = curve_fit(n2_fit, x2, y2)
popt3, pcov = curve_fit(n3_fit, x2, y2)

In [32]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, linestyle="", marker="o", alpha=0.6, label='Data')
# we have to arrange the data from lowest to highest so the line displays correctly
x2_arranged = x2.sort_values()
ax1.plot(x2_arranged, n2_fit(x2_arranged, *popt2), 'k-', label='Our fit n=2')
ax1.plot(x2_arranged, n3_fit(x2_arranged, *popt3), 'r--', label='Our fit n=3')

ax1.set_xlabel(xlab)
ax1.set_ylabel(ylab)
ax1.set_title("Poly. fit of Petal Width vs. Length")
ax1.legend()
plt.show()

<IPython.core.display.Javascript object>

In [33]:
# and some exponentials???

xlab = "sepallength"
ylab = "petalwidth"
x = df[xlab]
y = df[ylab]
x2 = x[0:150]
y2 = y[0:150]

def exp_fit(x, a, b, c):
    return a * np.exp(-b * x) + c

popt, pcov = curve_fit(exp_fit, x2, y2)

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, linestyle="", marker="o", alpha=0.6, label='Data')
# we have to arrange the data from lowest to highest so the line displays correctly
x2_arranged = x2.sort_values()
ax1.plot(x2_arranged, exp_fit(x2_arranged, *popt), 'k-', label='Our fit Exp')

ax1.set_xlabel(xlab)
ax1.set_ylabel(ylab)
ax1.set_title("Exp fit of Petal Width vs. Length")
ax1.legend()
plt.show()

<IPython.core.display.Javascript object>

### Excercises

- Scripting
  - You have a excel spreadsheet and want to fit your data with a polynomial curve
  - You need to create a plot displaying this curve fit, along with the following
     - Subfigures describing the data, axis labels, error bars
     - Saved figure needs to be publication-ready resolution
