What's the name(s) of the person(s) sitting next to you? And if they woke up tomorrow magically knowing all programming, what would they do with their new powers?

# Scientific Computing II: Pandas, Matplotlib, SciPy, Scikit Learn

_Last Lecture:_
- `numpy` - for lists of numbers

_This lecture:_
- `pandas` - for heterogenious data (i.e. not just numbers)
- `matplotlib` - for drawing plots
- `scipy` - for statistical analysis
- `sklearn` - for machine learning

Most of this lecture will focus on `pandas`, which is what you are most likely to use in real life. We will also look at the others _briefly_, so you have a basic idea of _when_ to reach for them.

## Pandas is a Python library for spreadsheet-like data

Last lecture we learned that NumPy is for lots of _numbers_ (1-D, 2-D, 3-D, or N-D arrays of numbers...or booleans).

Today we learn Pandas. **Pandas is for _tables_ of data.** Stuff like this:

| subject_id | score   | group | condition    |
|------------|---------|-------|--------------|
| '001'      | 16.5    | 2     | 'cognition'  |
| '002'      | 21.0    | 1     | 'perception' |
| '003'      | 18.1    | 1     | 'perception' |

Unlike homogeneous arrays of numbers, table data...

- may be both continuous (numeric) and categorical (discrete) in the same dataset
- has labeled columns (and sometimes labeled rows)
- may need different data types need to be stored (_heterogenous_ data)

A table in pandas is called a "dataframe".

In [None]:
# Everyone imports Pandas as pd
import pandas as pd

In [None]:
# Create some example heterogenous data
row1 = {'Subj_ID': '001', 'score': 16.5, 'group' : 2, 'condition': 'cognition'}
row2 = {'Subj_ID': '002', 'score': 22.0, 'group' : 1, 'condition': 'perception'}
row3 = {'Subj_ID': '003', 'score': 18.1, 'group' : 1, 'condition': 'perception'}

In [None]:
# Create a dataframe
df = pd.DataFrame([row1, row2, row3], [0, 1, 2]) # [0, 1, 2] are the row labels and can be omitted

In [None]:
# Check out the dataframe
# Notice the column names
# And the row labels 0 1 2
df

In [None]:
# You can index in pandas, by column
df['condition']

(This new kind of object here is a _series_, which is like a list but with index labels on every item. Here, the labels are `0` `1` `2` just like a list or array. But the labels could be anything.)

In [None]:
# Like NumPy, you can index using a list, here we grab two columns
# This gives us a new dataframe
df[['score','group']]

In [None]:
df # original df unchanged

Indexing a dataframe grabs columns. To get rows, we have to use `.loc` or `.iloc`

In [None]:
df[0] # does not work

In [None]:
# .loc specifies row, column position
df.loc[0, :]

In [None]:
# Or
df.loc[0]

In [None]:
# To grab multiple rows, use array of row labels
df.loc[[0,0,1]]

In [None]:
# Same
df.loc[[0,0,1], :]

In [None]:
# Shape, in (rows, columns) like NumPy
df.shape

In [None]:
# how many rows there are in a series/df
len(df) # df.shape[1] would also work

As we are seeing, **dataframes**:

- are a data structure for labeled rows and columns of data
- have associated methods for working with data.
- are arranged by columns, each of which is a Pandas **Series**

#### Class Question #1

```python
row1 = {'Subj_ID': '001', 'score': 16.5, 'group' : 2, 'condition': 'cognition'}
row2 = {'Subj_ID': '002', 'score': 22.0, 'group' : 1, 'condition': 'perception'}
row3 = {'Subj_ID': '003', 'score': 18.1, 'group' : 1, 'condition': 'perception'}

df = pd.DataFrame([row1, row2, row3])
df
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Subj_ID</th>
      <th>score</th>
      <th>group</th>
      <th>condition</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>001</td>
      <td>16.5</td>
      <td>2</td>
      <td>cognition</td>
    </tr>
    <tr>
      <th>1</th>
      <td>002</td>
      <td>22.0</td>
      <td>1</td>
      <td>perception</td>
    </tr>
    <tr>
      <th>2</th>
      <td>003</td>
      <td>18.1</td>
      <td>1</td>
      <td>perception</td>
    </tr>
  </tbody>
</table>

How do we extract just the `score` column?

#### Class Question #2

How do we extract both the `Subj_ID` and `score` columns together (all rows)?

#### Class Question #3

How do we extract the first row (all columns)?

#### Class Question #4

How do we extract the first row as a 1-row _dataframe_ (all columns)?

#### Class Question #5

How do we extract the second and third row together (all colums)?

#### Class Question #6

How do we extract the second and third row, but only the `Subj_ID` and `score` columns?

#### Class Question #7

How do we get the third row's `score` value?

### Descriptive statistics

There are *a lot* of functions and methods within pandas. The general syntax is `df.method()` where the `method()` computes something about the dataframe `df`. These methods are usually _not_ in-place (they do not modify the original dataframe).

In [None]:
df

In [None]:
# calculate summary statistics
df.describe()

In [None]:
# Get the max of a column
df['score'].max()

In [None]:
# Take the average of the two numeric columns
df[['score','group']].mean()

Breakdown how many of each category there are by using `.value_counts()`

In [None]:
val_counts = df['condition'].value_counts()
val_counts

This is a series so we can index by the labels:

In [None]:
val_counts['perception']

In [None]:
val_counts['cognition']

In [None]:
# .unique() says which unique values are there
df['condition'].unique()

In [None]:
# .nunique() says how many unique values are there
df['condition'].nunique()

In [None]:
# What's the category that shows up the most?
val_counts = df['condition'].value_counts()
val_counts.idxmax()

In [None]:
# What's the count of the value that shows up the most?
val_counts.max()

#### Class Question #8

Compute the _mean_ age of the following five people.

In [None]:
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'age':  [25, 20, 22, 25, 18],
    'city': ['San Diego', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
})
df

In [None]:
# answer here

#### Class Question #9

Compute the _most common_ age of the five people.

In [None]:
df # remind ourselves of the data

In [None]:
# answer here

#### Class Question #10

What ordinary Python type is the best analogy for a dataframe?

- A) A dataframe is like a list
- B) A dataframe is like a dictionary
- C) A dataframe is like a tuple
- D) A dataframe is like a list of tuples
- E) A dataframe is like a list of dictionaries
- G) A dataframe is like a dictionary of lists

### Three ways to initialize a dataframe

List of dictionaries:

In [None]:
# Each dictionary is a row.
row1 = {'Subj_ID': '001', 'score': 16.5, 'group': 2, 'condition': 'cognition'}
row2 = {'Subj_ID': '002', 'score': 22.0, 'group': 1, 'condition': 'perception'}
row3 = {'Subj_ID': '003', 'score': 18.1, 'group': 1, 'condition': 'perception'}

df = pd.DataFrame([row1, row2, row3])
df

Dictionary of lists:

In [None]:
# Each list is a column.
col1 = ['001', '002', '003']
col2 = [16.5, 22.0, 18.1]
col3 = [2, 1, 1]
col4 = ['cognition', 'perception', 'perception']

df = pd.DataFrame({
    'Subj_ID':   col1,
    'score':     col2,
    'group':     col3,
    'condition': col4
})
df

From a comma-separated value (CSV) file:

In [None]:
# There is a '14.2-data.csv' file in the LectureNotes folder.
!cat '14.2-data.csv'

In [None]:
df = pd.read_csv('14.2-data.csv')
df

In [None]:
# But note it inferred that our subject IDs were integers and we lost the leading zeros.
#
# Let's force Subj_ID to be strings:

df = pd.read_csv('14.2-data.csv', dtype={'Subj_ID': str})
df

### Modifying columns

In [None]:
# Remind ourself what df is
df

To edit values within a column and replace original values, the general form is:

```python
df[column] = some_list_or_array_of_the_same_length
```

String replacement...

In [None]:
df['Subj_ID'] = df['Subj_ID'].replace('00', '000', regex=True)
df['Subj_ID']

Changing the type of a variable in a column...

In [None]:
df['group']

In [None]:
df['group'].astype(float)

In [None]:
df['group'] # original unchanged

In [None]:
# To change the original, assign to that column
df['group'] = df['group'].astype(float)
df['group']

In [None]:
df # yep it changed now.

To add a new column, just assign to a new name. The new column can be whatever, as long as it's the right length.

In [None]:
# Here we add the subject id as an integer:

df['my_new_column'] = df['Subj_ID'].astype(int)
df

#### Class Question #11

To the dataframe, add a new column named `age` with the values 22, 19, and 21 for rows 0, 1, and 2 respectively.

### Computing new columns from existing columns

Let's mix it up and use some different data now.

In [None]:
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'age': [25, 20, 22, 25, 18],
    'city': ['San Diego', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
})

df

The columns behave somewhat like NumPy arrays, which makes it convenient to operate on them.

In [None]:
df['age']

In [None]:
df['age'] + 20

In [None]:
df['age'] + df['age']

In [None]:
df

In [None]:
df['double_age'] = df['age'] + df['age']

In [None]:
df['years_until_40'] = 40 - df['age']

In [None]:
df

#### Class Question #12

To the dataframe, add a column called `estimated_birth_year` that is the estimated year of the person's birth (based on simple subtraction from the current year).

### Filtering rows

Sometimes you want to select only some of the rows, based on some criteria. Let's filter down to only the people over 20 years old.

In [None]:
df

In [None]:
df['age'] > 20

Like NumPy arrays, we can index with a list of booleans.

This lets us filter down the data.

In [None]:
df.loc[df['age'] > 20]

In [None]:
df.loc[df['age'] > 20,  'name']

#### Class Question #13

Let's put it all together...

**Get the mean age of all the people whose name begins with 'C' or later in the alphabet.**

- Hint: you can compare against a string e.g. `something >= 'C'`

**If you want to learn more about Pandas** or want to see a different presentation of all of the above, check out the Pandas chapter of the _Learning Data Science_ online textbook by UCSD's very own Sam Lau (and coauthors): [https://learningds.org/ch/06/pandas_intro.html](https://learningds.org/ch/06/pandas_intro.html)

## Matplotlib is for scientific plotting

Matplotlib is almost universally regarded as clumsy to use.

Nevertheless, it's really flexible, which is why I use it anyway.

It's also built in to pandas with `df.plot()`

**You may, and in fact are encouraged to, use ChatGPT to generate your matplotlib code. The matplotlib API is so unlearnable there is no pedagogical value in trying to teach it.**

Here's the briefest teaser of matplotlib:

In [None]:
# The standard import incantation:

import matplotlib.pyplot as plt
import matplotlib as mpl

In [None]:
# Create some data
import numpy as np
dat = np.array([1, 2, 4, 8, 16, 32])

In [None]:
# Plot the data
plt.plot(dat)

In [None]:
# Or, as a bar chart.
#
# The bar function wants X-Positions, Heights

# x_positions = np.arange(len(dat)) # [0, 1, 2, 3, 4, 5]

plt.bar(np.arange(len(dat)), dat)

_Many_ plot types are available.

_Lots_ of customizations is possible. Otherwise, matplotlib would have been forgotten long ago.

Again, try using ChatGPT for generating matplotlib plots. (Show it some of your code and ask it to plot.)

## SciPy is for statistical analysis

For normal distributions, hypothesis testing, etc.

Use SciPy if you need probability distributions or need to calculate statistical significance.

A _brief_ demo:

In [None]:
import scipy as sp
from scipy import stats

In [None]:
# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

In [None]:
# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);

In [None]:
# Statistically compare the two distributions with a t-test
stats.ttest_ind(d1, d2)

In [None]:
# Wow, that's a reeeaaaally low p-value! Must not be the same distribution.
stats.ttest_ind(d1, d2).pvalue

## Scikit-Learn is for machine learning

Machine learning (ML) is trying to predict some target property from other input properties.

ML does not have to be fancy deep neural networks.

Again, just a brief demo. Let's try a quick linear regression.

Our data from before:

In [None]:
import pandas as pd

row1 = {'Subj_ID': '001', 'score': 16.5, 'group' : 2, 'condition': 'cognition'}
row2 = {'Subj_ID': '002', 'score': 22.0, 'group' : 1, 'condition': 'perception'}
row3 = {'Subj_ID': '003', 'score': 18.1, 'group' : 1, 'condition': 'perception'}

df = pd.DataFrame([row1, row2, row3])
df

Let's try to predict `score`, based on `group` and `condition`.

The inputs and outputs need to be numeric, so for `condition` we will code `'cognition'` as `0` and `'perception'` as `1`.

In [None]:
# Remember this sort of thing?
df['condition'] == 'perception'

In [None]:
(df['condition'] == 'perception').astype(int)

In [None]:
# Remember we add a new column by assigning a list/array to a new column name.

df['condition_as_indicator'] = (df['condition'] == 'perception').astype(int)
df

In [None]:
# Traditionally, the input for ML is called X

X = df[['group', 'condition_as_indicator']]
X

In [None]:
# And the target is called y

y = df['score']
y

In [None]:
df

In [None]:
# Train a simple linear model to predict y from X

from sklearn import linear_model

reg = linear_model.LinearRegression()

reg.fit(X, y) # this does the training

# The trained model is stored in reg

print('The regression coefficients: ', reg.coef_)
print('The regression intercept: ', reg.intercept_)
print()
print('That is, we predict:')
print('score = ', reg.coef_[0], '* group + ', reg.coef_[1], '* condition_as_indicator + ', reg.intercept_)

In [None]:
# Put the predictions back in the dataframe as "predicted_score" so we can see it all together.
#
# The predict() method predicts using the trained model

X = df[['group', 'condition_as_indicator']]

df['predicted_score'] = reg.predict(X)
df

### COGS 108: Data Science in Practice

<div class="alert alert-info">
If you are interested in data science and scientific computing in Python, consider taking <b>COGS 108</b> : <a>https://github.com/COGS108/</a>.
</div>

## Scientific Computing Recap

We looked at these _briefly_, so you have an idea of _when_ to reach for them.

- `numpy` - for lists of numbers
- `pandas` - for tables of heterogenious data (i.e. not just numbers)
- `matplotlib` - for drawing plots
- `scipy` - for probability distributions and statistical analysis
- `sklearn` - for machine learning

To actually use these you will have to Google your questions and/or read the docs. Honestly, that's what real-world programming is. You cannot memorize everything.