# Pandas Basics — Teaching Demonstration

In today's class, we're going to introduce some of the basics of [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html), a powerful Python library for working with tabular data like CSV files.

We will cover how to:

* Import Pandas
* Read in a CSV file
* Calculate summary statistics and frequencies
* Explore and filter data
* Make simple plots and data visualizations

___

## Dataset
### The Bellevue Almshouse Dataset

> Nineteenth-century immigration data was produced with the express purpose of reducing people to bodies; bodies to easily quantifiable aspects; and assigning value to those aspects which proved that the marginalized people to who they belonged were worth less than their elite counterparts.

> — Anelise Shrout, ["(Re)Humanizing Data"](https://crdh.rrchnm.org/essays/v01-10-(re)-humanizing-data/)


The dataset that we're working with in this lesson is the [Bellevue Almshouse Dataset](https://www.nyuirish.net/almshouse/the-almshouse-records/), created by historian and DH scholar Anelise Shrout. It includes information about Irish-born immigrants who were admitted to New York City's Bellevue Almshouse in the 1840s.

The Bellevue Almshouse was part of New York City's public health system, a place where poor, sick, homeless, and otherwise marginalized people were sent — sometimes voluntarily and sometimes forcibly. Devastated by widespread famine in Ireland, many Irish people fled their homes for New York City in the 1840s, and many of them ended up in the Bellevue Almshouse.

We're using the [Bellevue Almshouse Dataset](https://www.nyuirish.net/almshouse/the-almshouse-records/) to practice data analysis with Pandas because we want to think deeply about the consequences of reducing human life to data. As Shrout argues in [her essay](https://crdh.rrchnm.org/essays/v01-10-(re)-humanizing-data/), this data purposely reduced people to bodies and "easily quantifiable aspects" in order to devalue their lives, potentially enacting "both epistemic and physical violence" on them.

___

## Import Pandas


> If you installed Python with Anaconda, you should already have Pandas installed. If you did not install Python with Anaconda, see [Pandas Installation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html).


To use the Pandas library, we first need to `import` it.

In [None]:
import pandas as pd

The above `import` statement not only imports the Pandas library but also gives it an alias or nickname — `pd`. This alias will save us from having to type out the entire words `pandas` each time we need to use it. Many Python libraries have commonly used aliases like `pd`.

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [None]:
pd.options.display.max_rows = 200

## Read in CSV File

To read in a CSV file, we will use the method `pd.read_csv()` and insert the name of our desired file path. 

In [None]:
pd.read_csv('Bellevue_Almshouse_Dataset.csv', delimiter=',', encoding='utf-8')

### ❓🧐  Students: What do you notice about the data already?

This creates a Pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe), one of the two main data structures in Pandas. A DataFrame looks and acts a lot like a spreadsheet, but it has special powers and functions that we will discuss below and in the next few lessons.

| Pandas objects | Explanation                         |
|----------|-------------------------------------|
| `DataFrame`    | Like a spreadsheet, 2-dimensional    |
| `Series`      | Like a column, 1-dimensional                     |

When reading in the CSV file, we also specified a `delimiter` and `encoding`. The `delimiter` specifies the character that separates or "delimits" the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for *Comma Separated Values*.) Sometimes, however, the delimiter of a CSV file might be a tab (`\t`) or, more rarely, another character.

We assign the DataFrame to a variable called `bellevue_df`. It is common convention to name DataFrame variables `df`, but we want to be a bit more specific. 

In [None]:
bellevue_df = pd.read_csv('Bellevue_Almshouse_Dataset.csv', delimiter=',', encoding='utf-8')

## Begin to Examine Patterns

### Select Columns

To select a column from the DataFrame, we will type the name of the DataFrame followed by square brackets and a column name in quotations marks.

In [None]:
bellevue_df['age']

Technically, a single column in a DataFrame is a [*Series* object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro).

In [None]:
type(bellevue_df['age'])

| Pandas method | Explanation                         |
|----------|-------------------------------------|
| `.sum()`      | Sum of values                       |
| `.mean()`     | Mean of values                      |
| `.median()`   | Median of values         |
| `.min()`      | Minimum                             |
| `.max()`      | Maximum                             |
| `.mode()`     | Mode                                |
| `.std()`      | Unbiased standard deviation         |
| `.count()`    | Total number of non-blank values    |
| `.value_counts()` | Frequency of unique values |

### ❓🧐  How old (on average) were the people admitted to the Bellevue Almshouse?

In [None]:
bellevue_df['age'].mean()

### ❓🧐  How old was the oldest person admitted to Bellevue?

In [None]:
bellevue_df['age'].max()

### ❓🧐  How young was the youngest person?

In [None]:
bellevue_df['age'].min()

### ❓🧐  What are the most common professions among these Irish immigrants?

To count the values in a column, we can use the `.value_counts()` method. What patterns or inconsistencies do you notice in this list? What, if anything, seems strange to you?

In [None]:
bellevue_df['profession'].value_counts()

### 🛑   What stands out to you about this list? What kind of patterns do you notice?

*Jot down your thoughts here (double-click cell to type)*

### ❓🧐  What are the most common diseases?

In [None]:
bellevue_df['disease'].value_counts()

### 🛑   What stands out to you about this list? What kind of patterns do you notice?

*Jot down your thoughts here (double-click cell to type)*

### ❓🧐  Where were most people sent?

In [None]:
bellevue_df['sent_to'].value_counts()[:10]

## Examine Subsets

### ❓🧐  Why were people being sent to Hostpital Ward 38?

To explore this question, we can filter rows with a condition.

In [None]:
bellevue_df['sent_to'] == 'Hospital Ward 38'

In [None]:
bellevue_df[bellevue_df['sent_to'] == 'Hospital Ward 38']

In [None]:
bellevue_df[bellevue_df['sent_to'] == 'Hospital Ward 38']['disease'].value_counts()

## ❓🧐  What data is missing? What data do you wish we had?