# Introduction to Pandas



---

### Table of Contents


1 - [Pandas and Dataframes](#section1)<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Importing Data](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Statistics](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Histograms and Standard Deviation](#subsection3)<br>



---
## 1. Pandas and Data Frames <a name='section1'></a>

First, let's make sure to import our libraries. Similar to how `np` is used as an alias for `numpy`, most people often use `pd` for `pandas`:

In [None]:
import numpy as np
import pandas as pd

### 1.1 Importing Data  <a name='subsection1'></a>

We will use the function `read_csv()` in the **Pandas** library to import and read our data.
The _csv_ at the end of the function tells the program to read a "comma-delimited file" - a file where data is separated by commas. There are many types of delimiters such as tab, semicolon, pipe, etc.

Let's revisit the data from our survey! We will read in `BLDAP2025DataSurvey.csv` as a **DataFrame** and store it in a variable called `bldap_data`. When we read in the data file, we must include the fact that the file is saved in the folder `data`, so the computer knows where to look for the data file! We add the folder name before the filename and add a slash (/) between.

In [None]:
# EXAMPLE
# Read in results from survey and store it in a variable called bldap_data.

bldap_data = pd.read_csv('data/BLDAP2025DataSurvey.csv')

Great! Now let's explore our data set.

We will begin by using the method (or function)  `.head()`. By default, it will show the first 5 rows of or data set, but you can tell it to display the first `n` results by passing `n` as an argument to `.head()`.

In [None]:
# EXAMPLE

bldap_data.head()

You can also see the last `n` rows of our data using the method `.tail()`.

In [None]:
# EXAMPLE

bldap_data.tail()

DataFrames contain rows and columns. You can think of them as Google or Excel sheets. If you want to understand the structure of your DataFrame, there are a few functions and attributes that might come handy.

These include
* `shape`: outputs number of rows and number of columns
* `columns`: outputs names of columns
* `index`: outputs the indices in a format of `(start, stop, step)`
* `info()`: outputs info per each column, very useful for retrieving an index of each column, checking the format of data in each column (sometimes numbers can be in a form of a string and prevent you from running your calculations properly), it also shows you the number of `Null` (or missing) values per each column.
* `len()`: just like with other data structures, we can use `len( )` with DataFrames.
* `describe()`: outputs basic statistics per each column like mean/median/mode, etc.

In [None]:
# EXAMPLE

bldap_data.shape

This first number is how many rows the dataframe has and the second number is how many columns the dataframe has.

In [None]:
# EXAMPLE

bldap_data.columns

In [None]:
# EXAMPLE

bldap_data.index

In [None]:
# EXAMPLE

bldap_data.info()

As with lists and arrays, you can also use the function `len()` to see how many entries (in this case rows) our data set contains.

In [None]:
# EXAMPLE

len(bldap_data)

### 1.2 Statistics  <a name='subsection2'></a>

A useful method is `.describe()`. Describe provides you with some basic statistics about each of the variables in your DataFrame including measures of tendency, dispersion and shape of a dataset's distribution, excluding **NaN** values (NaN "Not a Number" values represent missing data. In a later lesson, we'll explore how to deal with missing data).

By default, it will return the summary statistics of the numeric columns. If you use the argument `include='all'`, it will also work with mixed data. For example, if the column is not numeric, it will return measures such as count, number of unique values, and the most frequent value (`top` is the value of the most frequent value, while `freq` tells you its frequency).

*Extra note:* What if there are multiple top values with the same frequency (as is the case with our 'Birth Month and Day' column)? It turns out that pandas arbitrarily chooses a top value. If you restart the kernel or restart the session of the whole notebook, you might get another value for top.

In [None]:
# EXAMPLE

bldap_data.describe(include='all')

Pandas also has methods to calculate the mean, median, and mode separately. However, you do need to include the parameter `numeric_only=True` to ensure that it will only try to compute the mean/median/mode for columns that contain numerical data.

In the cells below, see if you can find the mean, median, and mode using `.mean()`, `.median()`, `.mode()`. The first one has been done for you.

In [None]:
#Find the mean

bldap_data.mean(numeric_only=True)

In [None]:
#Find the median

bldap_data....

In [None]:
#Find the mode

bldap_data....

When finding the mode, you can see that it returns a dataframe! If a column of data has more than one mode, you will see them all.

### 1.3 Histograms and Standard Deviation  <a name='subsection3'></a>

As you may imagine, pandas has a method that easily returns the standard deviation:

In [None]:
#Find the standard deviation

bldap_data.std(numeric_only=True)

Aside from looking at various statistical data for our dataset, a histogram is the best way to quickly inspect your data and get a feel for what you are dealing with. You can create a histogram by calling `dataframe.hist()`. Pandas will automatically decide how many boxes to create in your histogram.

The example below just looks at the column of 'Hours Slept'. (We will learn how to grab columns later!)

_Quick note: Use a semicolon ; at the end of the last line in a Jupyter notebook cell to suppress the notebooks from printing the return value of the last line. You can try removing the semicolon in the below cell and see how the output changes._

In [None]:
bldap_data_hours = bldap_data['Hours Slept']
bldap_data_hours.hist();

Question: What does the x-axis of the histogram represent? And the y-axis?

You can change the number of bins by using the argument `bins=` within the parentheses of `hist()`. Below try different bins and see how your histogram changes.

In [None]:
bldap_data_hours.hist(bins=...);

---
Notebook developed by: Kseniya Usovich, Baishakhi Bose, Alisa Bettale, Arianna Formenti