This notebook will walk through the basics of using [the Python library pandas](https://pandas.pydata.org/) to read, explore, manipulate, and save tabular data. Today, we'll be working with [a dataset of gene expression levels across different cells](https://drive.google.com/file/d/1dcqjPNvCaAGbrbgyJsszw_8HUwDJz6pu/view?usp=sharing).

First, let's import the necessary libraries. We'll be using `pandas` to handle our data.

In [None]:
import pandas as pd

# Loading data

Next, we'll load our dataset. The data consists of measurements of the expression level of 20 different genes within each of the 20 barcoded cells under study.

**Note:** Make sure to change the file path referenced below to reflect where the dataset is saved on your computer!

In [None]:
# Load the dataset
df = pd.read_csv('drive/MyDrive/Vanderbilt/Teaching/IGP 8001-06/gene_expression.csv')

# Show the first few rows to get a sense of what the data looks like
df.head()

# Exploring data

After inspecting the first few rows of the dataset to make sure that everything looks right, let's also take a look at some basic statistics and check for any missing values.

In [None]:
# Get an overview of the dataset structure
df.info()

In [None]:
# Get basic statistics about the dataset
df.describe()

In [None]:
# Check for any missing values
df.isnull().sum()

Now, let's sort the data by the expression level of a particular gene. For example, we'll sort the cells by the expression of `GENE1` in descending order to see which cells have the highest expression for this gene.

In [None]:
# Sort the data by GENE1 expression
df_sorted = df.sort_values(by='GENE1', ascending=False)

# Display the top results
df_sorted.head()

# Accessing specific cells and ranges

Now that we've worked with basic data manipulation, let's explore two important pandas methods for accessing data in a DataFrame: `loc` and `iloc`.

We'll start with `loc`, which is used to access rows and columns by their **labels** (such as row names or column names). `loc` is very flexible, allowing us to retrieve data using specific labels or ranges of labels.

In [None]:
# Let's revisit our DataFrame and look at it again.
df.head()

Let's use `loc` to select specific rows and columns by their labels. For example, we can retrieve the row for the second cell with the value of 'GENE2'.

In [None]:
# Selecting a specific row (the second) and column ('GENE2') using loc
df.loc['BC2', 'GENE2']

Why didn't that work the way we expected it to? Right now, data from the first column isn't actually being used to label the rows. Here's how we could get the expected result from our current DataFrame.

In [None]:
# Selecting a specific row (the second) and column ('GENE2') using loc
df.loc[1, 'GENE2']

We can also load our dataset in a way that treats the barcodes as row labels (index), making it easier to access data for specific cells by their barcode.

In [None]:
# Load the dataset using barcodes as row labels (index)
df = pd.read_csv('drive/MyDrive/Vanderbilt/Teaching/IGP 8001-06/gene_expression.csv', index_col=0)

# Show the first few rows with barcodes as the index
df.head()

Now, let's try using `loc` again.

In [None]:
# Selecting a specific row (the second) and column ('GENE2') using loc
df.loc['BC2', 'GENE2']

You can also use `loc` to select multiple rows and columns by specifying a range of labels. For instance, we can retrieve the data for cells 'BC11' through 'BC15' and for genes 'GENE6' through 'GENE12'.

In [None]:
# Selecting a range of rows and columns using loc
df.loc['BC11':'BC15', 'GENE6':'GENE12']

Next, let's discuss `iloc`, which is used for position-based indexing. Unlike `loc`, which relies on labels, `iloc` uses **integer positions** for rows and columns.

This can be useful when you don't know the specific labels, or if you're working with positional data in the DataFrame. The first row is indexed as `0`, the second row as `1`, and so on. Similarly, the first column is indexed as `0`, the second as `1`, etc.

In [None]:
# Selecting the second row (index 1) and second column (index 1) using iloc
df.iloc[1, 1]

Just like with `loc`, you can use ranges with `iloc`. The difference is that ranges in `iloc` are **exclusive** of the end index, meaning the last row or column in the range will not be included. For example, let's select the first two rows and first two columns using `iloc`.

In [None]:
# Selecting a range of rows and columns using iloc (first two rows and columns)
df.iloc[0:2, 0:2]

Let's now compare how `loc` and `iloc` work side by side.

Here, we'll use `loc` to select data based on labels, and `iloc` to select data based on positions. Even though the methods differ in how they access data, they can return the same results.

In [None]:
# Using loc to select rows 'Cell1' and 'Cell2' and columns 'GENE1' and 'GENE2'
df_loc = df.loc['BC1':'BC2', 'GENE1':'GENE2']

# Using iloc to select the same data using positions (0 and 1 for rows, 0 and 1 for columns)
df_iloc = df.iloc[0:2, 0:2]

# Display both results
df_loc, df_iloc

As you can see, both methods return the same result, but the way we specify the rows and columns differs. Depending on your needs, you might prefer one method over the other.

Use `loc` when you want to access data by **labels** (row names or column names). For example, you know the name of a specific gene or cell and want to retrieve its data.

Use `iloc` when you want to access data by **integer positions**. This is useful when the exact labels are unknown, or you are working with positional data (like retrieving the first or last few rows).

# Filtering data

Let's check which cells express a specific gene, say `GENE8`, above a given threshold. We'll filter the data to show only those cells where the expression of `GENE8` is greater than 50.

In [None]:
# Define the threshold
gene_threshold = 50

# Find cells where GENE1 expression is above the threshold
cells_above_threshold = df[df['GENE1'] > gene_threshold]

# Show the result
cells_above_threshold

Next, we want to find cells that have at least 5 genes expressed above a certain threshold. This is a useful way to identify highly active cells in our dataset.

In [None]:
# Define the threshold
threshold = 60

# Find cells with at least 5 genes above the threshold
genes_above_threshold = (df > threshold).sum(axis=1)
cells_with_at_least_5 = df[genes_above_threshold >= 5]

# Show the result
cells_with_at_least_5


Let's break down how this works in a little more detail. First, we define a threshold that will be used to identify cells where gene expression levels are above this value.

In [None]:
threshold = 70

Next, we create a **boolean DataFrame** where each value is `True` if it is greater than the threshold and `False` otherwise. The resulting DataFrame will have the same shape as `df`, but instead of gene expression values, it will have values of either `True` or `False` depending on whether the corresponding value in the original DataFrame is above the threshold.

In [None]:
genes_above_threshold = (df > threshold)

genes_above_threshold

The `.sum(axis=1)` function is summing across the rows (i.e., across the columns/genes for each cell). Since the boolean `True` is treated as `1` and `False` as `0`, this sum effectively counts how many genes in each cell have expression levels above the threshold.

The result of this line is a **Series** where the index corresponds to the cells, and the values are the number of genes that have expression levels above the threshold.

In [None]:
genes_above_threshold = (df > threshold).sum(axis=1)

genes_above_threshold

A **Series** in pandas is a one-dimensional array-like object that holds data along with labels, known as the index. You can think of a Series as a single column of data, where each element is associated with a corresponding index value (similar to row labels in a DataFrame). A Series can store various data types such as integers, floats, strings, or even other objects.

In [None]:
# Example Series
s = pd.Series([10, 20, 30], index=['A', 'B', 'C'])
s

A **DataFrame**, on the other hand, is a two-dimensional table (like a spreadsheet) that consists of **multiple Series**. Each column in a DataFrame is a Series, but a DataFrame allows you to organize and manipulate multiple Series together. A DataFrame has both row and column labels, and it can store data in more than one dimension.

In [None]:
# Example DataFrame
data = {'Column1': [10, 20, 30], 'Column2': [40, 50, 60]}
manual_df = pd.DataFrame(data)
manual_df

Note that these examples also illustrate how to create Series and DataFrames from dictionaries and/or lists, which can be useful!

Returning to our gene expression question, there's one final step to build the filtered DataFrame that you saw above. `genes_above_threshold >= 5` creates a boolean condition that checks whether each cell (from the `genes_above_threshold` Series) has at least 5 genes with expression above the threshold. This returns a boolean Series where each cell that meets the condition is `True`.

In [None]:
genes_above_threshold >= 5

Using this expression as a row reference filters the original DataFrame `df` to include only the rows (cells) where the condition is `True`, i.e., where the number of genes expressed above the threshold is 5 or more.

In [None]:
cells_with_at_least_5 = df[genes_above_threshold >= 5]

cells_with_at_least_5

We used a similar, albeit less complex, approach in the first example above.

In [None]:
# Define the threshold
gene_threshold = 50

# Find cells where GENE1 expression is above the threshold
cells_above_threshold = df[df['GENE8'] > gene_threshold]

# Show the result
cells_above_threshold

Why do we need to include `df` *inside* our filtering expression here? What would happen if we left it out?

# Answering a more complex question

When multiple genes are highly expressed, are they always the same ones? Let's investigate by first checking which genes are commonly expressed above a given threshold across all cells.

In [None]:
# Define the threshold
threshold = 60

# Count how many times each gene is expressed above the threshold
highly_expressed_genes = (df > threshold).sum()

# Show the result
highly_expressed_genes


Next, we can use the same code as we did in the more complex filtering example above to see which cells have an unusually high number of highly expressed genes.

In [None]:
# Define the threshold
threshold = 60

# Find cells with at least 5 genes above the threshold
genes_above_threshold = (df > threshold).sum(axis=1)
cells_with_at_least_5 = df[genes_above_threshold >= 5]

# Show the result
cells_with_at_least_5

We might be able to draw some conclusions as to the answer to our question by visually inspecting this DataFrame. But what if we wanted to perform additional analyses or data visualization? Is there a way that we could transform this DataFrame into a different Python data structure?

First, let's create a DataFrame where each cell contains `True` if the gene expression level is greater than the threshold, and `False` otherwise.

In [None]:
# Define the threshold
threshold = 50

# Create a boolean DataFrame where True means the gene is expressed above the threshold
genes_above_threshold_df = df > threshold

# Display the boolean DataFrame
genes_above_threshold_df.head()

Next, let's visualize which **genes** are expressed above the threshold for each cell in a slightly different way.

In [None]:
# For each cell, get the names of the genes that are expressed above the threshold
for index, row in genes_above_threshold_df.iterrows():
    highly_expressed_genes = row[row == True].index.tolist()  # List of gene names where expression is above threshold
    print(f"{index}: {highly_expressed_genes}")

This code loops over each cell (row) in the boolean DataFrame and collects the names of genes where the expression is above the threshold. It then prints each cell's identifier along with its corresponding list of highly expressed genes.

Finally, we could store the results (which genes are expressed above the threshold for each cell) in a dictionary for further analysis or visualization.

In [None]:
# Create a dictionary to store the results
cell_gene_dict = {}

# Populate the dictionary with cell names as keys and lists of highly expressed genes as values
for index, row in genes_above_threshold_df.iterrows():
    cell_gene_dict[index] = row[row == True].index.tolist()

# Display the dictionary
cell_gene_dict

Instead of printing the results, this code builds a dictionary where each key is a cell name and its value is the list of gene names expressed above the threshold. Finally, it displays the dictionary with all the cells and their associated highly expressed genes.

# Saving data

Note that in this case we set the index parameter to `True` since the indices of our rows contain meaningful information (in this case, cell barcodes).

In [None]:
# Save the DataFrame to a new CSV file
df.to_csv('filtered_data.csv', index=True)

# Conclusion

Congratulations on completing this exercise! Throughout this notebook, you've gained hands-on experience with key pandas operations, including loading data, exploring and manipulating DataFrames, and filtering data using conditions. You've also learned how to access data by labels with `loc` and by positions with `iloc`, and seen how to save your processed DataFrame as a CSV file.

These skills are foundational for working with data in Python, and they will serve as a solid starting point for more advanced data analysis tasks. Keep practicing and exploring pandas' powerful functionality—there's much more to discover!