# Data Visualization with Pandas, Matplotlib, and Seaborn in Biology

Welcome to this workshop! In this session, we'll explore how to use **Pandas** for data manipulation and **Matplotlib**/**Seaborn** for plotting biological data. We'll cover:

- Loading and exploring biological datasets
- Data manipulation with Pandas
- Creating visualizations with Matplotlib
- Enhancing plots with Seaborn



## Table of Contents

1. [Introduction](#introduction)
2. [Loading Biological Data](#loading_data)
3. [Data Exploration with Pandas](#data_exploration)
4. [Visualizations with Matplotlib](#matplotlib_visualizations)
5. [Advanced Visualizations with Seaborn](#seaborn_visualizations)
6. [Case Study: Gene Expression Data](#case_study)
7. [Conclusion](#conclusion)

---

<a id='loading_data'></a>
## 1. Loading Biological Data

We'll begin by loading some example biological data into Pandas DataFrames.

### Dataset: Iris Flower Dataset

- Famous dataset in biology for classification.
- Contains measurements of sepal length, sepal width, petal length, petal width for three species of Iris.

In [None]:
# Import necessary libraries
import pandas as pd

# Load the Iris dataset from a URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
iris = pd.read_csv(url)

# Display the first few rows
iris.head()

<a id='data_exploration'></a>
## 2. Data Exploration with Pandas

Let's explore the dataset using Pandas.

In [None]:
# Get basic information about the dataset
iris.info()

In [None]:
# Get statistical summary
iris.describe()

### Distribution of Species

```python
iris['species'].value_counts()
```

In [None]:
# Get counts of each species
iris['species'].value_counts()

<a id='matplotlib_visualizations'></a>
## 3. Visualizations with Matplotlib

We'll create basic plots using Matplotlib.

In [None]:
# Import Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline  

### Histogram of Sepal Length

```python
plt.hist(iris['sepal_length'], bins=15, color='green')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Length')
plt.show()
```

In [None]:
# Histogram of Sepal Length
plt.hist(iris['sepal_length'], bins=15, color='green')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Length')
plt.show()

### Scatter Plot of Sepal Length vs. Petal Length

```python
plt.scatter(iris['sepal_length'], iris['petal_length'], color='blue')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.title('Sepal Length vs. Petal Length')
plt.show()
```

In [None]:
# Scatter Plot of Sepal Length vs. Petal Length
plt.scatter(iris['sepal_length'], iris['petal_length'], color='blue')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.title('Sepal Length vs. Petal Length')
plt.show()

<a id='seaborn_visualizations'></a>
## 4. Advanced Visualizations with Seaborn

Seaborn builds on Matplotlib and provides a high-level interface for statistical graphics.

In [None]:
# Import Seaborn
import seaborn as sns
sns.set(style='whitegrid')

### Boxplot of Sepal Length by Species

```python
sns.boxplot(x='species', y='sepal_length', data=iris)
plt.title('Sepal Length by Species')
plt.show()
```

In [None]:
# Boxplot of Sepal Length by Species
sns.boxplot(x='species', y='sepal_length', data=iris)
plt.title('Sepal Length by Species')
plt.show()

### Pairplot of All Variables

```python
sns.pairplot(iris, hue='species')
plt.show()
```

In [None]:
# Pairplot of All Variables
sns.pairplot(iris, hue='species')
plt.show()

### Violin Plot of Petal Length by Species

```python
sns.violinplot(x='species', y='petal_length', data=iris, inner='quartile')
plt.title('Petal Length Distribution by Species')
plt.show()
```

In [None]:
# Violin Plot of Petal Length by Species
sns.violinplot(x='species', y='petal_length', data=iris, inner='quartile')
plt.title('Petal Length Distribution by Species')
plt.show()

<a id='case_study'></a>
## 5. Case Study: Gene Expression Data

We'll apply these techniques to a gene expression dataset.

### Dataset: Differential Gene Expression

- Contains expression levels of genes under different conditions.
- Example data is generated for demonstration.

In [None]:
# Generate sample gene expression data
import numpy as np

np.random.seed(0)
genes = [f'Gene_{i}' for i in range(1, 101)]
conditions = ['Control', 'Treatment']

data = {
    'gene': np.repeat(genes, len(conditions)),
    'condition': conditions * len(genes),
    'expression': np.random.randn(len(genes) * len(conditions)) + 5
}

gene_expression = pd.DataFrame(data)

# Display the first few rows
gene_expression.head()

### Visualize Expression Levels

```python
# Boxplot of Expression Levels
sns.boxplot(x='condition', y='expression', data=gene_expression)
plt.title('Gene Expression Levels')
plt.show()
```

```python
# Violin Plot
sns.violinplot(x='condition', y='expression', data=gene_expression, inner='quartile')
plt.title('Gene Expression Distribution')
plt.show()
```

```python
# Swarm Plot for a Subset of Genes
subset = gene_expression[gene_expression['gene'].isin(['Gene_1', 'Gene_2', 'Gene_3'])]
sns.swarmplot(x='gene', y='expression', hue='condition', data=subset)
plt.title('Expression Levels of Selected Genes')
plt.show()
```

In [None]:
# Boxplot of Expression Levels
sns.boxplot(x='condition', y='expression', data=gene_expression)
plt.title('Gene Expression Levels')
plt.show()

In [None]:
# Violin Plot
sns.violinplot(x='condition', y='expression', data=gene_expression, inner='quartile')
plt.title('Gene Expression Distribution')
plt.show()

In [None]:
# Swarm Plot for a Subset of Genes
subset = gene_expression[gene_expression['gene'].isin(['Gene_1', 'Gene_2', 'Gene_3'])]
sns.swarmplot(x='gene', y='expression', hue='condition', data=subset)
plt.title('Expression Levels of Selected Genes')
plt.show()

<a id='conclusion'></a>
## 6. Conclusion

We've:

- Learned how to load and explore biological datasets using Pandas.
- Created basic visualizations using Matplotlib.
- Enhanced our plots with Seaborn for statistical graphics.
- Applied these techniques to a gene expression dataset.

**Next Steps:**

- Practice with your own biological datasets.
- Explore more advanced plotting techniques in Seaborn and Matplotlib.
- Combine multiple plots to create comprehensive figures for publications.

---

Feel free to ask any questions or share your experiences with data visualization in biology!