# A Primer on Pandas DataFramesP
pandas DataFrames are a fundamental tool in Python for data manipulation and analysis. Understanding these core concepts is essential for effective data handling in scientific computngi, especially for large multivariate datasety.

## DataFrame Basics
- **Definition**: A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
- **Creation**: DataFrames can be created from various data sources like lists, dictionaries, or reading from files (e.g., CSV, Excel).

## Key Components
- **Rows and Columns**: DataFrames are composed of rows and columns, each row representing a record and each column a feature.
- **Index**: Each row has an index, which can be numeric, string, or datetime. It's crucial for data alignment and manipulation.
- **Data Types**: Columns can hold different types of data (integers, strings, floats, etc.).

## Data Manipulation
- **Selection**: Accessing specific rows, columns, or cells using loc, iloc, or column names.
- **Filtering**: Extracting a subset of rows based on a condition.
- **Adding Columns**: Enhancing DataFrames by calculating new columns or adding data.
- **Aggregation**: Performing statistical operations like mean, median, sum, etc., on columns.

## Merging and Joining
- **Merging**: Combining two DataFrames based on common columns.
- **Joining**: Linking DataFrames using their indexes.

## Handling Missing Data
- **Detection**: Identifying missing or NaN values in the DataFrame.
- **Filling**: Replacing missing values with specific values or statistical measures (mean, median).
- **Dropping**: Removing rows or columns with missing values.

## Input/Output
- **Reading Data**: Loading data from various file formats into DataFrames.
- **Writing Data**: Exporting DataFrames to different file formats for storage or further analysis.

## Conclusion
- pandas DataFrames provide a powerful, flexible, and efficient tool for handling and analyzing large datasets in Python.
- Their wide range of functionalities makes them indispensable for data preprocessing, exploration, and transformation in scientific computing, especilly in fields like biochemistry.


## Code Examples

In [None]:
# Python pandas DataFrames for biochemistry applications
# -----------------------------------------

# Importing necessary libraries
import pandas as pd
import numpy as np

# Creating a DataFrame
# --------------------
# Example: Creating a DataFrame to store amino acid properties
# Start by creating a dictionary with amino acid molecular weights keyed to their corresponding amino acid name
amino_acids = {
    'AminoAcid': ['Alanine', 'Cysteine', 'Aspartic Acid', 'Glutamine'],
    'MolecularWeight': [89.1, 121.2, 133.1, 146.1],
    'pKa': [2.34, 1.96, 3.9, 2.17]
}
# Convert the dictionary to a dataframe
amino_acids_df = pd.DataFrame(amino_acids)
print("Amino Acids DataFrame:\n", amino_acids_df)

# Accessing DataFrame Elements
# ----------------------------
# Accessing specific columns
weights = amino_acids_df['MolecularWeight']
print("Molecular Weights:\n", weights)

# Accessing rows using loc and iloc
alanine_data = amino_acids_df.loc[0]
print("Data for Alanine:\n", alanine_data)

# DataFrame Operations
# ---------------------
# Filtering: Selecting amino acids with a molecular weight greater than 120
heavy_amino_acids = amino_acids_df[amino_acids_df['MolecularWeight'] > 120]
print("Heavy Amino Acids:\n", heavy_amino_acids)

# Adding a new column: Calculating the molar mass of amino acids
amino_acids_df['MolarMass'] = amino_acids_df['MolecularWeight'] * 1.66054
print("Amino Acids with Molar Mass:\n", amino_acids_df)

# Aggregation: Calculating average molecular weight
average_weight = amino_acids_df['MolecularWeight'].mean()
print("Average Molecular Weight:", average_weight)

# DataFrame Merging and Joining
# ------------------------------
# Example: Merging with another DataFrame
# Assume we have another DataFrame with solubility data
solubility_data = pd.DataFrame({
    'AminoAcid': ['Alanine', 'Cysteine', 'Glutamine'],
    'Solubility': [167, 200, 180]  # Solubility in g/L
})
merged_df = pd.merge(amino_acids_df, solubility_data, on='AminoAcid', how='left')
print("Merged DataFrame with Solubility:\n", merged_df)

# Handling Missing Data
# ----------------------
# Filling missing values with a default value (e.g., average solubility)
merged_df['Solubility'].fillna(merged_df['Solubility'].mean(), inplace=True)
print("DataFrame with Missing Values Handled:\n", merged_df)

## Some additional pandas functions
These are particularly useful with dealing with very large datasets (which is where DataFrames really shine)

In [None]:
# Check the first five rows and column headers (useful for large dataframes)
amino_acids_df.head()
# Also try, 
#amino_acids_df.tail()
# What does this do?

In [None]:
# Descriptive statistics
amino_acids_df.describe()

# Also try amino_acids_df.info()

# Exercises

## Exercise 1: DataFrame Creation
- **Task**: Create a DataFrame for storing data about different enzymes, including their names, optimal pH, and temperature.
- **Hint**: Define a dictionary with enzyme information and convert it into a DataFrame using `pd.DataFrame()`.

## Exercise 2: Data Access and Manipulation
- **Task**: From the enzyme DataFrame, select and print only the names and optimal temperatures of enzymes.
- **Hint**: Use DataFrame indexing to select specific columns.

## Exercise 3: Filtering Data
- **Task**: Filter and display enzymes that operate at a pH greater than 7.
- **Hint**: Use a boolean condition to filter rows based on pH values.

## Exercise 4: Adding and Computing a New Column
- **Task**: Add a new column to the enzyme DataFrame that indicates whether the enzyme is thermophilic (optimal temperature > 60°C).
- **Hint**: Use a lambda function and `apply()` method to create the new column based on temperature.

## Exercise 5: Merging DataFrames
- **Task**: Assume you have another DataFrame with enzyme inhibition data. Merge it with the enzyme DataFrame based on enzyme names.
- **Hint**: Use `pd.merge()` function and specify the 'name' column as the key.

## Exercise 6: Handling Missing Data
- **Task**: In the merged DataFrame, fill any missing inhibition data with the average inhibition value.
- **Hint**: Use `fillna()` method on the DataFrame, replacing NaN values withng and analysis. Good luck!


In [None]:
# Your Answer Here

## Solutions

In [None]:
# Importing necessary libraries
import pandas as pd

# Exercise 1: DataFrame Creation
enzymes_dict = {
    'Name': ['Lipase', 'Amylase', 'Protease'],
    'Optimal_pH': [8.0, 7.0, 6.5],
    'Optimal_Temperature': [37, 67, 50]  # in Celsius
}
enzymes_df = pd.DataFrame(enzymes_dict)
print("Enzymes DataFrame:\n", enzymes_df)

# Exercise 2: Data Access and Manipulation
enzyme_temp = enzymes_df[['Name', 'Optimal_Temperature']]
print("Enzymes and their Optimal Temperatures:\n", enzyme_temp)

# Exercise 3: Filtering Data
alkaline_enzymes = enzymes_df[enzymes_df['Optimal_pH'] > 7]
print("Alkaline Enzymes:\n", alkaline_enzymes)

# Exercise 4: Adding and Computing a New Column
enzymes_df['Thermophilic'] = enzymes_df['Optimal_Temperature'].apply(lambda x: x > 60)
print("Enzymes DataFrame with Thermophilic Column:\n", enzymes_df)

# Exercise 5: Merging DataFrames
# Assuming another DataFrame with inhibition data
inhibition_data = pd.DataFrame({
    'Name': ['Lipase', 'Amylase'],
    'Inhibition': [50, 70]  # Inhibition in percentage
})
merged_df = pd.merge(enzymes_df, inhibition_data, on='Name', how='left')
print("Merged DataFrame with Inhibition Data:\n", merged_df)

# Exercise 6: Handling Missing Data
merged_df['Inhibition'].fillna(merged_df['Inhibition'].mean(), inplace=True)
print("Merged DataFrame with Missing Data Handled:\n", merged_df)
