# A Primer on File I/O in Python

pandas provides robust tools for file input/output (I/O), making it a powerful library for data analysis and manipulation in Python. Understanding these core concepts is essential for efficient data handling.

## Reading Data
- **Versatility**: pandas can read data from a variety of file formats including CSV, Excel, JSON, HTML, and SQL databases.
- **Function Usage**: Common functions include `pd.read_csv()`, `pd.read_excel()`, `pd.read_json()`, etc.
- **Customization**: These functions offer numerous parameters to handle different data formats, missing values, and file-specific settings.

## Writing Data
- **Data Export**: pandas allows you to export DataFrames to various file formats.
- **Function Usage**: Corresponding to read functions, there are `to_csv()`, `to_excel()`, `to_json()`, etc.
- **Flexibility**: You can specify index inclusion, header information, and file encoding options during the export.

## Handling Large Data
- **Chunking**: For large datasets, pandas can read and write data in chunks to avoid memory overload.
- **Example**: Use `chunksize` parameter in `read_csv()` to process large files in manageable portions.

## Data Transformation
- **Pre-Processing**: Before analysis, data often requires cleaning and transformation, which can be done during the read phase.
- **Example**: Converters and data type specifications can be used in `read_csv()` to preprocess data.

## Efficient Data Storage
- **Formats**: pandas supports efficient data storage formats like HDF5 and Parquet.
- **Benefits**: These formats are optimized for size and speed, especially beneficial for large datasets.

## Conclusion
- pandas' I/O capabilities are integral for data analysis workflows in Python.
- The ability to seamlessly read from and write to various data formats simplifies the process of data preparation and sharing.
- Mastery of pandas I/O functions is crucial for anyone looking to perform data analysis in scientific computing, including biochemistry.


## Code Examples
- **NOTE**: When pandas writes to or reads a file, it will look in your `current` folder (a.k.a. directory) by default. You can give alternative paths in the function call if desired. If using google colaboratory, the process is different. The files must be in your google drive, and you must `mount` the drive. That operation is outside the scope of this primer.  

In [1]:
# Python File I/O with pandas in Biochemistry
# --------------------------------------------

# Importing necessary libraries
import pandas as pd

# Step 1: Creating a pandas DataFrame
# -----------------------------------
# Example: Creating a DataFrame to store biochemical compound data
compounds = {
    'Compound': ['Glucose', 'ATP', 'Hemoglobin', 'Insulin'],
    'MolecularWeight': [180.16, 507.18, 64500, 5808],
    'Role': ['Energy Source', 'Energy Carrier', 'Oxygen Transport', 'Hormone']
}
compounds_df = pd.DataFrame(compounds)
print("Biochemical Compounds DataFrame:\n", compounds_df)

# Step 2: Saving the DataFrame to a CSV file
# ------------------------------------------
# Saving the DataFrame to a file named 'biochemical_compounds.csv'
compounds_df.to_csv('biochemical_compounds.csv', index=False)
print("DataFrame saved as 'biochemical_compounds.csv'")

# Step 3: Opening and Reading the CSV file
# -----------------------------------------
# Reading the saved CSV file into a new DataFrame
read_compounds_df = pd.read_csv('biochemical_compounds.csv')
print("Data read from 'biochemical_compounds.csv':\n", read_compounds_df)

# Step 4: Modifying the DataFrame
# -------------------------------
# Example Modification: Adding a new column for solubility
solubility = [91, 13.3, 'Low', 'High']  # Solubility in g/L (example values)
read_compounds_df['Solubility'] = solubility
print("Modified DataFrame with Solubility:\n", read_compounds_df)

# Step 5: Saving the Modified DataFrame
# -------------------------------------
# Saving the modified DataFrame back to the CSV file
read_compounds_df.to_csv('biochemical_compounds.csv', index=False)
print("Modified DataFrame saved back to 'biochemical_compounds.csv'")

# Conclusion
# ----------
# This code block demonstrates the use of file I/O with pandas in Python,
# showing how it can be applied in scientific computing for biochemistry.
# The process involves creating, saving, reading, modifying, and saving DataFrames,
# which is essential for handling and analyzing biochemical data.


Biochemical Compounds DataFrame:
      Compound  MolecularWeight              Role
0     Glucose           180.16     Energy Source
1         ATP           507.18    Energy Carrier
2  Hemoglobin         64500.00  Oxygen Transport
3     Insulin          5808.00           Hormone
DataFrame saved as 'biochemical_compounds.csv'
Data read from 'biochemical_compounds.csv':
      Compound  MolecularWeight              Role
0     Glucose           180.16     Energy Source
1         ATP           507.18    Energy Carrier
2  Hemoglobin         64500.00  Oxygen Transport
3     Insulin          5808.00           Hormone
Modified DataFrame with Solubility:
      Compound  MolecularWeight              Role Solubility
0     Glucose           180.16     Energy Source         91
1         ATP           507.18    Energy Carrier       13.3
2  Hemoglobin         64500.00  Oxygen Transport        Low
3     Insulin          5808.00           Hormone       High
Modified DataFrame saved back to 'biochemical_c

# Exercises

## Exercise 1: Create and Save DataFrame
- **Task**: Create a DataFrame containing data about various enzymes (Name, Function, Optimal pH). Then, save this DataFrame to a CSV file named 'enzymes.csv'.
- **Hint**: Use `pd.DataFrame()` to create the DataFrame and `to_csv()` method to save it.

## Exercise 2: Read and Display CSV File
- **Task**: Read the 'enzymes.csv' file into a DataFrame and display the first 5 rows.
- **Hint**: Use `pd.read_csv()` to read the file and `.head()` method to display the rows.

## Exercise 3: Add Column to DataFrame
- **Task**: Add a new column to the DataFrame indicating whether the enzyme is active in humans (Boolean values). Save the updated DataFrame back to 'enzymes.csv'.
- **Hint**: Assign a list of Boolean values to a new column and use `to_csv()` to save the changes.

## Exercise 4: Modify and Filter Data
- **Task**: Read the 'enzymes.csv' file, then filter and display enzymes with an optimal pH greater than 7.
- **Hint**: Use a boolean condition to filter the DataFrame and display the result.

## Exercise 5: Merge DataFrames
- **Task**: Assume you have another CSV file 'inhibitors.csv' with enzyme inhibitors data (Name, Inhibitor). Merge this with the 'enzymes.csv' DataFrame and display the merged DataFrame.
- **Hint**: Use `pd.merge()` to merge DataFrames ota handling. Good luck!


In [1]:
# Your Answers Here

## Solutions

In [2]:
# Importing necessary libraries
import pandas as pd

# Exercise 1: Create and Save DataFrame
enzymes_data = {
    'Name': ['Catalase', 'Amylase', 'Lipase'],
    'Function': ['Break down hydrogen peroxide', 'Starch digestion', 'Fat digestion'],
    'Optimal_pH': [7.0, 6.8, 8.0]
}
enzymes_df = pd.DataFrame(enzymes_data)
enzymes_df.to_csv('enzymes.csv', index=False)
print("Enzymes DataFrame saved to 'enzymes.csv'")

# Exercise 2: Read and Display CSV File
enzymes_df = pd.read_csv('enzymes.csv')
print("First 5 rows of the enzymes DataFrame:\n", enzymes_df.head())

# Exercise 3: Add Column to DataFrame
enzymes_df['ActiveInHumans'] = [True, True, False]
enzymes_df.to_csv('enzymes.csv', index=False)
print("Updated enzymes DataFrame saved to 'enzymes.csv'")

# Exercise 4: Modify and Filter Data
enzymes_df = pd.read_csv('enzymes.csv')
alkaline_enzymes = enzymes_df[enzymes_df['Optimal_pH'] > 7]
print("Alkaline enzymes:\n", alkaline_enzymes)

# Exercise 5: Merge DataFrames
# Assuming 'inhibitors.csv' exists with data
inhibitors_df = pd.DataFrame({
    'Name': ['Catalase', 'Amylase'],
    'Inhibitor': ['Acatalasemia', 'Alpha-amylase inhibitor']
})
merged_df = pd.merge(enzymes_df, inhibitors_df, on='Name', how='left')
print("Merged DataFrame with inhibitors:\n", merged_df)


Enzymes DataFrame saved to 'enzymes.csv'
First 5 rows of the enzymes DataFrame:
        Name                      Function  Optimal_pH
0  Catalase  Break down hydrogen peroxide         7.0
1   Amylase              Starch digestion         6.8
2    Lipase                 Fat digestion         8.0
Updated enzymes DataFrame saved to 'enzymes.csv'
Alkaline enzymes:
      Name       Function  Optimal_pH  ActiveInHumans
2  Lipase  Fat digestion         8.0           False
Merged DataFrame with inhibitors:
        Name                      Function  Optimal_pH  ActiveInHumans  \
0  Catalase  Break down hydrogen peroxide         7.0            True   
1   Amylase              Starch digestion         6.8            True   
2    Lipase                 Fat digestion         8.0           False   

                 Inhibitor  
0             Acatalasemia  
1  Alpha-amylase inhibitor  
2                      NaN  
