# A Primer on Python Data Structures
- Simple variables store one value at a time.
- Data structures allow us to store many values in a single variable
- The different types- `lists`, `dictionaries`, `tuples`, `numpy arrays`, and `pandas dataframes` differ in the kind of information they store, their memory efficiency for certain types of data, and how the data can be accessed/changed. 

## Lists
- **Mutable**: Lists can be modified after creation (add, remove, or change items).
- **Ordered**: The order of items is maintained, and items can be accessed by their position.
- **Syntax**: Created using square brackets `[]`.
- **Use Case**: Ideal for collections of items where the order matters and contents might change.

In [None]:
# Lists: Dynamic arrays, useful for storing collections of items.
# --------------------------------------------------------------
# Creating a list of common enzymes
enzymes = ["Ligase", "Helicase", "Polymerase", "Nuclease"]
print("List of Enzymes:", enzymes)

# Adding an enzyme to the list
enzymes.append("Transferase")
print("Updated List of Enzymes:", enzymes)

# Accessing a specific enzyme by index
print("Second Enzyme in the List:", enzymes[1])

### Some Useful Functions for Using Lists
There are two types of syntax to consider: 
- **The function takes an item from a list as input**: In this case you should write something like `my_list.function()`. Note, these functions change the list in place. 
- **The function taks a list (iteratable) as an input**: In this case you should write something like `function(my_list)`. Note, it may be preferable to create a new list to make these modifications. 

1. **Appending and Extending:**
   - `append(item)`: Adds an item to the end of the list.
   - `extend(iterable)`: Extends the list by appending all the items from the iterable (e.g., another list).

2. **Inserting and Removing Elements:**
   - `insert(index, item)`: Inserts an item at a specified index.
   - `remove(item)`: Removes the first occurrence of an item.
   - `pop([index])`: Removes and returns an item at a given index (or the last item if index is not specified).

3. **Finding Elements:**
   - `index(item)`: Returns the index of the first occurrence of an item.
   - `count(item)`: Returns the number of times an item appears in the list.

4. **Sorting:**
   - `sorted(iteratable)`: Sorts the items of the list.

5. **Copying:**
   - `copy(iteratable)`: Returns a shallow copy of the list.

6. **Clearing:**
   - `clear(iteratable)`: Removes all items from the list.

7. **List Comprehensions:**
   - A concise way to create lists. Uses `for` or `if` statements to iteratively create a list. 
   - `my_list = [x**2 for x in range(1,6)]` creates a list called my_list and fills it iteratively with the squares of 1-6.
   - `my_list = [x**2 for x in range(1, 6) if x % 2 == 0]` creates a list called my_list and fills it with only the even squares of 1-6. The conditional: `x % 2 == 0`, is a modulo function. A modulo function takes a number and a divisor and returns the remainder. Here, it means: `if the remainder of the list number divided by 2 is equal to 0` it must be even, so we keep it in the list.  

8. **Slicing:**
   - Used for accessing a subset of list elements. Syntax: `list[start:stop:step]`. Where step is the interval size between each list element. 

9. **Conversion to/from Other Data Types:**
   - `list(iterable)`: Converts an iterable (like a tuple, string, set, or dictionary) to a list.
   - `str.join(iterable)`: Concatenates a list of strings into a single string, with elements separated by the specified separator.

10. **Length:**
    - `len(list)`: Returns the number of items in the list.ns the number of items in the list.


In [None]:
# A few examples
my_list = [2,3,5,3,1,4,6,2,8] # Create the list
print(my_list) # print the list

print(len(my_list)) #print the length of the list

print(my_list.count(2)) # return the number of times '2' appears in the list

print(my_list.index(5)) # returns the index of 5; Note python starts counting at 0

sorted_list = sorted(my_list, reverse = False) # sort the list; Also try this: sorted(my_list, reverse = True)
print(sorted_list)

### Exercise 1: Lists
- **Task 1**: Create a list, and try some of these functions out
- **Task 2**: Create a list using a list comprehension.

In [None]:
# Your Answers Here; Create Additional Cells as Needed

## Tuples
- **Immutable**: Once a tuple is created, it cannot be modified.
- **Ordered**: Like lists, tuples maintain the order of items.
- **Syntax**: Created using parentheses `()`.
- **Use Case**: Suitable for fixed data sets, like coordinates or RGB color values.
- We won't focus too much on these here (b/c they're not very exciting), but wanted you to know about them.

## Dictionaries
- **Mutable**: Can change, add, or delete key-value pairs.
- **Unordered**: Items are not stored in a specific order and are accessed via keys.
- **Syntax**: Created using curly braces `{}` with key-value pairs.
- **Use Case**: Perfect for associating keys with values, like mapping names to phone numbers.

In [None]:
# Dictionaries: Key-value pairs, great for mapping relationships.
# ----------------------------------------------------------------
# Creating a dictionary to map enzymes to their functions
enzyme_functions = {
    "Ligase": "Joining of DNA strands",
    "Helicase": "Unwinding DNA helix",
    "Polymerase": "Polymerizing nucleotides",
    "Nuclease": "Cutting DNA strands"
}
print("Enzyme Functions:", enzyme_functions)

# Accessing a function by enzyme name
print("Function of Helicase:", enzyme_functions["Helicase"])

# Adding a new key-value pair
enzyme_functions["Transferase"] = "Transfer of functional groups"
print("Updated Enzyme Functions:", enzyme_functions)

### Some Useful Functions for Using Dictionaries

1. **Creating a Dictionary:**
   - `{}` or `dict()`: Create an empty dictionary.
   - `dict.fromkeys(sequence, value)`: Create a new dictionary with keys from `sequence` and values set to `value`.

2. **Accessing Elements:**
   - `dict[key]`: Access the item with key `key`. Raises a `KeyError` if the key is not found.
   - `get(key, default=None)`: Returns the value for `key` if `key` is in the dictionary, else `default`.

3. **Adding and Updating Elements:**
   - `dict[key] = value`: Sets `dict[key]` to `value`, overwriting any existing value.
   - `update([other])`: Updates the dictionary with the key/value pairs from `other`, overwriting existing keys.

4. **Removing Elements:**
   - `pop(key[, default])`: Remove the item with key `key` and return its value, or `default` if `key` is not found.
   - `popitem()`: Removes and returns a `(key, value)` pair as a 2-tuple.
   - `del dict[key]`: Removes `dict[key]` from the dictionary.
   - `clear()`: Removes all items from the dictionary.

5. **Keys, Values, and Items:**
   - `keys()`: Returns a new view of the dictionary's keys.
   - `values()`: Returns a new view of the dictionary's values.
   - `items()`: Returns a new view of the dictionary’s items (`(key, value)` pairs).

6. **Copying:**
   - `copy()`: Returns a shallow copy of the dictionary.

7. **Merging Dictionaries (Python 3.5+):**
   - `{**d1, **d2}`: Creates a new dictionary by merging `d1` and `d2`.

8. **Dictionary Comprehensions:**
   - `{key: value for (key, value) in iterable}`: Similar to list comprehensions, but for dictionaries.
   - `zip()`: Commonly used in combination with dictionary comprehensions to create dictionaries when you have separate lists of keys and values.
    - Example: `my_dict = dict(zip(list_of_keys, list_of_values))`

9. **Length:**
   - `len(dict)`: Returns the number of items in the dictionary.

10. **Membership Test:**
    - `key in dict`: Returns `True` if `dict` has a key `key`, else `False`.

11. **Nested Dictionaries:**
    - Used for storing hierarchical or structured data.

12. **Sorting:**
    - `sorted(dict)`: Returns a sorted list of the dictionary's keys.)`rns a sorted list of the dictionary's keys.


In [1]:
# Some useful examples

# Create a dictionary where the keys all have the same value.
keys = ['Alanine', 'Tyrosine', 'Methionine']
values = 'amino acid'
my_dict = dict.fromkeys(keys, values)
print(my_dict)

# Create another dictionary where the keys all have the same value.
keys = ['Adenosine', 'Cytosine', 'Guanosine']
values = 'nucleic acid'
new_dict = dict.fromkeys(keys, values)
print(new_dict)

# Merge the two dictionaries into one
combined_dict = {**my_dict, **new_dict}
print(combined_dict)

# Creating a dictionary from lists of keys and values using a dictionary comprehension
keys = ['alanine', 'adenosine', 'pyruvate']
values = ['amino acid', 'nucleic acid', 'keto acid']

# Create a dictionary with different values for each key
created_dict = {key: value for key, value in zip(keys, values)}
print(created_dict) 

{'Alanine': 'amino acid', 'Tyrosine': 'amino acid', 'Methionine': 'amino acid'}
{'Adenosine': 'nucleic acid', 'Cytosine': 'nucleic acid', 'Guanosine': 'nucleic acid'}
{'Alanine': 'amino acid', 'Tyrosine': 'amino acid', 'Methionine': 'amino acid', 'Adenosine': 'nucleic acid', 'Cytosine': 'nucleic acid', 'Guanosine': 'nucleic acid'}
{'alanine': 'amino acid', 'adenosine': 'nucleic acid', 'pyruvate': 'keto acid'}


### Exercise 2: Dictionaries
- **Task 1**: Create two dictionaries, combine them into one, and print a specific key:value.
- **Task 2**: Create a dictionary from a list of keys and a list of values using a dictionary comprehension

In [None]:
# Your Answers Here

## NumPy Arrays
- **Mutable**: Elements can be modified, but the array's size is fixed.
- **Ordered**: Elements are stored in a specific order.
- **Syntax**: Created using `numpy.array()` function.
- **Use Case**: Ideal for numerical operations, especially in scientific computing, due to its efficiency and the availability of vectorized operations. Perfect for linear algebra. Also great for encoding images (which are arrays of pixel values). 
- **Note**: Data type must be the same for every element in the array

In [None]:
# NumPy Arrays: Efficient arrays for numerical data.
# import the library
import numpy as np
# Creating a NumPy array of pH values
ph_values = np.array([7.2, 7.4, 6.8, 7.0, 7.3])
print("pH Values:", ph_values)

# Performing calculations on the entire array
average_ph = np.mean(ph_values)
print("Average pH:", average_ph)

### Some Useful Functions for Using NumPy Arrays

NumPy is a fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays.

1. **Creating Arrays:**
   - `np.array(list)`: Creates a NumPy array from a list or list of lists.
   - `np.zeros(shape)`: Creates an array filled with zeros.
   - `np.ones(shape)`: Creates an array filled with ones.
   - `np.arange(start, stop, step)`: Creates an array with values from `start` to `stop` with `step` increments.
   - `np.linspace(start, stop, num)`: Creates an array with `num` evenly spaced values over the specified interval.

2. **Array Attributes:**
   - `ndarray.shape`: Tuple of array dimensions.
   - `ndarray.size`: Number of elements in the array.
   - `ndarray.dtype`: Data type of the array elements.
   - `ndarray.ndim`: Number of array dimensions.

3. **Indexing and Slicing:**
   - `array[index]`: Accesses an element.
   - `array[start:stop:step]`: Slices the array.

4. **Reshaping and Transposing:**
   - `reshape(shape)`: Gives a new shape to an array without changing its data.
   - `transpose()`: Permute the dimensions of an array.

5. **Mathematical Operations:**
   - `np.add()`, `np.subtract()`, `np.multiply()`, `np.divide()`: Basic arithmetic operations.
   - `np.sqrt()`, `np.log()`, `np.exp()`: Square root, logarithm, and exponential.
   - `np.sin()`, `np.cos()`, `np.tan()`: Trigonometric functions.

6. **Aggregation Functions:**
   - `np.sum()`, `np.mean()`, `np.median()`: Summation, mean, and median.
   - `np.min()`, `np.max()`: Minimum and maximum.
   - `np.std()`: Standard deviation.

7. **Linear Algebra:**
   - `np.dot()`: Dot product of two arrays.
   - `np.linalg.inv()`: Inverse of a matrix.
   - `np.linalg.eig()`: Eigenvalues and eigenvectors of a matrix.

8. **Random Module:**
   - `np.random.rand()`: Random values in a given shape.
   - `np.random.randint()`: Random integers from a low to a high range.

9. **Comparisons:**
   - `np.equal()`, `np.greater()`, `np.less()`: Element-wise comparisons.

10. **Combining/Splitting:**
    - `np.concatenate()`, `np.vstack()`, `np.hstack()`: Combine arrays.
    - `np.split()`, `np.vsplit()`, `np.hsplit()`: Split arrays.

11. **Saving and Loading:**
    - `np.save()`, `np.load()`: Save and load arrays to and from disk.


In [None]:
# Some useful examples

# Create a 1D array
my_list = [1,2,3,4,5]
array_1D = np.array(my_list)
print(array_1D)

# Do math operations on the array; This is a very useful property of arrays
print(array_1D * 2)

# Create a 2D array from two lists
first_list = [1,2,3,4,5]
second_list = [5,4,3,2,1]
array_2D = np.array([first_list, second_list]) # notice the use of []
print(array_2D)

### Exercise 3: Numpy Arrays
- **Task 1**: Create a 2D array from two lists
- **Task 2**: Create a second 2d array with the same dimensions, and subract the first array from the second

In [None]:
# Your Answer Here

# Pandas DataFrames

Pandas DataFrames are a fundamental tool in Python for data manipulation and analysi DataFrames are particularly useful for large multivariate datasets, such as gene expression fold change data, or single cell flow cytometry data. 

. Key features and uses include:

- **Tabular Data Representation**: DataFrames provide a 2D table structure with rows and columns, similar to Excel spreadsheets or SQL tables.
- **Heterogeneous Data Types**: Each column can hold different data types (e.g., int, float, string), making it ideal for real-world data.
- **Data Manipulation**: Offers extensive functionality for modifying and transforming data, including adding/removing columns, filtering rows, and handling missing values.
- **Data Analysis**: Facilitates data analysis with built-in functions for descriptive statistics, aggregation, and handling time-series data.
- **Integration with Other Libraries**: Seamlessly works with other libraries like NumPy, Matplotlib, and Scikit-learn for numerical computing, plotting, and machine learnin
- **Note**: Notice that the dataframes can be easily created from a dictionary, and that the dictionary keys become the column headers. The more practical way to create dataframes is to read them from a .csv or excel file. We'll cover that process in the File IO primer.g tasks.


In [None]:
# Some useful examples
import pandas as pd

# Simulating gene expression data using a dictionary
gene_expression_data = {
    'gene': ['gene1', 'gene2', 'gene3', 'gene4', 'gene5'],
    'raw_expression_level': [20, 55, 35, 80, 45],
    'fold_change': [1, 2.5, -1, 3, .75]
}
gene_expression = pd.DataFrame(gene_expression_data)

# Displaying the first few rows of the DataFrame
print("print the first five rows of the dataframe:\n", gene_expression.head()) # note: the '\n' adds a new line to keep the output nice and neat

# Basic statistics for each gene
print("print a table of summary statistics\n", gene_expression.describe())

# Filtering genes with expression level above a threshold
high_expression = gene_expression[gene_expression['raw_expression_level'] > 50]
print(high_expression)


### Some Useful Functions for Using Pandas DataFrames 

Pandas DataFrames offer a wide range of functions for data manipulation and analysis. Here are some commonly used functions:

- **head()**: Displays the first few rows of the DataFrame.
  df.head()

- **describe()**: Provides a summary of statistics for numerical columns.
  df.describe()

- **groupby()**: Groups data by specified columns, useful for aggregation operations.
  df.groupby('column_name').mean()

- **merge()**: Merges two DataFrames based on a common column.
  pd.merge(df1, df2, on='common_column')

- **pivot_table()**: Creates a pivot table for data summarization.
  df.pivot_table(values='value_column', index='row_column', columns='column_column')

- **fillna()**: Fills NA/NaN values using a specified method.
  df.fillna(method='ffill')


### Exercise 4: DataFrames
- **Task 1**: Create a dataframe that includes 5 molecules (rows) and two properties (columns) for each molecule
- **Task 2**: Print the first  

In [None]:
# Your answer here

# Advanced Data Type Exercises

## Exercise 1: Protein List Manipulation
Given a list of proteins `["Hemoglobin", "Insulin", "Keratin", "Collagen", "Myosin"]`, write a function named `add_protein` that adds a new protein to this list. The function should take two arguments: the list of proteins and the new protein to be added. After defining the function, add "Actin" to the list and print the updated list.

## Exercise 2: Vitamin Tuple Iteration
Create a tuple named `vitamins` containing the following vitamins: "Vitamin A", "Vitamin B", "Vitamin C", "Vitamin D". Write a loop that iterates through this tuple and prints each vitamin. Note: this works the same was as it would for a list.

## Exercise 3: Amino Acid Molecular Weight Lookup
Create a dictionary named `amino_acid_weights` with the following amino acids and their molecular weights: Alanine (89.1), Cysteine (121.2), Aspartic Acid (133.1). Write a function named `get_molecular_weight` that takes an amino acid name as an argument and returns its molecular weight. Test the function by finding the molecular weight of "Alanine".

## Exercise 4: Analyzing Enzyme Activities with NumPy
Using NumPy, generate an array of 10 random enzyme activity values. Name this array `enzyme_activities`. Calculate and print the mean and standard deviation of these activities.

## Exercise 5: Protein Concentration Data Analysis with Pandas
Create a Pandas DataFrame named `protein_data` with the following columns and values:
- 'Protein': ['Protein A', 'Protein B', 'Protein A', 'Protein B', 'Protein A', 'Protein B']
- 'Concentration': [20, 35, 25, 40, 30, 45]
- 'Cell_Type': ['Eukaryotic', 'Prokaryotic', 'Eukaryotic', 'Prokaryotic', 'Eukaryotic', 'Prokaryotic']

Perform the following tasks:
1. Display the first 5 rows of the DataFrame.
2. Calculate and print the average concentration for each protein.
3. Calculate and print the total concentration for each cell type.
on for each cell type.


In [None]:
# Your answers here; Create new cells as needed    

## Solutions

In [3]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Exercise 1: List Manipulation
def add_protein(proteins, new_protein):
    proteins.append(new_protein)
    return proteins

proteins = ["Hemoglobin", "Insulin", "Keratin", "Collagen", "Myosin"]
new_protein = "Actin"
updated_proteins = add_protein(proteins, new_protein)
print("Updated Protein List:", updated_proteins)

# Exercise 2: Tuple Operations
vitamins = ("Vitamin A", "Vitamin B", "Vitamin C", "Vitamin D")
for vitamin in vitamins:
    print("Vitamin:", vitamin)

# Exercise 3: Dictionary Handling
amino_acid_weights = {
    "Alanine": 89.1,
    "Cysteine": 121.2,
    "Aspartic Acid": 133.1
}

def get_molecular_weight(amino_acid):
    return amino_acid_weights.get(amino_acid, "Unknown")

print("Molecular Weight of Alanine:", get_molecular_weight("Alanine"))

# Exercise 4: NumPy Array Calculations
enzyme_activities = np.random.rand(10)
print("Enzyme Activities:", enzyme_activities)
print("Mean Activity:", np.mean(enzyme_activities))
print("Standard Deviation:", np.std(enzyme_activities))

# Exercise 5: Dataframe challenge
# Create a DataFrame
protein_data = pd.DataFrame({
    'Protein': ['Protein A', 'Protein B', 'Protein A', 'Protein B', 'Protein A', 'Protein B'],
    'Concentration': [20, 35, 25, 40, 30, 45],
    'Cell_Type': ['Eukaryotic', 'Prokaryotic', 'Eukaryotic', 'Prokaryotic', 'Eukaryotic', 'Prokaryotic']
})

# Display the first 5 rows
print(protein_data.head())

# Average concentration for each protein
average_concentration = protein_data.groupby('Protein').mean()
print("The average concentrations of each protein are\n", average_concentration)

# Total concentration for each cell type
total_concentration = protein_data.groupby('Cell_Type').sum()
print("The total concentration for each cell type is\n", total_concentration)

Updated Protein List: ['Hemoglobin', 'Insulin', 'Keratin', 'Collagen', 'Myosin', 'Actin']
Vitamin: Vitamin A
Vitamin: Vitamin B
Vitamin: Vitamin C
Vitamin: Vitamin D
Molecular Weight of Alanine: 89.1
Enzyme Activities: [0.95287396 0.52038228 0.55417223 0.99046294 0.93752784 0.98944538
 0.12776423 0.23568532 0.94138047 0.92527361]
Mean Activity: 0.7174968270913544
Standard Deviation: 0.31480685546430076
     Protein  Concentration    Cell_Type
0  Protein A             20   Eukaryotic
1  Protein B             35  Prokaryotic
2  Protein A             25   Eukaryotic
3  Protein B             40  Prokaryotic
4  Protein A             30   Eukaryotic
The average concentrations of each protein are
            Concentration
Protein                 
Protein A             25
Protein B             40
The total concentration for each cell type is
              Concentration
Cell_Type                 
Eukaryotic              75
Prokaryotic            120
