## PSSM Position Specific Scoring Matrix

Refer: https://www.nature.com/articles/srep46237

In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import entropy

In [None]:
import matplotlib.pyplot as plt

In [None]:
sequences_df = pd.read_csv("ExampleData.csv")

In [None]:
sequences = sequences_df['Sequence'].tolist()

In [None]:
amino_acids = 'ACDEFGHIKLMNPQRSTVWY'

In [None]:
sequence_length = max(len(seq) for seq in sequences)
pssm = np.zeros((len(amino_acids), sequence_length))

In [None]:
for i, aa in enumerate(amino_acids):
    for j in range(sequence_length):
        column = [seq[j] if j < len(seq) else '-' for seq in sequences]
        pssm[i, j] = column.count(aa) / len(column)

# Convert PSSM to a pandas DataFrame for better visualization
pssm_df = pd.DataFrame(pssm, index=list(amino_acids))

In [None]:
print(pssm_df)
pssm_df.to_csv('PSSM_Matrix.csv', index=True)

In [None]:
# Calculate entropy for each position
position_entropies = pssm_df.apply(lambda col: entropy(col, base=2), axis=0)

In [None]:
# Visualization of the PSSM heatmap
plt.figure(figsize=(15, 8))
sns.heatmap(pssm_df, cmap='viridis', cbar=True, xticklabels=10, yticklabels=True)
plt.xlabel('Position in Sequence')
plt.ylabel('Amino Acid')
plt.title('Position Specific Scoring Matrix (PSSM) Heatmap')
plt.tight_layout()
plt.savefig("psm_matrix.png")
plt.show()

In [None]:
# Visualization of entropy across sequence positions
plt.figure(figsize=(10, 6))
plt.plot(position_entropies, marker='o')
plt.xlabel('Position in Sequence')
plt.ylabel('Entropy')
plt.title('Entropy of Amino Acid Frequencies at Each Position')
plt.grid(True)
plt.tight_layout()
plt.show()

### Explanation of the above graph (from chatgpt)

#### High Entropy Regions:

Peaks in the graph indicate high entropy, which means high variability in amino acid choices at those positions.
These positions are less conserved, suggesting that mutations are tolerated without severely impacting protein function.
Ideal targets for mutagenesis in directed evolution experiments, as introducing changes here might lead to novel or improved functions.

DE: Use saturation mutagenesis or random mutagenesis to explore new functionalities.

#### Low Entropy Regions:

Troughs or regions with near-zero entropy indicate highly conserved positions.
These positions are functionally or structurally critical, and mutations here might disrupt protein stability or function.
Typically, these are avoided in protein engineering unless the goal is to explore stability enhancements or fine-tuning.

DE: Apply conservative mutations or avoid changes unless specifically targeting protein stability or specific structural features.

#### Flat Regions with Zero Entropy:

If the graph shows a flat line at zero, it means that only one amino acid is present at these positions in all sequences.
Such conserved motifs often correlate with active sites or binding domains crucial for the protein’s activity.