# **FASTA Protein Data: Length and Sequence**

---

## ðŸš€ Brief Pipeline Explanation

This pipeline is designed to perform essential **Exploratory Data Analysis (EDA)** on a large FASTA file, transforming raw sequence data into a structured format suitable for machine learning or detailed statistical study.

### 1. Data Parsing and Length Analysis (Visualization)

This initial stage uses the **Biopython** library (`SeqIO`) to efficiently parse the raw FASTA file containing over 82,000 protein sequences.
* **Length Extraction:** The length of every sequence is computed and stored.
* **Statistical Summary:** Basic statistics (`mean`, `min`, `max`, `quartiles`) are generated using Pandas `describe()` to quickly understand the size range of the proteins.
* **Distribution Visualization:** **Histograms** are plotted to visualize the highly **right-skewed** distribution of sequence lengths (since most proteins are short, but a few are extremely long). A second, filtered plot focuses on the main body of the distribution (e.g., sequences under the 99th percentile) for clearer insight.

### 2. Metadata Structuring and ID Parsing

The second stage structures the sequence metadata into a clean Pandas DataFrame, a process often critical for aligning protein data with annotation files.
* **EntryID Splitting:** The pipeline iterates through all entries, taking the **Full EntryID** (e.g., `sp|A0A0C5B5G6|MOTSC_HUMAN`) and splitting it by the pipe (`|`) delimiter into three informative columns:
    * **Accession\_Type:** (e.g., `sp` for SwissProt/UniProtKB reviewed entry).
    * **Accession\_ID:** The unique identifier (e.g., `A0A0C5B5G6`).
    * **Protein\_Name:** A short descriptive name (e.g., `MOTSC_HUMAN`).
* **Final DataFrame:** The output is a comprehensive DataFrame containing the parsed IDs, the sequence length, and a preview of the sequence, preparing the data for feature engineering or linkage with functional annotation datasets.

Would you like to proceed with using the generated DataFrame (`df`) for further analysis, such as investigating the distribution of `Accession_Type` or grouping by species?

In [None]:
!pip install biopython

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO
import os

# -----------------------------------------------------------
# 1. Data Loading and Length Extraction
# -----------------------------------------------------------

# Specify the file path (assuming the CAFA 6 dataset structure)
file_path = '/kaggle/input/cafa-6-protein-function-prediction/Train/train_sequences.fasta'

# Check if the file exists
if not os.path.exists(file_path):
    print(f"Error: File not found at {file_path}")
else:
    # Use SeqIO to parse the FASTA file
    sequences = list(SeqIO.parse(file_path, 'fasta'))
    n_sequences = len(sequences)
    print(f"âœ… Total number of sequences loaded: {n_sequences}")

    # Extract the length of all sequences into a list
    lengths = [len(record.seq) for record in sequences]

    # Create a Pandas DataFrame
    df_lengths = pd.DataFrame(lengths, columns=['Length'])

    # -----------------------------------------------------------
    # 2. Display Statistical Summary
    # -----------------------------------------------------------

    print("\n--- Statistical Summary (Sequence Length Statistics) ---")
    # Display stats formatted as integers for readability
    print(df_lengths['Length'].describe().apply(lambda x: f'{x:,.0f}'))

    # -----------------------------------------------------------
    # 3. Visualization of Distribution
    # -----------------------------------------------------------

    # Since sequence length distributions are often heavily skewed by long outliers,
    # we'll create two plots for better visualization.
    
    plt.style.use('ggplot')
    
    fig, axes = plt.subplots(2, 1, figsize=(12, 10))
    
    # === Plot 1: Histogram of Full Range ===
    
    # Check the overall shape and the presence of extreme outliers
    sns.histplot(df_lengths['Length'], bins=100, kde=True, ax=axes[0], color='skyblue')
    axes[0].set_title('â‘  Distribution of All Sequence Lengths (Full Range)', fontsize=14)
    axes[0].set_xlabel('Sequence Length (Amino Acids)', fontsize=12)
    axes[0].set_ylabel('Frequency (Count)', fontsize=12)

    # === Plot 2: Histogram Focusing on the Main Distribution ===
    
    # Focus on the most concentrated region (e.g., sequences shorter than the 99th percentile)
    # The actual cutoff might need adjustment based on the data.
    max_visible_length = df_lengths['Length'].quantile(0.99) 
    df_filtered = df_lengths[df_lengths['Length'] <= max_visible_length]

    sns.histplot(df_filtered['Length'], bins=50, kde=True, ax=axes[1], color='coral')
    axes[1].set_title(f'â‘¡ Distribution of Sequence Lengths (Focus on Length <= {int(max_visible_length):,} AA)', fontsize=14)
    axes[1].set_xlabel('Sequence Length (Amino Acids)', fontsize=12)
    axes[1].set_ylabel('Frequency (Count)', fontsize=12)
    
    plt.tight_layout()
    plt.show()

In [None]:


if 'sequences' not in locals() or not sequences:
    print("Error: The 'sequences' list is not found or is empty. Please ensure the FASTA file loading was successful.")
else:
    # 1. Data Extraction
    data = []
    for record in sequences:
        # Split the EntryID by the '|' (pipe) delimiter
        parts = record.id.split('|')
        
        # Expected format: ['sp', 'A0A0C5B5G6', 'MOTSC_HUMAN'] (3 elements)
        if len(parts) == 3:
            accession_type = parts[0]
            accession_id = parts[1]
            protein_name = parts[2]
        else:
            # Fallback for unexpected ID formats (e.g., store the full ID)
            accession_type = 'N/A'
            accession_id = record.id
            protein_name = 'N/A'

        # Append the row data to the list
        data.append({
            'Full_EntryID': record.id,
            'Length': len(record.seq),
            'Accession_Type': accession_type,
            'Accession_ID': accession_id,
            'Protein_Name': protein_name,
            'Sequence': str(record.seq)[0:32] # The sequence itself is usually omitted due to its length
        })

    # 2. DataFrame Creation
    df = pd.DataFrame(data)

    # 3. Display Results
    print("âœ… DataFrame successfully created.")
    print("\n--- First 5 rows of the DataFrame ---")
    print(df.head())
    print("\n--- DataFrame Information ---")
    print(df.info())

    df.to_csv('fasta.csv',index=False)