# ProtVec Encoding for Drug-Protein Interaction Dataset

This notebook implements ProtVec encoding for amino acid sequences in the scope_onside_common_v3 dataset. ProtVec is a distributed representation method for amino acid sequences that converts protein sequences into fixed-length numerical vectors using pre-trained 3-gram embeddings.

## Dataset Overview
- **Total interactions**: 34,741 drug-protein interactions
- **Unique proteins**: 2,385 proteins with UniProt IDs
- **Encoding method**: ProtVec using pre-trained 100-dimensional 3-gram embeddings

## References
- Asgari, E., & Mofrad, M. R. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11), e0141287.
- Implementation based on PhageProtVec-master from GitHub

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import csv
import warnings
from tqdm import tqdm
import os
from collections import defaultdict

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("Required libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Required libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 1.26.4


## 2. Load Dataset and Explore Structure

In [2]:
# Load the drug-protein interaction dataset
dataset_path = "scope_onside_common_v3.parquet"
df = pd.read_parquet(dataset_path)

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDataset info:")
print(df.info())
print(f"\nFirst 5 rows:")
df.head()

Dataset loaded successfully!
Dataset shape: (34741, 7)

Columns: ['drug_chembl_id', 'target_uniprot_id', 'label', 'smiles', 'sequence', 'molfile_3d', 'rxcui']

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34741 entries, 0 to 34740
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   drug_chembl_id     34741 non-null  object
 1   target_uniprot_id  34741 non-null  object
 2   label              34741 non-null  int64 
 3   smiles             34741 non-null  object
 4   sequence           34741 non-null  object
 5   molfile_3d         34741 non-null  object
 6   rxcui              34741 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.9+ MB
None

First 5 rows:


Unnamed: 0,drug_chembl_id,target_uniprot_id,label,smiles,sequence,molfile_3d,rxcui
0,CHEMBL1000,O15245,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MPTVDDILEQVGESGWFQKQAFLILCLLSAAFAPICVGIVFLGFTP...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
1,CHEMBL1000,P08183,1,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MDLEGDRNGGAKKKNFFKLNNKSEKDKKEKKPTVSVFSMFRYSNWL...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
2,CHEMBL1000,P35367,1,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MSLPNSSCLLEDKMCEGNKTTMASPQLMPLVVVLSTICLVTVGLNL...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
3,CHEMBL1000,Q02763,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MDSLASLVLCGVSLLLSGTVEGAMDLILINSLPLVSDAETSLTCIA...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
4,CHEMBL1000,Q12809,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MPVRRGHVAPQNTFLDTIIRKFEGQSRKFIIANARVENCAVIYCND...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610


In [3]:
# Analyze protein sequences
print("Protein Sequence Analysis:")
print(f"Total drug-protein interactions: {len(df)}")
print(f"Unique proteins (by target_uniprot_id): {df['target_uniprot_id'].nunique()}")
print(f"Unique sequences: {df['sequence'].nunique()}")

# Check sequence lengths
sequence_lengths = df['sequence'].str.len()
print(f"\nSequence length statistics:")
print(f"Min length: {sequence_lengths.min()}")
print(f"Max length: {sequence_lengths.max()}")
print(f"Mean length: {sequence_lengths.mean():.2f}")
print(f"Median length: {sequence_lengths.median():.2f}")

# Check for null sequences
null_sequences = df['sequence'].isnull().sum()
print(f"\nNull sequences: {null_sequences}")

# Sample sequence preview
print(f"\nSample protein sequence (first 100 characters):")
print(df['sequence'].iloc[0][:100] + "...")

Protein Sequence Analysis:
Total drug-protein interactions: 34741
Unique proteins (by target_uniprot_id): 2385
Unique sequences: 2381

Sequence length statistics:
Min length: 17
Max length: 2753
Mean length: 711.01
Median length: 555.00

Null sequences: 0

Sample protein sequence (first 100 characters):
MPTVDDILEQVGESGWFQKQAFLILCLLSAAFAPICVGIVFLGFTPDHHCQSPGVAELSQRCGWSPAEELNYTVPGLGPAGEAFLGQCRRYEVDWNQSAL...


## 3. Extract Unique Proteins and Sequences

In [4]:
# Extract unique proteins with their sequences
unique_proteins = df[['target_uniprot_id', 'sequence']].drop_duplicates()

print(f"Unique protein-sequence pairs extracted: {len(unique_proteins)}")

# Check if there are proteins with multiple sequences (unlikely but worth checking)
protein_seq_counts = unique_proteins.groupby('target_uniprot_id').size()
proteins_with_multiple_seqs = protein_seq_counts[protein_seq_counts > 1]

if len(proteins_with_multiple_seqs) > 0:
    print(f"\nProteins with multiple sequences: {len(proteins_with_multiple_seqs)}")
    print("Top 5 proteins with multiple sequences:")
    print(proteins_with_multiple_seqs.head())
else:
    print("\nAll proteins have unique sequences - good!")

# Display sample of unique proteins
print(f"\nSample of unique proteins:")
unique_proteins.head()

Unique protein-sequence pairs extracted: 2385

All proteins have unique sequences - good!

Sample of unique proteins:


Unnamed: 0,target_uniprot_id,sequence
0,O15245,MPTVDDILEQVGESGWFQKQAFLILCLLSAAFAPICVGIVFLGFTP...
1,P08183,MDLEGDRNGGAKKKNFFKLNNKSEKDKKEKKPTVSVFSMFRYSNWL...
2,P35367,MSLPNSSCLLEDKMCEGNKTTMASPQLMPLVVVLSTICLVTVGLNL...
3,Q02763,MDSLASLVLCGVSLLLSGTVEGAMDLILINSLPLVSDAETSLTCIA...
4,Q12809,MPVRRGHVAPQNTFLDTIIRKFEGQSRKFIIANARVENCAVIYCND...


## 4. Load ProtVec Model

In [5]:
# Load ProtVec embeddings - using the method from the PhageProtVec implementation
protvec_file = "protVec_100d_3grams.csv"

print("Loading ProtVec embeddings...")

# Load embeddings using the same method as in the original protvec.py
ehsanEmbed = []
with open(protvec_file) as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    for line in tsvreader:
        ehsanEmbed.append(line[0].split('\t'))

# Extract 3-mers and embedding vectors
threemers = [vec[0] for vec in ehsanEmbed]
embeddingMat = [[float(n) for n in vec[1:]] for vec in ehsanEmbed]

# Create 3-mer to index mapping dictionary
threemersidx = {}
for i, kmer in enumerate(threemers):
    threemersidx[kmer] = i

print(f"ProtVec model loaded successfully!")
print(f"Number of 3-mers: {len(threemers)}")
print(f"Embedding dimension: {len(embeddingMat[0])}")
print(f"Sample 3-mers: {threemers[:10]}")

# Check if '<unk>' token exists for unknown 3-mers
has_unk = '<unk>' in threemersidx
print(f"Unknown token available: {has_unk}")
if not has_unk:
    print("Warning: No <unk> token found. Unknown 3-mers will be skipped.")

Loading ProtVec embeddings...
ProtVec model loaded successfully!
Number of 3-mers: 9048
Embedding dimension: 100
Sample 3-mers: ['AAA', 'ALA', 'LLL', 'LAA', 'AAL', 'ALL', 'LLA', 'LAL', 'SSS', 'EAL']
Unknown token available: True


## 5. Implement ProtVec 3-gram Generation

In [6]:
# Implement the kmerlists function from the original protvec.py
def kmerlists(seq):
    """
    Convert sequences to three lists of non-overlapping 3-mers
    Based on the original implementation from PhageProtVec-master
    """
    kmer0 = []
    kmer1 = []
    kmer2 = []
    
    for i in range(0, len(seq) - 2, 3):
        if len(seq[i:i + 3]) == 3:
            kmer0.append(seq[i:i + 3])
        i += 1
        if i < len(seq) - 2 and len(seq[i:i + 3]) == 3:
            kmer1.append(seq[i:i + 3])
        i += 1
        if i < len(seq) - 2 and len(seq[i:i + 3]) == 3:
            kmer2.append(seq[i:i + 3])
    
    return [kmer0, kmer1, kmer2]

# Test the function with a sample sequence
test_sequence = unique_proteins['sequence'].iloc[0][:30]  # First 30 characters for testing
test_kmers = kmerlists(test_sequence)

print(f"Test sequence: {test_sequence}")
print(f"3-mer lists generated:")
for i, kmer_list in enumerate(test_kmers):
    print(f"  List {i}: {kmer_list}")
    
print(f"\nTotal 3-mers generated: {sum(len(kl) for kl in test_kmers)}")

Test sequence: MPTVDDILEQVGESGWFQKQAFLILCLLSA
3-mer lists generated:
  List 0: ['MPT', 'VDD', 'ILE', 'QVG', 'ESG', 'WFQ', 'KQA', 'FLI', 'LCL', 'LSA']
  List 1: ['PTV', 'DDI', 'LEQ', 'VGE', 'SGW', 'FQK', 'QAF', 'LIL', 'CLL']
  List 2: ['TVD', 'DIL', 'EQV', 'GES', 'GWF', 'QKQ', 'AFL', 'ILC', 'LLS']

Total 3-mers generated: 28


## 6. Implement ProtVec Encoding Function

In [7]:
# Implement the protvec function from the original protvec.py
def protvec(kmersdict, seq, embeddingweights):
    """
    Convert protein sequence to ProtVec representation
    Based on the original implementation from PhageProtVec-master
    
    Args:
        kmersdict: Dictionary mapping 3-mers to indices
        seq: Amino acid sequence string
        embeddingweights: Matrix of embedding vectors
    
    Returns:
        List of 100 float values representing the protein sequence
    """
    # Convert seq to three lists of kmers
    kmerlist = kmerlists(seq)
    # Flatten the list of lists
    kmerlist = [j for i in kmerlist for j in i]
    
    # Convert center kmers to their vector representations
    kmersvec = [0.0] * 100  # Initialize with zeros
    valid_kmers = 0
    
    for kmer in kmerlist:
        try:
            # Add the embedding vector for this kmer
            kmer_vector = embeddingweights[kmersdict[kmer]]
            kmersvec = np.add(kmersvec, kmer_vector)
            valid_kmers += 1
        except KeyError:
            # Handle unknown kmers - try to use <unk> token if available
            if '<unk>' in kmersdict:
                kmer_vector = embeddingweights[kmersdict['<unk>']]
                kmersvec = np.add(kmersvec, kmer_vector)
                valid_kmers += 1
            # If no <unk> token, skip this kmer
            continue
    
    return kmersvec.tolist(), valid_kmers

# Test the protvec function
test_sequence = unique_proteins['sequence'].iloc[0]
test_vector, valid_kmers = protvec(threemersidx, test_sequence, embeddingMat)

print(f"Test sequence length: {len(test_sequence)}")
print(f"Valid 3-mers processed: {valid_kmers}")
print(f"ProtVec vector dimension: {len(test_vector)}")
print(f"First 10 dimensions: {test_vector[:10]}")
print(f"Vector norm: {np.linalg.norm(test_vector):.4f}")

Test sequence length: 554
Valid 3-mers processed: 552
ProtVec vector dimension: 100
First 10 dimensions: [-42.14478700000004, -3.937756000000003, -3.767964999999999, -47.825093999999964, 0.5022699999999996, -1.566796, 14.418511000000008, -10.299479000000009, -12.616394000000009, 33.20704599999999]
Vector norm: 162.2363


## 7. Apply ProtVec Encoding to Dataset

In [8]:
# Apply ProtVec encoding to all unique protein sequences
print("Applying ProtVec encoding to all unique protein sequences...")
print(f"Total sequences to encode: {len(unique_proteins)}")

# Initialize lists to store results
encoded_proteins = []
encoding_stats = []

# Process each unique protein sequence
for idx, row in tqdm(unique_proteins.iterrows(), total=len(unique_proteins), desc="Encoding proteins"):
    uniprot_id = row['target_uniprot_id']
    sequence = row['sequence']
    
    # Skip if sequence is null or too short
    if pd.isna(sequence) or len(sequence) < 3:
        encoded_proteins.append({
            'target_uniprot_id': uniprot_id,
            'protvec': [0.0] * 100,  # Zero vector for invalid sequences
            'valid_kmers': 0,
            'sequence_length': 0 if pd.isna(sequence) else len(sequence),
            'encoding_status': 'invalid_sequence'
        })
        continue
    
    # Encode the sequence
    protvec_vector, valid_kmers = protvec(threemersidx, sequence, embeddingMat)
    
    # Store results
    encoded_proteins.append({
        'target_uniprot_id': uniprot_id,
        'protvec': protvec_vector,
        'valid_kmers': valid_kmers,
        'sequence_length': len(sequence),
        'encoding_status': 'success'
    })
    
    # Collect statistics
    encoding_stats.append({
        'sequence_length': len(sequence),
        'valid_kmers': valid_kmers,
        'vector_norm': np.linalg.norm(protvec_vector)
    })

print(f"\nEncoding completed!")
print(f"Successfully encoded: {len([p for p in encoded_proteins if p['encoding_status'] == 'success'])}")
print(f"Failed encodings: {len([p for p in encoded_proteins if p['encoding_status'] != 'success'])}")

Applying ProtVec encoding to all unique protein sequences...
Total sequences to encode: 2385


Encoding proteins: 100%|██████████| 2385/2385 [00:23<00:00, 99.85it/s] 


Encoding completed!
Successfully encoded: 2385
Failed encodings: 0





In [9]:
# Analyze encoding statistics
stats_df = pd.DataFrame(encoding_stats)

print("Encoding Statistics:")
print(f"Sequence length - Mean: {stats_df['sequence_length'].mean():.2f}, Std: {stats_df['sequence_length'].std():.2f}")
print(f"Valid k-mers - Mean: {stats_df['valid_kmers'].mean():.2f}, Std: {stats_df['valid_kmers'].std():.2f}")
print(f"Vector norm - Mean: {stats_df['vector_norm'].mean():.4f}, Std: {stats_df['vector_norm'].std():.4f}")

# Check for any unusual patterns
print(f"\nSequences with no valid k-mers: {len(stats_df[stats_df['valid_kmers'] == 0])}")
print(f"Sequences with zero vector norm: {len(stats_df[stats_df['vector_norm'] == 0])}")

# Display distribution plots would go here in a more complete analysis
print(f"\nSample of encoding results:")
for i in range(min(3, len(encoded_proteins))):
    protein = encoded_proteins[i]
    print(f"Protein {i+1}: {protein['target_uniprot_id']}")
    print(f"  Sequence length: {protein['sequence_length']}")
    print(f"  Valid k-mers: {protein['valid_kmers']}")
    print(f"  Vector norm: {np.linalg.norm(protein['protvec']):.4f}")
    print(f"  First 5 dimensions: {protein['protvec'][:5]}")
    print()

Encoding Statistics:
Sequence length - Mean: 598.54, Std: 395.57
Valid k-mers - Mean: 596.54, Std: 395.57
Vector norm - Mean: 173.3883, Std: 113.7266

Sequences with no valid k-mers: 0
Sequences with zero vector norm: 0

Sample of encoding results:
Protein 1: O15245
  Sequence length: 554
  Valid k-mers: 552
  Vector norm: 162.2363
  First 5 dimensions: [-42.14478700000004, -3.937756000000003, -3.767964999999999, -47.825093999999964, 0.5022699999999996]

Protein 2: P08183
  Sequence length: 1280
  Valid k-mers: 1278
  Vector norm: 357.2444
  First 5 dimensions: [-95.70522299999993, -12.298284000000002, 15.374387000000016, -93.36739299999994, -12.442629000000007]

Protein 3: P35367
  Sequence length: 487
  Valid k-mers: 485
  Vector norm: 139.4499
  First 5 dimensions: [-35.203913, -0.4923620000000024, -10.671149000000005, -38.49461599999999, 4.789974000000003]



## 8. Validate and Save Encoded Results

In [10]:
# Create a comprehensive dataframe with encoded results
protein_protvec_df = pd.DataFrame(encoded_proteins)

# Validate dimensions
print("Validation of encoded results:")
print(f"Number of encoded proteins: {len(protein_protvec_df)}")
print(f"Expected unique proteins: {len(unique_proteins)}")

# Check vector dimensions
vector_dimensions = protein_protvec_df['protvec'].apply(len)
print(f"Vector dimensions - Unique values: {vector_dimensions.unique()}")
print(f"All vectors have 100 dimensions: {all(vector_dimensions == 100)}")

# Check for successful encodings
successful_encodings = protein_protvec_df[protein_protvec_df['encoding_status'] == 'success']
print(f"Successful encodings: {len(successful_encodings)} ({len(successful_encodings)/len(protein_protvec_df)*100:.2f}%)")

# Identify any problematic encodings
failed_encodings = protein_protvec_df[protein_protvec_df['encoding_status'] != 'success']
if len(failed_encodings) > 0:
    print(f"Failed encodings: {len(failed_encodings)}")
    print("Failure reasons:")
    print(failed_encodings['encoding_status'].value_counts())

print("\nDataframe structure:")
print(protein_protvec_df.info())

Validation of encoded results:
Number of encoded proteins: 2385
Expected unique proteins: 2385
Vector dimensions - Unique values: [100]
All vectors have 100 dimensions: True
Successful encodings: 2385 (100.00%)

Dataframe structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2385 entries, 0 to 2384
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   target_uniprot_id  2385 non-null   object
 1   protvec            2385 non-null   object
 2   valid_kmers        2385 non-null   int64 
 3   sequence_length    2385 non-null   int64 
 4   encoding_status    2385 non-null   object
dtypes: int64(2), object(3)
memory usage: 93.3+ KB
None


In [11]:
# Save the encoded results in multiple formats

# 1. Save as pickle for Python use (preserves all data types)
pickle_file = "protein_protvec_encoded.pkl"
protein_protvec_df.to_pickle(pickle_file)
print(f"Saved encoded proteins to {pickle_file}")

# 2. Create a matrix format suitable for machine learning
# Extract vectors into a numpy array
protvec_matrix = np.array([protein['protvec'] for protein in encoded_proteins])
uniprot_ids = [protein['target_uniprot_id'] for protein in encoded_proteins]

print(f"ProtVec matrix shape: {protvec_matrix.shape}")

# Save as numpy array
np.save("protein_protvec_matrix.npy", protvec_matrix)
np.save("protein_uniprot_ids.npy", np.array(uniprot_ids))
print("Saved ProtVec matrix and UniProt IDs as numpy arrays")

# 3. Save as CSV for broader compatibility (vectors as separate columns)
# Create column names for the 100 dimensions
vector_columns = [f'protvec_dim_{i}' for i in range(100)]

# Create a dataframe with expanded vectors
expanded_df = pd.DataFrame(protvec_matrix, columns=vector_columns)
expanded_df['target_uniprot_id'] = uniprot_ids
expanded_df['valid_kmers'] = [protein['valid_kmers'] for protein in encoded_proteins]
expanded_df['sequence_length'] = [protein['sequence_length'] for protein in encoded_proteins]
expanded_df['encoding_status'] = [protein['encoding_status'] for protein in encoded_proteins]

# Reorder columns to put metadata first
metadata_cols = ['target_uniprot_id', 'sequence_length', 'valid_kmers', 'encoding_status']
expanded_df = expanded_df[metadata_cols + vector_columns]

csv_file = "protein_protvec_encoded.csv"
expanded_df.to_csv(csv_file, index=False)
print(f"Saved expanded format to {csv_file}")

print(f"\nFiles created:")
print(f"- {pickle_file}: Complete DataFrame with list-format vectors")
print(f"- protein_protvec_matrix.npy: NumPy matrix of vectors ({protvec_matrix.shape})")
print(f"- protein_uniprot_ids.npy: Corresponding UniProt IDs")
print(f"- {csv_file}: Expanded format with vectors as separate columns")

Saved encoded proteins to protein_protvec_encoded.pkl
ProtVec matrix shape: (2385, 100)
Saved ProtVec matrix and UniProt IDs as numpy arrays
Saved expanded format to protein_protvec_encoded.csv

Files created:
- protein_protvec_encoded.pkl: Complete DataFrame with list-format vectors
- protein_protvec_matrix.npy: NumPy matrix of vectors ((2385, 100))
- protein_uniprot_ids.npy: Corresponding UniProt IDs
- protein_protvec_encoded.csv: Expanded format with vectors as separate columns
Saved expanded format to protein_protvec_encoded.csv

Files created:
- protein_protvec_encoded.pkl: Complete DataFrame with list-format vectors
- protein_protvec_matrix.npy: NumPy matrix of vectors ((2385, 100))
- protein_uniprot_ids.npy: Corresponding UniProt IDs
- protein_protvec_encoded.csv: Expanded format with vectors as separate columns


In [12]:
# Create a mapping file for easy integration with the original dataset
# This allows you to merge the encoded vectors back with the drug-protein interactions

# Create a mapping from UniProt ID to encoded vector
uniprot_to_protvec = dict(zip(uniprot_ids, protvec_matrix))

print("Integration Example:")
print("To merge with original dataset, you can use:")
print("df_merged = df.merge(expanded_df[['target_uniprot_id'] + vector_columns], on='target_uniprot_id', how='left')")

# Show sample of how the data looks
print(f"\nSample of final encoded data:")
print(expanded_df.head()[['target_uniprot_id', 'sequence_length', 'valid_kmers', 'protvec_dim_0', 'protvec_dim_1', 'protvec_dim_2']])

print(f"\nEncoding Summary:")
print(f"- Original dataset: {len(df)} drug-protein interactions")
print(f"- Unique proteins: {len(unique_proteins)} proteins")
print(f"- Successfully encoded: {len(successful_encodings)} proteins")
print(f"- Vector dimension: 100 (as per ProtVec standard)")
print(f"- Average sequence length: {expanded_df['sequence_length'].mean():.1f} amino acids")
print(f"- Average valid k-mers per protein: {expanded_df['valid_kmers'].mean():.1f}")

print("\nEncoding process completed successfully!")
print("You can now use these ProtVec representations for machine learning tasks such as:")
print("- Drug-target interaction prediction")
print("- Protein function classification") 
print("- Protein similarity analysis")
print("- Feature extraction for downstream tasks")

Integration Example:
To merge with original dataset, you can use:
df_merged = df.merge(expanded_df[['target_uniprot_id'] + vector_columns], on='target_uniprot_id', how='left')

Sample of final encoded data:
  target_uniprot_id  sequence_length  valid_kmers  protvec_dim_0  \
0            O15245              554          552     -42.144787   
1            P08183             1280         1278     -95.705223   
2            P35367              487          485     -35.203913   
3            Q02763             1124         1122     -73.190838   
4            Q12809             1159         1157     -93.953584   

   protvec_dim_1  protvec_dim_2  
0      -3.937756      -3.767965  
1     -12.298284      15.374387  
2      -0.492362     -10.671149  
3       2.165375      -8.775661  
4     -19.250826     -26.050599  

Encoding Summary:
- Original dataset: 34741 drug-protein interactions
- Unique proteins: 2385 proteins
- Successfully encoded: 2385 proteins
- Vector dimension: 100 (as per ProtVe

## Summary

This notebook successfully implements ProtVec encoding for the drug-protein interaction dataset using the authentic ProtVec method from PhageProtVec-master. 

### Key Implementation Details:

1. **Authentic ProtVec Implementation**: Used the exact same functions (`kmerlists` and `protvec`) from the original PhageProtVec-master repository
2. **Pre-trained Embeddings**: Utilized the provided `protVec_100d_3grams.csv` file with 9,048 3-gram embeddings
3. **3-gram Generation**: Implemented the non-overlapping 3-gram splitting method as per the original paper
4. **Vector Summation**: Each protein sequence is represented as a 100-dimensional vector by summing the embeddings of all its 3-grams

### Output Files:
- `protein_protvec_encoded.pkl`: Complete DataFrame with all results
- `protein_protvec_matrix.npy`: NumPy matrix of encoded vectors (2385 x 100)
- `protein_uniprot_ids.npy`: Corresponding UniProt IDs
- `protein_protvec_encoded.csv`: Expanded format with vectors as separate columns

### Dataset Statistics:
- **34,741** drug-protein interactions processed
- **2,385** unique proteins encoded
- **100-dimensional** vector representation per protein
- Successfully handles sequences ranging from 17 to 2,753 amino acids

The encoded protein representations can now be used for machine learning tasks such as drug-target interaction prediction, protein function classification, and similarity analysis.