<a href="https://colab.research.google.com/github/PDBeurope/pdbe-notebooks/blob/Genevieve/Finding_PDB_ids/Finding_PDB_ids_and_complex_ids_from_author_name.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src = "https://www.ebi.ac.uk/pdbe/sites/default/files/2019-04/PDBe-letterhead-charcoal-RGB_2013.png"
 height="80" align="right">

# Finding PDB & complex ids based on author name

<br>
<br>

The Protein Data Bank in Europe (PDBe) is a leading resource for the collection, organisation, and dissemination of macromolecular structures.


The APIs are a programmatic way to obtain information from the PDB and EMDB.

For more information, visit:

* http://www.ebi.ac.uk/pdbe/pdbe-rest-api

* https://www.ebi.ac.uk/pdbe/api/doc/search.html


<br>

---

#**Leveraging the power of PDBe's APIs**

### How to use Google Colab
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or press Shift+Enter to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it. If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.
5. Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.

Notebook prepared by Genevieve Evans.

## (1) Installing python packages

First, we import some packages that we will use, and set some variables.

We will be using Python packages / modules:

UPDATE the below text after notebook is completed.

*   [re](https://https://docs.python.org/3/library/re.html) - allows use of regular expression matching operations similar to those found in Perl.
*   [requests](https://docs.python.org/3/library/re.html) - allows you to send HTTP/1.1 requests extremely easily.
*   [pprint](https://docs.python.org/3/library/pprint.html) - makes data look more readable / pretty
*   [pandas](https://pandas.pydata.org/docs/index.html) - used for turning results into mini databases

<br>



---


In [1]:
# @title Setting up the notebook with various codes and packages needed (part I)

import re # regular expressions
import requests # used for getting data from a URL
from pprint import pprint # pretty print
import pandas as pd # used for turning results into mini databases

#import ipywidgets as widgets
#[ipywidgets](https://ipywidgets.readthedocs.io/en/stable/) - interactive browser controls for Jupyter notebooks
#import csv
#[csv](https://docs.python.org/3/library/csv.html) - enables csv file input and output

In [2]:
# @title Setting up the notebook with various codes and packages needed (part II)

!pip install solrq
from solrq import Q # used to turn result queries into the right format

Collecting solrq
  Downloading solrq-1.1.2-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading solrq-1.1.2-py2.py3-none-any.whl (9.3 kB)
Installing collected packages: solrq
Successfully installed solrq-1.1.2


## (2) Setting up variables

In [3]:
# @title Specifying PDBe's advanced search / Solr API

search_url = "https://www.ebi.ac.uk/pdbe/search/pdb/select?" # the rest of the URL used for PDBe's search API.

In [4]:
# @title Input entry author last name

last_name = 'Evans' #  @param {type:"string"}

In [5]:
# @title Input entry author first name initial

first_initial = 'G' #  @param {type:"string"}

## (3) Setting up code / new definitions



In [6]:
# @title Setting up functions to run information retrieval

def run_search(last, initial, number_of_rows=100):

    # Convert to all lowercase
    last_name_lower = last.lower()
    first_initial_lower = initial.lower()

    # Prepare search terms and filter terms
    name = f"q_entry_authors:{last_name_lower}*"
    search_terms = {'q':name}
    filters = ['pdb_id', 'status', 'release_year', 'resolution', 'title', 'entry_authors','q_complex_id']
    filter_terms = {'fl':filters}

    # Prepare rows & response type specified as json
    rows_specified = {'rows':number_of_rows}
    json = {'wt':'json'}

    search_dict = {**filter_terms, **rows_specified, **json}
    search_dict = {**search_terms, **search_dict}

    response = requests.post("https://www.ebi.ac.uk/pdbe/search/pdb/select/?", data=search_dict)
    results1 = response.json()
    results2 = results1.get('response')
    if response.status_code == 200:
        results3 = results2.get('docs')
        #print('Number of results: {}'.format(len(results3)))
        return results3
    else:
        print("[No data retrieved - %s] %s" % (response.status_code, response.text))

    return

In [7]:
# @title Setting up functions to transform the information retrieved {"display-mode":"form"}

def change_lists_to_strings(results):
    """
    input - list of results from search
    output - list of results with lists changed into strings
    """
    for row in results:
        for data in row:
            if type(row[data]) == list:
                # if there are any numbers in the list change them into strings
                row[data] = [str(a) for a in row[data]]
                # unique and sort the list and then change the list into a string
                row[data] = ','.join(sorted(list(set(row[data]))))

    return results

def pandas_dataset(list_of_results):
    results = change_lists_to_strings(list_of_results) # we have added our function to change lists to strings

    # Convert to pandas dataframe
    df = pd.DataFrame(results)

    # Rename column in pandas dataframe
    df = df.rename(columns={"q_complex_id": "PDBe_complex_id"}, errors="raise")

    # Drop duplicate rows
    df = df.drop_duplicates()

    return df

def find_unique_authors(df_result,last,initial):
   df_author_names = df_result['entry_authors'].str.split(',', expand=True)

  # Convert DataFrame to a Series and get unique values
   unique_values = pd.unique(df_author_names.stack())

   # Convert to a list
   unique_list = unique_values.tolist()

   # Define the regex pattern
   pattern = re.compile(rf'^{last.title()}')

   # Find all values that match the pattern
   matching_values = [value for value in unique_list if pattern.search(value)]

   # Get unique values
   unique_matching_values = list(set(matching_values))

   # Value to search for initial
   search_value = initial.capitalize()

   # Filter the list based on the presence of the search value
   filtered_names = [name for name in unique_matching_values if search_value in name or name == f"last.title()"]

   return(filtered_names)

def update_df_w_matching_authors(df_result, unique_author_names):

    # List of values to match
    values_list = unique_author_names

    # Apply the function to the specific column 'entry_authors' and create a new column 'Match'
    df_result['matched'] = df_result['entry_authors'].apply(lambda x: any(val in x for val in values_list))

    # Function to find the first partial match
    def find_match(x):
        author_list = x.split(',')
        for val in values_list:
            for author in author_list:
                if val == author:
                    return author
        return None

    # Create a new column 'author' with the actual matched value
    df_result['author'] = df_result['entry_authors'].apply(find_match)

    # Filter rows where column 'matched' is True
    df_filtered = df_result.loc[df_result['matched']]

    # Change the data type of column 'release_year' from float to integer
    ds_year = df_filtered['release_year'].astype(int)
    ds_year = ds_year.rename('year_released')
    combined_df = pd.concat([df_filtered, ds_year], axis=1)
    combined_df = combined_df.drop(columns=['release_year'])

    # Sort the DataFrame by column 'release_year' in descending order
    df_sorted = combined_df.sort_values(by='year_released', ascending=False)

    # Reset index to have continuous range starting from 0
    df_reindexed = df_sorted.reset_index(drop=True)

    # Remove column 'matched'
    df_reindexed = df_reindexed.drop(columns=['matched'])


    # Reorder the DataFrame
    column_names = ['year_released', 'author', 'entry_authors', 'pdb_id', 'PDBe_complex_id', 'title', 'resolution', 'status']
    df_reordered = df_reindexed[column_names]

    return df_reordered

## (4) Generating results

In [8]:
# @title Example {"display-mode":"form"}

# API query
example_results = run_search(last_name, first_initial, number_of_rows=50000)

# Convert to pandas dataframe
df_pdb_ids = pandas_dataset(example_results)

unique_author_names = find_unique_authors(df_pdb_ids,last_name,first_initial)
print(unique_author_names)
print(len(unique_author_names))

['Evans HG', 'Evans GB', 'Evans GL', 'Evans G']
4


In [9]:
# @title Example (continued) {"display-mode":"form"}

# generating table
df_matched = update_df_w_matching_authors(df_pdb_ids, unique_author_names)
display(df_matched)

Unnamed: 0,year_released,author,entry_authors,pdb_id,PDBe_complex_id,title,resolution,status
0,2024,Evans GL,"Evans GL,Olasz B,Vrielink A",8ucy,PDB-CPX-275473,"Sterile Alpha Motif (SAM) domain from Tric1, A...",1.48,REL
1,2023,Evans G,"Crawshaw A,Evans G,Stuart D,Sutton G,Trincao J...",8qph,PDB-CPX-235404,Crystal structure of Lymantria dispar CPV14 po...,1.34,REL
2,2023,Evans G,"Crawshaw A,Evans G,Stuart D,Sutton G,Trincao J...",8qqc,PDB-CPX-187518,Crystal structure of Lymantria dispar CPV14 po...,1.30,REL
3,2021,Evans G,"Begley T,Bolton R,Ealick SE,Evans G,Giri N,Rod...",7nhf,PDB-CPX-185267,Crystal structure of Arabidopsis thaliana Pdx1...,2.35,REL
4,2021,Evans G,"Begley T,Bolton R,Ealick SE,Evans G,Giri N,Rod...",7nhe,PDB-CPX-185267,Crystal structure of Arabidopsis thaliana Pdx1...,2.23,REL
...,...,...,...,...,...,...,...,...
84,2002,Evans G,"Bricogne G,Evans G",1gwg,PDB-CPX-136237,Tri-iodide derivative of apoferritin,2.01,REL
85,2002,Evans G,"Bricogne G,Evans G",1gw9,PDB-CPX-150204,Tri-iodide derivative of Xylose Isomerase from...,1.55,REL
86,2002,Evans G,"Bricogne G,Evans G",1gwa,PDB-CPX-133451,Triiodide derivative of porcine pancreas elastase,1.85,REL
87,2000,Evans G,"Beno D,Collart FR,Evans G,Huberman E,Joachimia...",1zfj,"PDB-CPX-142882,PDB-CPX-142883",INOSINE MONOPHOSPHATE DEHYDROGENASE (IMPDH; EC...,1.90,REL


Copyright 2024 EMBL - European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.