<h1 style="text-align: center; color: #1f77b4;">NCBI GEO samples scrapping</h1>

### install selenium

In [1]:
pip install selenium

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import required libraries

- re: For regular expression operations.
- csv: For reading and writing CSV files.
- pandas as pd: For creating a DataFrame.
- requests: For sending HTTP requests.
- BeautifulSoup: From the bs4 package, used for parsing HTML and XML documents.
- selenium: A tool for automating web browsers. The specific imports include webdriver for controlling the browser.
- By: for locating elements.
- Service: for managing the driver service.

In [3]:
import re
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
service = Service('geckodriver.exe')

### Setting up the WebDriver to interact with the website

1. website: A variable containing the URL of the website to be accessed, which in this case is an NCBI GEO dataset page.
2. driver: An instance of the Firefox WebDriver created using the service object initialized earlier. This driver will control the Firefox browser.
3. driver.get(website): This command instructs the driver to navigate to the specified URL, opening the webpage in the Firefox browser.

In [4]:
website= 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186458'

In [5]:
driver = webdriver.Firefox(service=service)
driver.get(website)  

### A. Finding Page elements using selenium

1. more_link: A variable that stores the web element found using Selenium's find_element method. The method locates the element using an XPath expression, which in this case targets an anchor (\<a>) tag within a (\<div>) element whose ID contains the string "divhidden".

2. more_link.click(): This command simulates a click on the found element, which is a "more" link reveals additional content on the page.

In [6]:
more_link = driver.find_element(By.XPATH, '//div[contains(@id, "divhidden")]/a')
more_link.click()

### Access the website using requests

1. response: A variable that stores the result of the requests.get(website) call. This sends a GET request to the URL stored in the website variable and retrieves the response from the server.

2. response: When this variable is called, it will display the HTTP response object, which includes the response content.

In [7]:
response = requests.get(website)
response

<Response [200]>

### parsing the HTML content of a webpage

- soup: A variable that stores the parsed HTML content. It uses BeautifulSoup from the bs4 library to convert the HTML text into a structured format that can be easily navigated and searched.

- BeautifulSoup(response.text, 'html.parser'): This creates a BeautifulSoup object by passing the response.text (the raw HTML content obtained from the HTTP response) and specifying 'html.parser' as the parser to use. This allows for easy extraction of data from the HTML structure.

In [8]:
soup = BeautifulSoup(response.text, 'html.parser')

### B. Finding the webpage elements

tds: A variable that stores a list of all \<td> elements found on the webpage that match the specified attributes.

In [9]:
tds = soup.find_all('td', attrs={'onmouseout': "onLinkOut('HelpMessage' , geo_empty_help)",'onmouseover': "onLinkOver('HelpMessage' , geoaxema_recenter)"})

samples_container: A variable that stores the second \<td> element from the tds list that considerd the container of the samples we want to scrape.

In [10]:
samples_container= tds[1]

tables: A variable that stores a list of all \<table> elements found within the samples_container element, these tables contains the samples links.

In [11]:
tables= samples_container.find_all('table')

In [12]:
len(tables)

2

samples_links: A variable that stores a combined list of all \<a> elements found in the first and second \<table> elements within the tables list which contains the href of the samples.

In [13]:
samples_links= tables[0].find_all('a') + tables[1].find_all('a')

In [17]:
print(f'The number of samples= {len(samples_links)}')

The number of samples= 253


### initializing data structures to store information of the samples

- samples_dataset: An empty dictionary intended to store datasets related to samples. This dictionary may later hold various sample details as key-value pairs.

- samples_GEO_accession: An empty list that will store GEO accession numbers, which are unique identifiers assigned to each sample or dataset in the Gene Expression Omnibus (GEO) database.

- samples_age: An empty list that will store the ages of the samples, likely related to the age of individuals or subjects associated with the data.

- samples_sex: An empty list that will store the sex of the samples, indicating whether the samples are from male or female subjects.

In [18]:
samples_dataset={}
samples_GEO_accession=[]
samples_age= []
samples_sex= []

### Extracting samples information

1. Iterate through samples_links: For each \<a> tag (link) in the samples_links list:

    - name: Extracts the sample name from the link and appends it to samples_GEO_accession.
    - link: Retrieves and constructs the full URL for the sample page.
    - sample_page: Sends a GET request to the sample page URL.
    - sample_soup: Parses the HTML content of the sample page using BeautifulSoup.
    - table: Finds the specific table on the sample page using given attributes.
    - tds: Retrieves all \<td> elements from the table.
    - sample_info: Selects a specific \<td> element (17th) which contain the sample characteristics for further processing.
    - text_content: Extracts and cleans the text from sample_info.
    - age_match: Searches for an age pattern in the text and extracts the age if found.
    - sex_match: Searches for a sex or gender pattern in the text and extracts the sex if found.
    - age: Appends the extracted age or "NaN" if not found.
    - sex: Appends the extracted sex or "NaN" if not found.

In [19]:
sample_number=0
for a_tag in samples_links:
    name= a_tag.text
    samples_GEO_accession.append(name)
    link= a_tag.get('href')
    link= f"https://www.ncbi.nlm.nih.gov{link}"
    sample_page= requests.get(link)
    sample_soup= BeautifulSoup(sample_page.text, 'html.parser')
    table = sample_soup.find('table', attrs={'cellpadding': '2','cellspacing': '0','width': '600'})
    tds= table.find_all('td')
    sample_info= tds[17]
    text_content = sample_info.get_text(separator=' ').strip()
    age_match = re.search(r'age:\s*(\d+)', text_content)
    sex_match = re.search(r'[Sex|gender]:\s*([M|F|female|male])', text_content)
    age = age_match.group(1) if age_match else "NaN"
    sex = sex_match.group(1) if sex_match else "NaN"
    samples_age.append(age)
    samples_sex.append(sex)
    sample_number+=1
    print(f'Sample {sample_number} done')

Sample 1 done
Sample 2 done
Sample 3 done
Sample 4 done
Sample 5 done
Sample 6 done
Sample 7 done
Sample 8 done
Sample 9 done
Sample 10 done
Sample 11 done
Sample 12 done
Sample 13 done
Sample 14 done
Sample 15 done
Sample 16 done
Sample 17 done
Sample 18 done
Sample 19 done
Sample 20 done
Sample 21 done
Sample 22 done
Sample 23 done
Sample 24 done
Sample 25 done
Sample 26 done
Sample 27 done
Sample 28 done
Sample 29 done
Sample 30 done
Sample 31 done
Sample 32 done
Sample 33 done
Sample 34 done
Sample 35 done
Sample 36 done
Sample 37 done
Sample 38 done
Sample 39 done
Sample 40 done
Sample 41 done
Sample 42 done
Sample 43 done
Sample 44 done
Sample 45 done
Sample 46 done
Sample 47 done
Sample 48 done
Sample 49 done
Sample 50 done
Sample 51 done
Sample 52 done
Sample 53 done
Sample 54 done
Sample 55 done
Sample 56 done
Sample 57 done
Sample 58 done
Sample 59 done
Sample 60 done
Sample 61 done
Sample 62 done
Sample 63 done
Sample 64 done
Sample 65 done
Sample 66 done
Sample 67 done
Samp

In [20]:
print(f'length of samples_age list= {len(samples_age)}')
print(f'length of samples_sex list= {len(samples_sex)}')
print(f'length of samples_GEO_accession list= {len(samples_GEO_accession)}')

length of samples_age list= 253
length of samples_sex list= 253
length of samples_GEO_accession list= 253


### populating the samples_dataset dictionary with extracted sample information:

In [21]:
samples_dataset['GEO accession']= samples_GEO_accession
samples_dataset['Age']= samples_age
samples_dataset['Sex']= samples_sex

### Creating a DataFrame for the samples 

In [23]:
df = pd.DataFrame(samples_dataset)
df.set_index('GEO accession', inplace=True)

In [27]:
df.head(25)

Unnamed: 0_level_0,Age,Sex
GEO accession,Unnamed: 1_level_1,Unnamed: 2_level_1
GSM5652176,53,F
GSM5652177,35,F
GSM5652178,37,F
GSM5652179,33,F
GSM5652180,62,F
GSM5652181,54,F
GSM5652182,52,F
GSM5652183,34,M
GSM5652184,67,M
GSM5652185,60,M


### Saving the Dataset as CSV file

In [25]:
df.to_csv('Samples.csv')