# About this notebook:

- We are collecting articles ID that are related to diseases from [PubMed](https://pubmed.ncbi.nlm.nih.gov/) 
- We will be using [NCBI API (E-utilities)](https://www.ncbi.nlm.nih.gov/books/NBK25497/): Entrez to obtain our data

**About PubMed**<br>
PubMed is a database that indexes and provides access to biomedical and life sciences literature, including scientific research articles, clinical studies, and reviews. Scientists are one of the primary users of PubMed, as they rely on the database to find relevant literature and stay up-to-date with the latest developments in their field of study.

PubMed plays a crucial role in the scientific community by providing access to a vast amount of research literature.

# Import Libraries & Data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [2]:
df = pd.read_csv('/kaggle/input/id-df-downld/id_df.csv')
df.head()
len(df)

Unnamed: 0,Date,ID
0,2010/01/01,33897310
1,2010/01/01,33467806
2,2010/01/01,32809304
3,2010/01/01,32688721
4,2010/01/01,32288940


2699268

# Randomly selecting 3000 article ID per month

> We will limit to 360K article ID to scrape for their content.

In [3]:
df.rename(columns={'Date': 'Date_str'},inplace=True)
df['Date'] = pd.to_datetime(df['Date_str'])
df

Unnamed: 0,Date_str,ID,Date
0,2010/01/01,33897310,2010-01-01
1,2010/01/01,33467806,2010-01-01
2,2010/01/01,32809304,2010-01-01
3,2010/01/01,32688721,2010-01-01
4,2010/01/01,32288940,2010-01-01
...,...,...,...
2699263,2019/12/31,31883521,2019-12-31
2699264,2019/12/31,31883520,2019-12-31
2699265,2019/12/31,31883515,2019-12-31
2699266,2019/12/31,31883514,2019-12-31


In [4]:
# Extract month and year from 'Date' column and create new 'Month' and 'Year' columns
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

# Group by 'Year' and 'Month' columns
grouped = df.groupby(['Year', 'Month'])

# Create an empty dataframe to store the randomly selected articles
sampled_df = pd.DataFrame()

# Loop through each year-month group
for (year, month), group in grouped:
    # Randomly select up to 1000 articles using the 'sample()' function
    sampled_group = group.sample(n=min(3000, len(group)))
    # Add the sampled group to the new dataframe
    sampled_df = sampled_df.append(sampled_group)

# The new dataframe 'sampled_df' contains up to 1000 randomly selected articles for each month per year
sampled_df

Unnamed: 0,Date_str,ID,Date,Month,Year
3141,2010/01/01,20042657,2010-01-01,1,2010
4701,2010/01/06,20048405,2010-01-06,1,2010
172,2010/01/01,26966632,2010-01-01,1,2010
17630,2010/01/30,20111061,2010-01-30,1,2010
14175,2010/01/23,20092734,2010-01-23,1,2010
...,...,...,...,...,...
2686078,2019/12/15,31835505,2019-12-15,12,2019
2691434,2019/12/21,31859835,2019-12-21,12,2019
2689531,2019/12/19,31849011,2019-12-19,12,2019
2684909,2019/12/14,31832766,2019-12-14,12,2019


In [5]:
# Group the dataframe by 'Year'
grouped = sampled_df.groupby('Year')

<font size=5>Saving articles' ID into individual csv, based on the publication year.</font>

In [6]:
# Loop through each year group
for year, group in grouped:
    # Export the 'ID' column of the group to a CSV file named 'IDs_year.csv'
    filename = f'IDs_{year}.csv'
    group.to_csv(filename, index=False)