# Downloading papers using PyAlex 

Some suggestions from the first checkpoint: 
- Synonyms, broaden search on openalex
- Wikipedia one is not the best
- Uncased is the best
- Around 1000

In [1]:
pip install pyalex pandas

Collecting pyalex
  Downloading pyalex-0.18-py3-none-any.whl (13 kB)
Installing collected packages: pyalex
Successfully installed pyalex-0.18
Note: you may need to restart the kernel to use updated packages.


In [15]:
import pandas as pd
from pyalex import Works, Authors, Sources, Institutions, Topics

In [None]:
# Test query: Fetching 5 recent papers related to AI
sample_papers = Works().search("artificial intelligence").get(5)

# Print one sample paper
# print(sample_papers[0])

# Test query worked

  sample_papers = Works().search("artificial intelligence").get(5)


Let's define our search criteria now that our test query worked. 

Topic: AI for pricing and promotion for GDO 

Date Range: Last 10 years (2014-2024)

Fields to extract:
- id: ArXiv ID
- submitter: Who submitted the paper
- authors: Authors of the paper
- title: Title of the paper
- comments: Additional info (e.g., number of pages and figures)
- Journ-ref: Information about the journal
- doi: Digital Object Identifier
- abstract: The abstract of the paper
- versions: A version history

Useful links: https://docs.openalex.org/api-entities/entities-overview and https://github.com/J535D165/pyalex

In [None]:
# First search parameters 
# Define search parameters
# query = "((artificial intelligence OR AI) AND (pricing OR promotion)) AND (retail OR supermarkets OR 'large-scale distribution' OR GDO)"
# Got 25 articles 

# # Define broader search parameters
# query = """
# (
#   (artificial intelligence OR AI OR "machine learning" OR "deep learning" OR "natural language processing" OR "predictive analytics")
#   AND
#   (pricing OR promotion OR discount OR "price optimization" OR "sales forecasting" OR marketing)
# )
# AND
# (
#   retail OR supermarket OR "large-scale distribution" OR GDO OR "supply chain" OR ecommerce OR "FMCG" OR grocery OR commerce
# )
# """
# Still got 25 articles? 

# Even broader query
query = """
(
  "artificial intelligence" OR AI OR "machine learning" OR ML OR "deep learning" OR DL OR 
  "natural language processing" OR NLP OR "predictive modeling" OR "data mining" OR 
  "data science" OR "neural networks" OR "transformer models" OR "language models" OR 
  "recommendation systems" OR "generative AI" OR "unsupervised learning" OR "supervised learning"
)
AND
(
  pricing OR promotion OR discount OR "price optimization" OR "dynamic pricing" OR 
  "sales prediction" OR "revenue management" OR "price elasticity" OR "demand forecasting" OR 
  marketing OR "campaign optimization" OR "consumer behavior" OR "targeting" OR "personalization"
)
AND
(
  retail OR supermarket OR "large-scale distribution" OR GDO OR "grocery stores" OR 
  "e-commerce" OR ecommerce OR "supply chain" OR "consumer goods" OR "FMCG" OR 
  "wholesale" OR "shopping behavior" OR "omnichannel" OR "online retail" OR "brick and mortar"
)
"""


# Search parameters 
year_start = 2014
year_end = 2024

# Fetch papers from OpenAlex
papers = Works().search(query).filter(from_publication_date=f"{year_start}-01-01",to_publication_date=f"{year_end}-12-31").get()

# Function to reconstruct abstracts if stored in abstract_inverted_index format
def reconstruct_abstract(abstract_index):
    if not abstract_index:
        return None
    words = []
    for key, positions in sorted(abstract_index.items(), key=lambda x: min(x[1])):
        words.append(key)
    return " ".join(words)

# Extract relevant information safely
data = []
for paper in papers:
    primary_location = paper.get("primary_location", {})  # Handle missing journal info
    source = primary_location.get("source")  # Might be None
    source_name = source["display_name"] if source else "Unknown"
    
    data.append({
        "id": paper.get("id"),  # ArXiv ID or OpenAlex ID
        "submitter": paper.get("submitter"),  # Who submitted the paper (if available)
        "authors": ", ".join([author["author"]["display_name"] 
                              for author in paper.get("authorships", []) 
                              if author.get("author")]),
        "title": paper.get("title"),
        "comments": paper.get("comments"),  # Additional info (e.g., pages, figures)
        "journ_ref": paper.get("journal_reference"),  # Journal reference info, if available
        "doi": paper.get("doi"),  # Digital Object Identifier
        "abstract": reconstruct_abstract(paper.get("abstract_inverted_index")),
        "versions": paper.get("versions"),  # Version history
        "year": paper.get("publication_year"),
        "journal": source_name,  # Journal name extracted from the source
        "keywords": paper.get("keywords", []),
        "topics": [topic["display_name"] for topic in paper.get("topics", [])],
    })

# Convert to DataFrame
df = pd.DataFrame(data)


Something is going wrong with the code, because we're getting very little articles. 
- With query "AI AND (pricing OR promotion) AND GDO" we get 21 articles. 
- With query "AI AND pricing AND promotion AND GDO" we get 5 articles. 
- With query "((artificial intelligence OR AI) AND (pricing OR promotion)) AND (retail OR supermarkets OR 'large-scale distribution' OR GDO)" we get 25 articles. 

Roughly speaking, how many articles should we be able to retrieve? 

What is GDO? 
GDO stands for "Grande Distribuzione Organizzata." In English, it is often translated as "organized large-scale distribution" or "organized retail," referring to large retail chains such as supermarkets and hypermarkets. 


The previous code was wrong because it kept getting stuck on the first page. 

In [None]:
# run this in your terminal 
# pip install pyalex


In [16]:
!pip install pyalex




In [None]:
# from pyalex import Works
# import pandas as pd
# import os

# # Set polite OpenAlex usage email
# os.environ["OPENALEX_EMAIL"] = "your.email@example.com"

# # Search parameters
# query = "((artificial intelligence OR AI) AND (pricing OR promotion)) AND (retail OR supermarkets OR 'large-scale distribution' OR GDO)"
# year_start = 2014
# year_end = 2024

# # Paginate manually
# cursor = "*"
# per_page = 50
# max_results = 500  # optional limit
# retrieved = 0
# data = []

# while cursor and retrieved < max_results:
#     response = Works().filter(
#         search=query,
#         from_publication_date=f"{year_start}-01-01",
#         to_publication_date=f"{year_end}-12-31"
#     ).paginate(per_page=per_page, cursor=cursor)

#     results = response['results']
#     cursor = response.get('meta', {}).get('next_cursor')
    
#     for paper in results:
#         def reconstruct_abstract(abstract_index):
#             if not abstract_index:
#                 return None
#             words = []
#             for key, positions in sorted(abstract_index.items(), key=lambda x: min(x[1])):
#                 words.append(key)
#             return " ".join(words)

#         venue = paper.get("host_venue", {})
#         source_name = venue.get("display_name", "Unknown")

#         data.append({
#             "id": paper.get("id"),
#             "title": paper.get("title"),
#             "abstract": reconstruct_abstract(paper.get("abstract_inverted_index")),
#             "authors": ", ".join([a["author"]["display_name"] for a in paper.get("authorships", []) if "author" in a]),
#             "year": paper.get("publication_year"),
#             "journal": source_name,
#             "doi": paper.get("doi"),
#             "topics": [t["display_name"] for t in paper.get("topics", [])],
#         })

#         retrieved += 1
#         if retrieved >= max_results:
#             break

# # Convert to DataFrame
# df = pd.DataFrame(data)
# print(f"✅ Total articles retrieved: {len(df)}")


TypeError: 'Paginator' object is not subscriptable

In [13]:
# Display the number of articles in the dataframe
num_articles = len(df)
num_articles

25

In [9]:
display(df)

Unnamed: 0,id,submitter,authors,title,comments,journ_ref,doi,abstract,versions,year,journal,keywords,topics
0,https://openalex.org/W1901616594,,"Michael I. Jordan, Tom M. Mitchell","Machine learning: Trends, perspectives, and pr...",,,https://doi.org/10.1126/science.aaa8415,Machine learning addresses the question of how...,[],2015,Science,"[{'id': 'https://openalex.org/keywords/lying',...",[Anomaly Detection Techniques and Applications...
1,https://openalex.org/W3089252064,,"Reza Toorajipour, Vahid Sohrabpour, Ali Nazarp...",Artificial intelligence in supply chain manage...,,,https://doi.org/10.1016/j.jbusres.2020.09.009,This paper seeks to identify the contributions...,[],2020,Journal of Business Research,[{'id': 'https://openalex.org/keywords/scienti...,"[Quality and Supply Management, Management and..."
2,https://openalex.org/W4220820301,,Jeffrey Dastin,Amazon Scraps Secret AI Recruiting Tool that S...,,,https://doi.org/10.1201/9781003278290-44,Automation has been key to Amazon's e-commerce...,[],2022,Auerbach Publications eBooks,[],[Digital Economy and Work Transformation]
3,https://openalex.org/W4387379065,,"Keng‐Boon Ooi, Garry Wei‐Han Tan, Mostafa Al‐E...",The Potential of Generative Artificial Intelli...,,,https://doi.org/10.1080/08874417.2023.2261010,ABSTRACTIn a short span of time since its intr...,[],2023,Journal of Computer Information Systems,[{'id': 'https://openalex.org/keywords/generat...,"[AI in Service Interactions, Artificial Intell..."
4,https://openalex.org/W2964362517,,"Rupa Dash, Mark E. McMurtrey, Carl Rebman, Upe...",Application of Artificial Intelligence in Auto...,,,https://doi.org/10.33423/jsis.v14i3.2105,A well-functioning supply chain is a key to su...,[],2019,Journal of Strategic Innovation and Sustainabi...,[],"[Internet of Things and AI, Blockchain Technol..."
5,https://openalex.org/W4386077567,,"Hanadi A. Salhab, Mahmoud Allahham, Ibrahim A....","Inventory competition, artificial intelligence...",,,https://doi.org/10.5267/j.uscm.2023.8.009,This research examines the synergistic influen...,[],2023,Uncertain Supply Chain Management,[],"[Quality and Supply Management, Digital Transf..."
6,https://openalex.org/W3007552940,,"Anandakumar Haldorai, Suriya Murugan, Arulmuru...","Evolution, challenges, and application of inte...",,,https://doi.org/10.1002/cae.22217,Abstract Artificial intelligence (AI) aims at ...,[],2020,Computer Applications in Engineering Education,[{'id': 'https://openalex.org/keywords/investm...,"[Smart Systems and Machine Learning, Big Data ..."
7,https://openalex.org/W3109960748,,"Vahid Sohrabpour, Pejvak Oghazi, Reza Toorajip...",Export sales forecasting using artificial inte...,,,https://doi.org/10.1016/j.techfore.2020.120480,Sales forecasting is important in production a...,[],2020,Technological Forecasting and Social Change,[{'id': 'https://openalex.org/keywords/sales-f...,"[Evolutionary Algorithms and Applications, Mar..."
8,https://openalex.org/W2913100501,,"K.H. Leung, Chris Luk, K.L. Choy, H.Y. Lam, C....",A B2B flexible pricing decision support system...,,,https://doi.org/10.1080/00207543.2019.1566674,"In the era of digitalisation, e-commerce retai...",[],2019,International Journal of Production Research,[{'id': 'https://openalex.org/keywords/dynamic...,"[Big Data and Business Intelligence, Consumer ..."
9,https://openalex.org/W3202393140,,"Saurabh Sharma, Vijay Kumar Gahlawat, Kumar Ra...",Sustainable Innovations in the Food Industry t...,,,https://doi.org/10.3390/logistics5040066,The agri-food sector is an endless source of e...,[],2021,Logistics,[],"[Smart Agriculture and AI, Food Supply Chain T..."


In [11]:
# Save to CSV
# You might need to change the path 
dionne_path = "/Users/dionnespaltman/Desktop/Luiss /Data Science in Action/Project/openalex_papers.csv"
df.to_csv(dionne_path, index=False)

print("Extracted", len(df), "papers and saved them to 'openalex_papers.csv'.")

Extracted 25 papers and saved them to 'openalex_papers.csv'.
