# What Causes More Scientific Discoveries in Short Time

## Data Scrape

- **Creating Author:** Balam, Yanheng Liu, Foster
- **Latest Modification:** 20-03-2025  
- **Modification Author:** Yanheng Liu  
- **E-mail:** [yanheng.liu@etu.sorbonne-universite.fr](mailto:yanheng.liu@etu.sorbonne-universite.fr)  
- **Version:** 1.1  

---

This is a data scrape provided for the project in *DALAS* course.


Check package whether are installed in the environment.

In [5]:
import pkg_resources
import subprocess

# Read package list from requirements.txt
with open("../../requirements.txt", "r") as file:
    packages = [line.strip() for line in file if line.strip() and not line.startswith("#")]

# Get the list of currently installed packages
installed_packages = {pkg.key for pkg in pkg_resources.working_set}

# Check and install missing packages
for package in packages:
    pkg_name = package.split("==")[0].lower() if "==" in package else package.lower()
    if pkg_name not in installed_packages:
        print(f"Installing missing package: {package}")
        try:
            subprocess.check_call(["pip", "install", package])
        except subprocess.CalledProcessError as e:
            print(f"Failed to install {package}. Error: {e}")
    else:
        print(f"Already installed: {package}")


Already installed: requests
Already installed: beautifulsoup4
Already installed: pandas
Already installed: tabulate
Already installed: pdfplumber
Already installed: lxml
Already installed: pandas
Already installed: rapidfuzz


---

## Balam's Scrape Task

Below is the web scraping process for Balam's part of the scraping in project.

---


In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
import pdfplumber

url = "https://unacademy.com/content/railway-exam/study-material/general-awareness/inventions-discoveries/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    table = soup.find("table")
    
    if table:
        header_tags = table.find_all("th")
        if header_tags:
            headers = [th.text.strip() for th in header_tags]
        else:
            headers = [f"Columna {i+1}" for i in range(len(table.find_all("tr")[1].find_all("td")))]
        
        rows = []
        for tr in table.find_all("tr")[1:]: 
            cells = [td.text.strip() for td in tr.find_all("td")]
            if cells: 
                rows.append(cells)
        
        df = pd.DataFrame(rows, columns=headers)
        
        print(tabulate(df, headers='keys', tablefmt='grid'))
    else:
        print("No se encontró la tabla en la página.")
else:
    print("Error al acceder a la página:", response.status_code)



+----+-----------------------+--------------------------------------------------+-------------+
|    | Columna 1             | Columna 2                                        | Columna 3   |
|  0 | Automatic Calculator  | Wilhelm Schickard                                | 1623        |
+----+-----------------------+--------------------------------------------------+-------------+
|  1 | Air Conditioner       | Willis Carrier                                   | 1902        |
+----+-----------------------+--------------------------------------------------+-------------+
|  2 | Anemometer            | Leon Battista Alberti                            | 1450        |
+----+-----------------------+--------------------------------------------------+-------------+
|  3 | Animation             | J. Stuart Blackton                               | —           |
+----+-----------------------+--------------------------------------------------+-------------+
|  4 | Atom Bomb             | Julius Ro

In [3]:
url = "https://www.ipoi.gov.ie/en/student-teacher-zone/inventions-a-z/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    content_wrapper = soup.find("div", id="ContentWrapper")
    
    if content_wrapper:
        tables = content_wrapper.find_all("table")  

        all_data = []

        for table in tables:
            header_tags = table.find_all("th")
            if header_tags:
                headers = [th.text.strip() for th in header_tags]
            else:
                headers = [f"Columna {i+1}" for i in range(len(table.find_all("tr")[1].find_all("td")))]
            
            rows = []
            for tr in table.find_all("tr")[1:]: 
                cells = [td.text.strip() for td in tr.find_all("td")]
                if cells: 
                    rows.append(cells)

            df = pd.DataFrame(rows, columns=headers)
            all_data.append(df)

        if all_data:
            final_df = pd.concat(all_data, ignore_index=True)
            
            print(tabulate(final_df, headers='keys', tablefmt='grid'))
        else:
            print("No se encontraron tablas en la página.")
    else:
        print("No se encontró el contenedor principal.")
else:
    print("Error al acceder a la página:", response.status_code)



+----+---------------------------+--------------------------+-------------+
|    | Columna 1                 | Columna 2                | Columna 3   |
|  0 | AdhesiveTape              | RichardDrew              | 1930        |
+----+---------------------------+--------------------------+-------------+
|  1 | Airplane                  | Orville&WilburWright     | 1930        |
+----+---------------------------+--------------------------+-------------+
|  2 | AlcoholThermometer        | GabrielFahrenheit        | 1709        |
+----+---------------------------+--------------------------+-------------+
|  3 | AstroTurf                 | J.Faria&R.Wright         | 1965        |
+----+---------------------------+--------------------------+-------------+
|  4 | BallpointPen              | LaszloBiro               | 1938        |
+----+---------------------------+--------------------------+-------------+
|  5 | Barometer(mercury)        | EvangelistaTorricelli    | 1643        |
+----+------

In [4]:
url = "https://www.adda247.com/defence-jobs/important-inventions-and-their-inventors/"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    tables = soup.find_all("table")

    all_data = []

    selected_tables = tables[1:3] if len(tables) > 2 else tables

    for table in selected_tables:
        rows = table.find_all("tr")  
        table_data = []

        for row in rows:
            cols = row.find_all(["td", "th"]) 
            cols = [col.text.strip() for col in cols]  
            if cols:  
                table_data.append(cols)

        if table_data:
            all_data.extend(table_data)

    df = pd.DataFrame(all_data)

    print(tabulate(df, headers="firstrow", tablefmt="grid"))

else:
    print("Error al obtener la página:", response.status_code)


+-----+-----------------------+-----------------------------+---------------------+
|   0 | Invention/Discovery   | Name of the Inventor        | Year of Invention   |
|   1 | Automatic Calculator  | Wilhelm Schickard           | 1623                |
+-----+-----------------------+-----------------------------+---------------------+
|   2 | Air Conditioner       | Willis Carrier              | 1902                |
+-----+-----------------------+-----------------------------+---------------------+
|   3 | Anemometer            | Leon Battista Alberti       | 1450                |
+-----+-----------------------+-----------------------------+---------------------+
|   4 | Animation             | J. Stuart Blackton          | —                   |
+-----+-----------------------+-----------------------------+---------------------+
|   5 | Atom Bomb             | Julius Robert Oppenheimer   | 1945                |
+-----+-----------------------+-----------------------------+---------------

In [5]:
pdf_url = "https://cdn1.byjus.com/wp-content/uploads/2020/06/List-of-Important-Inventions-Discoveries.pdf"

pdf_response = requests.get(pdf_url)
pdf_path = "inventions_discoveries.pdf"

with open(pdf_path, "wb") as file:
    file.write(pdf_response.content)

table_data = []

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                if row and len(row) == 3 and "Invention/Discovery" not in row[0]:
                    table_data.append(row)

headers = ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
print(tabulate(table_data, headers=headers, tablefmt="grid"))



+----------------------------------+-----------------------------------------+---------------------+
| Invention/Discovery              | Name of the Inventor                    | Year of Invention   |
| List of Inventions & Discoveries |                                         |                     |
+----------------------------------+-----------------------------------------+---------------------+
| Automatic Calculator             | Wilhelm Schickard                       | 1623                |
+----------------------------------+-----------------------------------------+---------------------+
| Air Conditioner                  | Willis Carrier                          | 1914                |
+----------------------------------+-----------------------------------------+---------------------+
| Amplitude Modulation             | Reginald Fessenden                      | -                   |
+----------------------------------+-----------------------------------------+-------------

---

## Yanheng Liu's Scrape Task

Below is the web scraping process for Yanheng Liu's part of the scraping in project.

---
