# What Causes More Scientific Discoveries in Short Time

## Data Scrape

- **Creating Author:** Balam, Yanheng Liu, Foster
- **Latest Modification:** 20-03-2025  
- **Modification Author:** Yanheng Liu  
- **E-mail:** [yanheng.liu@etu.sorbonne-universite.fr](mailto:yanheng.liu@etu.sorbonne-universite.fr)  
- **Version:** 1.1  

---

This is a data scrape provided for the project in *DALAS* course.


Check package whether are installed in the environment.

In [69]:
import pkg_resources
import subprocess

# Read package list from requirements.txt
with open("../../requirements.txt", "r") as file:
    packages = [line.strip() for line in file if line.strip() and not line.startswith("#")]

# Get the list of currently installed packages
installed_packages = {pkg.key for pkg in pkg_resources.working_set}

# Check and install missing packages
for package in packages:
    pkg_name = package.split("==")[0].lower() if "==" in package else package.lower()
    if pkg_name not in installed_packages:
        print(f"Installing missing package: {package}")
        try:
            subprocess.check_call(["pip", "install", package])
        except subprocess.CalledProcessError as e:
            print(f"Failed to install {package}. Error: {e}")
    else:
        print(f"Already installed: {package}")


FileNotFoundError: [Errno 2] No such file or directory: '../../requirements.txt'

---

## Balam's Scrape Task

Below is the web scraping process for Balam's part of the scraping in project.

---


In [70]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
import pdfplumber

url = "https://unacademy.com/content/railway-exam/study-material/general-awareness/inventions-discoveries/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    table = soup.find("table")
    
    if table:
        header_tags = table.find_all("th")
        if header_tags:
            headers = [th.text.strip() for th in header_tags]
        else:
            headers = ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
        
        rows = []
        for tr in table.find_all("tr")[1:]: 
            cells = [td.text.strip() for td in tr.find_all("td")]
            if cells: 
                rows.append(cells)
        
        df = pd.DataFrame(rows, columns=headers)
        
        print(tabulate(df, headers='keys', tablefmt='grid'))
    else:
        print("Chart not found")
else:
    print("Error accesing to the website", response.status_code)



+----+-----------------------+--------------------------------------------------+---------------------+
|    | Invention/Discovery   | Name of the Inventor                             | Year of Invention   |
|  0 | Automatic Calculator  | Wilhelm Schickard                                | 1623                |
+----+-----------------------+--------------------------------------------------+---------------------+
|  1 | Air Conditioner       | Willis Carrier                                   | 1902                |
+----+-----------------------+--------------------------------------------------+---------------------+
|  2 | Anemometer            | Leon Battista Alberti                            | 1450                |
+----+-----------------------+--------------------------------------------------+---------------------+
|  3 | Animation             | J. Stuart Blackton                               | —                   |
+----+-----------------------+----------------------------------

In [71]:
url = "https://www.ipoi.gov.ie/en/student-teacher-zone/inventions-a-z/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    content_wrapper = soup.find("div", id="ContentWrapper")
    
    if content_wrapper:
        tables = content_wrapper.find_all("table")  

        all_data = []

        for table in tables:
            header_tags = table.find_all("th")
            if header_tags:
                headers = [th.text.strip() for th in header_tags]
            else:
                headers = ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
            
            rows = []
            for tr in table.find_all("tr")[1:]: 
                cells = [td.text.strip() for td in tr.find_all("td")]
                if cells: 
                    rows.append(cells)

            df1 = pd.DataFrame(rows, columns=headers)
            all_data.append(df1)

        if all_data:
            df1 = pd.concat(all_data, ignore_index=True)
            
            print(tabulate(df1, headers='keys', tablefmt='grid'))
        else:
            print("No charts found")
    else:
        print("Container not found")
else:
    print("Error accesing to the website", response.status_code)



+----+---------------------------+--------------------------+---------------------+
|    | Invention/Discovery       | Name of the Inventor     | Year of Invention   |
|  0 | AdhesiveTape              | RichardDrew              | 1930                |
+----+---------------------------+--------------------------+---------------------+
|  1 | Airplane                  | Orville&WilburWright     | 1930                |
+----+---------------------------+--------------------------+---------------------+
|  2 | AlcoholThermometer        | GabrielFahrenheit        | 1709                |
+----+---------------------------+--------------------------+---------------------+
|  3 | AstroTurf                 | J.Faria&R.Wright         | 1965                |
+----+---------------------------+--------------------------+---------------------+
|  4 | BallpointPen              | LaszloBiro               | 1938                |
+----+---------------------------+--------------------------+---------------

In [72]:
url = "https://www.adda247.com/defence-jobs/important-inventions-and-their-inventors/"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    
    tables = soup.find_all("table")

    all_data = []

    selected_tables = tables[1:3] if len(tables) > 2 else tables

    for table in selected_tables:
        rows = table.find_all("tr")  
        table_data = []

        for row in rows:
            cols = row.find_all(["td", "th"]) 
            cols = [col.text.strip() for col in cols]  
            if cols:  
                table_data.append(cols)

        if table_data:
            all_data.extend(table_data)

    df2 = pd.DataFrame(all_data)
    df2.columns = ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
    print(tabulate(df2, headers="keys", tablefmt="grid"))

else:
    print("Error acccesing to the website", response.status_code)


+----+-----------------------+-----------------------------+---------------------+
|    | Invention/Discovery   | Name of the Inventor        | Year of Invention   |
|  0 | Invention/Discovery   | Name of the Inventor        | Year of Invention   |
+----+-----------------------+-----------------------------+---------------------+
|  1 | Automatic Calculator  | Wilhelm Schickard           | 1623                |
+----+-----------------------+-----------------------------+---------------------+
|  2 | Air Conditioner       | Willis Carrier              | 1902                |
+----+-----------------------+-----------------------------+---------------------+
|  3 | Anemometer            | Leon Battista Alberti       | 1450                |
+----+-----------------------+-----------------------------+---------------------+
|  4 | Animation             | J. Stuart Blackton          | —                   |
+----+-----------------------+-----------------------------+---------------------+
|  5

In [86]:
pdf_url = "https://cdn1.byjus.com/wp-content/uploads/2020/06/List-of-Important-Inventions-Discoveries.pdf"

pdf_response = requests.get(pdf_url)
pdf_path = "inventions_discoveries.pdf"

with open(pdf_path, "wb") as file:
    file.write(pdf_response.content)
table_data = []

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                if row and len(row) == 3 and "Invention/Discovery" not in row[0]:
                    table_data.append(row)


if table_data:
    all_data.extend(table_data)

    df3 = pd.DataFrame(all_data)

df3.columns = ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
print(tabulate(df3, headers="keys", tablefmt="grid"))

PermissionError: [Errno 13] Permission denied: 'inventions_discoveries.pdf'

In [29]:
url = "https://www.studyiq.com/articles/inventions-and-discoveries/"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, "lxml")

    entry_content = soup.find("div", class_="entry-content")

    if not entry_content:
        print("Entry-content not found")
    else:
        tables = entry_content.find_all("table")

        if not tables or len(tables) < 3:
            print("Error with one or more charts")
        else:
            second_table = tables[1]  
            third_table = tables[2]  

            combined_data = []

            for row in second_table.find_all("tr"):
                cols = row.find_all(["th", "td"])
                cols = [col.text.strip() for col in cols]
                if cols and "Invention" not in cols:  
                    combined_data.append(cols)

            first_row = True  
            for row in third_table.find_all("tr"):
                cols = row.find_all(["th", "td"])
                cols = [col.text.strip() for col in cols]
                if cols:
                    if first_row:
                        first_row = False  
                        continue
                    combined_data.append(cols)

            df4 = pd.DataFrame(combined_data, columns=["","Invention/Discovery", "Name of the Inventor", "Year of Invention"])

            print(tabulate(df4, headers="keys", tablefmt="grid"))



+----+----+---------------------------------------+------------------------------------------------------------------------------+------------------------------------+
|    |    | Invention/Discovery                   | Name of the Inventor                                                         | Year of Invention                  |
|  0 |  1 | Telephone                             | Alexander Graham Bell                                                        | 1876                               |
+----+----+---------------------------------------+------------------------------------------------------------------------------+------------------------------------+
|  1 |  2 | Radio                                 | Guglielmo Marconi                                                            | 1895                               |
+----+----+---------------------------------------+------------------------------------------------------------------------------+------------------------------

In [33]:
dataframes = {
    "df": df,
    "df1": df1,
    "df2": df2,
    "df3": df3,
    "df4": df4
}

selected_columns = {
    "df": ["Invention/Discovery", "Name of the Inventor", "Year of Invention"],  
    "df1": ["Invention/Discovery", "Name of the Inventor", "Year of Invention"],
    "df2": ["Invention/Discovery", "Name of the Inventor", "Year of Invention"],
    "df3": ["Invention/Discovery", "Name of the Inventor", "Year of Invention"],
    "df4": ["Invention/Discovery", "Name of the Inventor", "Year of Invention"]
}

for key, df in dataframes.items():
    if key in selected_columns:
        selected_cols = selected_columns[key]
        
        filtered_df = df[selected_cols] if all(col in df.columns for col in selected_cols) else df
        
        csv_filename = f"{key}.csv"
        
        filtered_df.to_csv(csv_filename, index=False)
        
        print(f"CSV file'{csv_filename}' saved")


CSV file'df.csv' saved
CSV file'df1.csv' saved
CSV file'df2.csv' saved
CSV file'df3.csv' saved
CSV file'df4.csv' saved


In [None]:
df = df[["Invention/Discovery", "Name of the Inventor", "Year of Invention"]]
df1 = df1[["Invention/Discovery", "Name of the Inventor", "Year of Invention"]]
df2 = df2[["Invention/Discovery", "Name of the Inventor", "Year of Invention"]]
df3 = df3[["Invention/Discovery", "Name of the Inventor", "Year of Invention"]]
df4 = df4[["Invention/Discovery", "Name of the Inventor", "Year of Invention"]]

combined_df = pd.concat([df, df1, df2, df3, df4], ignore_index=True)

combined_df = combined_df[combined_df.iloc[:, 0] != combined_df.columns[0]]

combined_df.to_csv("combined_dataset.csv", index=False)
print("✅ CSV guardado como 'combined_dataset.csv'")


PermissionError: [Errno 13] Permission denied: 'combined_dataset.csv'

---

## Yanheng Liu's Scrape Task

Below is the web scraping process for Yanheng Liu's part of the scraping in project.

---
