# What Causes More Scientific Discoveries in Short Time

## Data Clean

- **Creating Author:** Yanheng Liu
- **Latest Modification:** 13-04-2025  
- **Modification Author:** Yanheng Liu  
- **E-mail:** [yanheng.liu@etu.sorbonne-universite.fr](mailto:yanheng.liu@etu.sorbonne-universite.fr)  
- **Version:** 1.2  

---

This is a data clean provided for the project in *DALAS* course.


## Data Cleaning and Deduplication Workflow

### Step 0: Check library installation

### Step 1: Define Common Functions
Includes text cleaning, normalization, and fuzzy matching functions.

### Step 2: Load and Merge Data
Reads three data files and merges them into a single DataFrame.

### Step 3: Clean Punctuation
Removes unnecessary punctuation from all columns (e.g., quotes), but keeps semicolons.

### Step 4: Preview Country Names
Displays original country names to check for inconsistencies manually.

### Step 5: Standardize Country Names
Uses a predefined mapping dictionary to unify country name variations.

### Step 6: Preview Invention Categories
Shows all original invention categories.

### Step 7: Generalize Categories
Maps specific invention categories to broader scientific fields, with detailed explanation.

### Step 8: Normalize Key Text Fields
Cleans and normalizes 'Name of Invention' and 'Name of Inventor' by lowering case and removing stopwords.

### Step 9: Deduplicate Using Fuzzy Matching
Groups similar invention names using fuzzy matching and merges records by combining field values.

### Step 10: Output Final Dataset
Displays the final cleaned dataset and saves it to a new CSV file.


Check package whether are installed in the environment.

In [54]:
import pkg_resources
import subprocess

# Read package list from requirements.txt
with open("../../requirements.txt", "r") as file:
    packages = [line.strip() for line in file if line.strip() and not line.startswith("#")]

# Get the list of currently installed packages
installed_packages = {pkg.key for pkg in pkg_resources.working_set}

# Check and install missing packages
for package in packages:
    pkg_name = package.split("==")[0].lower() if "==" in package else package.lower()
    if pkg_name not in installed_packages:
        print(f"Installing missing package: {package}")
        try:
            subprocess.check_call(["pip", "install", package])
        except subprocess.CalledProcessError as e:
            print(f"Failed to install {package}. Error: {e}")
    else:
        print(f"Already installed: {package}")


Already installed: requests
Already installed: beautifulsoup4
Already installed: pandas
Already installed: tabulate
Already installed: pdfplumber
Already installed: lxml
Already installed: pandas
Installing missing package: fuzzywuzzy


### Public Functions
The cell is used to import pandas, re, string and fuzzywuzzy related functions, and defines common text preprocessing, punctuation cleaning, country normalisation, invention category classification, text normalisation and fuzzy matching and de-emphasis and other basic functions.

In [None]:
import pandas as pd
import re
import string
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Define a translation table to remove punctuation except semicolon
punctuations = string.punctuation.replace(";", "")  # keep semicolon
trans_table = str.maketrans('', '', punctuations)

def remove_punctuation(text):
    """Remove punctuation from text except semicolon."""
    if isinstance(text, str):
        # Remove punctuation using the translation table
        return text.translate(trans_table).strip()
    else:
        return text

def normalize_text(text, stopwords=None):
    """Normalize text: lower-case, remove punctuation and stopwords, and extra spaces."""
    if not isinstance(text, str):
        return text
    # lower-case
    text = text.lower()
    # remove punctuation (except semicolon)
    text = text.translate(trans_table)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    if stopwords:
        # remove stopwords provided as a set
        tokens = text.split()
        tokens = [t for t in tokens if t not in stopwords]
        text = ' '.join(tokens)
    return text

def standardize_country(country_str, mapping):
    """
    Standardize country names.
    Split multiple countries by semicolon, map each to official name, then return unique sorted values.
    """
    if not isinstance(country_str, str):
        return country_str
    # Split by semicolon and possibly comma if there is complex cases
    parts = re.split(r'[;，,]', country_str)
    standardized = []
    for part in parts:
        # Remove leading and trailing spaces and convert to lower case
        part_clean = part.strip().lower()
        if part_clean in mapping:
            standardized.append(mapping[part_clean])
        else:
            # If not in the mapping, capitalise the first letter.
            standardized.append(part_clean.title())
    # De-weighting and sorting before joining with a semicolon
    standardized = sorted(set(standardized))
    return "; ".join(standardized)

def generalize_category(cat):
    """
    Generalize specific invention categories into broader scientific fields.
    Uses regex patterns to match keywords (case-insensitive) in each category.
    """
    if not isinstance(cat, str):
        return cat
    cat_lower = cat.lower()
    # Mapping rules: key is the matching pattern, value is the generic category after categorisation
    mapping_rules = [
        (r'\b(physics|quantum|nuclear|astronomy|cosmology|astrophysics|geophysics|optics)\b', 'Physical Sciences'),
        (r'\b(chemistry|chemical|biochemistry|electrochemistry|materials|metallurgical|industrial)\b', 'Chemistry and Materials Science'),
        (r'\b(medicine|medical|genetics|virology|pharmaceutical|immunology|biotechnology)\b', 'Life Sciences & Medicine'),
        (r'\b(engineering|computer|electronic|telecommunications|software|hardware|artificial intelligence|robotics|network|cryptography|tech|aerospace)\b', 'Engineering and Technology'),
        (r'\b(biology|bio)\b', 'Life Sciences & Medicine')
    ]
    for pattern, general in mapping_rules:
        if re.search(pattern, cat_lower, re.IGNORECASE):
            return general
    return 'Other'

def merge_dict_values(rows):
    """
    Merge a list of values (e.g., country, inventor, etc.) by taking unique values and joining them with semicolon.
    """
    merged = set()
    for item in rows:
        if isinstance(item, str):
            # Splitting possible composite entries
            parts = re.split(r'[;，,]', item)
            for part in parts:
                merged.add(part.strip())
        else:
            merged.add(str(item).strip())
    return "; ".join(sorted(merged))

# Define a set of stopwords to remove in text cleaning for invention names
stopwords = set(['the', 'for', 'of', 'and', 'in', 'on', 'origin', 'proposed', 'idea', 'first'])


### Read CSV files and merge data
This cell reads three separate CSV files from this directory and merges them into a single DataFrame for unified processing.


In [56]:
df1 = pd.read_csv('../../raw_data/clean_data/clean_data_yann_1.csv')
df2 = pd.read_csv('../../raw_data/clean_data/clean_data_yann_2.csv')
df3 = pd.read_csv('../../raw_data/clean_data/clean_data_yann_3.csv')

# Concatenate all data into one DataFrame
df = pd.concat([df1, df2, df3], ignore_index=True)
print("number of origin data：", len(df))


number of origin data： 445


Cleansing the data, removing punctuation and redundant spaces other than semicolons from each field


In [57]:
df_clean = df.copy()

# Punctuation cleaning for all string type fields
for col in df_clean.columns:
    df_clean[col] = df_clean[col].apply(remove_punctuation)

df_clean.head()


Unnamed: 0,Year,Country,Name of Invention,Name of Inventor,Category
0,1900,Germany,Quantum theory proposed,Planck,Physics
1,1901,AustrianAmerican,Discovery of human blood groups,Landsteiner,Medicine
2,1905,Germany,Waveparticle duality of light,Einstein,Physics
3,1905,Germany,Special theory of relativity,Einstein,Physics
4,1906,United Kingdom,Existence of vitamins proposed,Hopkins,Biochemistry


### Export all country names for inspection
Extract unique values by combining all country fields into a Series

In [58]:
all_countries = df_clean['Country'].dropna().unique()
print("List of original country names:")
for country in sorted(all_countries):
    print(country)


List of original country names:
Australia
Australia; Switzerland
Austria
Austria; Netherlands
Austria; Sweden
AustrianAmerican
Belgium
Belgium; United States
Britain
Bulgaria
Canada
Croatia
Czech
Denmark
Egyptian; Korean
England
Finland
France
France; Germany; United Kingdom
Germany
Germany; Austria
Germany; Canada
Germany; Denmark; Germany; Germany; Austria
Germany; France
Germany; USA
Germany; United Kingdom
Hungarian; British
Hungary
India
Ireland
Italy
Italy; Germany
Japan
Japan; Dutch
Japan; USA
Mexico; USA
Multiple
Netherlands
Netherlands; Japan
New Zealand
Norway
Poland
Poland; France
Russia
Russia; USA
Scotland
Serbia
South Korea
Soviet Union
Sweden
Switzerland
USA
USA; England
USA; France
USA; Germany
USA; International
USA; Japan
USA; Pakistan
USA; USA; Canada
USA; United Kingdom
United Kingdom
United Kingdom; Australia
United Kingdom; France
United Kingdom; Germany
United Kingdom; USA
United Kingdom; United States
United Kingdom; United States; Israel
United States
United St

### Standardised Country Names
Define country similar name mapping (all keys are lowercase)

In [59]:
country_mapping = {
    "british": "United Kingdom",
    "britain": "United Kingdom",
    "england": "United Kingdom",
    "wales": "United Kingdom",
    "scotland": "United Kingdom",
    "usa": "United States",
    "us": "United States",
    "soviet union": "Russia",
}

# Standardise the Country field, using the standardize_country function defined previously
df_clean['Country'] = df_clean['Country'].apply(lambda x: standardize_country(x, country_mapping))

# Output standardised list of country names
std_countries = df_clean['Country'].dropna().unique()
print("Standardised list of country names:")
for c in sorted(std_countries):
    print(c)


Standardised list of country names:
Australia
Australia; Switzerland
Australia; United Kingdom
Austria
Austria; Denmark; Germany
Austria; Germany
Austria; Netherlands
Austria; Sweden
Austrianamerican
Belgium
Belgium; United States
Bulgaria
Canada
Canada; Germany
Canada; United States
China; United States
Croatia
Czech
Denmark
Dutch; Japan
Egyptian; Korean
Finland
France
France; Germany
France; Germany; United Kingdom
France; Japan; United States
France; Poland
France; United Kingdom
France; United States
Germany
Germany; Italy
Germany; United Kingdom
Germany; United States
Hungarian; United Kingdom
Hungary
India
International; United States
Ireland
Israel; United Kingdom; United States
Italy
Italy; United States
Japan
Japan; Netherlands
Japan; United States
Mexico; United States
Multiple
Netherlands
New Zealand
Norway
Pakistan; United States
Poland
Russia
Russia; United States
Serbia
South Korea
Sweden
Switzerland
United Kingdom
United Kingdom; United States
United States


### Export all original invention categories

In [60]:
all_categories = df_clean['Category'].dropna().unique()
print("List of original invention categories:")
for cat in sorted(all_categories):
    print(cat)

List of original invention categories:
3D Printing
Acoustics
Aerospace
Aerospace Engineering
Agricultural Chemistry
Agricultural Engineering
Agricultural Machinery
Agricultural Technology
Agriculture
Anthropology
Artificial Intelligence
Assistive Technology
Astronomy
Astrophysics
Audio Technology
Automotive Technology
Aviation
Aviation Technology
Battery Technology
Biochemistry
Biology
Biotechnology
Blockchain Technology
Chaos Theory
Chemical Engineering
Chemistry
Civil Engineering
Climate Science
Climatology
Communication Technology
Communications
Computer Engineering
Computer Hardware
Computer Interface
Computer Networking
Computer Science
Computing
Construction Engineering
Construction Materials
Consumer Electronics
Cosmology
Cryptography
Data Storage
Diving Technology
Domestic Technology
Electrical Engineering
Electricity
Electrochemistry
Electronic Engineering
Electronics
Energy Technology
Environmental Science
Ethology
Evolutionary Biology
Explosives
Film Technology
Financial Tec

### Aggregate invention categories into general categories
Apply the generalize_category function to the Category column and create a new column General_Category

In [61]:
df_clean['General_Category'] = df_clean['Category'].apply(generalize_category)

# Print the statistics of the transformed categories
print("General Category Statistics:")
print(df_clean['General_Category'].value_counts())

# Explanation:
# We use a keyword-based regex matching method to aggregate detailed categories into four major groups:
# 1. Physical Sciences: includes keywords such as physics, quantum, nuclear, astronomy, cosmology, astrophysics, geophysics, optics, etc.
# 2. Chemistry and Materials Science: includes keywords such as chemistry, chemical, biochemistry, electrochemistry, materials, metallurgical, etc.
# 3. Life Sciences & Medicine: includes keywords such as medicine, medical, genetics, virology, pharmaceutical, immunology, biotechnology, biology, etc.
# 4. Engineering and Technology: includes keywords such as engineering, computer, electronic, telecommunications, software, hardware, artificial intelligence, robotics, network, cryptography, aerospace, etc.
# Categories that do not match any keywords are categorized as "Other".


General Category Statistics:
General_Category
Other                              177
Engineering and Technology          87
Physical Sciences                   71
Chemistry and Materials Science     56
Life Sciences & Medicine            54
Name: count, dtype: int64


### Data preprocessing and normalization (unifying case, removing stopwords and whitespace, etc.)
Normalize 'Name of Invention' as a basis for later deduplication

In [62]:

df_clean['Invention_Norm'] = df_clean['Name of Invention'].apply(lambda x: normalize_text(x, stopwords=stopwords))
# Similarly, normalize 'Name of Inventor' if needed
df_clean['Inventor_Norm'] = df_clean['Name of Inventor'].apply(lambda x: normalize_text(x))
df_clean.head()


Unnamed: 0,Year,Country,Name of Invention,Name of Inventor,Category,General_Category,Invention_Norm,Inventor_Norm
0,1900,Germany,Quantum theory proposed,Planck,Physics,Physical Sciences,quantum theory,planck
1,1901,Austrianamerican,Discovery of human blood groups,Landsteiner,Medicine,Life Sciences & Medicine,discovery human blood groups,landsteiner
2,1905,Germany,Waveparticle duality of light,Einstein,Physics,Physical Sciences,waveparticle duality light,einstein
3,1905,Germany,Special theory of relativity,Einstein,Physics,Physical Sciences,special theory relativity,einstein
4,1906,United Kingdom,Existence of vitamins proposed,Hopkins,Biochemistry,Chemistry and Materials Science,existence vitamins,hopkins


### Perform data deduplication using a fuzzy matching algorithm

In [63]:
def deduplicate_df(df, threshold=90):
    """
    Use fuzzy matching to group similar invention names and merge rows.
    The merging is performed by taking the union of all other column values.
    """
    groups = []  # list to store groups, each group is a list of row indices
    # List of normalized invention names
    names = df['Invention_Norm'].tolist()
    used_idx = set()
    
    for idx, name in enumerate(names):
        if idx in used_idx:
            continue
        # Create a new group with the current index
        group = [idx]
        used_idx.add(idx)
        for jdx in range(idx+1, len(names)):
            if jdx in used_idx:
                continue
            # Compute fuzzy ratio between two normalized strings
            ratio = fuzz.ratio(name, names[jdx])
            if ratio >= threshold:
                group.append(jdx)
                used_idx.add(jdx)
        groups.append(group)
    
    # Merge groups: for each group, merge the corresponding rows
    merged_records = []
    for group in groups:
        # If the group has only one record, keep it as is
        if len(group) == 1:
            merged_records.append(df.iloc[group[0]])
        else:
            # Merge each column (join values with semicolons), and apply a strategy (e.g., min) for numeric fields like 'Year'
            merged = {}
            # For non-numeric fields, merge unique values
            for col in ['Year', 'Country', 'Name of Invention', 'Name of Inventor', 'Category', 'General_Category']:
                values = df.iloc[group][col].dropna().astype(str).tolist()
                merged[col] = merge_dict_values(values)
            # For normalized fields, keep the first item
            merged['Invention_Norm'] = df.iloc[group[0]]['Invention_Norm']
            merged_records.append(pd.Series(merged))
    
    return pd.DataFrame(merged_records)

# Apply the deduplication function
df_dedup = deduplicate_df(df_clean, threshold=90)
print("Number of records before deduplication:", len(df_clean))
print("Number of records after deduplication:", len(df_dedup))


Number of records before deduplication: 445
Number of records after deduplication: 435


### Output the final cleaned and deduplicated data, then save it

In [64]:
print("Preview of the cleaned and deduplicated data:")
display(df_dedup.head())

# Save to a new CSV file to avoid overwriting the original file
df_dedup.to_csv('../../raw_data/clean_data/clean_data_dd.csv', index=False)
print("The final cleaned data has been saved to 'clean_data_dd.csv'")


Preview of the cleaned and deduplicated data:


Unnamed: 0,Year,Country,Name of Invention,Name of Inventor,Category,General_Category,Invention_Norm,Inventor_Norm
0,1900,Germany,Quantum theory proposed,Planck,Physics,Physical Sciences,quantum theory,planck
1,1901,Austrianamerican,Discovery of human blood groups,Landsteiner,Medicine,Life Sciences & Medicine,discovery human blood groups,landsteiner
2,1905,Germany,Waveparticle duality of light,Einstein,Physics,Physical Sciences,waveparticle duality light,einstein
3,1905,Germany,Special theory of relativity,Einstein,Physics,Physical Sciences,special theory relativity,einstein
4,1906,United Kingdom,Existence of vitamins proposed,Hopkins,Biochemistry,Chemistry and Materials Science,existence vitamins,hopkins


The final cleaned data has been saved to 'clean_data_dd.csv'
