# Data file description 

This project uses three cleaned data files stored in `... /... /raw_data/clean_data/` directory.  
Each file is in CSV format and contains four columns of information: 
- **Year**: the year of invention or discovery 
- **Country**: the country where the invention or discovery is located 
- **Name of Invention**: the name of the invention or discovery 
- **Name of Inventor**: the name of the inventor or discoverer 
- **Category**: Category of Field 

The list of files is as follows: 
- `clean_data_yann_1.csv` 
- `clean_data_yann_2.csv` 
- `clean_data_yann_3.csv`

Translated with DeepL.com (free version)

In [6]:
# Cell 2: Importing libraries and reading data
import pandas as pd

# path1 = '../../raw_data/clean_data/clean_data_yann_1.csv'
# path2 = '../../raw_data/clean_data/clean_data_yann_2.csv'
# path3 = '../../raw_data/clean_data/clean_data_yann_3.csv'
path4 = '../../raw_data/clean_data/clean_data_balam.csv'

# df1 = pd.read_csv(path1)
# df2 = pd.read_csv(path2)
# df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)




In [7]:
# Cell 3: Merge the data, extract the de-duplicate Category and count it.
df_all = pd.concat([df4], ignore_index=True)
categories = df_all['Category'].drop_duplicates().tolist()
count = len(categories)

print(f"There are {count} of different Categories after de-duplication:")
for cat in categories:
    print(f"－ {cat}")


There are 151 of different Categories after de-duplication:
－ Music
－ Agricultural Machinery
－ Measurement Instrument
－ Metallurgical Engineering
－ Mechanical Engineering
－ Horology
－ Textile Machinery
－ Domestic Technology
－ Materials Technology
－ Refrigeration Technology
－ Electrical Engineering
－ Food Technology
－ Marine Navigation
－ Recreational Technology
－ Chemistry
－ Military Technology
－ Marine Technology
－ Firearms Technology
－ Aviation Technology
－ Optics
－ Safety Technology
－ Communication Technology
－ Immunology
－ Printing Technology
－ Battery Technology
－ Machine Tools
－ Transportation Engineering
－ Rocketry
－ Civil Engineering
－ Transportation
－ Construction Engineering
－ Medical Equipment
－ Assistive Technology
－ Photography
－ Renewable Energy
－ Biology
－ Computing
－ Electrochemistry
－ Communication Device
－ Polymer Material
－ Timekeeping Mechanism
－ Agricultural Implement
－ Communication System
－ Photographic Process
－ Chemical Discovery
－ Vehicle
－ Industrial Machinery

In [5]:
# Cell 4: Extract and print unique countries
# Assuming df_all is already defined from the previous cell
# Get the 'Country' column, drop duplicate entries, and convert to a list
unique_countries = df_all['Country'].drop_duplicates().tolist()

# Count how many unique countries there are
country_count = len(unique_countries)

# Print the total number of unique countries
print(f"There are {country_count} unique countries in the dataset:")

# Print each country on its own line
for country in unique_countries:
    print(f"- {country}")


There are 75 unique countries in the dataset:
- Germany
- Austrian-American
- United Kingdom
- Croatia
- USA
- New Zealand
- Netherlands
- Austria
- United Kingdom; Germany
- Serbia
- United Kingdom; USA
- Canada
- Australia
- Germany; Austria
- Japan
- Austria; Netherlands
- USA; Germany
- Soviet Union
- Austria; Sweden
- Italy; Germany
- Russia; USA
- USA; Japan
- USA; United Kingdom
- Japan; USA
- USA; Pakistan
- Mexico; USA
- Australia; Switzerland
- USA; France
- Germany; United Kingdom
- France
- Switzerland
- USA; International
- England
- Sweden
- Italy
- Denmark
- Germany; USA
- France; Germany; United Kingdom
- Czech
- Russia
- Scotland
- Germany; Denmark; Germany; Germany; Austria
- Belgium
- Britain
- USA; USA; Canada
- Poland
- USA; England
- India
- Ireland
- Hungary
- Wales
- Germany; Canada
- Germany; France
- United States
- United Kingdom; United States
- Bulgaria
- United States; France
- Poland; France
- Norway
- Belgium; United States
- United Kingdom; France
- Uni

In [9]:
# 1. Load the final cleaned CSV
df_final = pd.read_csv('../../raw_data/clean_data/clean_data_dd.csv')

# 2. Filter rows where General_Category == 'Other'
others = df_final[df_final['General_Category'] == 'Other']

# 3. Report and inspect
print(f"Found {len(others)} rows with General_Category == 'Other':")

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

display(others)


Found 69 rows with General_Category == 'Other':


Unnamed: 0,Year,Country,Name of Invention,Name of Inventor,Category,General_Category
7,1725,China,Clock Mechanical,Hsing and LingTsan,Horology,Other
20,1745,Netherlands,Leyden Jar; The Leyden Jar,Ewald Georg von Kleist; Pieter van Musschenbroek,Electricity,Other
24,1752,United States,Lightning Rod,Benjamin Franklin,Electricity,Other
40,1769,France,steamdriven gun carriage,NicolasJoseph Cugnot,Military Technology,Other
58,1783,France,Manned Hot Air Balloon,JosephMichel Montgolfier; JacquesÉtienne Montg...,Aerospace,Other
61,1783,France,Modern Parachute,LouisSébastien Lenormand,Aerospace,Other
84,1799,Italy,The Battery,Alessandro Volta,Electricity,Other
88,1800,Italy,Voltaic Pile,Alessandro Volta,Electricity,Other
91,1802,United Kingdom,Arc Lamp,Humphry Davy,Electricity,Other
106,1810,France,First Wristwatch,AbrahamLouis Breguet,Horology,Other


In [10]:
# Extract, dedupe, and sort the original Category values
unique_categories = sorted(others['Category'].dropna().unique())

# Print them out
print("Unique original Category values among 'Other' rows:")
for cat in unique_categories:
    print(f"- {cat}")

Unique original Category values among 'Other' rows:
- 3D Printing
- Aerospace
- Communications
- Construction Materials
- Diving Technology
- Electric Motor
- Electricity
- Electronics
- Electronics; Semiconductor Technology
- Energy Storage Device
- Horology
- Household Appliance
- Industrial Design
- Laser Technology
- Lighting Device
- Medical Equipment
- Metallurgical Furnace
- Military Technology
- Mining Technology
- Nanotechnology
- Navigation Technology
- Optical Storage
- Optoelectronics
- Scientific Model
- Semiconductor Technology
- Space Technology
- Technical Drawing
- Textile Fiber
- Timekeeping Mechanism
- Timekeeping Technology
