# Analysis notebook

This is an analysis of patents data from webscraping WIPO PATENTSCOPE using: <br> In the main page (https://patentscope.wipo.int/search/en/search.jsf), set *"Field"* to **Publication Date**. Then, write the first year of interest in the *"Search terms..."* section. <br> In the section right under the search bar, set *"Sort"* to **Relevance** and *"Per page"* to **200**.    

## Importing libraries

In [2]:
import pandas as pd 
import plotly.express as px
import country_converter as coco
import plotly.io as pio
from IPython.display import IFrame
pio.renderers.default = "plotly_mimetype"
cc = coco.CountryConverter()

## Importing data 

In [3]:
import glob
import os

# Get all CSV files from the patents directory
patents_dir = "/Users/kahina/Montréal/Cours/GBM6330E_emerging_biotech/patents"
csv_files = glob.glob(os.path.join(patents_dir, "*.csv"))

# Read and concatenate all CSV files with latin1 encoding
pat = pd.concat([pd.read_csv(f, encoding='latin1') for f in csv_files], ignore_index=True)


Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.



## Data Processing

Adding country names, codes and continents

In [4]:
pat["Full Country"] = cc.pandas_convert(series=pat["Office Country"], to='name_short', not_found = "NaN")

SU not found in ISO2
DD not found in ISO2
CS not found in ISO2
EP not found in ISO2
EA not found in ISO2
AP not found in ISO2
WO not found in ISO2


Function to get patent type

In [5]:
#function
def determine_status(row):
    applicant_types = {'research', 'group', 'society', 'foundation', 'inc', 'compania', 'lab', 'industries', 'societe', 'manufacturing', 'machine', 'co '
                       'corp', 'association', 'university', 'institute', 'company', 'llc', 'ltd', 'lfp', 'industria', 'industrie', 'firm', '+', 'co.',
                       'pharmaceuticals', 'roche', "l'oreal", 'campos', 'technologies', 'corp', 'inst', 'pharma', 'electronics', 'volvo', 'corporation',
                      'ltda', 'communications', 'ifp', 'technik', 'siemens','s.a', 'operations', 'limited', 'gmbh', 'novartis', 'agency',
                      'elektronik', 's.p.a', 'UNIWERSYTET', 's.l', 's.r.l', 'a.s','urs', 'ag ', 'UNIVERSITEIT', 'hospital', 'silverphase',
                      'sanofi', 'science', 'medicament', 'recherche', 'tech', 'international', 'networks', 'france', 'nucleix', 'cosmetique', 
                       'astrazeneca', 'universite', 'les ', 'igt', 'service', 'services', 'univ', 'products', 'product', 'bank', 'compan', 
                      'cotton', '& co', '&co', 'comp', 'constructions', 'meca', 'sciences', 'tech', 'consulting', ' spa', 'management', 'associates', 
                       'holdings', 'systems', ' as', ' co', 'electric', 'printing', 'steel', ' ind', 'chemicals', ' ag', 'a.g', "johnson & johnson"
                      ,'gm. b. h', 'informazioni', 'g. m. b. h.', 'anonyme', 'limitada', 'sociedad', 'solex', 's. a', 'eleuterio', 'societr',
                      'commissariat', 's.a', 'interlight', 's. l', 'electronique', 'moebius & ruppert', 'g m b h', 'elektro', 'società', 'energía',
                      'philips', '&', 's.c.i.', 'société', 'sté', 'g.m.b.h', 'energy', 'a. k','investigación', 'fabrica', 'limited'}
    if pd.notna(row['Inventor']) and pd.notna(row['Applicant']): #if inventor and applicant are both not missing
        if isinstance(row['Inventor'], str) and isinstance(row['Applicant'], str):
            if row['Applicant'] in row['Inventor']:
                return "Solo Inventor"
            elif any(word in row['Applicant'].lower() for word in applicant_types):
                return "Research/Company"
            else:
                return "default"
    elif pd.isna(row['Inventor']) and pd.notna(row['Applicant']): #if inventor is missing but not applicant
        if any(word in row['Applicant'].lower() for word in applicant_types):
            return "Research/Company"
        elif row['Applicant'] == 'applicant name missing':
            return 'default'
        else:
            return "Solo Inventor"
    elif pd.notna(row['Inventor']) and pd.isna(row['Applicant']):#if applicant is missing but not inventor
        # if (row['Year']) < 1920:
            return "Solo Inventor"
        # else: return "default 2"
    else:
        return "default"

Converting to date, applying function to extract patent type, grouping by year and patent type

In [6]:
#convert date string to actual date
pat["Publication date"] = pd.to_datetime(pat["Publication date"], format="%d.%m.%Y")
#converting date to year
pat["Year"] = pat["Publication date"].dt.year
#apply method to get patent type
pat['Type'] = pat.apply(determine_status, axis=1)
pat = pat.groupby(['Year', 'Type', 'Full Country']).size().reset_index(name='Patents')

Getting proportions

In [7]:
#dropping default patents
pat_proportions = pat[pat["Type"] != "default"]
#getting sum of patents per year
total_per_year = pat_proportions.groupby('Year')['Patents'].transform('sum')
#getting proportions of patent types
pat_proportions['Proportion'] = pat_proportions['Patents'] / total_per_year
pat_proportions= pat_proportions.groupby(["Year", "Type"]).sum().reset_index()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Getting the gender out of each inventors name 

In [14]:
# Using the popular_names.csv file created from https://www.ssa.gov/oact/babynames/decades/century.html
names = pd.read_csv("/Users/kahina/Montréal/Cours/GBM6330E_emerging_biotech/popular_names.csv")

# Create lists of male and female names for quick lookup
male_names = set(names['male'].str.lower())
female_names = set(names['female'].str.lower())

# Function to determine gender from inventor name
def determine_gender(inventor_name):
    if pd.isna(inventor_name) or inventor_name == '':
        return 'Unknown'
    
    # Split the inventor name and check each word
    words = str(inventor_name).split()
    for word in words:
        word_lower = word.lower().strip('.,;')
        if word_lower in male_names:
            return 'Male'
        elif word_lower in female_names:
            return 'Female'
    
    return 'Ungendered'

# Apply gender determination to the original patent data (before grouping)
# We need to re-read and process the data to maintain inventor information
pat_gender = pd.concat([pd.read_csv(f, encoding='latin1') for f in csv_files], ignore_index=True)
pat_gender["Full Country"] = cc.pandas_convert(series=pat_gender["Office Country"], to='name_short', not_found = "NaN")
pat_gender["Publication date"] = pd.to_datetime(pat_gender["Publication date"], format="%d.%m.%Y")
pat_gender["Year"] = pat_gender["Publication date"].dt.year
pat_gender['Type'] = pat_gender.apply(determine_status, axis=1)
pat_gender['Gender'] = pat_gender['Inventor'].apply(determine_gender)

# Create CSV file with ungendered names
ungendered = pat_gender[pat_gender['Gender'] == 'Ungendered'][['Office Country', 'Inventor']].drop_duplicates()
ungendered.to_csv("/Users/kahina/Montréal/Cours/GBM6330E_emerging_biotech/ungendered_names.csv", index=False)

# Create CSV file with all inventors, their office country, and gender
inventors_gender = pat_gender[['Office Country', 'Inventor', 'Gender']].drop_duplicates()
# Replace 'Unknown' and 'Ungendered' with blank values
inventors_gender['Gender'] = inventors_gender['Gender'].replace({'Unknown': '', 'Ungendered': ''})
inventors_gender.to_csv("/Users/kahina/Montréal/Cours/GBM6330E_emerging_biotech/inventors_names_gender.csv", index=False)

print(f"Total patents: {len(pat_gender)}")
print(f"Gender distribution:")
print(pat_gender['Gender'].value_counts())
print(f"\nUngendered names saved to ungendered_names.csv ({len(ungendered)} unique entries)")
print(f"All inventors saved to inventors_names_gender.csv ({len(inventors_gender)} unique entries)")

# Group by Year, Type, and Gender for visualization
pat_gender_grouped = pat_gender.groupby(['Year', 'Type', 'Gender']).size().reset_index(name='Patents')
# Remove 'default' and 'Unknown' categories for cleaner visualization
pat_gender_filtered = pat_gender_grouped[(pat_gender_grouped['Type'] != 'default') & 
                                          (pat_gender_grouped['Gender'] != 'Unknown')]


Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.

SU not found in ISO2
DD not found in ISO2
CS not found in ISO2
EP not found in ISO2
EA not found in ISO2
AP not found in ISO2
WO not found in ISO2


Total patents: 2642012
Gender distribution:
Gender
Male          1003272
Ungendered     944522
Unknown        562848
Female         131370
Name: count, dtype: int64

Ungendered names saved to ungendered_names.csv (656198 unique entries)
All inventors saved to inventors_names_gender.csv (1388839 unique entries)


### Data Visualization: Stacked Bars

In [26]:
#| label: fig1nb

# Create a combined Type-Gender column for plotting
pat_gender_filtered['Type_Gender'] = pat_gender_filtered['Type'] + ' - ' + pat_gender_filtered['Gender']

# Define color mapping: Purple for Male, Orange for Female
color_map = {
    'Research/Company - Male': '#D7BDE2',      # Light Purple
    'Research/Company - Female': '#FAD7A0',    # Light Orange
    'Solo Inventor - Male': '#5B2C6F',         # Dark Purple
    'Solo Inventor - Female': '#D35400'        # Dark Orange
}

# Define dash pattern: Research/Company dashed, Solo Inventor solid
dash_map = {
    'Research/Company - Male': 'dash',
    'Research/Company - Female': 'dash',
    'Solo Inventor - Male': 'solid',
    'Solo Inventor - Female': 'solid'
}

fig1 = px.line(pat_gender_filtered,
              x="Year",
              y="Patents",
              color='Type_Gender',
              line_dash='Type_Gender',
              template="plotly_white",
              color_discrete_map=color_map,
              line_dash_map=dash_map,
              title="<b>Worldwide Patent Applications by Type and Gender</b>",
              height=500,
              width=1100,
              hover_data=["Patents"])

fig1.update_xaxes(title=None, dtick=5, ticks="outside", ticklen=4, range=[1910, 2023])
fig1.update_yaxes(title="Number of Patents")
fig1.update_layout(legend_title=None,
                   font_family="Calibri",
                   font_color="black",
                   title_font_family="Calibri",
                   font=dict(size=14),
                   title_font_color="black",
                   legend=dict(orientation="v", yanchor="top", y=0.98, xanchor="left", x=0.02))

annotations = [
    {'text': "Source: WIPO Patentscope", 'showarrow': False, 'x': 0.99, 'y': -0.15, 
     'xref': 'paper', 'yref': 'paper', 'font': {'size': 11, 'color': "grey"}}
]
for annotation in annotations:
    fig1.add_annotation(annotation)
    
fig1.show()

# Second figure - Histogram/Bar chart version with ungendered categories
# Extended color map including ungendered categories
color_map_extended = {
    'Research/Company - Male': '#D7BDE2',      # Light Purple
    'Research/Company - Female': '#FAD7A0',    # Light Orange
    'Solo Inventor - Male': '#9B59D0',         # Medium Purple
    'Solo Inventor - Female': '#E67E22',       # Medium Orange
    'Solo Inventor - Ungendered': '#95A5A6',   # Medium Grey
    'Research/Company - Ungendered': '#D3D3D3' # Light Grey
}

fig2 = px.bar(pat_gender_filtered,
              x="Year",
              y="Patents",
              color='Type_Gender',
              template="plotly_white",
              color_discrete_map=color_map_extended,
              title="<b>Worldwide Patent Applications by Type and Gender (Bar Chart)</b>",
              height=500,
              width=1100,
              hover_data=["Patents"],
              barmode='stack')

fig2.update_xaxes(title=None, dtick=5, ticks="outside", ticklen=4, range=[1910, 2023])
fig2.update_yaxes(title="Number of Patents")
fig2.update_layout(legend_title=None,
                   font_family="Calibri",
                   font_color="black",
                   title_font_family="Calibri",
                   font=dict(size=14),
                   title_font_color="black",
                   legend=dict(orientation="v", yanchor="top", y=0.98, xanchor="left", x=0.02))

annotations2 = [
    {'text': "Source: WIPO Patentscope", 'showarrow': False, 'x': 0.99, 'y': -0.15, 
     'xref': 'paper', 'yref': 'paper', 'font': {'size': 11, 'color': "grey"}}
]
for annotation in annotations2:
    fig2.add_annotation(annotation)
    
fig2.show()

# Third figure - Percentage/Proportion version
# Calculate proportions per year
pat_gender_proportions = pat_gender_filtered.copy()
total_per_year = pat_gender_proportions.groupby('Year')['Patents'].transform('sum')
pat_gender_proportions['Proportion'] = (pat_gender_proportions['Patents'] / total_per_year) * 100

fig3 = px.bar(pat_gender_proportions,
              x="Year",
              y="Proportion",
              color='Type_Gender',
              template="plotly_white",
              color_discrete_map=color_map_extended,
              category_orders={'Type_Gender': [
                  'Research/Company - Male',
                  'Research/Company - Ungendered',
                  'Research/Company - Female',
                  'Solo Inventor - Male',
                  'Solo Inventor - Ungendered',
                  'Solo Inventor - Female'
              ]},
              title="<b>Worldwide Patent Applications by Type and Gender (Percentage)</b>",
              height=500,
              width=1100,
              hover_data=["Proportion", "Patents"],
              barmode='stack')

fig3.update_xaxes(title=None, dtick=5, ticks="outside", ticklen=4, range=[1910, 2023])
fig3.update_yaxes(title="Percentage (%)")
fig3.update_layout(legend_title=None,
                   font_family="Calibri",
                   font_color="black",
                   title_font_family="Calibri",
                   font=dict(size=14),
                   title_font_color="black",
                   legend=dict(orientation="v", yanchor="top", y=0.98, xanchor="left", x=0.02))

annotations3 = [
    {'text': "Source: WIPO Patentscope", 'showarrow': False, 'x': 0.99, 'y': -0.15, 
     'xref': 'paper', 'yref': 'paper', 'font': {'size': 11, 'color': "grey"}}
]
for annotation in annotations3:
    fig3.add_annotation(annotation)
    
fig3.show()

# Fourth figure - Research/Company only (Percentage)
pat_company_proportions = pat_gender_proportions[pat_gender_proportions['Type'] == 'Research/Company'].copy()
# Recalculate proportions for companies only
total_company_per_year = pat_company_proportions.groupby('Year')['Patents'].transform('sum')
pat_company_proportions['Proportion'] = (pat_company_proportions['Patents'] / total_company_per_year) * 100

fig4 = px.bar(pat_company_proportions,
              x="Year",
              y="Proportion",
              color='Type_Gender',
              template="plotly_white",
              color_discrete_map=color_map_extended,
              category_orders={'Type_Gender': [
                  'Research/Company - Male',
                  'Research/Company - Ungendered',
                  'Research/Company - Female'
              ]},
              title="<b>Research/Company Patent Applications by Gender (Percentage)</b>",
              height=500,
              width=1100,
              hover_data=["Proportion", "Patents"],
              barmode='stack')

fig4.update_xaxes(title=None, dtick=5, ticks="outside", ticklen=4, range=[1910, 2023])
fig4.update_yaxes(title="Percentage (%)")
fig4.update_layout(legend_title=None,
                   font_family="Calibri",
                   font_color="black",
                   title_font_family="Calibri",
                   font=dict(size=14),
                   title_font_color="black",
                   legend=dict(orientation="v", yanchor="top", y=0.98, xanchor="left", x=0.02))

annotations4 = [
    {'text': "Source: WIPO Patentscope", 'showarrow': False, 'x': 0.99, 'y': -0.15, 
     'xref': 'paper', 'yref': 'paper', 'font': {'size': 11, 'color': "grey"}}
]
for annotation in annotations4:
    fig4.add_annotation(annotation)
    
fig4.show()

# Fifth figure - Solo Inventor only (Percentage)
pat_solo_proportions = pat_gender_proportions[pat_gender_proportions['Type'] == 'Solo Inventor'].copy()
# Recalculate proportions for solo inventors only
total_solo_per_year = pat_solo_proportions.groupby('Year')['Patents'].transform('sum')
pat_solo_proportions['Proportion'] = (pat_solo_proportions['Patents'] / total_solo_per_year) * 100

fig5 = px.bar(pat_solo_proportions,
              x="Year",
              y="Proportion",
              color='Type_Gender',
              template="plotly_white",
              color_discrete_map=color_map_extended,
              category_orders={'Type_Gender': [
                  'Solo Inventor - Male',
                  'Solo Inventor - Ungendered',
                  'Solo Inventor - Female'
              ]},
              title="<b>Solo Inventor Patent Applications by Gender (Percentage)</b>",
              height=500,
              width=1100,
              hover_data=["Proportion", "Patents"],
              barmode='stack')

fig5.update_xaxes(title=None, dtick=5, ticks="outside", ticklen=4, range=[1910, 2023])
fig5.update_yaxes(title="Percentage (%)")
fig5.update_layout(legend_title=None,
                   font_family="Calibri",
                   font_color="black",
                   title_font_family="Calibri",
                   font=dict(size=14),
                   title_font_color="black",
                   legend=dict(orientation="v", yanchor="top", y=0.98, xanchor="left", x=0.02))

annotations5 = [
    {'text': "Source: WIPO Patentscope", 'showarrow': False, 'x': 0.99, 'y': -0.15, 
     'xref': 'paper', 'yref': 'paper', 'font': {'size': 11, 'color': "grey"}}
]
for annotation in annotations5:
    fig5.add_annotation(annotation)
    
fig5.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [16]:
#| label: fig2nb

IFrame(src='https://janeabdo.github.io/carousel/', width='800', height='700')