4. Assignee Type and Geographic Concentration
We hypothesize that patents assigned to organizations (companies, universities, government) show a more concentrated geographic footprint compared to patents with individual or no clear assignee. Specifically, we expect corporate-assigned patents to cluster around business hubs (California and New York), while individually filed patents may appear more dispersed.

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import json
import requests

In [2]:
loc_df = pd.read_csv("data/g_location_disambiguated.tsv", sep="\t", header=0)
loc_df[:5]

Unnamed: 0,location_id,disambig_city,disambig_state,disambig_country,latitude,longitude,county,state_fips,county_fips
0,00235947-16c8-11ed-9b5f-1234bde3cd05,Westfield,PA,US,41.919237,-77.538874,Tioga,42.0,117.0
1,00236a27-16c8-11ed-9b5f-1234bde3cd05,Helfenstein,PA,US,40.750499,-76.447334,Schuylkill County,42.0,107.0
2,00236f47-16c8-11ed-9b5f-1234bde3cd05,Pine Forge,PA,US,40.28192,-75.692236,Berks County,42.0,11.0
3,00237418-16c8-11ed-9b5f-1234bde3cd05,Partlow,VA,US,38.038748,-77.638876,Spotsylvania County,51.0,177.0
4,002378d7-16c8-11ed-9b5f-1234bde3cd05,Stumpy Point,NC,US,35.698506,-75.740453,Dare,37.0,55.0


In [3]:
app_df = pd.read_csv("data/g_application.tsv", sep="\t", header=0)
app_df[:5]

  app_df = pd.read_csv("g_application.tsv", sep="\t", header=0)


Unnamed: 0,application_id,patent_id,patent_application_type,filing_date,series_code,rule_47_flag
0,5497504,3963197,5,1074-08-14,5,0.0
1,5508062,3933359,5,1074-09-23,5,0.0
2,5518254,3941467,5,1074-10-29,5,0.0
3,5518570,3936670,5,1074-10-29,5,0.0
4,5555245,4003574,5,1075-03-04,5,0.0


In [4]:
assign_df = pd.read_csv("data/g_assignee_disambiguated.tsv", sep="\t", header=0)
assign_df[:5]

Unnamed: 0,patent_id,assignee_sequence,assignee_id,disambig_assignee_individual_name_first,disambig_assignee_individual_name_last,disambig_assignee_organization,assignee_type,location_id
0,4488683,0,7f675c65-8447-40ca-8147-b9c093a37237,,,Metal Works Ramat David,3.0,50dc5d46-16c8-11ed-9b5f-1234bde3cd05
1,11872626,0,225f1f9f-3540-4c39-9ae7-7621dd54ac76,,,"DIVERGENT TECHNOLOGIES, INC.",2.0,15c69712-16c8-11ed-9b5f-1234bde3cd05
2,5856666,0,94fe09ed-98d1-416b-83b5-98ee41249b5c,,,U.S. Philips Corporation,2.0,92237ca2-16c8-11ed-9b5f-1234bde3cd05
3,5204210,0,d1a6baec-354d-4ab2-b952-0dc79e430a4b,,,Xerox Corporation,2.0,0cd1998f-16c8-11ed-9b5f-1234bde3cd05
4,5302149,1,9fe92432-4ade-44f4-9be2-d4fc8e92054b,,,Commonwealth Scientific and Industrial Researc...,3.0,4d36742f-16c8-11ed-9b5f-1234bde3cd05


In [5]:
#assign_df['assignee_type'].value_counts()

In [6]:
patent_df = app_df.merge(
    assign_df[['patent_id', 'location_id', 'assignee_type']],
    on="patent_id",
    how="left")
patent_df = patent_df.merge(
    loc_df[['location_id', 'disambig_state', 'disambig_country', 'latitude', 'longitude', 'state_fips', 'county_fips']],
    on="location_id",
    how="left")

In [7]:
patent_df[:5]

Unnamed: 0,application_id,patent_id,patent_application_type,filing_date,series_code,rule_47_flag,location_id,assignee_type,disambig_state,disambig_country,latitude,longitude,state_fips,county_fips
0,5497504,3963197,5,1074-08-14,5,0.0,,3.0,,,,,,
1,5508062,3933359,5,1074-09-23,5,0.0,,,,,,,,
2,5518254,3941467,5,1074-10-29,5,0.0,a26e22db-16c8-11ed-9b5f-1234bde3cd05,2.0,CA,US,37.444329,-122.159847,6.0,85.0
3,5518570,3936670,5,1074-10-29,5,0.0,,,,,,,,
4,5555245,4003574,5,1075-03-04,5,0.0,a05a9b40-16c8-11ed-9b5f-1234bde3cd05,2.0,NJ,US,39.486278,-75.025426,34.0,11.0


In [8]:
us_patent_df = patent_df[patent_df['disambig_country'] == 'US']

In [9]:
len(us_patent_df)

4195067

In [10]:
cleaned_patent_df = us_patent_df.dropna(subset=['assignee_type', 'state_fips', 'county_fips'])
cleaned_patent_df['assignee_type'] = cleaned_patent_df['assignee_type'].astype(int)
cleaned_patent_df['state_fips'] = cleaned_patent_df['state_fips'].astype(int).astype(str).str.zfill(2)
cleaned_patent_df['county_fips'] = cleaned_patent_df['county_fips'].astype(int).astype(str).str.zfill(3)
cleaned_patent_df['county_fips'] = cleaned_patent_df['state_fips'] + cleaned_patent_df['county_fips']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df['assignee_type'] = cleaned_patent_df['assignee_type'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df['state_fips'] = cleaned_patent_df['state_fips'].astype(int).astype(str).str.zfill(2)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df[

In [11]:
cleaned_patent_df['type_name'] = cleaned_patent_df['assignee_type'].map({
    1:'individual', 2:'company', 3:'company', 4:'individual', 5:'individual', 6:'government',
    7:'government', 8:'government', 9:'government'
})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df['type_name'] = cleaned_patent_df['assignee_type'].map({


In [12]:
cleaned_patent_df['type_name'].value_counts()

type_name
company       4070917
government      49378
individual      45038
Name: count, dtype: int64

In [13]:
len(cleaned_patent_df)

4168865

In [14]:
#cleaned_sample_df = cleaned_patent_df.sample(n=500000, random_state=42)

company_sample_n = 50000
# Keep all 'individual' and 'government' records
filtered_df = cleaned_patent_df[cleaned_patent_df['type_name'].isin(['individual', 'government'])]

# Randomly sample 'company' records
company_sample = cleaned_patent_df[cleaned_patent_df['type_name'] == 'company'].sample(n=company_sample_n, random_state=42)

# Combine them
cleaned_sample_df = pd.concat([filtered_df, company_sample]).reset_index(drop=True)

In [15]:
cleaned_sample_df['type_name'].value_counts()

type_name
company       50000
government    49378
individual    45038
Name: count, dtype: int64

In [16]:
# Download U.S. counties GeoJSON (simplified)
geojson_url = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"
counties_geo = requests.get(geojson_url).json()

In [17]:
# Choropleth for individual density
fig = px.scatter_geo(
    cleaned_sample_df,
    lat='latitude',
    lon='longitude',
    color='type_name',  # color by assignee type
    scope='usa',
    hover_name='type_name',  # or another field like 'assignee_name'
    title='Patent Applications by Assignee Type in the U.S.',
    color_discrete_map={
        'company': 'blue',
        'government': 'green',
        'individual': 'red'
    }
)

# Update layout for map styling
#fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})

pio.renderers.default = 'browser'
fig.show()

In [18]:
density_by_county_type = (cleaned_patent_df.groupby(['county_fips', 'type_name']).size().reset_index(name='count'))
density_by_county_type_sub = (cleaned_sample_df.groupby(['county_fips', 'type_name']).size().reset_index(name='count'))
# For each county, keep the row with the highest count
dominant_type = density_by_county_type.sort_values('count', ascending=False).drop_duplicates('county_fips')
dominant_type_sub = density_by_county_type_sub.sort_values('count', ascending=False).drop_duplicates('county_fips')

In [19]:
# Loop over assignee types and create maps
for assignee in ['individual', 'company', 'government']:
    sub_df = density_by_county_type[density_by_county_type['type_name'] == assignee]

    fig = px.choropleth(
        sub_df,
        geojson=counties_geo,
        locations='county_fips',
        color='count',
        color_continuous_scale='Blues',
        scope='usa',
        title=f'Patent Density by County — {assignee.capitalize()} Assignee',
    )
   # fig.update_geos(fitbounds="locations", visible=False)
    fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
    pio.renderers.default = 'browser'
    fig.show()

In [20]:
# Plot for sample by 50000 for the company
fig = px.choropleth(
    dominant_type_sub,
    geojson=counties_geo,
    locations='county_fips',
    color='type_name',
    color_discrete_map={
        'individual': 'red',
        'company': 'blue',
        'government': 'green'
    },
    scope='usa',
    title='Dominant Patent Assignee Type by County'
)

#fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})

pio.renderers.default = 'browser'
fig.show()

In [21]:
# Plot for whole density
fig = px.choropleth(
    dominant_type,
    geojson=counties_geo,
    locations='county_fips',
    color='type_name',
    color_discrete_map={
        'individual': 'red',
        'company': 'blue',
        'government': 'green'
    },
    scope='usa',
    title='Dominant Patent Assignee Type by County'
)

#fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})

pio.renderers.default = 'browser'
fig.show()

In [22]:
# Group and count patents per state per assignee type
state_counts = (cleaned_patent_df.groupby(['disambig_state', 'type_name']).size().reset_index(name='count'))

In [23]:
fig = px.bar(
    state_counts,
    x='disambig_state',
    y='count',
    color='type_name',
    title='Patent Count by State and Assignee Type',
    labels={'count': 'Number of Patents', 'disambig_state': 'State'},
    barmode='stack',  # or 'stack' for stacked bars
    category_orders={'disambig_state': sorted(state_counts['disambig_state'].unique())}
)

fig.update_layout(
    xaxis_tickangle=-45,
    xaxis=dict(
        tickmode='linear',   # force evenly spaced ticks
        dtick=1              # one tick per category
    )
)
pio.renderers.default = 'browser'
fig.show()

In [24]:
# Get unique assignee types
assignee_types = state_counts['type_name'].unique()
states = sorted(state_counts['disambig_state'].unique())
# Create a bar trace for each assignee type
fig = go.Figure()
for atype in assignee_types:
    df_sub = state_counts[state_counts['type_name'] == atype]
    fig.add_trace(go.Bar(
        x=df_sub['disambig_state'],
        y=df_sub['count'],
        name=atype,
        visible=(atype == 'individual')  # Show only one at a time initially
    ))

In [25]:
# Create dropdown buttons
buttons = []
for i, atype in enumerate(assignee_types):
    visibility = [False] * len(assignee_types)
    visibility[i] = True
    buttons.append(dict(label=atype, method="update", args=[{"visible": visibility}, {"title": f"Patents by State — {atype.capitalize()}"}]))

In [26]:
# Add dropdown to layout
fig.update_layout(
    updatemenus=[{
        "buttons": buttons,
        "direction": "down",
        "showactive": True,
        "x": 0.0,
        "xanchor": "left",
        "y": 1.15,
        "yanchor": "top"
    }],
    xaxis=dict(
        tickmode='linear',
        dtick=1
    ),
    xaxis_tickangle=-45,
    barmode='stack',  # still allows stacked view per state if you switch back
    title="Patents by State — Individual"
)
pio.renderers.default = 'browser'
fig.show()