# Hypotehsis 4: Assignee Type and Concentration
- We hypothesize that patents assigned to organizations (companies, universities, government) show a more concentrated geographic footprint compared to patents with individual or no clear assignee.
- Specifically, we expect corporate-assigned patents to cluster around business hubs (e.g. California and New York), while individually filed patents may appear more dispersed.
- We checked this by comparing patent amount of individual vs. company by county level
- Prepared 3 visualizations and 1 statistical test
1. Scatter map for a brief overview
2. Density map
3. Bar plot for clearer comparison
4. Mann–Whitney U Test for comparing distribution of rural and urban area for assigned organizations

## Package import

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import json
import requests
from scipy.stats import mannwhitneyu

# This is for importing custom functions
from assist_scripts import merge, fips_merger

## Import files
- Patent application dataset, "g_application.tsv":"https://s3.amazonaws.com/data.patentsview.org/download/g_application.tsv.zip"
- Patent applicant assignee (individual, corporation, government) dataset, "g_assignee_disambiguated.tsv":"https://s3.amazonaws.com/data.patentsview.org/download/g_assignee_disambiguated.tsv.zip"
- Patent application location dataset, "g_location_disambiguated.tsv":"https://s3.amazonaws.com/data.patentsview.org/download/g_location_disambiguated.tsv.zip"
- Urban information filter on counties, "data/2020_UA_COUNTY.xlsx":"https://www2.census.gov/geo/docs/reference/ua/2020_UA_COUNTY.xlsx"


In [2]:
# dataset about patent applicants (contains fiiling date that is required for year-wise grouping, and each row is patent application)
# This dataset will connected with assignee dataset that contains location information, as assignee may have more than one application
application_df = pd.read_csv("data/g_application.tsv", sep="\t", header=0,
                             usecols=['application_id', 'patent_id'],
                             dtype={'application_id': str, 'patent_id': str},
                             encoding='unicode_escape'
                             )

In [3]:
# dataset about applicants (contains location ID required for indicate location by connected with latitude and longitude in location data below)
assignee_df = pd.read_csv("data/g_assignee_disambiguated.tsv", sep="\t", header=0,
                          usecols=['patent_id', 'location_id', 'assignee_type'],
                          dtype={'patent_id': str, 'location_id': str}
                          )

In [4]:
# dataset about location of patent application, includes location ID, latitude, and longitude those are necessary for the mapping
location_df = pd.read_csv("data/g_location_disambiguated.tsv", sep="\t", header=0,
                          usecols=['location_id', 'disambig_state', 'disambig_country', 'latitude', 'longitude', 'state_fips', 'county_fips'],
                          dtype={'location_id': str, 'disambig_state': str, 'disambig_country': str, 'latitude': float, 'longitude': float}
                          )

In [5]:
# https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html
urban_county_df = pd.read_excel("data/2020_UA_COUNTY.xlsx", sheet_name="2020_UA_COUNTY",
                            usecols=['STATE', 'COUNTY', 'ALAND_PCT_URB'],
                            dtype={'STATE': str, 'COUNTY': str, 'ALAND_PCT_URB': float})

In [6]:
# Download U.S. counties GeoJSON (simplified)
geojson_url = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"
counties_geo = requests.get(geojson_url).json()

## 1. Scatter map

### 1-1. Merging dataframes for combining required data for analysis

In [7]:
# we merged application, assignee, and location dataframes to get patent location information for each patent application

# Merge application and assignee dataframes as application_df contains the actual applications and assignee_df contains the assignee type (indicidual, company, government) and location ID for each application
patent_df = merge(merge_on_df=application_df, merge_from_df=assignee_df,
                  merge_from_keep=['patent_id', 'location_id', 'assignee_type'],
          foreign_key_left="patent_id")

# Merge patent_df with location_df to get latitude and longitude and FIPS for each patent application
patent_df = merge(merge_on_df=patent_df, merge_from_df=location_df,
                  merge_from_keep=['location_id', 'disambig_state', 'disambig_country', 'latitude', 'longitude', 'state_fips', 'county_fips'],
          foreign_key_left="location_id")

In [8]:
# Filter out patents with application within US
us_patent_df = patent_df[patent_df['disambig_country'] == 'US']

In [9]:
# create 5-digit FIPS code for counties in US
cleaned_patent_df = us_patent_df.dropna(subset=['assignee_type', 'state_fips', 'county_fips'])
cleaned_patent_df['assignee_type'] = cleaned_patent_df['assignee_type'].astype(int)
# we first convert state_fips and county_fips to integer as they are float (contains decimal 0.0), and then convert to string to create 5-digit FIPS code for counties in US
cleaned_patent_df['county_fips'] = fips_merger(cleaned_patent_df['state_fips'], cleaned_patent_df['county_fips'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df['assignee_type'] = cleaned_patent_df['assignee_type'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df['county_fips'] = fips_merger(cleaned_patent_df['state_fips'], cleaned_patent_df['county_fips'])


In [10]:
# we will convert assignee_type to string for easier mapping in visualization
cleaned_patent_df.loc[:, 'type_name'] = cleaned_patent_df['assignee_type'].map({
    1:'individual', 2:'company', 3:'company', 4:'individual', 5:'individual', 6:'government',
    7:'government', 8:'government', 9:'government'
})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_patent_df.loc[:, 'type_name'] = cleaned_patent_df['assignee_type'].map({


In [11]:
# Keep all 'individual' and 'government' records
filtered_df = cleaned_patent_df[cleaned_patent_df['type_name'].isin(['individual', 'government'])]

# Randomly sample 50000 'company' records
company_sample = cleaned_patent_df[cleaned_patent_df['type_name'] == 'company'].sample(n=50000, random_state=42)

# Combine them
cleaned_sample_df = pd.concat([filtered_df, company_sample]).reset_index(drop=True)

### Function for visualization plot
- The function contains scatter map, density map, and bar plot that will be used in hypothesis 4
- Unfortunately, it was hard to apply to interactive visualization

In [12]:
def vizualization(geojson, dataset, viz_type: str, lon=None, lat=None, hover_txt=None,
                  location_layout: str = None, color_by: str = None, title: str = None,
                  color_disc_map: list[str] = None, label: str = None, add_trace: bool = False,
                  bar_x=None, bar_y=None, bar_stack: bool = True):
    '''
    Visualize patent application data on USA map or other kinds of visualization

    geojson: geojson for the visualization will be on USA map
    viz_type: str, the type of visualization, can be 'choropleth', 'scattergeo', or 'bar'
    lon: list, longitude of the patent application
    lat: list, latitude of the patent application
    hover_txt: the text that will be displayed when hovering over the patent application
    location_layout: str, the column name of the location layout, can be 'state_fips' or 'county_fips'
    color_by: str, the column name of the color by, can be 'type_name' or 'count'
    title: str, the title of the visualization
    color_disc_map: list, the list of color for the visualization, can be ['company', 'government', 'individual']
    label: str, the label of the visualization, can be 'count' or 'disambig_state'
    add_trace: boolean, whether to add a trace for the visualization
    bar_x: str, the column name of the x axis of the bar chart
    bar_y: str, the column name of the y axis of the bar chart
    bar_stack: boolean, whether to stack the bar chart or not
    return: None
    '''
    if viz_type == 'choropleth':
        fig = px.choropleth(
            dataset,
            geojson=geojson,
            locations=location_layout,
            color=color_by,
            color_continuous_scale="Blues",
            color_discrete_map={color_disc_map[0]: 'red', color_disc_map[1]: 'blue',
                                color_disc_map[2]: 'green'} if color_disc_map else None,
            scope="usa",
            labels={'count': label} if label else None,
            title=title)
        if add_trace is True:
            fig.add_trace(go.Scattergeo(
                lon=lon,
                lat=lat,
                hovertext=hover_txt,
                mode='markers',
                marker=dict(
                    size=7,
                    color='red',
                    symbol='star'
                ),
                name='Universities'
            ))
        fig.update_layout(margin={"r": 0, "t": 40, "l": 0, "b": 0})
     #   pio.renderers.default = 'browser'
        fig.show()

    elif viz_type == 'scattergeo':
        fig = px.scatter_geo(
            dataset,
            lon=lon,
            lat=lat,
            color=color_by,  # color by assignee type
            scope='usa',
            hover_name=hover_txt,  # or another field like 'assignee_name'
            title=title,
            color_discrete_map={color_disc_map[0]: 'red', color_disc_map[1]: 'blue',
                                color_disc_map[2]: 'green'} if color_disc_map else None,
            labels={'count': label} if label else None)
        fig.update_layout(margin={"r": 0, "t": 40, "l": 0, "b": 0})
     #   pio.renderers.default = 'browser'
        fig.show()

    elif viz_type == 'bar':
        fig = px.bar(
            dataset,
            x=bar_x,
            y=bar_y,
            color=color_by,
            title=title,
            labels={'count': 'Number of Patents', 'disambig_state': 'State'},
            barmode='stack' if bar_stack is True else None,  # or 'stack' for stacked bars
            category_orders={bar_x: sorted(dataset[bar_x].unique())}
        )

        fig.update_layout(
            xaxis_tickangle=-45,
            xaxis=dict(
                tickmode='linear',  # force evenly spaced ticks
                dtick=1  # one tick per category
            ))
       # pio.renderers.default = 'browser'
        fig.show()

In [13]:
# Subset data for reduce sample size on map
density_by_county_type = (cleaned_patent_df.groupby(['county_fips', 'type_name']).size().reset_index(name='count'))
density_by_county_type_sub = (cleaned_sample_df.groupby(['county_fips', 'type_name']).size().reset_index(name='count'))
# For each county, keep the row with the highest count
dominant_type = density_by_county_type.sort_values('count', ascending=False).drop_duplicates('county_fips')
dominant_type_sub = density_by_county_type_sub.sort_values('count', ascending=False).drop_duplicates('county_fips')

## 1. Scatter map
- This visualization shows data points on map by whether patent assignee is individual, government, or company in U.S.
- Just like in the hypothesis 2, the map used subsampled version for performance issue.
- Dataset subsetted by around 1/20 due to large sample size (n>10M)
- Got an assistence from ChatGPT for the plot

In [14]:
# Choropleth for individual density
vizualization(geojson=counties_geo, dataset=cleaned_sample_df, viz_type='scattergeo',
              lon='longitude', lat='latitude',
              color_by='type_name', color_disc_map=['company', 'government', 'individual'], title='Patent Applications by Assignee Type in the U.S.', hover_txt='type_name')

### Result from scatter map
- Unclear about organizations are centered in urban area compare to the individual

## 2. Density map
- This is the visualization by counties by density of patent application
- First visualization shows 3 heatmaps by 'individual', 'company', 'government', how are they concentrate in counties
- Second visualization shows map that colored with dominant patent application amount by 'individual', 'company', 'government'. Subset company samples by 50000, similar amount with individuals to check dominance over counties in same size of patent application.
- Third is similar with the second but used full dataset that company not subsampled with 50000.
- Got an assistence from ChatGPT for the plot

In [15]:
# Loop over assignee types and create maps
for assignee in ['individual', 'company', 'government']:
    sub_df = density_by_county_type[density_by_county_type['type_name'] == assignee]
    vizualization(geojson=counties_geo, dataset=sub_df, viz_type='choropleth',
              location_layout='county_fips',
              color_by='count', title=f'Patent Density by County — {assignee.capitalize()} Assignee')

In [16]:
# Plot for sample by 50000 for the company
vizualization(geojson=counties_geo, dataset=dominant_type_sub, viz_type='choropleth', location_layout='county_fips',
              color_by='type_name', color_disc_map=['company', 'government', 'individual'], title='Dominant Patent Assignee Type by County')

In [17]:
# Plot for whole dataset for the company
vizualization(geojson=counties_geo, dataset=dominant_type, viz_type='choropleth', location_layout='county_fips',
              color_by='type_name', color_disc_map=['company', 'government', 'individual'], title='Dominant Patent Assignee Type by County')

### Result from density maps
- Density map for individual and company shows companies are more spreaded
- Density map with dominant patent assignee showed
- - Individuals are majority if subset company with same size of individual patents
- - Companies are majority if use full dataset

## 3. Bar plot
- As previous visualizations were lack of explain in the main hypothesis: so assignee from company are more focused in area and individuals more spreaded?
- We try to show this by state-wise (since county is too many to visualize in bar plot)
- Prepared 2 bar plots,
- one for stacked bar plot to see overall amount with concentration on patent application by assginee
- another for interactive bar plot that user can choose assignee type from drop down menu, can individually check 'individual', 'company', and 'government' patent application density by states.
- Got an assistence from ChatGPT for the plot

In [18]:
# Group and count patents per state per assignee type
state_counts = (cleaned_patent_df.groupby(['disambig_state', 'type_name']).size().reset_index(name='count'))

In [19]:
# Plot for bar chart stacked with types of assignee ('company', 'government', 'individual')
vizualization(geojson=counties_geo, dataset=state_counts, viz_type='bar', bar_x='disambig_state', bar_y='count',
              color_by='type_name', title='Patent Count by State and Assignee Type')

In [20]:
# Interactive bar plot with dropdown menu for types of assignee ('company', 'government', 'individual')
# Also got helped from ChatGPT, for the creating button and apply to visualization
# Get unique assignee types
assignee_types = state_counts['type_name'].unique()
states = sorted(state_counts['disambig_state'].unique())

# Create dropdown buttons
buttons = []
for i, atype in enumerate(assignee_types):
    visibility = [False] * len(assignee_types)
    visibility[i] = True
    buttons.append(dict(label=atype, method="update", args=[{"visible": visibility}, {"title": f"Patents by State — {atype.capitalize()}"}]))

# Create a bar trace for each assignee type
fig = go.Figure()
for atype in assignee_types:
    df_sub = state_counts[state_counts['type_name'] == atype]
    fig.add_trace(go.Bar(
        x=df_sub['disambig_state'],
        y=df_sub['count'],
        name=atype,
        visible=(atype == 'individual')  # Show only one at a time initially
    ))

# Add dropdown to layout
fig.update_layout(
    updatemenus=[{
        "buttons": buttons,
        "direction": "down",
        "showactive": True,
        "x": 0.0,
        "xanchor": "left",
        "y": 1.15,
        "yanchor": "top"
    }],
    xaxis=dict(
        tickmode='linear',
        dtick=1
    ),
    xaxis_tickangle=-45,
    barmode='stack',  # still allows stacked view per state if you switch back
    title="Patents by State — Individual"
)
#pio.renderers.default = 'browser'
fig.show()

### Result from the bar plots
- We can clearly see companies’ patent activity indeed centered states like California and New York.
- However, is was the similar for individual patent activity
- Hard to say organizations are more concentrated

## 4. Mann–Whitney U Test
- Still one final question remained, does companies more clustered around busniess hub and individuals are more spreaded?
- We compared statistics by comparing amount of patent application by counties, Company assignee in urban vs. rural and Individual assignee in urban vs. rural.
- Then applied Mann–Whitney U Test to check each assignee's patent application in urban and rural counties are different, which if both companies and individuals shown to be different, then the hypothesis likely wrong because they are both distribution are different in counties.

In [21]:
# Filtering counties by urban percentage that is over 0%
urban_county_df = urban_county_df[urban_county_df['ALAND_PCT_URB'] > 0].copy()

In [22]:
# create combined fips code for county
urban_county_df['FIPS'] = urban_county_df['STATE'] + urban_county_df['COUNTY']

In [23]:
# create column for urban or not for each county based on fips code from urban_county_df
density_by_county_type['is_urban'] = density_by_county_type['county_fips'].isin(urban_county_df['FIPS'])

In [24]:
def filter_urban(dataframe:pd.DataFrame, assign_type:str, county_type:bool) -> pd.DataFrame:
    '''
    :param dataframe: Pandas dataframe, filtered dataframe for each assignee type that contains only urban or rural counties
    :param assign_type: string, assignee type for each patent application that is either 'individual' or 'company'
    :param county_type: boolean, assignee type for each patent application that is either 'is_urban' or 'rural'
    :return: Pandas dataframe, filtered dataframe for each assignee type that contains only urban or rural counties
    '''
    return dataframe[(dataframe['is_urban'] == county_type) & (dataframe['type_name'] == assign_type)]

In [25]:
ind_urban_density_df = filter_urban(density_by_county_type, 'individual', True)
ind_rural_density_df = filter_urban(density_by_county_type, 'individual', False)
comp_urban_density_df = filter_urban(density_by_county_type, 'company', True)
comp_rural_density_df = filter_urban(density_by_county_type, 'company', False)

In [26]:
# According to the Mann–Whitney U Test, the distributions are significantly different (p-value < 0.05)
# we can conclude that the distribution of patent density for individuals is significantly different between urban and rural counties.
mannwhitneyu(ind_urban_density_df['count'], ind_rural_density_df['count'], alternative='two-sided')

MannwhitneyuResult(statistic=np.float64(379920.5), pvalue=np.float64(1.3803444705679366e-51))

In [27]:
# we can conclude that the distribution of patent density for companies is significantly different between urban and rural counties.
mannwhitneyu(comp_urban_density_df['count'], comp_rural_density_df['count'], alternative='two-sided')

MannwhitneyuResult(statistic=np.float64(1308956.0), pvalue=np.float64(3.1239562045013912e-174))

### Result from the Mann–Whitney U Test
- According to the Mann–Whitney U Test, the distributions are significantly different (p-value < 0.05) for company and individual assignees
- we can conclude that the distribution of patent density for companies is significantly different between urban and rural counties.
- Overall, according to the bar plots and Mann–Whitney U Test, individual also concentrated in urban counties, which "hypothesis 4: We hypothesize that patents assigned to organizations (companies, universities, government) show a more concentrated geographic footprint compared to patents with individual or no clear assignee. Specifically, we expect corporate-assigned patents to cluster around business hubs (e.g. California and New York), while individually filed patents may appear more dispersed." is likely not right.

## AI usage
- Visualizations and statistics codes are created with assistance of ChatGPT