<h1>Patent Graph</h1>

We're going to explore what it looks like to explore and load in an external dataset (US Patents) into TigerGraph. We'll cover all aspects needed to get started with setting up your first graph from understanding your data, to building a schema based on it, to processing it to make sure it conforms to our loading standards.

# The dataset

We'll be using a dataset from the US Patent Office available [here](https://www.uspto.gov/ip-policy/economic-research/research-datasets/patent-examination-research-dataset-public-pair). 

<img src="images/data.png" width="500px"/>

Select one of the above data files in **.csv format**. All of the years follow the same format, but differ in the amount of data that they have. Once downloaded, the data will decompress to about **6x** the size listed on the site. Keep this in mind as you will have to upload this data at some point to your TigerGraph server and if you have a slower network connection, the smaller size might be preferred. Additionally, the Free tier TigerGraph instance is limited to 50gb of disk space, so you may run into that limitation with the 2020 and 2019 data sets.

I'm not including the full dataset in this repo, but will include the top 10K lines of each file so that you can follow along while your full dataset downloads.

## Making Data Manageable

It's a lot easier to work with smaller files when we're setting up our schema. Depending on how thorough you are with your initial data investigation or how well you know your dataset, you most likely won't nail your schema on the first shot. Sometimes this will involve having to change the format or re-structure the data files. That also means re-uploading the new file to your TigerGraph server. It's a lot easier and faster to do this with a ~10MB file than a ~11GB one.

The below function will take the top `numLines` lines from each data file in the patent dataset and create a new file following the same naming convention as the original file, but with `10K_` prepended to the name of the file.

If you are using the `10K_` files from the `processed_data` folder while you wait for your full files to download, then you **do not** need to run this cell. It is important to understand the importance of pairing down your data during the exploration phase, but you can use the pre-prepared files provided.

In [2]:
import os

# specify data folders
data_folder = "./raw_data"
output_folder = "./processed_data"

# How many lines do we want in the stripped down files
numLines = 10000

In [33]:
# Go through all the files in the data folder
for root, dirs, files in os.walk(data_folder):
    # Make the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.mkdir(output_folder)
    for fi in files:
        filepath = os.path.join(root,fi)
        # Create an output file with '10K_' before the file name (change this if you're using a different # than 10,000)
        outputFile = open(os.path.join(output_folder,("10K_"+fi)),'w+')
        # for each datafile
        with open(filepath) as dataFile:
            # go through each line in the file until we hit numLines
            try: 
                head = [next(dataFile) for x in range(numLines)]
                outputFile.writelines(head)
            # This will trigger if the file is less than numLines and will just copy the whole file
            except StopIteration:
                dataFile = open(filepath)
                lines = dataFile.readlines()
                outputFile.writelines(lines)
                outputFile.close()
        outputFile.close()

# How TigerGraph works under the Hood

One key step in the thought process of how we want to structure our data is knowing how TigerGraph handles the data that has been loaded into it. There are three key components in a Graph Database. 

- **Vertex** (aka node) - represent an individual entry of a particular data concept (ex. Person, Receipt, House)
- **Edge** - link together vertices via relationships (ex. owns_vehicle, is_friend, purchased_item)
- **Attribute** - describes additional information about a **Vertex** or **Edge** (ex. firstName, streetNumber, transactionDate)

The data is stored on the TigerGraph server as follows:

**Vertex and Edge** - The `primary_id` of all vertices is loaded in **memory** and an edge is stored pointing between memory locations. This makes traversing edges *extremely* fast.

**Attributes** - All other details of **Vertices and Edges** are read from disk and accessing them will incur slight performance overhead above just traversing edges. You can add one **Attribute** as an **Index** per vertex type to increase performance accessing that **attribute**, but this is advanced and we won't be covering that here.

# Dataset Exploration

Now that we have paired down versions of our files, let's take a look at what we're actually working with. Below we'll print out each file name as well as its header. As you look through each file, try to figure out what will be our **Vertices**, **Edges**, and **Attributes**.

In [3]:
# go through the files in the output folder
for root, dirs, files in os.walk(output_folder):
    for fi in files:
        filepath = os.path.join(root,fi)
        with open(filepath) as dataFile:
            # Print filename and first 2 lines (header + 1st row of data)
            print(fi)
            print(dataFile.readline(),dataFile.readline())

10K_all_inventors.csv
application_number,inventor_name_first,inventor_name_middle,inventor_name_last,inventor_rank,inventor_city_name,inventor_region_code,inventor_country_code
 04840815,WILLIAM,D.,SCHAEFFER,1,POMONA,CA,US

10K_transactions.csv
application_number,event_code,recorded_date
 02549302,SETS,2001-09-20

10K_application_data.csv
application_number,filing_date,application_invention_type,examiner_full_name,examiner_art_unit,uspc_class,uspc_subclass,confirm_number,atty_docket_number,appl_status_desc,appl_status_date,file_location,file_location_date,earliest_pgpub_number,earliest_pgpub_date,wipo_pub_number,wipo_pub_date,patent_number,patent_issue_date,invention_title,small_entity_indicator,aia_first_to_file
 04453098,,,"LATEEF, MARVIN M",2106,338,254000,6933,,Patented File - (Old Case Added for File Tracking Purposes),1983-12-28,FILE REPOSITORY (FRANCONIA),1986-04-23,,,,,,,,UNDISCOUNTED,

10K_event_codes.csv
event_code,description
 102P,102P

10K_pte_summary.csv
application_numbe

## all_inventors.csv

`all_inventors` as the name implies, contains information about the inventors who will be mentioned in the patent applications in the dataset.

This is ONE TABLE but I have split it to better fit most screens. Any future entries will also represent ONE TABLE unless otherwise specified.

|application_number|inventor_name_first|inventor_name_middle|inventor_name_last|inventor_rank|
|---|---|---|---|---|
|04840815|WILLIAM|D.|SCHAEFFER|1|

<br>

|inventor_city_name|inventor_region_code|inventor_country_code|
|---|---|---|
|POMONA|CA|US|

Already, we can tell that this list is referencing inventors back to their patent applications. `application_number` is the primary id that the patent applications use and that number ties each one of these inventors back to a patent application. 
Additionally, we see that the rest of the fields are describing an Inventor. From this first file, we can gather that `application_number` is something that we want to look for in the other files, and that there are a list of attributes that describe Inventor.

Let's start building our schema based on what we know.

First we need to identify the objects that this file talks about. The first immediate one is **Inventor**. Additionally we know that **Applications** exist due to the reference to `application_number`. In addition to just knowing that **Inventor** exists, we also know a little bit about our **Inventor**s such as their first, middle, and last names as well as the region that they live in. One thing that we do not have for our **Inventor**s is a unique identifier. There's no `inventor_id` or other field that could be used to ensure unique inventors. This is frustrating, but real-life data isn't perfect, so we'll need to generate ourselves. We'll get into that later on though, let's start with the basics first.

We can use the column names to define our first Vertices in the schema.

-**Inventor**
 - id (we have to generate this)
 - name_first
 - name_middle
 - name_last
 - inventor_rank
 - inventor_city
 - inventor_region
 - inventor_country

-**Application**
 - application_number
 (That's all we know for now about applications)

Looking at what we have above we can see that our **Inventor** is actually describing 3 things. The **Inventor** themselves (names), their **Rank** on the patent, and the **Location** that they used at the time of the filing.

This is where our domain knowledge will come in a little bit. When filing a patent, the inventors can be ordered by how much they contributed to the patent. The order of a name in the patent listing is their **Rank**. This **Rank** can be unique across each **Application** that an **Inventor** is on. Because of that, it does not make sense to store **Rank** inside of **Inventor** because that will only reflect one particular **Application**.

So what do we do here? Let's walk through the possibilities. 

### The Wrong Way

The first one is that we break off **Rank** as its own Vertex. This seems logical, because an **Inventor** `has_rank` **Rank**. But now let's run this through a theoretical example.

<img src="images/rank_schema.png" width="500"/>

*Inventor 1* has filed two **Application**s, *Application 1* and *Application 2*. *Inventor 1* is *Rank 1* on the first application and *Rank 2* on the second application. Following the solution outlined above, *Inventor 1* would have two **Rank** vertices attached to them, *Rank 1* and *Rank 2*. However, there's nothing that would tie either *Rank 1* or *Rank 2* to a particular **Application**. So finding the **Rank** of *Inventor 1* on *Application 1* would return both *Rank 1* and *Rank 2*. 

<img src="images/rank_1.png" width="300"/>
<img src="images/rank_1_1.png" width="300"/>

Okay then, so let's also tie our **Rank** to **Application**. Now *Inventor 1* `has_application` *Application 1*. *Application 1* `has_inventor_rank' *Rank 1* and *Rank 1* also ties back to *Inventor 1*

<img src="images/rank_2_1.png" width="500"/>

This seems like it would work (it won't), so let's see how messy this gets when we consider multiple **Application**s. Every **Application** will 'has_inventor_rank' *Rank 1* because there has to be at lest one **Inventor** on an **Application**. So if we wanted to find out what **Rank** an **Inventor** was in any given **Application**, our traversal would still leave us with an open answer. Let's take a look.

We would start at an **Inventor**, then follow the `filed_application` edge to **Application 1**. From there, we would do a multi-hop traversal from **Application 1** via `has_inventor_rank` to a **Rank** vertex, then through `has_rank` back to our source **Inventor**. 

As you might already see, we run into an issue if **Application 1** has a second inventor of **Rank 2** and the source **Inventor** happens to be **Rank 2** on *any* other application. Because those **Rank** vertices are connected to multiple **Applications** and **Inventors** it is impossible to distinguish which instance of `has_rank` corresponds to any given instance of `has_inventor_rank`.

<img src="images/rank_2_2.png" width="500"/>

### The Right Way

Luckily, there's a much easier way than all of this. We don't have to limit our information to only our Vertices, we can also store additional data along edges.

<img src="images/rank_3_1.png" width="500"/>

Instead of making a Vertex for **Rank**, we can include it as an attribute of the `filed_application` edge. Now, all we need to do is traverse one edge in order to find out not only which **Application** an **Inventor** filed, but also their **Rank** on that application.

<img src="images/rank_3_2.png" width="500"/>

The **Location** can and should be separated into its own **Vertex**. Unlike **Rank**, **Location** directly relates to just the **Inventor**. Because of that, a single edge can be used to connect **Inventor** -> `from_location` -> **Location**.

It is also important to note that we may want to **filter** on location later. Say, select all **Inventors** from a specific **Location**. When you anticipate wanting to filter on a concept like that, it's a good indicator that that concept should be a **Vertex**.

But we can go deeper here. **Location** contains three pieces of information, **Country**, **Region**, and **City**. Each of those sound like something we might want to filter on down the line... see where I'm going here?

What was once a single **Location** vertex can now be described as 3 vertices: **Country**, **Region**, and **City**. We further know from our sample data that **Region** is a US State and that is inside **Country** and **City** is inside **Region**.

I want to take a second to state why its so advantageous to separate anything that you want to filter on into its own **Vertex**. To do this we need to think about how we query data.

Here's the theoretical example: We want to select all Inventors who filed in the **City** of **Boston**

If *city* is an *attribute* on an **Inventor**, then in order to find all **Inventors** whose *city* = `Boston` we need to:
- check EVERY SINGLE **Inventor**
- read the *city* attribute
- compare that value to `Boston`
- return **Inventors** with matching value

If **City** is a **Vertex** connected to an **Inventor**, then we just need to find all **Inventors** connected via a `from_city` edge to the **Boston** vertex.
- select **Boston**
- traverse `from_city` (pointer in memory)
- return **Inventors** at resulting memory locations

The **TL;DR** is you have to touch EVERY **Inventor** vertex to filter an *attribute*, and you ONLY touch the **Inventor** vertices you are interested in when you filter on a **Vertex**. Much more performant only grabbing `758` **Inventors** + the **Boston** vertex, than checking all `21,617,363` **Inventors**

### Schema so Far

-**Inventor**
 - id (we have to generate this)
 - name_first
 - name_middle
 - name_last

-**Country**

-**Region**

-**City**

-**Application**
 - application_number

<img src="images/inventors_schema.png" width="500px" />

## transactions.csv

|application_number|event_code|recorded_date|
|---|---|---|
|02549302|SETS|2001-09-20|

**Transactions** are described in [appendix B](https://www.uspto.gov/sites/default/files/documents/Appendix%20B.pdf) for our dataset. At its most basic level, the `transactions` file just lists each event in the **Application**'s lifecycle. Each event, or **Transaction** is tied to an **Event Code** which describes the event and a date describing when the **Transaction** took place.

Here we have to think about what a **Transaction** really is and what additional information is required to describe it.

Here are two ways that we can represent a **Transaction**:

<img src="images/transactions_schema.png" width="700px" />

In the example on the left, the **Transaction** vertex only serves purpose to hold the *date* of the **Transaction**. To further complicate things, each **Vertex** needs a unique id. We don't have a unique id in the `transactions.csv` file. We could generate some, sure, but this still isn't the best answer. 

For example, if we want to find what event codes an application has, we need to traverse the `has_transaction` edge, then traverse the `has_code` edge. So, for each **Transaction** we need to make 2n hops where n = number of **Event Codes** related to the **Application**

Looking at the example on the right, `date` is stored on an edge between **Application** and **Event Code** negating the need for the **Transaction** vertex. Since a **Transaction** in this sense is a relationship between an **Application** and an **Event Code**, it makes sense for it to be represented by an edge. 

Additinally, we only need to traverse 1n edges in this example to get the **Event Codes** related to a **Application**.

### event_codes.csv

|event_code|description|
|---|---|
|102P|102P|

This one is pretty simple. It relates an **Event Code** to its corresponding, human readable *description*.

With this and what we learned from `transactions.csv`, let's see what our schema looks like now.

<img src="images/event_schema.png" width="500px" />

### attorney_agent.csv

|atty_name_first|atty_name_last|atty_name_middle|atty_name_suffix|atty_phone_number|
|---|---|---|---|---|
|James|Wetzel||||

<br>

|atty_registration_number|atty_practice_category|application_number|
|---|---|---|
|17686|Attorney|03831599|

Hopefully this should start making sense by now. This file has a lot of columns, but is relatively simple.

-**Attorney**
 - atty_number (primary_id)
 - first_name
 - last_name
 - middle_name
 - suffix
 - phone

-**Practice Category**

-**Application**

**Practice Category** is broken off into its own vertex because it is a separate concept from **Attorney** and we might want to filter on it later.

<img src="images/attorney_schema.png" width="500px" />

### application_data.csv

|application_number|filing_date|application_invention_type|examiner_full_name|examiner_art_unit|uspc_class|
|---|---|---|---|---|---|
|04453098|||"LATEEF, MARVIN M"|2106|338|

<br>

|uspc_subclass|confirm_number|atty_docket_number|appl_status_desc|appl_status_date|file_location|
|---|---|---|---|---|---|
|254000|6933||Patented File - (Old Case Added for File Tracking Purposes)|1983-12-28|FILE REPOSITORY (FRANCONIA)|

<br>

|file_location_date|earliest_pgpub_number|earliest_pgpub_date|wipo_pub_number|wipo_pub_date|patent_number|
|---|---|---|---|---|---|
|1986-04-23|||||

<br>

|patent_issue_date|invention_title|small_entity_indicator|aia_first_to_file|
|---|---|---|---|
||||UNDISCOUNTED|

This is the big one. It was so big I had to break down the table so it would fit on the screen. Luckily we've been practicing for this. It's time to put everything we've learned so far to use and make easy work of converting this file to schema.

All the nuances of `application_data.csv` are outlined in [Appendix A](https://www.uspto.gov/sites/default/files/documents/Appendix%20A.pdf) of the dataset. But just like everything else, we'll walk through it here.

-**Application**
- application_number (primary id)
- filing_date
- confirm_number
- docket_number
- invention_title

-**Invention Type**

-**Examiner**
- full_name

-**Art Unit**

-**USPC Class**

-**USPC Subclass**

-**Application Status**

-**File Location**

-**Pgpub Number**

-**WIPO Pub Number**

-**Patent Number**

-**Small Entity**

-**First to File**

Yeah, that's a lot. And that's just the Nodes...

Few things to talk about here, *confirm_number* and *docket_number* are just another reference number to the patent and therefor doesn't need thier own vertex.

We will probably want to filter on **Invention Type**, **Application Status**, and **File Location** so those will be vertices.

**Pgpub**, **WIPO**, and **Patent Number** are all concepts that could be attributes. However, each of these concepts has a date attached to it telling us when it happened. Because these don't necessarily happen on the same date as our **Application**, they can be considered separate concepts. We'll use an edge with the relevant date to tie them back to the **Application**.

**Small Entity** and **First to File** are tags letting us know if this **Application** comes from a company with less than 500 people, and if this **Application** follows the AIA First to File rules respectively. Because we might want to filter on these, they're vertices.

**Examiner** is someone who looks at a patent from a domain standpoint to make sure that it is unique and works with **Inventors** to assess if a **Patent** can be granted.

**Art Unit** describes what unit the **Examiner** is in and we'll want to make that a vertex for filtering purposes.

**USPC Class** and **USPC Subclass** both describe the classification of an **Application** and that's definitely something we want to be able to filter on, so vertex.

<img src="images/application_schema.png" width="500px" />

^^^ That's just this file. Our schema is starting to get complex. This is good, the more detail we capture in our schema, the more insights we will be able to extract from our data.

### foreign_priority.csv

|application_number|foreign_parent_id|foreign_parent_date|parent_country|
|---|---|---|---|
|08030312|2297/90|1990-09-24|DENMARK|

Let's do an easy one after that last one. [Appendix D](https://www.uspto.gov/sites/default/files/documents/Appendix%20D.pdf) describes **Foreign Parent** and the fields within.

-**Foreign Parent**

-**Parent Country**

Here we're really only describing two concepts, the **Foreign Parent** the application from another country that is being referneced by our **Application**, and the **Country** that that **Foreign Parent** is from. The *foreign_parent_date* can be stored along the edge linking **Application** to **Foreign Parent**.

Let's add this into our overall schema. (sorry, I ran out of unique colors)

<img src="images/foreign_schema.png" width="700px" />

The keen eyed will have noticed that I added an edge from **Foreign Patent** to our existing **Country** vertex. Even though the **Country** that the **Inventor** is from is a different concept than the **Country** that a **Foreign Parent** corresponds to, a **Country** is still a **Country** and our graph modeling should represent that.

### pat_term_adj.csv

|application_number|pta_sequence_number|pta_event_date|pta_event_desc|applicant_delay_duration|
|---|---|---|---|---|
|09068213||2002-12-02|Mail Notice of Allowance|0|

<br>

|uspto_delay_duration|start_pta_sequence_number|term_extension_indicator|
|---|---|---|
|0||1|

As described in [Appendix E](https://www.uspto.gov/sites/default/files/documents/Appendix%20E.pdf), term adjustments account for delays in the patent process. We can further break these down like so.

-**PTA Event**
- primary id (will need to generate)
- pta_event_desc
- applicant_delay_duration
- uspto_delay_duration

-**Extension Indicator**

<img src="images/pta_schema.png" width="500px" />

There are two main things to talk about here. The first will be pretty apparent, there's an edge pointing from **PTA Event** back to **PTA Event**. How's that work?

Remember that this is our schema, and that these Nodes don't represent actual Vertices, but rather types of Vertices and the relationships that they CAN have to each other. 

You CAN NOT have an edge that points from **PTA Event 1** back to **PTA Event 1**. Remember that edges are pointers between memory locations and that we need a unique ID for both the vertex at the source and target of the edge.

But you CAN have an edge that points from **PTA Event 1** to **PTA Event 2**. Even though **PTA Event 1** and **PTA Event 2** are both of type **PTA Event**, they each have a unique ID and are therefor capable of having an edge connect them.

Within the context of our dataset, the `has_start` edge references *start_pta_sequence_number* which ties a particular **PTA Event** to the **PTA Event** whose deadline was missed causing the delay.

The last thing that we need to talk about with this file is the *Primary ID* for **PTA Event**. We don't have one in the dataset, and we need one to make each **PTA Event** unique within the graph.

You might look at this and think: "Why can't we just use *pta_event_description* as a unique identifier?" we've used a similar method for **Application Status**, so why won't it work here? The thing stopping us from doing this are the attributes of **PTA Event**. If we knew that *applicant_delay_duration* and *uspto_delay_duration* were the same for each individual instance of *pta_event_description*, then we could do this. But because those delays vary from **Application** to **Application** and are NOT fixed to a given *pta_event_description*, then we need to maintain a unique copy of each **PTA Event**.

There are two ways to do this and each has scenarios where it works better than the alternative.

#### Method 1 - Unique by Combination

Something we haven't talked about yet are called token functions. Token functions are functions that run over our data while it's being loaded into the graph. These functions allow us to do things like convert Epoch time into Datetime, change the case of a string, and many other helpful data manipulation tasks that we would normally have to do before bringing our data into the database.

The token function that we'll be talking about for Method 1 is `gsql_concat`. As the name implies, this allows you to concatenate multiple fields of your data into one long string.

We can use this to take multiple non-unique fields from our data and combine them into a unique field. For example, we could use a concatenation of *application_number* and *pta_event_date* to create a unique identifier for **PTA Event**. Or at least we could if it was impossible to have multiple **PTA Events** for the same **Application** on the same day.

If we start looking through our data, we can see that this is not the case and there can be multiple **PTA Events** on the same *date* for a given **Application**. Fair enough, let's add in more information to ensure our ID is unique. The *pta_event_description* seems to NOT occur multiple times on the same *date* for the same **Application**. We could have our Primary ID be a concatenation of *application_number*, *pta_event_date*, and *pta_event_description* to result in a truly unique ID for this dataset.

For the first line of our data file, that would give us the Primary ID of: `090682132002-12-02Mail Notice of Allowance`. What a mouthful! 

This works and will function fine in our graph. However, it's not very pretty. This method usually works best when you only need to concatenate 2 fields and the resulting concatenation will be something useful for a user to identify the resulting vertex.

Say you had a dataset for playing cards and the only two columns of data you had were the **Symbol** on the card and the **Value** on the card. There's 13 different **Values** for each **Symbol**, so we can't use that as a Primary ID. And there's 4 different **Symbols** for each **Value**, so that's not unique either. However, there is only one **Card** per combination of **Symbol** and **Value**. Because of this, a concatenation of **Symbol** and **Value** will provide both a Unique ID and a human readable value that can easily describe a given **Card**. A sample Primary ID would be something like `Hearts3` or `SpadesQ`

#### Method 2 - Unique by Design

We don't always need a human readable Primary ID for our data, and in the case of **PTA Event** we have plenty of attributes to help us understand the **PTA Event** so the ID doesn't need to be too descriptive. You're saying, "If we don't need it to be human readable, then why didn't we just use the `gsql_concat` token function mentioned above?" The answer is because there's an easier way.

`gsql_uuid_v4` will generate a Unique ID. That's all it does, spits out an ID that is unique to all other vertices in the graph. It outputs a long string of numbers and letters resembling this: `4493d5ce-a69b-4c90-88e4-b41e9f576169`, it's not pretty, but it's guaranteed unique.

This method works for scenarios like this where `concat` might not guarantee a Unique ID across a huge dataset and you don't necessarily care about the value of the Primary ID.

### pta_summary.csv

|application_number|pto_delay_a|pto_delay_b|pto_delay_c|overlap_pto_delay|
|---|---|---|---|---|
|09743549|0|0|0|0|

<br>

|nonoverlap_pto_delay|pto_manual_adjustment|applicant_delay|patent_term_adjustment
|---|---|---|---|
|0|0|0|0|

-**PTA Summary**
- primary id (need to generate)
- pto_delay_a
- pto_delay_b
- pto_delay_c
- overlay_pto_delay
- nonoverlap_pto_delay
- pto_manual_adjustment
- applicant_delay
- patent_term_adjustment

**PTA Summary** builds off of **PTA Event** and provides us with the total delay time of multiple types for the duration of the **Application**. Where **PTA Event** shows the delay caused by each **PTA Event**, **PTA Summary** shows the summation of all delays for a given **Application**. 

Technically, we don't need this. We could gather the same information by selecting every **PTA Event** tied to an **Application** and summing the respective delay attributes. That however requires traversals equal to the number of **PTA Events** for an **Application** AND we need to access attributes of each **PTA Event** that we hit, which will cost performance. Conversely, we only need to access ONE **PTA Summary** to gather that same information.

You can learn what each delay type represents in [Appendix E](https://www.uspto.gov/sites/default/files/documents/Appendix%20E.pdf) of the dataset.

### pte_summary.csv

|application_number|pto_adjustment|pto_delay|applicant_delay|patent_term_extension|
|---|---|---|---|---|
|09068213|0|0|0|0|

-**PTE Summary**
- primary ID (need to generate)
- pto_adjustment
- pto_delay
- applicant_delay
- patent_term_extension

This file isn't mentioned in our Appendix for some reason, so we have to infer some info from it. First, we can guess that PTE stands for Patent Term Extension and due to the similarity of this file to `pta_summary.csv` we can assume that this file describes the cumulative extensions to a given **Application**.

We've added quite a few vertices and edges to our schema, let's see what the whole thing looks like so far.

<img src="images/pte_schema.png" width="800" />

### continuity_children.csv and continuity_parents.csv

`continuity_children.csv`
|application_number|child_application_number|child_filing_date|
|---|---|---|
|02262037|59997901|2018-01-01|

<br>

`continuity_parents.csv`
|application_number|parent_application_number|parent_filing_date|continuation_type|
|---|---|---|---|
|05354590|05101449|1970-12-28|CON|


-**Continuation Type**

-**`has_child`**
- filing_date

-**`has_parent`**
- filing_date

-**`is_continuation_type`**

[Appendix C](https://www.uspto.gov/sites/default/files/documents/Appendix%20C.pdf) for this one. Essentially, continuing a patent allows you to file child applications that can add to, augment, reissue, and more to your initial **Patent**. I'm no patent expert, so you can read up more about the different *continuation_types* somewhere like [wikipedia](https://en.wikipedia.org/wiki/Continuing_patent_application) or [Appendix C](https://www.uspto.gov/sites/default/files/documents/Appendix%20C.pdf).

<img src="images/continuation_schema.png" width="400px" />

The `has_parent` - `has_child` relationship is the epitome of a directed edge use case. **Continuation Type** is broken out so that we can use it as a filter, but you knew that by now.

### correspondence_address.csv

|application_number|correspondence_name|correspondence_street_line_1|correspondence_street_line_2|
|---|---|---|---|
|04526546|WILLIAMS D. HALL|200 SEMMES BUILDING|10220 RIVER ROAD|

<br>

|correspondence_street_line_3|correspondence_city|correspondence_postal_code|correspondence_region_code|
|---|---|---|---|
||POTOMAC|20854|MD|

<br>

|correspondence_country_code|customer_number|
|---|---|
|US||

-**Correspondence**
- name
- customer_number

-**Address**
- street_line_1
- street_line_2
- street_line_3
- city
- postal_code
- region
- country

-**Postal Code**

[Appendix F](https://www.uspto.gov/sites/default/files/documents/Appendix%20F.pdf)

This might not seem like the most valuable information, but we can actually do something cool here and use this data alongside our existing **City**, **Country**, and **Region** data.

Before:

<img src="images/location_1.png" width="500" />

With `correspondence_address.csv`:

<img src="images/location_2.png" width="500" />

Fully Connected:

<img src="images/location_3.png" width="500" />

Okay, that's a little messy, but essentially we're able to connect **Address** and **Postal Code** to their corresponding **City**, **Region**, and **Country** vertices. The richer the connections in our graph, the more information we can infer about the relationships between different entities.

## Putting it all Together

This is it! The moment we've been looking forward to, maybe even fearing this whole time. What does the complete schema look like?

<img src="images/full_schema.png" width="100%" />