In [None]:
!head -10 ./raw_data/all_inventors.csv

It's a lot easier to work with samller files when we're setting up our schema. Depending on how thurough you are with your initial data investigation or how well you know your dataset, you most likely won't nail your schema on the first shot. Sometimes this will involve having to change the format or re-structure the data files. It's a lot easier and faster to do this with a ~10MB file than a ~11GB one.

In [33]:
import os

# specify data folders
data_folder = "./raw_data"
output_folder = "./processed_data"

# How many lines do we want in the stripped down files
numLines = 10000

# Go through all the files in the data folder
for root, dirs, files in os.walk(data_folder):
    # Make the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.mkdir(output_folder)
    for fi in files:
        filepath = os.path.join(root,fi)
        # Create an output file with '10K_' before the file name (change this if you're using a different # than 10,000)
        outputFile = open(os.path.join(output_folder,("10K_"+fi)),'w+')
        # for each datafile
        with open(filepath) as dataFile:
            # go through each line in the file until we hit numLines
            try: 
                head = [next(dataFile) for x in range(numLines)]
                outputFile.writelines(head)
            # This will trigger if the file is less than numLines and will just copy the whole file
            except StopIteration:
                dataFile = open(filepath)
                lines = dataFile.readlines()
                outputFile.writelines(lines)
                outputFile.close()
        outputFile.close()

Now that we have paired down versions of our files, let's take a look at what we're actually working with. Below we'll print out each file name as well as its header.

In [None]:
# go through the files in the output folder
for root, dirs, files in os.walk(output_folder):
    for fi in files:
        filepath = os.path.join(root,fi)
        with open(filepath) as dataFile:
            # Print filename and first 2 lines (header + 1st row of data)
            print(fi)
            print(dataFile.readline(),dataFile.readline())

This next step is the most important. Now that we are aware of what our data is, we need to understand what the concepts are that it is talking about. This process typically involves some sort of domain knowledge on your data, so make sure you have Google ready for any terms that you aren't familiar with.

Let's walk through each of the files from our patent dataset and try to figure out what they're referencing and what the Entities are that we want to represent with our graph.

`all_inventors` As the name implies, we can assume that this file contains information about the inventors who will be mentioned in the patnets.

application_number,inventor_name_first,inventor_name_middle,inventor_name_last,inventor_rank,inventor_city_name,inventor_region_code,inventor_country_code
 04840815,WILLIAM,D.,SCHAEFFER,1,POMONA,CA,US

Already, we can tell that this list is referencing inventors back to their patent applications. `application_number` is the primary id that the patent applications use and that number ties each one of these inventors back to a patent application. 
Additionally, we see that the rest of the fields are describing an Inventor. From this first file, we can gather that `application_number` is something that we want to look for in the other files, and that there are a list of attributes that describe Inventor.

Let's start building our schema based on what we know.

First we need to identify the objects that this file talks about. The first immediate on is **Inventor**. Additionally we know that **Applications** exist due to the reference to `application_number`. In addition to just knowing that **Inventor** exists, we also know a little bit about our **Inventor**s such as their first, middle, and last names as well as the region that they live in. One thing that we do not have for our **Inventor**s is a unique identifier. There's no `inventor_id` or other field that could be used to ensure unique inventors. This is frustrating, and something we'll need to generate ourselves.

We can use this info to define our first Vertices in the schema.

-**Inventor**
 - id (we have to generate this)
 - name_first
 - name_middle
 - name_last
 - inventor_rank
 - inventor_city
 - inventor_region
 - inventor_country

-**Application**
 - application_number
 That's all we know for now about applications.

Looking at what we have above we can see that our **Inventor** is actually describing 3 things. The **Inventor** themselves (names), their **Rank** on the patent, and the **Location** that they used at the time of the filing.

This is where our domain knowldege will come in a little bit. When filing a patent, the inventors are ordered by how much they contributed to the patent. This is their **Rank**. This **Rank** is unique across each **Application** that an **Inventor** is on. Because of that, it does not make sense to store **Rank** inside of **Inventor** because that will only reflect one particular **Application**.

So what do we do here? Let's walk through the possibilites. 

The first one is that we break off **Rank** as its own Vertex. This seems logical, because an **Inventor** `has_rank` **Rank**. But now let's run this through a theoretical example. 
![rank schema](images/rank_schema.png)
*Inventor 1* has filed two **Applicaiton**s, *Application 1* and *Application 2*. *Inventor 1* is *Rank 1* on the first application and *Rank 2* on the second application. Following the solution outlined above, *Inventor 1* would have two **Rank** vertices attached to them, *Rank 1* and *Rank 2*. However, there's nothing that would tie either *Rank 1* or *Rank 2* to a particular **Application**. So finding the **Rank** of *Inventor 1* on *Application 1* would return both *Rank 1* and *Rank 2*. 
![](images/rank_1.png)
![](images/rank_1_1.png)
Okay then, so let's also tie our **Rank** to **Application**. Now *Inventor 1* `has_application` *Application 1*. *Applicaiton 1* `has_inventor_rank' *Rank 1* and *Rank 1* also ties back to *Inventor 1*
![](images/rank_2_1.png)
This seems like it would work, and it will, but let's see how messy this gets when we consider multiple **Applicaiton**s. Every **Application** will 'has_inventor_rank' *Rank 1* because there has to be at lest one **Inventor** on an **Application**. So if we wanted to find out what **Rank** and **Inventor** was in any given **Application**, our traversal 
![](images/rank_2_2.png)

Luckily, there's a much easier way than all of this. We don't have to limit our information to only our Vertices, we can also store additional data along edges.
![](images/rank_3_1.png)
Instead of making a Vertex for **Rank**, we can include it as an attribute of the `filed_application` edge. Now, all we need to do is traverse one edge in order to find out not only which **Application** an **Inventor** filed, but also their **Rank** on that application.
![](images/rank_3_2.png)