# Acceptance Criteria

The Secretary of State of North Dakota provides a business search web app that allows users to search for businesses by name. Your task:
- Download information for all active companies whose names start with the letter "X" (e.g., Xtreme Xteriors LLC) including their Commercial Registered Agent, Registered Agent, and/or Owners. Save the crawled data in the file format of your choice.
- Create and plot a graph of the companies, registered agents, and owners. You may consider names as sufficiently unique to identify each node in the graph.
- Full Task: [here](https://gist.github.com/jvani/57200744e1567f33041130840326d488)
- Website First Stop: [here](https://firststop.sos.nd.gov/search/business)

Discuss:
- Volume, Velocity, Variety

# Install, format, and lint

In [None]:
!make install #Run command to install necessary modules
%conda install --channel conda-forge pygraphviz
!make format
!make lint # Run a linter in Flake8

# Run the crawler

This runs the crawler for the FirstStop website for web scraping. This uses the FRAMEWORK (not library) called Scrapy which has (note: batteries included) :
- automatic deduplication
- customizable recursion depth for scraping
- keeps track of cookies, user-agent spoofing, proxy support
- asynchronous requests (Twisted)
- data pipeline support + error handling & logging
- auto throttling
- faster than BeautifulSoup, more scalable

In [None]:
# Run the crawler
!make crawl

# Read the output file

Reads the output collected from the FirstStop crawler for further processing

In [None]:
import json
with open('scrapy/output.json', 'r') as file:
    # Loads the JSON data from the file created from 'scrapy'
    companies_data = json.load(file)

companies_data

# Create the Graph

Generate the graphs to visualize relationships, parsing through the data

In [4]:
import networkx as nx
G = nx.Graph() # Person or Company that is an agent / owner

def create_graphs():
    """
    Creates the graphs, with two different types of nodes: company and person

    The edges are defined through defining relation between
    owner/agent and company
    """
    for data in companies_data:
        company_name = data["Company"]
        G.add_node(company_name, type="Company")
        # TODO: All to Uppercase, and strip with regex or str join
        if "Commercial Registered Agent" in data:
            cr_agent = data["Commercial Registered Agent"].split("\n")[0]
            G.add_node(cr_agent, type="Person")
            G.add_edges_from([(company_name, cr_agent),])
        if "Registered Agent" in data:
            r_agent = data["Registered Agent"].split("\n")[0]
            G.add_node(r_agent, type="Person")
            G.add_edges_from([(company_name, r_agent),])
        if "Owner Name" in data:
            owner = data["Owner Name"]
            G.add_node(owner, type="Person")
            G.add_edges_from([(company_name, owner),])
        elif "Owners" in data:
            owner_1 = data["Owners"].split("\n")[0]
            owner_2 = data[""].split("\n")[0]
            G.add_node(owner_1, type="Person")
            G.add_node(owner_2, type="Person")
            edges = [(owner_1, company_name), (owner_2, company_name),]
            G.add_edges_from(edges)
        # TODO: Throw an exception if none are satisfied

create_graphs()

# Visualize the Graph

In [59]:
# Set this flag to False if labels on graph are too distracting
graph_show_labels = True

In [None]:
import matplotlib.pyplot as plt

def visualize_graph(type_colors):
    """
    Visualizes graphs representing relationship between owner/agent and companies
    """
    plt.figure(1, figsize=(16, 16))
    plt.title('Company and Owner/Agent Network')
    pos = nx.nx_agraph.graphviz_layout(G, prog="neato")
    components = (G.subgraph(component) for component in nx.connected_components(G))
    for sub_graph in components:
        subgraph_colors = [type_colors[G.nodes[node]['type']] for node in sub_graph.nodes()]
        nx.draw(sub_graph, pos, node_size=40, node_color=subgraph_colors, with_labels=graph_show_labels)

colors = {'Company': 'lightblue', 'Person': 'lightgreen'}

visualize_graph(colors)

In [None]:
def print_connected_component_data():
    "Prints out connected component data"
    for component in nx.connected_components(G):
        print(G.subgraph(component).nodes(data=True))

print_connected_component_data()

# Push to Neo4J

In [None]:
import os
from dotenv import load_dotenv
from neo4j import GraphDatabase

def init_neo4j_driver():
    NEO4J_URI=str(os.getenv('NEO4J_URI'))
    NEO4J_USERNAME=str(os.getenv("NEO4J_USERNAME"))
    NEO4J_PASSWORD=str(os.getenv("NEO4J_PASSWORD"))
    print(NEO4J_URI)

    return GraphDatabase.driver(uri=NEO4J_URI,auth=(NEO4J_USERNAME,NEO4J_PASSWORD))

load_dotenv(override=True)
driver = init_neo4j_driver()

In [6]:
import nxneo4j as nxneo
network = nxneo.Graph(driver)
network.delete_all() # Clear all

In [7]:
import nxneo4j as nxneo
G = nxneo.Graph(driver) # Person or Company that is an agent / owner

def create_network_neo4j():
    """
    Creates the Network and writes it to Neo4 using the NxNeo4J Library
    """
    for data in companies_data:
        company_name = data["Company"]
        G.add_node(company_name, type="Company")
        if "Commercial Registered Agent" in data:
            cr_agent = data["Commercial Registered Agent"].split("\n")[0]
            G.add_node(cr_agent, type="Person")
            G.add_edges_from([(company_name, cr_agent),])
        if "Registered Agent" in data:
            r_agent = data["Registered Agent"].split("\n")[0]
            G.add_node(r_agent, type="Person")
            G.add_edges_from([(company_name, r_agent),])
        if "Owner Name" in data:
            owner = data["Owner Name"]
            G.add_node(owner, type="Person")
            G.add_edges_from([(company_name, owner),])
        elif "Owners" in data:
            owner_1 = data["Owners"].split("\n")[0]
            owner_2 = data[""].split("\n")[0]
            G.add_node(owner_1, type="Person")
            G.add_node(owner_2, type="Person")
            edges = [(owner_1, company_name), (owner_2, company_name),]
            G.add_edges_from(edges)

create_network_neo4j()

### Anomalies :
- x4i limited
- xanadu
- tanner collette

### Cypher:
- `:config initialNodeDisplay: 1000`


Order of Steps:

- Figure out API for businesses and parse
  - `Authorization: undefined` header necessary to resolve issues w/ request
- Figure out API for each business and parse
- Translate this to Scrapy
- Set up jupyter notebook for visualization purposes
- Add documentation
- Push to Neo4J
  - Tried using NeonX connector library
  - Used Networkx-Neo4J Neo4J API library instead

# Process

Collect, Clean, Label, Validate, Visualize, Storage

Anomalies:
- x4i limited
- xanadu
- tanner collette