## 9.1 Stock Data
1. Scope:
    - have data on stock prices (open, close, high, low)
    - have 1000 client apps accessing this data
2. Assumptions:
    - data is going to be read-only. don't want client apps to change the prices
    - they might be requesting the data only every couple of hours since i assume stock prices don't change that much and they are asking for end-of-day stock price info
3. Major Components:
    - database to contain all of this info
        - probably going to use a SQL-based DBMS like MySQL for the job. since it's mostly going to be read-only queries involved, don't need the DBMS to have complicated queries or actions
    - going to have some sort of API to handle all client requests and to query database
        - going to use node.js to create it since you can make clusters of the instance if traffic is heavy but it is simple enough for what we're doing
4. Key Issues:
    - if we just run one instance of the node.js api then it can be a big bottleneck if all of the clients make the requests within a small time period
    - if we only have one database with all the info, then it could also be a source of a bottleneck if there are multiple queries going on
5. Redesign:
    - we can scale the api by using node.js clusters which will have clusters that can divide the traffic and handle all of the requests simultaneously. decent solution if we make use of a multi-core system that can do this
    - we could have clones of the database that the clusters could access. it will also divide up the queries

### Book Solutions:

## 9.2 Social Network

1. Scope:
    - what features should this social network have? 
        - profiles, connections, messages, etc
    - who is the target audience?
        - linkedin and facebook cater to different user bases.
        - one is for professionals looking for jobs or posting up job offers
        - the other is a way for families, friends, acquaintances to connect
        - identifying the user audience is important to determine what features it should have
2. Assumptions:
    - going to assume a generic facebook-like social network
    - going to have a user profile, friends list, etc
    - able to message between user and their friends either 1-1 or group chat
    - going to have some sort of a timeline for user's to create posts and share them
3. Major components:
    - user profile
    - friends list
    - messaging system
    - timeline
    - data structure to use: a graph
        - a network itself is like a big graph so it's a no-brainer for the social network to be one as well
        - the social network itself is going to be a large graph with each of its nodes acting as a feature
            - so one node for profiles
            - one node for messaging system
            - one node for timelines etc
        - the user profiles themselves are graphs with each node being assign a different function
            - one node for mutuals
            - one node for user's information
            - another node for posts
            - and one for any messages
4. Key issues:
    - might have redundant information in each graph and will be pretty tedious to update them
    - so we have to update the messages in the user profile graph and in the messaging system node in the social network graph
5. Redesign:
    - should just keep that information in one place to make it more efficient to update
    - so all messages should be kept in the messaging system node and the user's messaging node has a reference to all their conversations
***
* for the shortest path between two users based on their mutuals:
    1. should be using a breadth-first search to find the shortest path
    2. essentially you run a double bfs until you land on a mutual user between the two
        - so you run bfs from user1 --> user2
        - and you simultaneously rn one from user2 --> user1
        - if the algorithm lands on the same mutual or on one that the other already looked through, then you have found the shortest path
    3. the way the data structure is set up for the social network as a big graph will help tremendously in this b/c you will have a list of friends/mutuals to work with for the double-bfs method
* a possible optimization would be to first look through all the connections of user1 and check if they are also user 2's connections
    - if there is 1 match, then that mutual is the shortest path between the two users and the double-bfs method is not necessary
    - it would probably be faster to do this than jumping straight into double-bfs b/c if both of the users have huge amounts of connections, like 5k+, then that is a lot of mutuals to go through

### Book Solutions:

## 9.3 Web Crawler

1. Scope:
    - what are the infinite loops in regards to?
    - is there a possibility for infinite loops when crawling within a domain's contents or are we talking about infinite loops when crawling through the entire web?
        - the answer to this could change the way the algorithm is designed
2. Assumptions:
    - assume infinite looping in regards to pages within a site so mypage/about vs mypage/faq, etc
    - assume site is a multi-page application rather than a single-page one
3. Major Components:
    - could have a hash table that takes in the URL of the page currently being crawled
    - if the crawler enters a site that it's already seen before, it will abort the operation and move onto other pages or sites
    - within that hash table, each site will also have info on the contents of the site
    - should have another hash table with some info on common html elements like head or body tags to make a comparison in case the URL check fails
4. Key issues:
    - for URLs, there could be sites that have similar URLs but pretty much the same content and that could be accidentally crawled infinitely
    - it also doesn't take into account the same domains/subdomains for a page
5. Redesign:
    - the hash table should also add in info on domains/subdomains that have similar extensions, etc
    - especially for sites that use QUERIES at the end of the URLs which would have different URLs but pretty much the same content

### Book Solutions:

## 9.4 Duplicate URLs

1. Scope:
    - what should we do with the duplicate URLs? should we just notify someone that there is a duplicate URL or should it be removed?
    - how are these URLs gathered and stored in the first place? is there a list of these URLs or do we have to crawl for them?
    - is there a distinction between domains and subdomains?
        - do maps.google.com and google.com count as duplicates?
2. Assumptions:
    - assume that the whole URL has to be unique
        - so google.com/results, google.com, and maps.google.com are three unique results
    - also assume that the URLs are stored in a list of some sort, like an array
3. Algorithm:
    1. want to have iterate through the entire list of URLs and add them to a hash table
    2. while iterating, check to see if the URL is already in the hash table. if it is, then we can notify that it is a duplicate and move on. 
        - this will essentially take O(n) space where n = # of URLs. in this case, n = 10 billion URLs
        - it should only take 1 pass through this list of URLs to notify and find any duplicates b/c as you pass through the list, you only keep track of unique URLs in the hash table and any duplicates will immediately be found
        - it should only take 1 machine to do
4. Key issues:
    - if we wanted to scale this further to more than 10 billion URLs, we might not have enough space to fit all the URLs onto one machine
5. Redesign:
    - we could split up the list of URLs into multiple parts and do the same algorithm on multiple machines
    - then once we are done, we can compare all the URLs in each hash table to see if there are duplicates between the hash tables

### Book Solutions:

## 9.5 Cache

1. Scope:
    - does processSearch(query) get results from a subset of the 100 machines available or is it from an entirely different cluster of machines?
    - is processSearch(query) expensive in time or in space?
    - how recent should the search results be? should it only be the most recent 1k, 100k, 1 million, etc results?
2. Assumptions:
    - assume a different set of machines handles the search than the current 100 machines
    - assume processSearch() is a time-expensive thing
    - assume data from machines are sent back to the web server
    - assume most recent 100k results
3. Major components:
    - web server that assigns a query to a random machine
    - 100 machines that are randomly selected for the results
    - the separate cluster of machines that handle the actual search
    - the client that sends the query to the web server
    - a hash table containing the most recent results
        - key = the query which is a string
        - value = the result sent back from processSearch()
        - on the web server, it will receive the query from the client and check it against this hash table
        - if it is present in the table, then return the result. else, call on one of the machines to call processSearch()
        - and if we have reached the threshold of 100k results in the table, the remove the oldest ~10k or so results in the table to make room for more
4. Key Issues:
    - will require some space for the hash table if there are lots of results to be cached
    - removal of the oldest 10k results in a hash table might be difficult b/c results are not ordered in any way
    - and it might be costly to remove that many results
5. Redesign:
    - use of a linked list with each node being a hash table
    - linked list will have a head pointer and a tail pointer and the hash table has a capacity of ~10k results
    - so once we reach the 10k results, the linked list will add another node at the head that contains another hash table
    - and the node at the tail will be removed
    - searching for the result will not be expensive b/c there will be at most 10 nodes for the most recent 100k results
    - and removing the oldest 10k results is as simple as updating the tail pointer to reference the node prior to the last one

### Book Solutions:

## 9.6 Sales Rank

1. Scope:
    - how are these products stored? is there just one general table called 'Products' are are they already stored by categories such as Sports, Home, Kitche, etc
    - does each product have a list of categories that describe what it is meant for? like if a product was part of Kitchen, Home, Utensil or something
    - is there a master list of all available categories in the eCommerce site?
    - are the categories stagnant or do they add new categories?
2. Assumptions:
    - assume that all products are stored in one table called 'Products' with a list of categories describing what the product is menat for
    - assume that there is some sort of master list with all the categories of their products available
    - assume that the master list is stagnant and no new categories are added into it
3. Major Components:
    - list of all products on the site
    - list of all categories of products
    - the system would work like this:
        - it would iterate through every product in the Products table
        - and it would then create an array for each category that the products are attached to, i.e. create a Sports array and a Home array
        - then it would add the product to this array
        - once all products are accounted for, each of these arrays will then be sorted by # of products sold and to determine the rankings
        - this would also be done for the Overall category as well
        - and the system would only be doing this every hour or so. could be more frequent if the eCommerce site usually has a high amount of traffic. so it could be done every 15 minutes or so
4. Key issues:
    - if there is a huge amount of products available, then iterating through each one then sorting them by categories will be an expensive task time-wise
    - not to mention it would also require a lot of space too to store all the products in arrays for each category
5. Redesign:
    - would be able to cache some of the results of the previous rankings and make adjustments if necessary.
    - this would be good if some products are not as frequently sold as others so their rankings are quite stagnant whereas the more frequently sold items would need to be constantly sorted

### Book Solutions:

## 9.7 Personal Financial Manager

1. Scope:
    - what types of features exactly should be available?
        - when they say 'Make recommendations' do they mean make recommendations on spending habits? on stocks? what exactly is it?
    - how do we connect to the bank accounts of these users and how often should their data be requested? what kind of data do we need from them?
    - what types of habits are we looking for in terms of spending and how exactly do we categorize it?
2. Assumptions:
    - assume that it will track purchases that are provided by bank statements and make recommendations on how to reduce spending and budget for other things
    - should probably categorize by essential and non-essential purchases. should probably give the user the ability to categorize their purchases as well since some users prioritize some purchases over others, e.g. streamers prioritize faster internet speeds whereas the avg person will just get an average plan
3. Major Components:
    - some sort of API that will help connect us to people's bank account information in regards to purchases. Chase bank has its own API for that sort of thing so might have to use multiple APIs to get access to all the popular banks
    - will have a database of some kind to store information on purchases for up to a year or some other predefined amount of time and each purchase will be categorized by essential or non-essential or some other categories as well
    - will need some sort of analytics implemented to show spending in general to the user
    - the API should be updated every day or every couple of days to show trends in spending habits so far
4. Key issues:
    - not a lot of the data that we fetch needs to be put into a database if we can already request it from bank APIs so having it would just be a waste of space
    - not to mention users will have variable purchasing habits so it is hard to tell whether to request the data frequently or infrequently from these APIs
5. Redesign:
    - the user should get to decide how often their purchases should be fetched. some users might like to look at this info on a monthly basis while others might want to look at it daily. this will help reduce the amount of calls to these APIs
    - purchases should not be kept in a database at all and instead their spending habits should be saved. any time we want to look at purchases, we ask the API. and any time users want to look at prior spending habits, we already have it saved in our database.

### Book Solutions:

## 9.8 Pastebin

### Book Solutions: