# How can we build a company database to handle product sales end-to-end?

## Goals (2 min)

In this case, we will move away from using SQL queries to extract data and talk about how to design a database. This will be put into practice by creating a database in the cloud using Amazon AWS's ```RDS``` service.

The case will involve interacting with a database in the cloud so there will also be ample use of the terminal to communicate with the server. The focus of the case will be on the design and creation of databases and so there will be more discussion than usual but simple queries will be done to practice querying via the terminal.

## Introduction (5 min)

**Business Context.** You are a data analyst for the same large financial services firm as in the previous case. The firm was pleased with your analysis and now the see they value of having databases that can easily be queried using SQL. It would therefore like to move its data, which is currently stored as CSV files, onto a proper database.

**Business Problem.** The business would like you to **create a database to house its existing data and add a few more data tables to track additional information they are interested in**.

**Analytical Context.** The data is split across three tables: "Agent", "Call", and "Customer", which sit on CSV files. We will be creating a database and then loading the data from these CSV files into that database.

The case is sequenced as follows: you will (1) learn the fundamentals of database management systems; (2) setup an RDS instance on Amazon AWS; (3) use SQL to set up a new database with tables in PostgreSQL; (4) query our newly created database to answer business questions; and finally (5) design enhancements to our database to fit expanded business requirements.

## Database management systems (10 min)

So far, you have been using SQL via `SQLAlchemy` to interact with databases. But databases by themselves are quite useless when there is more than one person involved in writing data to it. Rather, we use **database management systems** in order to manage databases properly. All the examples mentioned in the previous case (SQL Server, Oracle, PostgreSQL) are in fact database management systems (going forward, let’s refer to them as DBMS) and not databases.

But why did these DBMS come into existence? Why wasn’t a database enough? To answer that, imagine our previous example where we mentioned the phone book record, where you keep track of all your friends and their phone numbers, but now on a text file on your computer. So you have the following file:

<img src="images/m1_1.jpg" width="400">

That looks fine and it works as you'd expect. But now imagine you upload a YouTube video of yourself dancing to a new song and, due to your amazing dancing moves, that video goes viral. As a result, you become the most popular person in your neighborhood, and everybody wants to be your friend. Since you want to keep track of all of your new friends' phone numbers, you decide to add all these new contacts to your “phones.txt” file. But given the number of new friends you are making, you ask your sister to help you with adding your new friends to the list. 

Being a tech-savvy guy, you share the folder where the file is located on your home network and create a shortcut to it on your sister’s laptop. She opens the file and you both start adding new contacts to the file at the same time. Each of you individually save the file when you are finished and go to sleep. But when you wake up the next morning, you notice that half of all the contacts are gone!

### Exercise 1: (5 min)

What do you think went wrong here?

**Answer.** Although both save operations worked properly and were actually saved to the database, the first person to save the file later had their version overwritten by the second save request, using only the data inputted by the second person. Preventing this (and other even more complex and dangerous scenarios) is the reason why DBMS were born.

## Setting up a cloud database using RDS and importing data (45 min)

Let's set up a real database and so that we can see the different design considerations. To do this, we'll be using Amazon AWS's ```RDS``` product. Once we have the database created, we will connect to it using the `psql` command which should have been installed in the previous case.

1. Log into your AWS account and select "RDS" from the service list. You should see a screen like the one below, where you can hit the "Create database" button:

![Create Database](images/create_db.png)

2. The next option you'll see asks you if you want to use "standard create" or "easy create". Easy might sound tempting, but **choose "standard"** as we'll have to set up our database for public use so we can connect to it locally.

3. Choose "PostgreSQL" as the database type, leave the version at the default AWS has chosen for you (11.6-R1 at the time of writing), and choose "Free Tier"

4. Under "Storage" turn off "Storage autoscaling". This will prevent any unexpected future charges.

![Turn off autoscaling](images/autoscale.png)

5. Under the next section, choose a name for your database instance. Remember this is the machine that is hosting the database software, not the database itself (one RDS instance can host many databases), so I'm calling mine `ds4a-demo-instance` to reflect this, although we'll only be creating a single database for now. 

6. You can leave the master username as `postgres` and ask RDS to autogenerate a password (we'll be able to see this password at the next step):

![Set DB password](images/set_db_password.png)

7. You can leave the next settings as their defaults until you get to the "Connectivity" section. Usually, you'll set up an RDS instance to play with other infrastructure within your AWS account, such as EC2 servers. In our case, we want to push data in and out of the database directly from our local machine as the client, so we'll have to set our database up for "public access". This is generally less secure, but we'll add some firewall rules in a bit to make sure that only we can access it:

      * Expand the "Additional connectivity configuration" section

      * Set "publicly accessible" to "Yes"

      * Under "VPC security group", choose to "Create new", and give it a name like `allow-local-access`. This will create a firewall rule that will allow you to connect to your database on port 5432 (the default for PostgreSQL) using your current IP address. If you are using public WiFi, a hotspot, or if you think your IP address is likely to change soon for any reason, note that you'll have to modify this security group any time your IP address changes:

![Create Security Group](images/create-sec-group.png)

8. Press the "Create database" button in the bottom right, and you'll be taken back to the overview page where you can see your database being created. At the top, there'll be a notification where you can press "View credential details" to access your master password that was automatically generated. **Take note of this as you can only see it once.** Note: this creates a database in the default VPC. If your default VPC is not configured for DNS connections, you will need to create a new VPC. Please see 'Appendix 1: Troubleshooting RDS creation' for instructions on how to do achieve this.

![View credentials](images/view_creds.png)

9. Once your database becomes "available" (you might need to press the "refresh" button indicated below to see the change), you can connect to it. Click on the name of the database (`ds4a-demo-instance` in our example), to find out the connection details:

![DB available](images/db-available.png)

10. Once you click on the database, you should see the endpoint that you need on a screen similar to the one shown below. You need this endpoint to connect to the database from your local machine.

![DB Endpoint](images/db-endpoint.png)

11. Locally, open a terminal and run the following command, substituting [endpoint] with the one that you noted from the RDS console above.

```bash
psql -h [endpoint] -U postgres
```

This will connect to our instance's default database using the master username. It will prompt you for the password and you can enter the autogenerated password from above. You should now see a SQL prompt, similar to the image below:

![PSQL prompt](images/psql-prompt.png)

We've successfully created a cloud database and connected to it!

### Setting up our database (10 min)

Let's proceed by setting up our database in Amazon RDS:

1. In the SQL shell, run the following commands to create a database, create a user to manage our database, and give privileges on our new database to our new user. Replace [password] with your own choice of password:

```SQL
create database ds4a_demo_db;
create user ds4a_demo_user with login encrypted password '[password]';
grant all privileges on database ds4a_demo_db to ds4a_demo_user;
\q
```

Here, `\q` closes the connection so you can re-open it under a different user.

2. Run the following command. It is similar to the one we used before to connect but now specifies both our custom user and our custom database. Once again, substitute [endpoint] with the one you see in the RDS console.

```SQL
psql -h [endpoint] -U ds4a_demo_user -d ds4a_demo_db
```

3. Put in the new password that you entered in the SQL statement in step 1 instead of the master password that AWS automatically generated for us.  You'll see a very similar prompt, but with the `ds4a_demo_db=>` prompt instead of `postgres=>`:

![demo prompt](images/demo-prompt.png)

The next thing we need to do is to create tables to house our data.

### Data Definition Language (DDL) statements in SQL (5 min)

**Data Definition Language (DDL)** statements are used to create, modify, and remove database objects themselves as well as the data within them. The most important statement in DDL space is the `CREATE TABLE` command. To create a table, you need to provide the table's name, its columns, and each column's type. For example, the SQL command below creates a table called `Product`, with an `INTEGER` field called `ProductID` and a `VARCHAR(20)` (string with up to 20 characters) field called `ProductName`:

```SQL
CREATE TABLE Product(ProductID INT, ProductName varchar(20))
```

Once you have created a table, you can use certain DML statements used to manipulate data in the tables themselves (rather than merely in the outputs of queries, as you did in the previous case). These commands are:

1. `INSERT`: to insert data into a table
2. `UPDATE`: to update existing data within a table
3. `DELETE`: to delete records from a database table

Below is additional information on each:

1. The INSERT INTO statement is used to add rows to a table. Its syntax is as follows:

    ```SQL
    INSERT INTO table_name (column1, column2, column3,...)
    VALUES (value1, value2, value3,...)
    ```

    Alternatively, if you are adding values for all the columns of the table, you do not need to specify the column names.


2. The `UPDATE` statement is used to modify existing records in a table. You indicate which table you are updating, and then give the columns you want modified followed by a condition (a `WHERE` clause) that specifies which record(s) should be updated. If you omit the `WHERE` clause, all records in the table will be updated!

```SQL
UPDATE table_name
SET column1 = value1, column2 = value2,...
WHERE condition;
```

3. The `DELETE` statement is used to delete existing records in a table:

```SQL
DELETE FROM table_name
WHERE condition
```

Similarly to the syntax for `UPDATE`, the `WHERE` clause specifies which record(s) should be deleted. **If you omit the `WHERE` clause, all records in the table will be deleted, so be VERY careful with this statement!**

The below table summarizes the differences between DDL and DML statements:

![DDL Statements](./images/ddl.png)

### Exercise 2: (5 min)

Set up a new database with the following tables and column details. You should work on the queries in a text editor and copy and paste the commands to your `psql` connection:

1. Table Name: `Customer` 
   Columns:
      * **CustomerID** INT, this will be the primary key of the table
      * **Name** VARCHAR(50)
      * **Occupation** VARCHAR(50)
      * **Email** VARCHAR(50)
      * **Company** VARCHAR(50)
      * **PhoneNumber** VARCHAR(20)
      * **Age** INT

2. Table Name: `Agent`
   Columns:
      * **AgentID** INT, this will be the primary key of the table
      * **Name** VARCHAR(50)

3. Table Name: `Call`
   Columns:
      * **CallID** INT, this will be the primary key of the table
      * **AgentID** INT
      * **CustomerID** INT
      * **PickedUp** SMALLINT
      * **Duration** INT
      * **ProductSold** SMALLINT

**Answer.** One possible solution is given below:

```SQL
CREATE TABLE Customer(
    CustomerID INT primary key,
    Name VARCHAR(50),
    Occupation VARCHAR(50),
    Email VARCHAR(50),
    Company VARCHAR(50),
    PhoneNumber VARCHAR(20),
    Age INT
)

CREATE TABLE Agent(
    AgentID INT primary key,
    Name VARCHAR(50)
)

CREATE TABLE Call(
    CallID INT primary key,
    Agentid INT,
    Customerid INT,
    Pickedup SMALLINT,
    Duration INT,
    ProductSold SMALLINT
)
```

### Pushing sample data into RDS (10 min)

Let's now push our data onto RDS. Run the command below, again substituting [endpoint] with the actual endpoint you used above. Make sure that the `Customer.csv` file is located in the same directory that you run the `psql` command from:

```bash
psql -h [endpoint] -U ds4a_demo_user -d ds4a_demo_db -c "\copy Customer from 'Customer.csv' with (format csv, header true, delimiter ',');"
```

The first part of the command is the same one we used before to open a SQL shell. Here we also pass the `-c` flag which allows us to specify a SQL command to be run on the database. Because our shell has permissions to access our local file system, but our database doesn't, running the command like this means we won't have problems with permissions. In the `\copy` command, we specify which table we want to populate (`Customer`), where the local file is (`Customer.csv`), that our file is in CSV format, that it has a header, and that we are using a comma as a delimiter. 

This should prompt you for the password (again, use the one that you created for the `ds4a_demo_user`). It will then let you know how many rows it has successfully imported, similar to the image below:

![Copy successful](images/copy-successful.png)

### Exercise 3: (5 min)

Do the above steps for the `Agent`, `Call`, and `Customer` tables.

**Answer.** One possible solution is given below:

```SQL
psql -h [endpoint] -U ds4a_demo_user -d ds4a_demo_db -c "\copy Agent from 'Agent.csv' with (format csv, header true, delimiter ',');"

psql -h [endpoint] -U ds4a_demo_user -d ds4a_demo_db -c "\copy Call from 'Call.csv' with (format csv, header true, delimiter ',');"

psql -h [endpoint] -U ds4a_demo_user -d ds4a_demo_db -c "\copy Customer from 'Customer.csv' with (format csv, header true, delimiter ',');"
```

where `CSV file path` is replaced with wherever you are storing the files locally on your computer.

## Using our data tables to answer business questions (30 min)

Now that we have set up the exact tables we want, we can reap the rewards of our labor and write DML statements to extract info and answer relevant business questions.

### Exercise 4: (15 min)

Two metrics of sales agent performance that your firm is interested in are: 1) for each agent, how many seconds on average does it take them to sell a product when successful; and 2) for each agent, how many seconds on average do they stay on the phone before giving up when unsuccessful. Write a query which computes this.

**Answer.** One possible solution is given below:

```SQL
SELECT a.Name,
SUM(
   CASE
       WHEN ProductSold = 0 THEN Duration
       ELSE 0
   END)/SUM(
   CASE
       WHEN ProductSold = 0 THEN 1
       ELSE 0
   END)
AS avgWhenNotSold ,
SUM(
   CASE
       WHEN ProductSold = 1 THEN Duration
       ELSE 0
   END)/SUM(
       CASE WHEN ProductSold = 1 THEN 1
       ELSE 0
   END)
AS avgWhenSold
FROM call c
JOIN agent a ON c.AgentID = a.AgentID
GROUP BY a.Name
ORDER BY 1
```

### Exercise 5: (15 min)

In order to incentivize its sales agents, the firm is offering a bonus for the agents who manage to close a sale the fastest. Write a query which gives, for each agent, the duration of that agent's quickest sale and the customer name it was sold to. If there are ties (i.e. for the same agent, two sales with the same duration), pick the one with the highest `CustomerID` value to be part of your query results.

**Answer.** One possible solution is given below:

```SQL
SELECT a.name AS AgentName, cu.Name AS CustomerName, x.Duration
FROM
(
   SELECT ca.AgentID, ca.Duration, max(CustomerID) AS cid
   FROM
   (
       SELECT AgentID, min(Duration) as fastestcall
       FROM Call
       WHERE ProductSold = 1
       GROUP BY AgentID
   ) min
   JOIN Call ca ON ca.AgentID = min.AgentID AND ca.Duration = min.fastestcall
   WHERE ProductSold = 1
   GROUP BY ca.AgentID, ca.Duration
) x
JOIN Agent a ON x.AgentID = a.AgentID
JOIN Customer cu ON cu.CustomerID = x.cid
```

## A word about DBMS properties (5 min)

Every database must exhibit certain properties in order to guarantee that the data inside it is reliable. The **ACID properties** refer to four fundamental transactional properties of DBMS, and stand for “Atomicity, Consistency, Isolation, and Durability”. If a tool claims to be a DBMS and does not exhibit all of these properties, then it is not a DBMS.

Here is a visual explanation of the ACID properties:

![Acid Properties](./images/acid.png)

The properties can be quickly explained as follows:

1. **Atomicity**: Means “all or nothing”. Let's take the bank transfer example again; remember that transferring money from Account A to Account B is two separate operations. If the bank system fails right after the money leaves Account A but before it enters Account B, then that's not "all or nothing" (and it's bad). Atomicity guarantees that a unit of work is fully executed or not executed at all.

2. **Consistency**: Means that the DBMS prevents database corruption by guaranteeing that the database is always valid according to the rules on which it was defined (essentially, no "rogue" databases not following instructions).

3. **Isolation**: Ensures that concurrent execution of transactions leaves the database in the same state as would have been obtained if the transactions had been executed sequentially. This is especially important in distributed database systems.

4. **Durability**: This property guarantees that once the transaction is committed to the database, it is durable. Basically, this means that you cannot receive an “OK” message if something bad (like a power failure) happens between when the changes are written from the memory buffer and when they are written to the disk (causing the transaction to fail).

NoSQL databases, on the other hand, support a different set of properties called BASE, which stands for “Basically Available, Soft state, Eventually consistent”. We will not get into this here, except to mention that "Eventually consistent" means that “eventually” consumers may not see the latest version of the data, which may or may not be a problem depending on the context. If you would like to learn more, feel free to look up the CAP Theorem for distributed databases systems.

## Database design (40 min)

The business is happy with what you've done so far, and would like to incorporate more elements of its end-to-end sales process into the database you've created. However, this added complexity is likely to serve up some database design challenges. What are some important principles we should keep in mind as we help the business build this out?

Let's revisit the phones and cities example from earlier, where you wanted to record the city where each one of your contacts lives. This seems straightforward enough - just add a new column to your `phones` table called `city` and record the name of the correct city in each cell.

### Exercise 6: (5 min)

What sorts of problems and/or inefficiencies do you think this method might cause?

**Answer.** There are several, but some example include:

1. It wastes a lot of space. For example, suppose that you have 1000 friends in this database who all live in "San Francisco". The database would have to store the same exact piece of information, 100 times, with each instance consuming 13 characters of space.
2. If you decide to change “San Francisco" to “San Francisco, CA”, the database would have to change it in 1000 different locations.
3. If you accidentally write “san francisco” for one of the cells, then even though you know that it is the same as "San Francisco", the database treats them differently. A similar situation happens with typos.
4. If you write a query with a `WHERE` clause saying `WHERE city = “New York City”`, you may miss rows if either (2) or (3) above are true.

To fix these inefficiencies and problems, we use a concept called **normal forms**. Designing your database based on normal forms rules will make it much easier to scale and maintain. There are a total of 6 normal forms, plus some intermediaries and alternatives in between. You don't need to know these now, except First Normal Form (1NF), which we will discuss below. We suggest you spend some time on Google reading about these forms, but you’ll only really understand them with practice and experience, so there is no point memorizing them now. The process of representing a database in terms of relations in normal forms is known as **database normalization**.

1NF says that:

*“A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain.*

It's really unclear if that's even English, so we'll break it down:

1. Individual tables should not contain repeat information (i.e. non-ID info)
2. If there was originally repeated information, create a separate table to group that info
3. Identify each set of repeated information with a primary key

For instance, if we had information about vendors and the banks they were using, this process would look like the below:

![Normal forms](./images/normal_forms.gif)

Going back to our phones and cities example, instead of writing "San Francisco" in each row of the `phones` table, we can create a `cities` table, which has an ID and name for each city. We can then store the city ID instead of the name in the `phones` table, using it as a foreign key:

<img src="images/m1_2.jpg" width="600">

Notice that John and Rita live in New York, while Paul lives in Boston, but the `phones` table only has the ID of the city each of them live in, rather than information about the city itself. Also note that we added an ID to the `phones` table to uniquely identify each friend (which is also consistent with the 1NF definition).

This reduces necessary storage space (e.g. only a number will be stored in each row of `phones` instead of the entire city name), and also makes updates to the data much easier (e.g. only one row in `cities` table has to be updated if you want to change "San Francisco" to "San Francisco, CA").

### Question: (5 min)

What are your thoughts on the name of the `phones` table? Is it really a table used to store phone numbers? What if a friend has more than one phone number? How would you proceed? Discuss this with people around you.

### Exercise 7: (7 min)

In order to accommodate the increased complexity of its end-to-end product sales efforts, your firm would like you to set up new tables and/or modify existing tables in your newly-created database. The additional features are as follows:

1. Each product has a name, and customers can buy multiple products
2. Some products are upgrades to others; i.e. you must purchase the baselines version before you can purchase upgrades. For the sake of simplicity, assume each product can be upgraded from at most one other product
3. You need to keep track of customers' purchases of the various products

How would you set up new tables and/or modify existing tables to accommodate for these requirements? Write SQL queries that will accomplish this task. You will need the following additional DDL statements:

The `ALTER TABLE` statement is used to add, delete, or modify columns in an existing table or to add and drop constraints on an existing table.

1. Adding a column:

```SQL
ALTER TABLE "tablename"
ADD "newcolumn" "datatype"
```

2. Dropping a column:

```SQL
ALTER TABLE "tablename"
DROP COLUMN "columnname"
```

**Answer.** An ideal design would be as follows:

1. A `Product` table:
    - `ProductID` as a unique primary key (reasonable to assume it is an integer)
    - `ProductName` as the name of the product (string, reasonable to asume VARCHAR(50))
    - `UpgradedFromProductID` as the ID of the product from which this is an upgrade


2. Modify the `Call` table:
    - Add `ProductID` as a foreign key
    
An appropriate set of SQL queries for this is as follows:

```SQL
CREATE TABLE Product(
    ProductID INT primary key,
    ProductName VARCHAR(50),
    UpgradedFromProductID INT
)

ALTER TABLE call
ADD ProductID INT
```

### Question: (5 min)

Discuss this with your classmates. How would you modify your design in Exercise 7 if:

1. a product could be upgraded from multiple previous products
2. certain products can only be sold at certain periods of time (i.e. product availability is seasonal). You can assume this set of periods is not too large
3. you had to keep track of the times when the products are avaiable

### Exercise 8: (15 min)

The business would like you to extend your database design work into a brand-new data warehouse that will be used for real-time reporting. The data warehouse will be populated by a daily Extract-Transform-Load (ETL) process that your team will also write. The main source of data comes from the current end-to-end sales process. In addition to the tables in your previous design, here is a list of additional tables you think you will need:

1. `Order`: contains information about the sales date, the expected delivery date, the customer that bought the product, the total value of the sale, the delivery address, the cargo company, and the driver that will perform the delivery
2. `OrderItems`: contains information about each item in the order such as: the order item ID, the ID of the order, the product ID, the item quantity, and the product price (note that the same product can be sold for different prices across different orders)
3. `ProductCategory`: contains the description of the category

How would you design your datawarehouse to maximize read performance? Work on this with a partner. Note that in order to optimize read performance, it is NOT necessarily best to follow normal form protocol.

**Answer.** One possible solution is given below:

1. Have one `Order` table that contains information about orders and items, where each row is indexed by `OrderID` and `OrderItemID`. Although this does not conform with normal form protocol, a reporting system that performs a lot of reads will almost certainly want to extract information about both orders and items so this avoids a lot of expensive Cartesian products and JOINs of two tables that will be tied at the hip.
2. Add a detail key `ItemTotal` to each row of the `Order` table, which is defined as `ItemQuantity` times `ProductPrice`, as this field will probably be important for reporting, which is part of read performance.
3. Add `ProductCategoryID` as a foreign key to the `Product` table
4. Break out the delivery address field of `Order` into separate components such as `State`, `City`, `Zipcode`, etc. Aggregation based on location is a very common sort of reporting function and having to parse addresses every time is a pain unless the groupings are already well-defined.

## Basic properties of data warehouses (5 min)

We briefly mentioned the term **data warehouse** in the previous exercise, but what is it really? Data warehousing is a process for collecting and managing data from varied sources to provide meaningful business insights. A data warehouse is typically used to connect and analyze business data from heterogeneous sources and provide meaningful reports on the same to the end users:

![Data Warehouse](./images/datawarehouse.jpg)

Unlike databases, data is often de-normalized in a data warehouse. In professional settings, sometimes you may be required to read data from a data warehouse instead of a database. We'll not go much deeper into it right now, but we’ll drop in a few bullet points so that you are familiar with the concept if it comes up in a professional setting:

1. Data warehouses are mainly used for reporting as opposed to day-to-day transactions
    * As such, they are optimized for read rather than write operations
    * If you want to go deeper on this, read about OLTP versus OLAP systems
    
    
2. Data warehouses run on different hardware (servers) than the database does
    * Usually, that hardware is a lot more robust than the one the database runs
    
    
3. Data warehouses are usually updated via Extract-Transform Load (ETL) "batch" jobs
    * For example, once a day all changes from the database (compared to the previous day) are propagated to the data warehouse
    
    
4. Data warehouses don’t throw away all the normalization that was done at the database level, they just change them a bit to reduce the number of `JOIN`s necessary to run the queries
    * In other words, they sacrifice storage efficiency in favour of performance

## Conclusions (2 min)

In this case, you ventured outside of SQL in a Python environment and used your newfound knowledge to expand the scope of databaases within a financial products firm. First, you learned about database management systems. You then set up an initial PostgreSQL database in AWS RDS and made upgrades to it based on more complex business requirements. You also performed queries on your new database to answer relevant business questions.

## Takeaways (5 min)

In this case, we learned the basics of setting a database management system in the cloud using Amazon ```RDS```. We also built a foundation of basic DDL ```SQL``` commands to build databases and their tables. We finished the case by talking about how databases are structured in real world applications. Specifically we:

1. Created an ```RDS``` instance on Amazon AWS
2. Connected to a database using ```psql```
3. Performed ```CREATE TABLE``` queries
4. Learned about the ```ALTER TABLE``` and ```DROP``` queries
5. Discussed database normal forms and data warehouse

Databases are a core technology when building data science applications and putting them in production. More 
importantly, a bad database design can lead to convoluted structures that make your data extraction queries very slow.  While you probably will not be designing the database as a data scientist, it is helpful to know how databases are typically structured so that you may pick up a new database structure quickly.

To further expand on the concepts taught in this case, you should go through the ```RDS``` creation step again and investigate all the configuration options. You should also familiarize yourself with ```RDS``` [pricing](https://aws.amazon.com/rds/pricing/) so that you don't get surprised by large invoices once you actually begin using these in production projects.

## Appendix: Troubleshooting RDS creation

If you cannot create your database using the RDS service and instead see the error below, you will need to create a new VPC instead of using the default one. 

![vpc dns error](images/vpc-rds-error.png)


To do this, scroll back up to the "Connectivity" section, and choose "Create new VPC" from the dropdown as shown in the image below:

![create new vpc](images/create-new-vpc.png)

At the bottom of the page, press "Create Database" again, and you should see a notification briefly at the top of the page that confirms a new VPC has been created, as in the image below. Take a note of the ID.

![view vpc](images/view-vpc-id.png)

You might now see another error, as follows. This is because the VPC created from the RDS console has no name.

![vpc no name error](images/vpc-no-name-error.png)


If this is the case, you need to name your VPC. From the "Services" dropdown at the top of the page, search for "VPC" and open the VPC page in a new tab.


![view VPCs](images/services-select-vpc.png)

Find the VPC that was recently created (it will have the same ID as the one you noted above). Mouse over the "Name" field to see the pencil edit option appear, click on this, and give the VPC a name.


![name VPC](images/name-vpc.png)

Now that your VPC has a name, go back to the tab where you are creating the RDS instance, scroll back up to the "Connectivity" section, and choose the newly created VPC (you will see the name you chose displayed) from the dropdown.

Now you can finally press "Create Database" again (at the bottom of the page) and all should work.