<h1>How can we optimize our sales of financial products?</h1>

<h2>Goals</h2>
<p>By the end of ths case, you will be familiar with databases. Specifically, you will learn the differences among the major types of databases and the different database management systems available. Basic SQL queries will also be introduced.</p>
<p>You will also be exposed to the technical jargon of databases. While you probably will not use these terms on a daily basis, they will give you a more holistic understanding of the data engineering discipline and facilitate conversations between yourself and other data engineers.</p>

<h2>Introduction</h2>
<p><strong>Business Context.</strong> You are a data analyst at a large financial services firm that sells a diverse portfolio of products. In order to make these sales, the firm relies on a call center where sales agents make calls to current as well as prospective customers. The company would like you to dive into their data to devise strategies to increase their revenue or reduce their costs. Specifically, they would like to double down on their most reliable customers, and to cut out sales agents whom are not producing outcomes.</p>
<p><strong>Business Problem.</strong> The business would like to answer the following questions: <strong>"What types of customers are most likely to buy our product? And which of my sales agents are the most/least productive?</strong></p>
<p><strong>Analytical Context.</strong> The data is split across 3 tables: "Agents", "Calls", and "Customers", which sit on CSV files. Unlike previous cases though, we will first be reading these CSV files into a SQLite database created within Python. You will learn how this database differs from CSV files and how to interact with it using SQL to extract useful insights.</p>
<p>The case is sequenced as follows: you will (1) learn the fundamentals of databases and SQL; (2) use SQL <code>SELECT</code> statements to identify potentially interesting customers; and (4) use SQL aggregation functions to compute summary statistics on your agents and identify the most/least productive ones.</p>

<h2>Why databases?</h2>

<p>While we have been dealing with data sitting in CSV files so far, no serious data organization runs their operations off of CSV files on a single person's computer. This practice presents all sorts of hazards, including but not limited to:</p>
<ol>
<li>Destruction of that single device</li>
<li>Destruction of the files on that device</li>
<li>Inability to connect to that person's device from another device that requires the data</li>
<li>Inability to store more than a limited amount of data (since a single device doesn't have that much memory)</li>
</ol>
<p>Therefore, our data should be stored elsewhere if we want to reliably access it in the future and, more importantly, share it and work on it with others. The <strong>database</strong> is the classic location where modern organizations have chosen to store their data for professional use. Databases have been a topic of research since the late 1960s. Many technology vendors picked up on this and developed databases software for companies to consume. Some of these vendors and products are:</p>
<ol>
<li>Microsoft, initially with Microsoft Access and more recently with Microsoft SQL Server</li>
<li>Oracle, with their Oracle database and MySQL (a popular open source database)</li>
<li>The “PostgreSQL Global Development Group”, with the open-source PostgreSQL</li>
</ol>
<p>These databases all implement the standard SQL language and are thus fairly similar to each other in terms of features. However, there are some key differences. Comparing <code>MySQL</code> vs. <code>PostgreSQL</code>, the two most popular database systems, MySQL does not implement <code>FULL JOINS</code> (you will learn about <code>JOIN</code>s later on). PostgreSQL also supports some more advanced aggregation functions for statistics (you will learn about these soon). For example, in PostgreSQL you can perform regressions directly on the data before retrieving it, whereas MySQL only supports basic stats operations. However, this overhead leads to a slight performance hit, making MySQL faster for simple retrieval tasks.</p>
<p>For the purposes of this case study, we will be using PostgreSQL.</p>

<h2>Types of databases</h2>
<p>At this point, you might believe that databases can be thought of as a collection of data. This is true, but unfortunately it is not that simple. Data cannot simply be thrown in a database the same way you throw your socks in your sock drawer. Depending on your needs for the data, you will choose between one of two main types of databases.</p>

<h3>Relational databases</h3>
<p>The most common database type is called a <strong>relational database</strong>, and the systems that manage these kinds of databases are called <strong>Relational Database Management Systems (RDBMS)</strong>. Relational databases date back to the early 1970s and can be considered the first type of database ever conceived. Continuing with our sock example, this drawer would have many identical slots, each for one pair of socks. The socks may be of different materials, colors, brands, etc., but they need to fit into a slot.</p>
<p>Relational databases deal with “relational data”, which is a fancy way of saying “tabular” data. This kind of dataset consists of rows and columns (i.e. tables) where each row corresponds to an observation and each column corresponds to an attribute of that observation. So, for example, if we go back to the example where we were keeping track of our friends and their phones, each row on the file (or table) represents one friend and each column represents the information we want to track about that friend (name and phone number). The cell on the intersection of the row and column contains the actual data. Relational data is manipulated using a specific language called <strong>SQL (Structured Query Language)</strong>, which we will learn about soon.</p>
<p>A simple way to conceptualize a table inside a relational database is as a CSV file “copied” to the database. In fact, many databases offer that possibility (assuming your file is correctly formatted, of course).</p>

<h3>NoSQL databases</h3>
<p>Around 20 years ago, with the advent of the Internet and the necessity to store and process unstructured data (i.e. data that does not fit well in the row-by-column paradigm), developers started to discuss another type of database, which eventually ended up being referred to as a <strong>NoSQL database</strong>. These databases are not relational and are also built with more “relaxed” rules compared to their predecessors. NoSQL databases are more like a big drawer without slots and not exclusively for socks. You may choose to use this drawer primarily for socks of all sizes - small ones, big ones, maybe even a loose sock by itself - but it could also contain other items like sweaters or pants.</p>
<p>As the name implies, NoSQL databases do not rely on SQL, although many of them do allow you to use SQL to interface with them. At its core, a NoSQL database is simply a key-value store. That is, everything you store (sometimes called a <strong>document</strong>) in this database has a key associated with it. The database's job is simply to help you retrieve your desired document as quickly as possible. Nothing pre-determines what a document contains (i.e. it does not have a concept of "tables"); however, this flexibility comes at a price. When you retrieve a document, you have to perform extra checks on it to ensure its validity as the database will not automatically do this for you as it would with a relational database. This may or may not be desireable depending on your particular application.</p>

<h3>When to pick one over the other</h3>
<p>Picking between RDBMS vs. NoSQL really comes down to the requirements of your project. We touched on this above, but both systems prioritize different parts of the <a href="https://en.wikipedia.org/wiki/CAP_theorem"><strong>CAP Theorem</strong></a>. Simply put, the CAP Theorem says that a database system can't have all three of the following:</p>
<ul>
<li><strong>Consistency</strong>: Every read of the database will return the most up-to-date write version or an error</li>
<li><strong>Availability</strong>: Every request receives a (non-error) response, without the guarantee that it contains the most up-to-date write version</li>
<li><strong>Partition Tolerance</strong>: The system continues to operate no matter the network quality between nodes</li>
</ul>
<p>NoSQL prefers to be partition tolerant over consistent whereas RDBMS is the opposite. In certain applications, consistency is imperative which often forces you into using RDBMS. For example, if you are a bank and you query a customer's balance, you want to guarantee that the number you get is the most recent one and not the one from yesterday.</p>
<p>For the remainder of the case, we shall only consider RDBMS systems.</p>

<h2>What is this "SQL" thing?</h2>

<p>So we've been dropping in references to SQL throughout, yet we haven't explained what it is. Now we will! Just like data can't really survive without a database, a database can't be utilized without SQL. SQL is used for a wide variety of tasks, including but not limited to extracting data, creating the internal structure of a database (in the form of tables), and reading and writing data to these tables. SQL is an international published by the <a href="https://www.iso.org/standard/63555.html">ISO</a> and so it is the de facto language that all database systems adhere to.</p>
<p>In this case, we will be writing SQL queries using the <a href="https://www.sqlalchemy.org/"><code>SQLAlchemy</code></a> package in Python. This allows you to directly interface with relational databases without exiting the Python environment, while using syntax that is identical to what you would write outside of Python. Run the code below to set up this framework:</p>

In [39]:
import pandas as pd
from sqlalchemy import create_engine, text

#maximum number of rows to display
pd.options.display.max_rows = 10

engine=create_engine('sqlite://')
df = pd.read_csv('customer.csv').to_sql('customer', engine, if_exists='replace', index=False)
df = pd.read_csv('agent.csv').to_sql('agent', engine, if_exists='replace', index=False)
df = pd.read_csv('call.csv').to_sql('call', engine, if_exists='replace', index=False)

def runQuery(sql):
    result = engine.connect().execute((text(sql)))
    return pd.DataFrame(result.fetchall(), columns=result.keys())

In [10]:
query = """SELECT COUNT(*) 
FROM Customer"""
runQuery(query)

Unnamed: 0,COUNT(*)
0,1000


<p>The columns in each of the tables are as follows:</p>
<p><strong>agent.csv</strong>:
- <strong>AgentID</strong>: the primary key of the table (more on this below)
- <strong>Name</strong>: the name of the agent</p>
<p><strong>call.csv</strong>:
- <strong>CallID</strong>: the primary key of the table
- <strong>AgentID</strong>: a foreign key (more on this below) to the agents table of the agent who made the call
- <strong>CustomerID</strong>: a foreign key to the customers table of the customer who is being called
- <strong>PickedUp</strong>: a Boolean that is 1 if the customer picked up and 0 if they did not
- <strong>Duration</strong>: integer of the duration of the call
- <strong>ProductSold</strong>: a Boolean that is 1 if the agent made a sale and 0 if they did not</p>
<p><strong>customer.csv</strong>:
- <strong>CustomerID</strong>: the primary key of the table
- <strong>Name</strong>: the name of the customer
- <strong>Occupation</strong>: the occupation of the customer. 'Unemployed' if no occupation
- <strong>Email</strong>: the email of the customer
- <strong>Company</strong>: the company that the customer works for
- <strong>PhoneNumber</strong>: the phone number of the customer
- <strong>Age</strong>: the age of the customer</p>
<p>The above database structure can be visualized as below. This is called an <strong>Entity Relationship (ER) Diagram</strong>, denoting the tables present in the database, the columns in the tables, and the relations among the tables:</p>
<p><img alt="ER Diagram" src="images/database_schema.png" /></p>
<p>The above diagram gives a good overview of how the schema is structured and how the data is interconnected.</p>

<h2>Finding potentially interesting customer cohorts</h2>
<p>The most important thing you will ever do in SQL is extract a subset of the data from a SQL table based on a set of rules. This is accomplished using the <strong><code>SELECT</code></strong> statement and the following syntax:</p>
<p><img alt="Select Anatomy" src="./images/select_anatomy.png" /></p>
<p>To translate the above diagram into words:</p>
<ol>
<li>Start with the keyword <code>SELECT</code></li>
<li>Follow with the names of the columns you want to select, separated by commas (alternatively, you can use the <code>*</code> symbol to indicate you wish to select all columns)</li>
<li>Follow with the keyword <code>FROM</code></li>
<li>Finish with the name of the table you wish to select data from</li>
<li>Optionally, you can use the <code>WHERE</code> clause to only return results which satisfy certain conditions (similar to how code within Python <code>if...then</code> blocks only execute if the associated conditions are true)</li>
</ol>
<p>Since the firm wants to dig deeper into its customers, let's start by pulling some of their data out of our files; namely, information about customers who are not unemployed (and therefore are more likely to buy from us).</p>

<h3>Exercise 1:</h3>
<p>Write a query that selects the customer ID and name from the <code>Customer</code> table, only showing results for customers who are not unemployed. Remember to write your query as a multi-line string (enclosed within a pair of triple quotes <code>"""</code>) and pass it to the <code>runQuery()</code> function defined in the framework above to check your work!</p>

In [36]:
query1 = """
SELECT CustomerID, Name AS CustomerName
FROM customer
WHERE Occupation != 'Unemployed'
ORDER BY Name ASC
"""
type(runQuery(query1))

pandas.core.frame.DataFrame

Of course, for names, it's sensible to try to list them in alphabetical order. SQL allows us to do this rather easily with the ORDER BY statement. This is then followed by a comma-separated list of columns on which you want to order your results (columns that come first take priority in the subsequent ordering). Optionally, you can then append the keyword ASC or DESC (short for ascending and descending, respectively) after each column to determine the ordering type (e.g. alphabetical or reverse-alphabetical for a string column).

-------

<p>This is a great first step; however, while producing the list of customers that are not unemployed, you inevitably spend a lot of time looking at the different professions your customers have and realize how often engineers appear in your database. You know that engineering jobs tend to command higher salaries these days, so you decide to try to extract a list of all the unique types of engineering jobs that are represented in your database. To ensure that you don't get duplicate job titles in your query results, you'll need to write the keyword <code>DISTINCT</code> immediately after <code>SELECT</code> in your query.</p>

<h3>Exercise 2:</h3>
<p>Write a query which produces a list, in alphabetical order, of all the distinct occupations in the <code>Customer</code> table that contain the word "Engineer".</p>
<p><strong>Hint:</strong> The <code>LIKE</code> operator can be used when you want to look for similar values. It is included as part of a <code>WHERE</code> clause. It needs to be complemented with the <code>%</code> symbol, which is a wild card that represents zero, one, or multiple characters. For example, one valid <code>WHERE</code> clause utilizing the <code>LIKE</code> operator is <code>WHERE Name LIKE 'Matt%'</code>, which would return any results where the person's name starts with the word "Matt"; e.g. "Matt" or "Matteo" or "Matthew", etc.</p>

In [26]:
runQuery("""
SELECT DISTINCT occupation
FROM Customer
WHERE occupation like '%engineer%'
ORDER BY Occupation
""")

Unnamed: 0,Occupation
0,Chemical engineer
1,Electrical engineer
2,"Engineer, aeronautical"
3,"Engineer, agricultural"
4,"Engineer, automotive"
...,...
24,"Engineer, production"
25,"Engineer, site"
26,"Engineer, structural"
27,"Engineer, technical sales"


-------

<p>Now, one of your marketing colleagues tells you that people who are 30 or older will have a higher probability of buying your product (presumably because by that point they have more disposable income and savings). You don't want to take your colleague's word for granted, so you decide not to completely ignore people under 30, and instead add that information to the report regarding the person’s age, so that the agent making the subsequent call can decide how they want to use that information. However, due to privacy concerns, you also cannot share the person's exact age.</p>

<h3>Exercise 3:</h3>
<p>Write a query that retuns the customer ID, their name, and a column <code>Over30</code> containing "Yes" if the customer is more than 30 years of age and "No" if not.</p>
<p><strong>Hint:</strong> You will need to use the <code>CASE-END</code> clause. The <code>CASE-END</code> clause can be used to evaluate conditional statements and returns a value once a condition is met (similar to an if-then-else clause in Python). If no conditions are true, it returns the value in the ELSE clause (or NULL if there is no ELSE statement). For example:</p>
<div class="codehilite"><pre><span></span><code><span class="k">CASE</span>
    <span class="k">WHEN</span> <span class="n">Name</span> <span class="o">=</span> <span class="ss">&quot;Matt&quot;</span> <span class="k">THEN</span> <span class="s1">&#39;Yes&#39;</span>
    <span class="k">WHEN</span> <span class="n">Name</span> <span class="o">=</span> <span class="ss">&quot;Matteo&quot;</span> <span class="k">THEN</span> <span class="s1">&#39;Maybe&#39;</span>
    <span class="k">ELSE</span> <span class="s1">&#39;No&#39;</span>
<span class="k">END</span>
</code></pre></div>

In [37]:
runQuery("""
SELECT CustomerID, Name,
    CASE
        WHEN Age >= 30 THEN 'Yes'
        WHEN Age < 30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30
FROM Customer
ORDER BY Name ASC
""")

Unnamed: 0,CustomerID,Name,Over30
0,900,Aaron Gutierrez,Yes
1,461,Aaron Hendrix,No
2,145,Aaron Mcintyre,No
3,622,Aaron Rose,No
4,65,Adam Jimenez,No
...,...,...,...
995,883,Zachary Anderson,No
996,18,Zachary Howe,No
997,421,Zachary Ruiz,Yes
998,986,Zachary Stevenson,No


In [34]:
runQuery("""
SELECT CustomerID, Name, Occupation,
    CASE
        WHEN Age >= 30 THEN 'Yes'
        WHEN Age < 30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30
FROM Customer
Where occupation LIKE '%engineer%'
ORDER BY Name ASC
""")

Unnamed: 0,CustomerID,Name,Occupation,Over30
0,622,Aaron Rose,"Engineer, production",No
1,985,Alan Mitchell,"Engineer, electrical",Yes
2,432,Alexis Riddle,"Engineer, mining",No
3,568,Alice Lee,"Engineer, civil (consulting)",No
4,918,Alison Vaughan,"Engineer, water",Yes
...,...,...,...,...
356,966,William Garcia,"Engineer, broadcasting (operations)",No
357,973,William Jackson,"Engineer, communications",Yes
358,699,Willie Greene,"Engineer, electronics",Yes
359,952,Yolanda White,Chemical engineer,No


-------

<h2>Investigating customer conversion rates</h2>

<p>In order to validate whether our hypotheses about engineers and age are true (for example, engineers exhibit higher product sales conversion rates, and perhaps engineers over 30 tend to exhibit an even higher conversion rate), we will need to use two tables: <code>Call</code> and <code>Customer</code>. This is because the column <code>ProductSold</code> lies only in the <code>Call</code> table, yet information about customer professions and age only lie in the <code>Customer</code> table.</p>
<p><code>SELECT</code> commands are not restricted to a single table. In fact, theoretically, there is no limit to the number of tables that you can extract data from in a single SQL query. Let's introduce some new concepts that are relevant once we go beyond a single table.</p>
<p><strong>Primary and foreign keys</strong> are very important concepts that need to be understood by any database professional. Primary keys:</p>
<ol>
<li>Uniquely identify a record in the table. Their name usually includes the word "ID"<ul>
<li>For example, <code>CustomerID</code> is the primary key of the <code>Customer</code> table, <code>AgentID</code> is the primary key of the <code>Agent</code> table, and <code>CallID</code> is the primary key of the <code>Call</code> table    </li>
</ul>
</li>
<li>Do not accept null values. And they shouldn't, because they are being used to identify the record</li>
<li>Are limited to one per table</li>
</ol>
<p>On the other hand, foreign keys:</p>
<ol>
<li>Are a field in the table that is the primary key in another table</li>
<li>Can accept null values</li>
<li>Are not limited in any way per table<ul>
<li>For example, the <code>Call</code> tables has 2 foreign keys: <code>AgentID</code> and <code>CustomerID</code> pointing to the <code>Agent</code> and <code>Customer</code> tables, respectively</li>
</ul>
</li>
</ol>

<h3>Extracting call data for customers working in engineering professions</h3>

<p>Let's first extract the relevant data so we can perform this analysis. Here, a <strong><code>JOIN</code></strong> clause will come in handy. </p>
<p><code>JOIN</code> clauses are used to combine data from two or more tables in the same query. For example, in the current scenario, we need to get the name of the agent involved in a call. The <code>Call</code> table contains only the <code>AgentID</code> and not the name of the agent. <code>JOIN</code> becomes useful here so we can match up the <code>Call</code> table with the <code>Agent</code> table, which does contain the name information.</p>
<p>Here's a diagram showing how <code>JOIN</code> (specifically, the <strong><code>INNER JOIN</code></strong>, which is the default version and the only one you will need to worry about in this case) works. Notice that only the rows with <code>id</code> of 1 and 4 are extracted because those are the only two <code>id</code>s which show up in both tables:</p>
<p><img alt="Join" src="./images/join.png" /></p>
<p>A <code>JOIN</code> clause consists of two parts:</p>
<ol>
<li>The base <code>JOIN</code> statement, which is of the form <code>[Table 1] JOIN [Table 2]</code>. This performs a Cartesian product on the 2 tables being joined. For example, if we have Table A with 5 rows, and Table 5 with 3 rows, their Cartesian product will return 15 rows (5 x 3)</li>
<li>A <code>JOIN</code> criteria, which filters the Cartesian product's results, beginning with the <code>ON</code> keyword</li>
</ol>
<p>Here is an example of a <code>JOIN</code> criteria in action, which is telling us to only give combinations of rows where the agent ID matches in both tables:</p>
<div class="codehilite"><pre><span></span><code><span class="k">SELECT</span> <span class="n">CallID</span><span class="p">,</span> <span class="n">A</span><span class="p">.</span><span class="n">AgentID</span><span class="p">,</span> <span class="n">name</span>
<span class="k">FROM</span> <span class="k">Call</span> <span class="k">C</span>
<span class="k">JOIN</span> <span class="n">Agent</span> <span class="n">A</span> <span class="k">ON</span> <span class="k">C</span><span class="p">.</span><span class="n">AgentID</span> <span class="o">=</span> <span class="n">A</span><span class="p">.</span><span class="n">AgentID</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">Name</span> <span class="k">DESC</span>
</code></pre></div>

In [41]:
query6 = """SELECT CallID, A.AgentID, name
FROM Call C
JOIN Agent A ON C.AgentID = A.AgentID
ORDER BY Name ASC"""
runQuery(query6)

Unnamed: 0,CallID,AgentID,Name
0,0,10,Agent X
1,2,10,Agent X
2,6,10,Agent X
3,15,10,Agent X
4,17,10,Agent X
...,...,...,...
9934,9959,3,Todd Morrow
9935,9970,3,Todd Morrow
9936,9979,3,Todd Morrow
9937,9984,3,Todd Morrow


<p>Note that:</p>
<ol>
<li><code>C</code> and <code>A</code> are aliases to the <code>Call</code> and <code>Agent</code> tables to avoid having to type the table name every time. Unlike with column aliasing earlier, we do not need the <code>AS</code> keyword here</li>
<li>We write <code>A.AgentID</code> instead of <code>AgentID</code> in the SELECT statement – this is because the <code>AgentID</code> column exists in both tables, so we have to tell the database which one to get the result from</li>
</ol>

<h3>Exercise 4:</h3>
<p>Write a query which returns all calls made out to customers in the engineering profession, and shows whether they are over or under 30 as well as whether they ended up purchasing the product from that call.</p>

In [54]:
runQuery("""
SELECT ca.CallID, cu.Name,
    CASE
        WHEN Age >= 30 THEN 'Yes'
        WHEN Age < 30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30,
    ca.ProductSold
FROM Call ca
JOIN Customer CU ON ca.CustomerID = cu.CustomerID
WHERE cu.Occupation LIKE '%Engineer%'
ORDER BY Name DESC
""")

Unnamed: 0,CallID,Name,Over30,ProductSold
0,2049,Zachary Ruiz,Yes,0
1,2960,Zachary Ruiz,Yes,0
2,3365,Zachary Ruiz,Yes,0
3,3386,Zachary Ruiz,Yes,1
4,4332,Zachary Ruiz,Yes,0
...,...,...,...,...
3614,6444,Aaron Rose,No,1
3615,7994,Aaron Rose,No,0
3616,8811,Aaron Rose,No,0
3617,9524,Aaron Rose,No,1


-------

<h2>Analyzing the call conversion data</h2>
<p>Now, we want to determine whether or not customers in our desired cohort exhibit a higher sales conversion rate compared to the overall population of customers. A reasonable way to do this is to count the total number of calls to this cohort which resulted in a sale, and divide that by the total number of calls to this cohort (whether or not they resulted in a sale) to get a percentage, and then compare that with the percentage we compute from the <code>calls</code> table overall.</p>
<p>However, to compute these figures, we'll need to learn a bit about <strong>aggregation functions</strong>. An aggregation function allows you to perform a calculation on a set of values to return a single value, essentially computing some sort of summary statistic.</p>
<p>Aggregation queries usually look like this:</p>
<p><img alt="Aggregation Queries" src="./images/aggregation.png" /></p>
<p>The following are the most commonly used SQL aggregate functions:</p>
<ol>
<li><code>AVG()</code> – calculates the average of a set of values</li>
<li><code>COUNT()</code> – counts rows in a specified table or view</li>
<li><code>MIN()</code> – gets the minimum value in a set of values</li>
<li><code>MAX()</code> – gets the maximum value in a set of values</li>
<li><code>SUM()</code> – calculates the sum of values</li>
</ol>
<p>As mentioned before, PostgreSQL as some more advanced <a href="https://www.postgresql.org/docs/9.5/functions-aggregate.html">aggregate functions</a>. Specifically, they have some nice ones for statistics. For example,</p>
<ol>
<li><code>regr_intercept(Y, X)</code> - Returns the intercept for the line of best fit</li>
<li><code>regr_slope(Y, X)</code> - Returns the slope of the line of best fit</li>
<li><code>corr(Y, X)</code> - Returns the correlation between two columns</li>
</ol>

<h3>Exercise 5:</h3>
<p>Write two queries - one that computes the total sales and total calls made to customers in the engineering profession, and one that computes the same metrics for the entire customer base. What can you conclude regarding the conversion rate within the engineering customers vs. the overall customer base?</p>

In [56]:
query8 = """SELECT SUM(ProductSold) AS TotalSales, COUNT(*) AS NCalls
FROM Customer Cu
JOIN Call Ca ON Ca.CustomerID = Cu.CustomerID
WHERE Occupation LIKE '%Engineer%'"""
runQuery(query8)

Unnamed: 0,TotalSales,NCalls
0,760,3619


In [57]:
query9 = """SELECT SUM(ProductSold) AS TotalSales, COUNT(*) AS NCalls
FROM Customer Cu
JOIN Call Ca ON Ca.CustomerID = Cu.CustomerID"""
runQuery(query9)

Unnamed: 0,TotalSales,NCalls
0,2084,9925


The conversion rate for both groups is ~20.9%, indicating that engineers are not more likely to purchase our products than the overall population.

-------

<h3>Exercise 6:</h3>
<p>Write a query that computes the total sales and total calls made to customers over the age of 30. Is there a notable difference between the conversion ratio here and that of the overall customer base?</p>

In [61]:
runQuery("""
SELECT SUM(ProductSold) AS TotalSales, COUNT(*) AS NCalls
FROM Customer Cu
JOIN Call Ca ON Ca.CustomerID = Cu.CustomerID
WHERE Age >= 30 
""")

Unnamed: 0,TotalSales,NCalls
0,659,3096


-------

<h3>Exercise 7:</h3>
<p>How about if you look at the sales conversion rate for engineers over the age of 30?</p>

In [62]:
runQuery("""
SELECT SUM(ProductSold) AS TotalSales, COUNT(*) AS NCalls
FROM Customer Cu
JOIN Call Ca ON Ca.CustomerID = Cu.CustomerID
WHERE Age >= 30
AND Occupation LIKE '%Engineer%'
""")

Unnamed: 0,TotalSales,NCalls
0,376,1816


-------

<h2>Evaluating our agents' performance</h2>

<p>Recall the second part of our business question: we need to figure out which of our agents are the most and least productive. To do this, it makes sense to determine which metrics could be related to productivity. Looking at the features present, the following seem to be reasonable:</p>
<ol>
<li>The number of calls an agent made</li>
<li>The lengths of calls an agent made</li>
<li>The total number of products an agent sold</li>
</ol>

<h3>Question:</h3>
<p>For any given agent, would extracting this info be a good way of quickly analyzing their productivity? Why or why not?</p>

<p>While the above metrics are useful, some of them are too numerous to be easiy analyzed. Specifically, the lengths of calls an agent made is a dataset that is as large as the number of calls the agent made. If the agent made many calls, it will be meaningless to just throw the entire set of call lengths at ourselves. Instead, we ought to compute some summary statistics of this metric; namely, the minimum, maximum, and mean lengths seem reasonable.</p>

<h3>Exercise 8:</h3>
<p>Write a query that returns, for each agent, the agent's name, number of calls, longest and shortest call lengths, average call length, and total number of products sold. Name the columns <code>AgentName</code>, <code>NCalls</code>, <code>Shortest</code>, <code>Longest</code>, <code>AvgDuration</code>, and <code>TotalSales</code>, and order the table by <code>AgentName</code> alphabetically. (Make sure to include the <code>WHERE PickedUp = 1</code> clause to only calculate the average across all the calls that were picked up (otherwise all the minimum durations will be 0)!)</p>

In [66]:
query12 = """
SELECT Name AS AgentName, COUNT(*) AS NCalls,
    MIN(Duration) AS Shortest, MAX(Duration) AS Longest,
    AVG(Duration) AS AvgDuration, SUM(ProductSold) AS TotalSales
FROM Call C
    JOIN Agent A ON C.AgentID = A.AgentID
WHERE PickeDup = 1
GROUP BY Name
ORDER BY Name
"""
runQuery(query12)

Unnamed: 0,AgentName,NCalls,Shortest,Longest,AvgDuration,TotalSales,SUM(ProductSold)/COUNT(*)
0,Agent X,640,22,334,180.975000,194,0
1,Angel Briggs,591,12,362,181.081218,157,0
2,Christopher Moreno,649,47,363,177.979969,189,0
3,Dana Hardy,554,49,356,177.203971,182,0
4,Gloria Singh,662,36,349,182.175227,209,0
...,...,...,...,...,...,...,...
6,Lisa Cordova,639,46,344,179.214397,201,0
7,Michele Williams,685,22,306,177.880292,198,0
8,Paul Nunez,648,-5,323,181.070988,194,0
9,Randy Moore,600,16,326,178.595000,177,0


-------

<h3>Question:</h3>
<p>Throughout this case, we have defined sales conversion rate as the number of products sold divided by the number of calls made. What are the strengths and weaknesses of this choice of definition? Is there a way you can adjust the definition to correct for some of those weaknesses while retaining all the strengths?</p>

<h2>A word about SQL statement types</h2>

<p>In this case, you have used SQL's <a href="https://en.wikipedia.org/wiki/Data_manipulation_language"><strong>Data Manipulation Language (DML)</strong></a> statements; that is, statements that are used to read or write (manipulate) data from the database. However, SQL also has the ability to create, modify, and remove database objects themselves as well as the data within them. It does this by using <a href="https://en.wikipedia.org/wiki/Data_definition_language"><strong>Data Definition Language (DDL)</strong></a> statements which are commands that define the different structures in a database. You will learn more about these statements in future cases.</p>
<p>There are two other types of SQL statements that are important, but less likely to be used by someone who is merely focused on analyzing data. We'll not not dig into these as it's very unlikely that you’ll have to deal with these anytime soon, but you are free to read up about them elsewhere if you are interested. They are:</p>
<ol>
<li>
<p><a href="https://en.wikipedia.org/wiki/Data_control_language"><strong>Data Control Language (DCL)</strong></a>: These determine who has permission to do what in the database. Everytime you log in to a database, you do that using (your) database user account. By default, a user after being created does not have permission to do anything, so someone (normally a <strong>database administrator (DBA)</strong>) needs to grant permission to that user to perform certain operations on the database.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/SQL#Transaction_controls"><strong>Transactional Control Language (TCL)</strong></a>: These commands are used to guarantee that full units of work are either completed as a whole or not at all. An example is a bank transfer: you need to ensure that if money has been withdrawn from account A, then it has also been deposited in account B, which requires wrapping these two commands into a transaction.</p>
</li>
</ol>

<h2>Conclusions</h2>

<p>In this case, you learned the basics of SQL and used it to optimize the sales operations of a financial services firm. We narrowed down our set of potentially interesting customer cohorts and were able to compute summary statistics on the sales conversion rates of those cohorts, particularly versus the mean. In particular, we learned that some of our "no-brainer" hypotheses did not pan out, which illustrates the importance of always investigating the data to validate our thoughts. We also looked at sales agent performance and were able to find the ones that were most/least productive on particular metrics.</p>

<h2>Takeaways</h2>

<p>In this case, we learned the basics of RDBMS systems and their appropriate terminology. We also built a foundation of basic SQL commands to extract data from a database. Specifically we:</p>
<ol>
<li>Learned what an RDBMS is</li>
<li>Connected to a database using <code>SQLAlchemy</code></li>
<li>Performed <code>SELECT ... FROM</code> queries</li>
<li>Learned the <code>WHERE</code>, <code>ORDER BY</code>, <code>AS</code>, <code>DISTINCT</code>, <code>LIKE</code>, <code>CASE-END</code>, and <code>JOIN</code>, keywords</li>
<li>Performed basic aggregation methods</li>
</ol>
<p>When working with large datasets, SQL is a powerful tool that can help us navigate and understand data in ways that Python cannot. Sometimes, it can even serve as the first stage of an exploratory data analysis and can help us answer questions all by itself. Furthermore, SQL is the means through which we can create and persist data in databases for future, large-scale use.</p>
<p>As alluded to at the end, we only touched on one subset of SQL's capabilities and syntax - namely, performing queries that manipulate the data. No data scientist's toolkit is complete without an understanding of how to interface with and store the raw data that they work with. SQL is less of an everyday staple compared to Python, but you should still be familiar with the different capabilities of SQL and use this case as a cheat sheet for when you have to use SQL in the future. For your convenience, we've attached a cheat sheet at the end.      </p>

## Appendix: SQL Cheat Sheet

**SELECT**

```SQL
- SELECT * FROM table_name -- Select all columns from a table
- SELECT column_name(s) FROM table_name -- Select some columns from a table
- SELECT DISTINCT column_name(s) FROM table_name -- Select only the different values
- SELECT column_name(s) FROM table_name -- Select data filtered with the WHERE clause
  WHERE column operator value
        AND column operator value
        OR column operator value
- SELECT column_name(s) FROM table_name -- Order data by multiple columns. DESC for descending 
  ORDER BY column_1, column_2 DESC, column_3 ASC -- and ASC (optional) for ascending order
```

**Operators**
- `<` - Less than
- `>` - Greater than
- `<=` - Less than or equal
- `>=` - Greater than or equal
- `<>` - Not equal
- `=` - Equal
- `BETWEEN v1 AND v2` - Between a specified range
- `LIKE` - Search pattern. Use '%' as a wildcard. E.g., '%o%' matches o, bob, blob, etc.

**Aggregate Functions**
- `AVG(column)` - Returns the average value of a column
- `COUNT(column)` - Returns the number of rows (without a NULL value) of a column
- `MAX(column)` - Returns the maximum value of a column
- `MIN(column)` - Returns the minimum value of a column
- `SUM(column)` - Returns the minimum value of a column
```SQL
SELECT AVG(column_name), MIN(column_name), MAX(column_name) FROM table_name
```
 
**Misc.**
- `CASE-END` - Used in `SELECT` queries to alter a variable in place. E.g.
```SQL
SELECT column_name
    CASE
        WHEN column_name >= 0 THEN 'POSITIBE'
        ELSE 'negative'
    END
FROM table
```
- `AS` - Used to rename a variable. E.g.
```SQL
SELECT SUM(column_name) AS total_column_name FROM table_name
```
- `GROUP BY` - Used to group rows that share the same value(s) in particular column(s). It is mostly used along with aggregation functions
- `ORDER BY` - Determines the order in which the rows are returned by an SQL query