<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/SQL_for_Data_Science/blob/main/sql_for_data_analysis1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b><h1>Welcome To SQL Basics...</h1></b>

We connect to MySQL server and workbench and make analysis with the parch-and-posey database.<br>
This course is the practicals of the course **SQL for Data Analysis at Udacity**

In [None]:
# First we install mysql-connector for python
!pip install mysql-connector-python

In [None]:
# we import some required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

## Parch-and-Posey DataBase Entity Relationship Diagram Schema

<img src='https://video.udacity-data.com/topher/2017/October/59e946e7_erd/erd.png' height=400 weidth=400>

**Next, we create a connection to the parch-and-posey DataBase in MySQL Work-Bench**

In [None]:
import mysql
from mysql.connector import Error
from getpass import getpass

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='parch_and_posey',
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

**Let's see the tables in parch-and-posey database**

In [None]:
# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

**Defining a method that converts the result of each query to a data frame**

In [None]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

**Let's see the first 3 data of the different tables in parch and posey database**

In [None]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

In [None]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

In [None]:
# 3. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

In [None]:
# 4. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

In [None]:
# 5. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

In [None]:
# let's close the cursor connection after running these queries
cursor.close()

The SQL language has a few different elements. The most basic of which is the `statement`. Think of a statement as a piece of correctly written SQL codes. Statements tell the database what you'd like to do with the data.<br>For example `drop`, `create`, `alter` and so on. The most common tho is the `select` statement, which allows us to read data and display em. `Select` statements are commonly referrred to as queries.<br>
Both the `select` and `from` clauses are mandatory, included in any query.

The `SELECT` statement is where you put the columns for which you would like to show the data. The `FROM` statement is where you put the tables from which you would like to pull data.

In [None]:
# We closed the cursor connection above, so we reopen it first
cursor = connection.cursor()

Now let's select 10 rows from just a few columns from the orders table

In [None]:
query = 'SELECT id, account_id, occurred_at FROM orders LIMIT 10;'
query_to_df(query)

<h3>Formatting Best Practices..</h3>

1. **Using Upper and Lower Case in SQL:**<br>
SQL queries can be run successfully whether characters are written in upper- or lower-case. In other words, SQL queries are not case-sensitive

2. **Capitalizing SQL Clauses:**<br>It is common and best practice to capitalize all SQL commands, like `SELECT` and `FROM`, and keep everything else in your query lower case. Capitalizing command words makes queries easier to read, which will matter more as you write more complex queries.

3. **One other note:**<BR> The text data stored in SQL tables can be either upper or lower case, and SQL is case-sensitive in regard to this text data.

4. **Avoid Spaces in Table and Variable Names:**<br>
It is common to use underscores and avoid spaces in column names. It is a bit annoying to work with spaces in SQL. In Postgres if you have spaces in column or table names, you need to refer to these columns/tables with double quotes around them (Ex: `FROM "Table Name"` as opposed to `FROM table_name`). In other environments, you might see this as square brackets instead (Ex: `FROM [Table Name]`).

5. **Use White Space in Queries:**<br>
SQL queries ignore spaces, so you can add as many spaces and blank lines between code as you want, and the queries are the same. But pls use with decorum.

6. **Semicolons:**<br>
Depending on your SQL environment, your query may need a semicolon at the end to execute. Other environments are more flexible in terms of this being a "requirement." It is considered best practice to put a semicolon at the end of each statement, which also allows you to run multiple queries at once if your environment allows this.

**The LIMIT clause**

* The `LIMIT` command is always the very last part of a query.

**The ORDER BY Clause**

* The `ORDER BY` statement allows us to sort our results using the data in any column. 
* Using `ORDER BY` in a SQL query only has temporary effects, for the results of that query, unlike sorting a sheet by column in Excel or Sheets.
* The `ORDER BY` statement always comes in a query after the `SELECT` and `FROM` statements, but before the `LIMIT` statement. If you are using the `LIMIT` statement, it will always appear last.
* Remember **DESC** can be added after the column in your `ORDER BY` statement to sort in descending order, as the default is to sort in ascending order.

Write a query to return the 10 earliest orders in the orders table. Include the `id`, `occurred_at`, and `total_amt_usd` columns.

In [None]:
query = 'SELECT id, occurred_at, total_amt_usd FROM orders \
        ORDER BY occurred_at LIMIT 10;'
query_to_df(query)

Write a query to return the top 5 orders in terms of largest `total_amt_usd`. Include the `id`, `account_id`, and `total_amt_usd`.

In [None]:
query = 'SELECT id, account_id, total_amt_usd FROM orders \
        ORDER BY total_amt_usd DESC LIMIT 5;'
query_to_df(query)

Write a query to return the lowest 20 orders in terms of smallest `total_amt_usd`. <br>Include the `id`, `account_id`, and `total_amt_usd`.

In [None]:
query = 'SELECT id, account_id, total_amt_usd FROM orders ORDER \
        BY total_amt_usd LIMIT 20;'
query_to_df(query)

**The ORDER BY 2 Clause**

* We can `ORDER BY` more than one column at a time
* When you provide a list of columns in an `ORDER BY` command, the sorting occurs using the leftmost column in your list first, then the next column from the left, and so on.
* We still have the ability to flip the way we order using `DESC`.

Write a query that displays the order `ID`, `account ID`, and `total dollar amount` for all the orders, <br>sorted first by the `account ID` (in ascending order), and then by the `total dollar amount` (in descending order).


In [None]:
query = 'SELECT id, account_id, total_amt_usd FROM orders \
        ORDER BY account_id, total_amt_usd DESC;'
query_to_df(query)

Now write a query that again displays `order ID`, `account ID`, and `total dollar amount` for each order, but this time sorted first by `total dollar amount` (in descending order), and then by `account ID` (in ascending order).

In [None]:
query = 'SELECT id, account_id, total_amt_usd FROM orders \
        ORDER BY total_amt_usd DESC, account_id;'
query_to_df(query)

Compare the results of these two queries above. How are the results different when you switch the column you sort on first?

**Ans:** 
* For the first query, order by account_id first ensures that each account_id ascends from lowest to highest. And for each account_id, the total_amt_usd is decreasing from highest to lowest.
* For the second query where we sort by total_amt_usd DESC first, the query is sorted by the observations with the highest total_amt_usd first, and the account_id next. Now if each account_id has multiple total_amt_usd, then these would be sorted by total_amt_usd Desc first.
* In query #1, all of the orders for each account ID are grouped together, and then within each of those groupings, the orders appear from the greatest order amount to the least. 
* In query #2, since you sorted by the total dollar amount first, the orders appear from greatest to least regardless of which account ID they were from. Then they are sorted by account ID next. (The secondary sorting by account ID is difficult to see here, since only if there were two orders with equal total dollar amounts would there need to be any sorting by account ID.)



**The WHERE Clause:**

* Using the `WHERE` statement, we can display subsets of tables based on conditions that must be met. You can also think of the `WHERE` command as filtering the data.
* The `WHERE` clause goes after `FROM`, but before `ORDER BY` or `LIMIT`
* Common symbols used in `WHERE` statements include:
** $>$ (greater than)
** $<$ (less than)
** $>=$ (greater than or equal to)
** $<=$ (less than or equal to)
** $=$ (equal to)
** $!=$ (not equal to)

Write a query that:

Pulls the first 5 rows and all columns from the orders table that have a dollar amount of gloss_amt_usd greater than or equal to 1000.

In [None]:
query = 'SELECT * FROM orders WHERE gloss_amt_usd >= 1000 LIMIT 5;'
query_to_df(query)

Pulls the first 10 rows and all columns from the orders table that have a total_amt_usd less than 500.

In [None]:
query = 'SELECT * FROM orders WHERE total_amt_usd < 500  LIMIT 10;'
query_to_df(query)

**WHERE Clause contd...**

The `WHERE` statement can also be used with non-numeric data. We can use the = and != operators here. You need to be sure to use single quotes (just be careful if you have quotes in the original text) with the text data, not double quotes.

Commonly when we are using `WHERE` with non-numeric data fields, we use the `LIKE`, `NOT`, or `IN` operators. 

Filter the accounts table to include the company name, website, and the primary point of contact (primary_poc) just for the Exxon Mobil company in the accounts table.

In [None]:
query = "SELECT name, website, primary_poc FROM accounts WHERE name = 'Exxon Mobil';"
query_to_df(query)

Note: If you received an error message when executing your query, remember that SQL requires single-quotes, not double-quotes, around text values like 'Exxon Mobil.'

**Derived Columns:**

* Creating a new column that is a combination of existing columns is known as a derived column (or "calculated" or "computed" column). Usually you want to give a name, or "alias," to your new column using the `AS` keyword.
* This derived column, and its alias, are generally only temporary, existing just for the duration of your query. The next time you run a query and access this table, the new column will not be there.
* **Order of Operations**<br>
Remember **PEMDAS** from math class to help remember the order of operations? The same order of operations applies when using arithmetic operators in SQL.

The following two statements have very different end results:

* Standard_qty / standard_qty + gloss_qty + poster_qty
* standard_qty / (standard_qty + gloss_qty + poster_qty)

Select id, account_id, gloss_amt_usd, total_amt_usd and add a derived column called pct_gloss_amt by dividing gloss_amt_usd by total_amt_usd. Show only 10 rows

In [None]:
query = 'SELECT id, account_id, gloss_amt_usd, total_amt_usd, \
        (gloss_amt_usd/total_amt_usd) AS pct_gloss_amt FROM orders LIMIT 10;'
query_to_df(query)

Using the orders table Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields.


In [None]:
query = 'SELECT id, account_id, (standard_amt_usd / standard_qty) \
        AS unit_price FROM orders LIMIT 10;'

query_to_df(query)

Using the orders table Write a query that finds the percentage of revenue that comes from poster paper for each order. You will need to use only the columns that end with _usd. (Try to do this without using the total column.) Display the id and account_id fields also. NOTE - you will receive an error with the correct solution to this question. This occurs because at least one of the values in the data creates a division by zero in your formula. You will learn later in the course how to fully handle this issue. For now, you can just limit your calculations to the first 10 orders, as we did in question #1, and you'll avoid that set of data that causes the problem.

In [None]:
query = 'SELECT id, account_id, \
        (poster_amt_usd / (poster_amt_usd + gloss_amt_usd + standard_amt_usd)) \
        AS pct_revenue_poster FROM orders LIMIT 10;'
query_to_df(query)

********************************************************************************

<h3>Introduction to Logical Operators</h3>

In the next concepts, you will be learning about Logical Operators. Logical Operators include:

* **LIKE**<br>
This allows you to perform operations similar to using `WHERE` and `=`, but for cases when you might not know exactly what you are looking for.

* **IN**<br>
This allows you to perform operations similar to using `WHERE` and `=`, but for more than one condition.

* **NOT**<br>
This is used with `IN` and `LIKE` to select all of the rows `NOT LIKE` or `NOT IN` a certain condition.

* **AND & BETWEEN**<br>
These allow you to combine operations where all combined conditions must be true.

* **OR**<br>
This allow you to combine operations where at least one of the combined conditions must be true.

**The LIKE Operator:**

The `LIKE` operator is extremely useful for working with text. You will use `LIKE` within a `WHERE` clause. The `LIKE` operator is frequently used with `%`. The `%` tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters.

Remember to use single quotes for the text you pass to the `LIKE` operator, because of this lower and uppercase letters are not the same within the string. Searching for 'T' is not the same as searching for 't'.

**The IN Operator:**

The `IN` operator is useful for working with both numeric and text columns. This operator allows you to use an `=`, but for more than one item of that particular column. We can check one, two or many column values for which we want to pull data, but all within the same query.

**The NOT Operator:**

The `NOT` operator is an extremely useful operator for working with the previous two operators we introduced: `IN` and `LIKE`. By specifying `NOT` `LIKE` or `NOT` `IN`, we can grab all of the rows that do not meet a particular criteria.

**Expert Tip**

In most SQL environments, although not in our Udacity's classroom, you can use single or double quotation marks - and you may NEED to use double quotation marks if you have an apostrophe within the text you are attempting to pull.

Questions using LIKE Operator

Using the accounts table find All the companies whose names start with 'C'.

In [None]:
query = "SELECT * FROM accounts WHERE name LIKE 'C%';"
query_to_df(query)

All companies whose names contain the string 'one' somewhere in the name.

In [None]:
query = "SELECT * FROM accounts WHERE name LIKE '%one%';"
query_to_df(query)

All companies whose names end with 's'

In [None]:
query = "SELECT * FROM accounts WHERE name LIKE '%s';"
query_to_df(query)

Questions using IN operator<br>
Use the accounts table to find the account name, primary_poc, and sales_rep_id for Walmart, Target, and Nordstrom.

In [None]:
query = "SELECT name, primary_poc, sales_rep_id FROM accounts \
        WHERE name IN ('Walmart', 'Target', 'Nordstrom');"
query_to_df(query)

Use the web_events table to find all information regarding individuals who were contacted via the channel of organic or adwords.

In [None]:
query = 'SELECT * FROM web_events WHERE channel IN ("organic", "adwords");'
query_to_df(query)

Questions using the NOT operator:

Use the accounts table to find the account name, primary poc, and sales rep id for all stores except Walmart, Target, and Nordstrom.


In [None]:
query = "SELECT name, primary_poc, sales_rep_id FROM accounts \
        WHERE name NOT IN ('Walmart', 'Target', 'Nordstrom');"
query_to_df(query)

Use the web_events table to find all information regarding individuals who were contacted via any method except using organic or adwords methods.

In [None]:
query = 'SELECT * FROM web_events WHERE channel NOT IN ("organic", "adwords");'
query_to_df(query)

Use the accounts table to find, All the companies whose names do not start with 'C'.

In [None]:
query = 'SELECT name FROM accounts WHERE name NOT LIKE "C%";'
query_to_df(query)

All companies whose names do not contain the string 'one' somewhere in the name.

In [None]:
query = 'SELECT name FROM accounts WHERE name NOT LIKE "%one%";'
query_to_df(query)

All companies whose names do not end with 's'.

In [None]:
query = 'SELECT name FROM accounts WHERE name NOT LIKE "%s";'
query_to_df(query)

**The AND Operator:**

* The `AND` operator is used within a `WHERE` statement to consider more than one logical clause at a time
* Each time you link a new statement with an `AND`, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time.
* This operator works with all of the operations we have seen so far including arithmetic operators `(+, *, -, /)`. `LIKE, IN`, and `NOT` logic can also be linked together using the `AND` operator.


**The BETWEEN Operator:**

* Sometimes we can make a cleaner statement using `BETWEEN` than we can using `AND`. Particularly this is true when we are using the same column for different parts of our `AND` statement.
* Note that the endpoints of a `BETWEEN` operator query are inclusive. both start and end limits, included in the output.

For example, statement 1 below is much better written as statement 2 below.

1. `SELECT * FROM table WHERE column >= 6 AND column <= 10`
2. `SELECT * FROM table WHERE column BETWEEN 6 AND 10`

<h2>Questions using AND and BETWEEN operators</h2>

Write a query that returns all the orders where the standard_qty is over 1000, the poster_qty is 0, and the gloss_qty is 0.

In [None]:
query = 'SELECT * FROM orders WHERE standard_qty > 1000 \
        AND poster_qty = 0 AND gloss_qty = 0;'
query_to_df(query)

Using the accounts table, find all the companies whose names do not start with 'C' and end with 's'.

In [None]:
query = 'SELECT name FROM accounts WHERE name NOT LIKE "C%" AND name LIKE "%S";'
query_to_df(query)

When you use the BETWEEN operator in SQL, do the results include the values of your endpoints, or not? Figure out the answer to this important question by writing a query that displays the order date and gloss_qty data for all orders where gloss_qty is between 24 and 29. Then look at your output to see if the BETWEEN operator included the begin and end values or not.


In [None]:
query = 'SELECT occurred_at, gloss_qty FROM orders WHERE gloss_qty BETWEEN 24 and 29;'
query_to_df(query)

Use the web_events table to find all information regarding individuals who were contacted via the organic or adwords channels, and started their account at any point in 2016, sorted from newest to oldest.

In [None]:
query = 'SELECT * FROM web_events WHERE channel IN ("organic", "adwords") \
        AND occurred_at between "2016-01-01" and "2017-01-01" ORDER BY occurred_at DESC;'

query_to_df(query)

You will notice that using BETWEEN is tricky for dates! While BETWEEN is generally inclusive of endpoints, it assumes the time is at 00:00:00 (i.e. midnight) for dates. This is the reason why we set the right-side endpoint of the period at '2017-01-01'.


**The OR Operator**

* Similar to the `AND` operator, the `OR` operator can combine multiple statements.
* Each time you link a new statement with an `OR`, you will need to specify the column you are interested in looking at, just like with `AND`.
* You may link as many statements as you would like to consider at the same time.
* This operator works with all of the operations we have seen so far including arithmetic operators `(+, *, -, /)`, `LIKE`, `IN`, `NOT`, `AND`, and `BETWEEN` logic can all be linked together using the `OR` operator.
* When combining multiple of these operations, we frequently might need to use parentheses to ensure that logic we want to perform is being executed correctly.

<h3>Questions using the OR operator</h3>

Find list of orders ids where either gloss_qty or poster_qty is greater than 4000. Only include the id field in the resulting table.

In [None]:
query = 'SELECT id FROM orders WHERE gloss_qty > 4000 OR poster_qty > 4000;'

query_to_df(query)

Write a query that returns a list of orders where the standard_qty is zero and either the gloss_qty or poster_qty is over 1000.

In [None]:
query = 'SELECT * FROM orders WHERE standard_qty = 0 AND \
        (gloss_qty > 1000 OR poster_qty > 1000);'
query_to_df(query)

Find all the company names that start with a 'C' or 'W', and the primary contact contains 'ana' or 'Ana', but it doesn't contain 'eana'.


In [None]:
query = 'SELECT * FROM accounts WHERE \
        (name LIKE "C%" OR name LIKE "W%") AND \
        (primary_poc LIKE "%ana%" or primary_poc LIKE "%Ana%") AND\
        primary_poc NOT LIKE "%eana%";'

query_to_df(query)

In [None]:
# closing connection and cursor for the day

if connection.is_connected():
    cursor.close()
    connection.close()
    print("MySQL connection is closed")