# Import Dependencies

In [1]:
from sqlalchemy import create_engine
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
# Create MySQL Database Connection
# ----------------------------------
# engine = create_engine('mysql+pymysql://user:password@host/database', pool_recycle=3600)
conn = engine.connect()

In [4]:
# Confirm connection by printing table in database
engine.table_names()

['portfolio']

# Dividing Data into Logical Sets using <font color="red">GROUP BY</font>

<br>
<strong>SQL Syntax</strong><br>
SELECT column aggregation(*) AS some_column<br>
FROM table<br>
<strong><font color="red">GROUP BY</font></strong> column/alias;

# Examine Data

In [6]:
sql_view = "SELECT * FROM portfolio LIMIT 5;"

In [7]:
# Run query
view_data = pd.read_sql(sql_view, conn)
# Displaying subset of data
view_data

Unnamed: 0,MyUnknownColumn,mean_return,variance,pf_weights,bm_weights,Security,GICS Sector,GICS Sub Industry
0,A,0.146146,0.035194,0.0,0.0,Agilent Technologies Inc,Health Care,Health Care Equipment
1,AAL,0.444411,0.094328,0.214,0.0,American Airlines Group,Industrials,Airlines
2,AAP,0.242189,0.029633,0.0,0.0,Advance Auto Parts,Consumer Discretionary,Automotive Retail
3,AAPL,0.225074,0.027283,0.0,0.0,Apple Inc.,Information Technology,Computer Hardware
4,ABBV,0.182541,0.029926,0.0,0.0,AbbVie,Health Care,Pharmaceuticals


# Create Groups

### Example 1: GROUP BY GICS Sector and COUNT the number of stocks in that sector

In [6]:
sql_view1 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                GROUP BY `GICS Sector`;"""

In [7]:
# Run query
view_data1 = pd.read_sql(sql_view1, conn)
# Displaying subset of data
view_data1

Unnamed: 0,GICS Sector,Total_Stocks
0,Health Care,59
1,Industrials,69
2,Consumer Discretionary,84
3,Information Technology,68
4,Consumer Staples,36
5,Utilities,28
6,Financials,62
7,Real Estate,29
8,Materials,25
9,Energy,36


### <strong><font color="blue">Explanation:</font></strong><br>
The <code>SELECT</code> statement specified two columns<br>

1. <code>`GICS Sector`</code><br>
2. <code>COUNT(MyUnknownColumn) AS Total_Stocks</code><br>
<br>

I aliased the column MyUnknownColumn to give it a meaningful name<br>
I used backticks around GICS Sector as it has a space in between that string.<br>
<br>
The second column <code>COUNT(MyUnknownColumn) AS Total_Stocks</code><br> is a data manipulating function that calculates (counts) all instances in that field.<br>
<br>
The  <code>Group by</code> clause instructs the Database Management System to organize and group the data by the column <code>`GICS Sector`</code>. This then causes the  <code>COUNT(MyUnknownColumn) AS Total_Stocks</code>to be calculated by each group.<br>

### Why is this a powerful clause?
<font color="red">The <code>GROUP BY</code> clause enables us to group data by category and perform some aggregate on each group without having to specify each category!</font>


# Important <code>GROUP BY</code> Rules

 1. Can contain many columns, allowing you to have nested groups.
 2. For nested groups (many columns), the data will be evaluated by all the columns specified.
 3. Columns listed in clause must be retrieved column or valid expression and not an aggregated function. 
 4. You need to use the same expression in the <code>SELECT</code> statement that you use in the group by clause.
 5. Most Relational Database Management Systems (RDMS) do not allow variable length datatypes, such as text that are not categorical datatypes, in a <code>GROUP BY</code> clause .
 6. <code>NULL</code> values will be returned as a group.
 7. <font color="red"><code>GROUP BY</code> comes <strong>AFTER</strong> <code>WHERE</code> clause and <strong>BEFORE</strong> <code>ORDER BY</code> clause.</font>
 8. Some RDMS allow you to specify columns by relative position, but not recommended as it is susceptible to errors when editing SQL statements.

# Filtering Groups with the <font color="red">HAVING</font> Clause

 - The <code>WHERE</code> clause is a powerful tool for filtering data in specific rows in a table, but DOES NOT WORK for groups.
 - The <code>HAVING</code> clause filters data by groups!
 - All wildcard operators can be used with the <code>HAVING</code> clause.
 - <code>WHERE</code> clause filters data <strong>BEFORE</strong> the data is grouped, and the <code>HAVING</code> clause filters data <strong>AFTER</strong> the data is grouped.

### Example 2: Filter GICS Sector Groups by a specific group category

<br>
<strong>SQL Syntax</strong><br>
SELECT column COUNT(*) AS some_column<br>
FROM table<br>
<strong><font color="red">GROUP BY</font></strong> column/alias<br>
HAVING column = SOME_VALUE;

In [8]:
sql_view2 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                GROUP BY `GICS Sector`
                HAVING `GICS Sector` = "Information Technology";"""

In [9]:
# Run query
view_data2 = pd.read_sql(sql_view2, conn)
# Displaying subset of data
view_data2

Unnamed: 0,GICS Sector,Total_Stocks
0,Information Technology,68


### <strong><font color="blue">Explanation:</font></strong><br>
1. The <code>GROUP BY `GICS Sector`</code> statement instructs the RDMS to filter the data by the group `GICS Sector`<br>
    
2. The <code>HAVING `GICS Sector` = "Information Technology"</code> statement instructs the RDMS to retrieve records in the `GICS Sector` group that matches the string <strong>"Information Technology"</strong>.<br>

3. The <code>COUNT(MyUnknownColumn) AS Total_Stocks</code> in the <code>SELECT</code> statement returns the number of records returned in the <code>GROUP BY</code> and <code>HAVING</code> clause. IE the number of records that matched "Information Technology".

### Example 3: Filter GICS Sector Groups based on a groups aggregation.

<br>
<strong>SQL Syntax</strong><br>
SELECT column aggregation(*) AS some_column<br>
FROM table<br>
<strong><font color="red">GROUP BY</font></strong> column/alias<br>
HAVING aggregation(*) logical condition(s);

In [10]:
sql_view3 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                GROUP BY `GICS Sector`
                HAVING COUNT(MyUnknownColumn) > 20;"""

In [11]:
# Run query
view_data3 = pd.read_sql(sql_view3, conn)
# Displaying subset of data
view_data3

Unnamed: 0,GICS Sector,Total_Stocks
0,Health Care,59
1,Industrials,69
2,Consumer Discretionary,84
3,Information Technology,68
4,Consumer Staples,36
5,Utilities,28
6,Financials,62
7,Real Estate,29
8,Materials,25
9,Energy,36


### <strong><font color="blue">Explanation:</font></strong><br>
1. The <code>HAVING COUNT(MyUnknownColumn)</code> statement instructs the RDMS to filter the data by the aggregated function <code>COUNT()</code> performed on MyUnknownColumn<br>
2. The <code>COUNT(MyUnknownColumn) > 20</code> statement instructs the RDMS to filter group by only those that have more than <strong>20</strong> records.

### Example 4: Filter grouped data using multiple conditions with the <code>AND</code> operator.

<br>
<strong>SQL Syntax</strong><br>
SELECT column aggregation(*) AS some_column<br>
FROM table<br>
<strong><font color="red">GROUP BY</font></strong> column/alias<br>
HAVING aggregation(*) logical condition(s) <font color="red">AND</font> aggregation(*) logical condition(s);

In [12]:
sql_view4 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                GROUP BY `GICS Sector`
                HAVING COUNT(MyUnknownColumn) > 20 AND COUNT(MyUnknownColumn) <= 40;"""

In [13]:
# Run query
view_data4 = pd.read_sql(sql_view4, conn)
# Displaying subset of data
view_data4

Unnamed: 0,GICS Sector,Total_Stocks
0,Consumer Staples,36
1,Utilities,28
2,Real Estate,29
3,Materials,25
4,Energy,36


### <strong><font color="blue">Explanation:</font></strong><br>
1. The <code>HAVING COUNT(MyUnknownColumn)</code> statement instructs the RDMS to filter the data by the aggregated function <code>COUNT()</code> performed on MyUnknownColumn<br>
2. The <code>COUNT(MyUnknownColumn) > 20</code> statement instructs the RDMS to filter group by data that have more than <strong>20</strong> records.
3. The <code>AND COUNT(MyUnknownColumn) <= 40;</code> statement that follows, instructs the RDMS to also retrieve aggregated data that is equal and/or less than 40.  This creates two conditions that the RDMS must consider in this query.

# Filtering data with the <font color="red">WHERE</font> clause and the  <font color="red">HAVING</font> Clause

<br>
<strong>SQL Syntax</strong><br>
SELECT column aggregation(*) AS some_column<br>
FROM table<br>
WHERE some_column operator condition
<strong><font color="red">GROUP BY</font></strong> column/alias<br>
HAVING aggregation(*)/alias logical condition(s);

### Example 5: Filter data by first retrieving only data <code>WHERE</code> mean return is greater than some value, and then <code>GROUP BY</code> the results by the sector that are <code>HAVING</code> greater than 2 stock options.

In [14]:
sql_view5 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                WHERE mean_return >= 0.3
                GROUP BY `GICS Sector`
                HAVING Total_Stocks > 2;"""

In [15]:
# Run query
view_data5 = pd.read_sql(sql_view5, conn)
# Displaying subset of data
view_data5

Unnamed: 0,GICS Sector,Total_Stocks
0,Industrials,5
1,Information Technology,6
2,Consumer Discretionary,5
3,Health Care,4


### <strong><font color="blue">Explanation:</font></strong><br>
1. The <code>WHERE mean_return >= 0.3</code> statement instructs the RDMS to filter the data where the mean return of the stock is greater than some value<br>
2. The <code>GROUP BY `GICS Sector`</code> statement instructs the RDMS to then group the retrieved data by `GICS Sector`.
3. The <code>HAVING Total_Stocks > 2;</code> statement that instructs the RDMS to only return the grouped data that has more than 2 instances in the aggregated column "Total_Stocks". Here we could have used  <code>HAVING COUNT(MyUnknownColumn) > 2;</code> and it would have been the referred to the same aggregated field.

### Example 6: <font color="red">Comparison</font> of the Filter data in example 5 <font color="red">without</font> the <code>WHERE</code> clause before the <code>HAVING</code> clause.

In [16]:
sql_view6 = """SELECT `GICS Sector`, COUNT(MyUnknownColumn) AS Total_Stocks 
                FROM portfolio
                GROUP BY `GICS Sector`
                HAVING Total_Stocks > 2;"""

In [17]:
# Run query
view_data6 = pd.read_sql(sql_view6, conn)
# Displaying subset of data
view_data6

Unnamed: 0,GICS Sector,Total_Stocks
0,Health Care,59
1,Industrials,69
2,Consumer Discretionary,84
3,Information Technology,68
4,Consumer Staples,36
5,Utilities,28
6,Financials,62
7,Real Estate,29
8,Materials,25
9,Energy,36


### <strong><font color="blue">Explanation:</font></strong><br>
In this example, the WHERE clause was omitted, and as a result, all data was retrived and groups according to the GISC sector without considering the mean_return first.<br>
This is a very different outcome!

# Grouping and Sorting Explained!

 - <code>GROUP BY</code> and <code>ORDER BY</code> appear to accomplish the same task, but there are major differences.
 - Grouped by data may sometimes be returned in an ordered fashion, but that DOES NOT suggest they work the same way.


<table>
    <tr>
        <th><code>GROUP BY</code></th>
        <th><code>ORDER BY</code></th>
    </tr>
    <tr>
        <td>Returns data that is organized by groups</td>
        <td>Organizes return data</td>
    </tr>
    <tr>
        <td>Operation is only performed on selected columns or expression columns and every selected feature must be used</td>
        <td>Operation is performed may be performed on any column (even those not used)</td>
    </tr>
    <tr>
        <td>Required if using columns with aggregated data</td>
        <td>Not required</td>
    </tr>    
</table>

# Guideline for SELECT clause ordering

 - The following guide highlights the order by which clauses are incorporated in a query, from top to bottom.

<table>
    <tr>
        <th>Clause</th>
        <th>Description</th>
        <th>Importance</th>
    </tr>
    <tr>
        <td><code>SELECT</code></td>
        <td>Is used to retrieve specific columns or expressions</td>
        <td>Required</td>
    </tr>
    <tr>
        <td><code>FROM</code></td>
        <td>Used to instruct RDMS which table to retrive data</td>
        <td>Required</td>
    </tr>
    <tr>
        <td><code>WHERE</code></td>
        <td>Instruct the RDMS to filter rows from the table given a specific instruction</td>
        <td>Optional</td>
    </tr>
    <tr>
        <td><code>GROUP BY</code></td>
        <td>Instrucs the RDMS to filter data by groups</td>
        <td>Optional</td>
    </tr>
    <tr>
        <td><code>HAVING</code></td>
        <td>Instructs the RDMS to filter the grouped data that meet a specific criteria</td>
        <td>Optional</td>
    </tr>
    <tr>
        <td><code>ORDER BY</code></td>
        <td>Instructs the RDMS to sort the retrieved data by alphabetic or numeric criteria.</td>
        <td>Optional</td>
    </tr>    
</table>