### *“Success is stumbling from failure to failure with no loss of enthusiasm.”* -- Winston Churchill

We heart SQL!

----

### YOUR NAME HERE:  Brandon Rank

#### List the names of anyone you helped here: Alex Rodriguez, Luke Faro

#### List the names of anyone who helped you here: Luke Faro
----

# HW SQL 1: Selecting Columns and Filtering Rows

**Be sure to use your SQL cheat sheet to help you in this homework!**

The goal of this homework will be to practice selecting columns, filtering rows, using these concepts.

* Count(\*) vs Count(column_name)
* LIMIT
* DISTINCT
* LIKE
* AND
* OR
* Mixing ANDs/ORs
* IN
* BETWEEN
* IS NULL



## PostgreSQL vs. SQLite

First, let me give you a little more background about PostgreSQL and SQLite.

There are numerous Relational Database Management Systems (RDBMS) available in the market, including MySQL, PostgreSQL, SQLite, etc.  You have been using PostgreSQL in the DataCamp tutorials.  

Most Relational DB engines like PostgreSQL rely on Servers, i.e., they send requests to a Host Server and receive a response with the desired data. However, SQLite is a Serverless and self-contained RDBMS. This means that a SQLite Database engine can operate from within the software that is accessing the data (so inside of this notebook file in fact).

**The point is:**  SQLite suits our needs well because it will allow us to practice using SQL easily in a Jupyter Notebook with no additonal software using basic csv's.

So while you'll be using PostgreSQL in DataCamp, you'll be using SQLite when using notebook files.  The good news is that both PostgreSQL and SQLite follow many of the Standard SQL guidelines recommended by the W3C organiziational body. Hence, for our needs, they look basically the same.

## How to Create a SQLite database from a csv file

To use SQLite, we need to import:

* The sqlite3 package - A Python package we need to  use SQLite.
* The pandas and numpy pacakages - A Python package that allows us to process a dataframe, which is essentially the table returned from our queries.

### !!!!!!!!!!!!!!!!!!!!!!So please run the cell below!!!!!!!!!!!!!!!!!!!!

In [None]:
import pandas as pd
import numpy as np
import sqlite3 as sql

Now we will read in a csv file as usual, create/connect to a database, and transform the csv file into a database.

**Read through the code comments below so you have an idea of what's going on but don't panic. By no means do I expect you to memorize this code.  I will always provide it.**

**When done reading the cell, run the code in the cell.**

In [None]:
#Read in a csv with Hearts On Fire data
hearts_on_fire = pd.read_csv('HoF_2020_2021.csv')

#Make a connection to a database file ('hearts_on_fire.db').
#If the file does not yet exist, it will be created.
conn = sql.connect('hearts_on_fire.db')

#The to_sql method writes records stored in a DataFrame (table) to a SQLite database.
#It requires 2 inputs to run:
#     the name of TABLE inside the database file
#     and a connection to the database where the table lives.
#So essentially, the code below dumps the contents of the hearts_on_fire dataframe
#     into a table called weather in a hearts_on_fire.db file.

try:
    hearts_on_fire.to_sql('hearts_on_fire', conn, index = False)
except ValueError:
    print("""Dr. R Note: A ValueError occurred. That's probably fine and likely just means
           that you've run this cell twice and you're getting an error
           because the DB is already created.  :) """)

## How to run SQL queries

Here's how you run a SQL query against the SQLite database we just created.

* Create a string variable containing the SQL statement.  *Use triple quotes to do this.*  We like triple quotes because the SQL statement to span across multiple lines.
* Call ```pd.read_sql_query``` and pass in the sql statement as well as the connection (conn) to the database.
* Store the results into a dataframe.

**Remember:  By convention we capitalize SQL keywords and we line up SELECT/FROM/WHERE, putting the keywords on different lines for ease of reading, like below. Points will be taken off for not doing this**

    """SELECT column1_name, column2_name
      FROM table_name
      WHERE condition;"""

Let's practice.  The code below selects all rows from the hearts on fire table.  Run the cell.

In [None]:
#Create the sql statement between the triple quotes.
sql_statement = """SELECT *
                   FROM hearts_on_fire;"""

#Run the query, passing it and the connection (conn) into the needed method.
#You get a dataframe with the results of the query.
query_results = pd.read_sql_query(sql_statement, conn)

#Print the dataframe results.
query_results

Unnamed: 0,donation,state,affiliation,class_year,preferred_division,date_submitted,donation_year
0,50.0,CA,Alumni,1997.0,Performing Arts,1/29/2020 10:57,2020
1,37.0,PA,Student,2020.0,Business,2/12/2020 21:31,2020
2,25.0,PA,Alumni,,,2/10/2020 15:10,2020
3,10.0,PA,Student,2022.0,Science/Math,2/12/2020 17:53,2020
4,50.0,PA,Family,,,2/12/2020 15:47,2020
...,...,...,...,...,...,...,...
1779,100.0,NY,Alumni,2019.0,Liberal Arts/Social,2/8/2021 11:53,2021
1780,20.0,FL,Family,2023.0,,2/9/2021 11:44,2021
1781,100.0,WV,Alumni,1970.0,Science/Math,2/10/2021 15:01,2021
1782,250.0,DE,Alumni,2007.0,Liberal Arts/Social,2/10/2021 18:23,2021


## Understanding the Data

As you can see, our ```hearts_on_fire``` table has 1784 rows.  Hearts on Fire is a campaign that DSU runs every year to fundraise donations.

Here's what each of the columns mean.

* donation:  the amount the contributor contributed
* state:  where the alumnus lives
* affiliation: Alumni, (current) student, family member, etc.
* class_year:  the year the person graduated.
  * **Notice that some years are NaN - this is the python way to say NULL.**
* preferred_division:  where the contributor wishes to send their funds.
  * **Notice that some years are None - this is another python way to say NULL.**
* date_submitted: the date and time of the contributor made the donation
* donation year: The year of the hearts on fire campaign

# QUESTION 1

What affiliations are possible?  List each exactly once.

**Notice:**  None/Null will not be listed in the options and so from this we can deduce that all records/contributors must have an affiliation.

In [None]:
# CodeGrade Tag: question_1

sql_statement1 = """SELECT DISTINCT(affiliation)
                    FROM hearts_on_fire;"""

query_results1 = pd.read_sql_query(sql_statement1, conn)
query_results1

Unnamed: 0,affiliation
0,Alumni
1,Student
2,Family
3,Employee
4,Friend


# QUESTION 2

For the 2021 campaign, how many contributers were there?

In [None]:
# CodeGrade Tag: question_2

sql_statement2 = """SELECT COUNT(donation)
                    FROM hearts_on_fire
                    WHERE donation_year = 2021; """

query_results2 = pd.read_sql_query(sql_statement2, conn)
query_results2

Unnamed: 0,COUNT(donation)
0,929


# QUESTION 3

Select all contributors who were family members of someone affiliated with DSU.  

Include only the state, donation, and affiliation of these contributors.


In [None]:
# CodeGrade Tag: question_3

sql_statement3 =  """SELECT state, donation, affiliation
                     FROM hearts_on_fire
                     WHERE affiliation = 'Family';"""

query_results3 = pd.read_sql_query(sql_statement3, conn)
query_results3

Unnamed: 0,state,donation,affiliation
0,PA,50.0,Family
1,NJ,100.0,Family
2,PA,20.0,Family
3,NJ,25.0,Family
4,NJ,25.0,Family
...,...,...,...
273,DE,50.0,Family
274,PA,40.0,Family
275,PA,50.0,Family
276,PA,50.0,Family


# QUESTION 4

You want know if all 50 states are represented.

Write a query to calculate how many states are represented in the data set.

You can count DC as a state.

In [None]:
# CodeGrade Tag: question_4

sql_statement4 = """ SELECT COUNT(DISTINCT(state))
                     FROM hearts_on_fire;"""

query_results4 = pd.read_sql_query(sql_statement4, conn)
query_results4

Unnamed: 0,COUNT(DISTINCT(state))
0,39


# QUESTION 5

Select just the donation amounts and class years of all alumni from the 70s.

In [None]:
# CodeGrade Tag: question_5

sql_statement5 = """ SELECT donation, class_year
                     FROM hearts_on_fire
                     WHERE class_year BETWEEN 1970 AND 1979; """

query_results5 = pd.read_sql_query(sql_statement5, conn)
query_results5

Unnamed: 0,donation,class_year
0,50.0,1975.0
1,100.0,1975.0
2,100.0,1971.0
3,100.0,1971.0
4,75.0,1970.0
5,3000.0,1970.0
6,250.0,1975.0
7,100.0,1973.0
8,50.0,1974.0
9,50.0,1974.0


# QUESTION 6

How many current students from out of state contributed?

In [None]:
# CodeGrade Tag: question_6

sql_statement6 = """ SELECT COUNT(*)
                     FROM hearts_on_fire
                     WHERE (state IS NOT 'PA')
                     AND (affiliation = 'Student'); """

query_results6 = pd.read_sql_query(sql_statement6, conn)
query_results6

Unnamed: 0,COUNT(*)
0,49


# QUESTION 7

How many alumni from the tri-state area which consists of NY, NJ, PA.

In [None]:
# CodeGrade Tag: question_7

sql_statement7 = """ SELECT COUNT(*)
                     FROM hearts_on_fire
                     WHERE (affiliation = 'Alumni')
                     AND (state IN ('NY', 'NJ', 'PA')); """

query_results7 = pd.read_sql_query(sql_statement7, conn)
query_results7

Unnamed: 0,COUNT(*)
0,741


# QUESTION 8

How many Employees chose to send their donations specifically to the Healthcare division?

In [None]:
# CodeGrade Tag: question_8

sql_statement8 = """ SELECT COUNT(*)
                     FROM hearts_on_fire
                     WHERE (affiliation = 'Employee')
                     AND (preferred_division = 'Healthcare');"""

query_results8 = pd.read_sql_query(sql_statement8, conn)
query_results8

Unnamed: 0,COUNT(*)
0,4


# QUESTION 9

List all donations that fall between 100-1000 INCLUSIVE.  Include all columns.

In [None]:
# CodeGrade Tag: question_9

sql_statement9 = """ SELECT *
                     FROM hearts_on_fire
                     WHERE donation BETWEEN 100 AND 1000;"""

query_results9 = pd.read_sql_query(sql_statement9, conn)

query_results9

Unnamed: 0,donation,state,affiliation,class_year,preferred_division,date_submitted,donation_year
0,250.0,PA,Alumni,,,2/11/2020 23:06,2020
1,100.0,DE,Alumni,2018.0,,2/11/2020 12:47,2020
2,100.0,NJ,Family,,,2/7/2020 16:29,2020
3,100.0,PA,Friend,,,2/12/2020 9:48,2020
4,250.0,PA,Alumni,1993.0,,1/16/2020 15:02,2020
...,...,...,...,...,...,...,...
702,100.0,PA,Employee,,,2/8/2021 11:54,2021
703,100.0,PA,Alumni,2004.0,Science/Math,2/9/2021 21:04,2021
704,100.0,NY,Alumni,2019.0,Liberal Arts/Social,2/8/2021 11:53,2021
705,100.0,WV,Alumni,1970.0,Science/Math,2/10/2021 15:01,2021


# QUESTION 10

List all donations that fall between 100-1000 EXCLUSIVE.  Include all columns.

Notice how many less there are than in the INCLUSIVE case.  This is because 100 dollars and 1000 dollars are popular whole number amounts to give.

In [None]:
# CodeGrade Tag: question_10

sql_statement10 = """SELECT *
                     FROM hearts_on_fire
                     WHERE (donation > 100) AND (donation < 1000);"""

query_results10 = pd.read_sql_query(sql_statement10, conn)
query_results10

Unnamed: 0,donation,state,affiliation,class_year,preferred_division,date_submitted,donation_year
0,250.0,PA,Alumni,,,2/11/2020 23:06,2020
1,250.0,PA,Alumni,1993.0,,1/16/2020 15:02,2020
2,500.0,IL,Alumni,2004.0,Business,2/12/2020 15:56,2020
3,400.0,PA,Alumni,1994.0,Business,2/10/2020 14:52,2020
4,250.0,NY,Alumni,2016.0,Science/Math,2/11/2020 12:19,2020
...,...,...,...,...,...,...,...
290,250.0,PA,Employee,,,2/10/2021 19:26,2021
291,500.0,ME,Alumni,1969.0,,2/1/2021 13:44,2021
292,250.0,PA,Alumni,1998.0,Science/Math,2/10/2021 18:19,2021
293,200.0,PA,Employee,,,2/9/2021 11:24,2021


# QUESTION 11

List the states and amounts of all 2020 contributors who donated less than $100.

In [None]:
# CodeGrade Tag: question_11

sql_statement11 = """SELECT state, donation
                     FROM hearts_on_fire
                     WHERE (donation_year = 2020) AND (donation < 100);"""

query_results11 = pd.read_sql_query(sql_statement11, conn)
query_results11

Unnamed: 0,state,donation
0,CA,50.0
1,PA,37.0
2,PA,25.0
3,PA,10.0
4,PA,50.0
...,...,...
545,NY,50.0
546,PA,3.0
547,FL,25.0
548,PA,50.0


# QUESTION 12

How many contributors come from a state that starts with a vowel?

In [None]:
# CodeGrade Tag: question_12

sql_statement12 = """ SELECT COUNT(*)
                      FROM hearts_on_fire
                      WHERE (state LIKE 'A%') OR (state LIKE 'E%') OR (state LIKE 'O%') OR (state LIKE 'I%') OR (state LIKE'U%');"""

query_results12 = pd.read_sql_query(sql_statement12, conn)
query_results12

Unnamed: 0,COUNT(*)
0,34


# QUESTION 13

How many contributors did NOT specifiy a division to contribute to?

In [None]:
# CodeGrade Tag: question_13

sql_statement13 = """ SELECT COUNT(*)
                      FROM hearts_on_fire
                      WHERE preferred_division IS NULL; """

query_results13 = pd.read_sql_query(sql_statement13, conn)
query_results13

Unnamed: 0,COUNT(*)
0,936


# QUESTION 14

The US Census lists the following as the 4 main regions of the US.  

* Northeast Region
  * CT, ME, MA, NH, RI, VT, NJ, NY, PA
* Midwest Region
  * IL, IN, MI, OH, WI, IA, KS, MN, MO, NE, ND, SD
* South Region
  * DE, DC, FL, GA, MD, NC, SC, VA, WV, AL, KY, MS, TN, AR, LA, OK, TX
* West Region
  * AZ, CO, ID, MT, NV, NM, UT, WY, AK, CA, HI, OR, WA

(ASIDE: There are also 9 official divisions.  For example, New England is a subset of the Northeast region.)

Write 4 queries to tell us what % of contributors came from each of these 4 regions.   You will of course see that the NE region has the highest percentage because most of our students are local.

In [None]:
#Put code for Northeast here

# CodeGrade Tag: question_14A

sql_statement14A = """ SELECT COUNT(state) *100.0/1784
                       FROM hearts_on_fire
                       WHERE state IN ('CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA');"""

query_results14A = pd.read_sql_query(sql_statement14A, conn)
query_results14A

Unnamed: 0,COUNT(state) *100.0/1784
0,85.762332


In [None]:
#Put code for Midwest here
# CodeGrade Tag: question_14B

sql_statement14B = """ SELECT COUNT(state) *100.0/1784
                       FROM hearts_on_fire
                       WHERE state IN ('IL', 'IN', 'MI', 'OH', 'WI', 'IA', 'KS', 'MN', 'MO', 'NE', 'ND', 'SD');"""

query_results14B = pd.read_sql_query(sql_statement14B, conn)
query_results14B

Unnamed: 0,COUNT(state) *100.0/1784
0,1.961883


In [None]:
# Put code for West region Here
# CodeGrade Tag: question_14C

sql_statement14C = """ SELECT COUNT(state) *100.0/1784
                       FROM hearts_on_fire
                       WHERE state IN ('AZ', 'CO', 'ID', 'MT', 'NV', 'NM', 'UT', 'WY', 'AK', 'CA', 'HI', 'OR', 'WA');"""

query_results14C = pd.read_sql_query(sql_statement14C, conn)
query_results14C

Unnamed: 0,COUNT(state) *100.0/1784
0,3.08296


In [None]:
# Put code for South region here

# CodeGrade Tag: question_14D

sql_statement14D = """ SELECT COUNT(state) *100.0/1784
                       FROM hearts_on_fire
                       WHERE state IN ('DE', 'DC', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX');"""

query_results14D = pd.read_sql_query(sql_statement14D, conn)
query_results14D

Unnamed: 0,COUNT(state) *100.0/1784
0,9.024664


# QUESTION 15

Below are 2 DataCamp queries.   Even thought ```COUNT(*)``` and ```COUNT(column_name)``` behave differently, the following queries return the same value.   If we thought hard enough about these queries, we could have guessed that the return values HAD to be the same.

Explain why it is sensible that they do not return different values.

| |
|-------|
| ```SELECT COUNT(*)``` <br/> ```FROM films ``` <br/> ```WHERE release_year = 2000;``` |
| ```SELECT COUNT(release_year) ``` <br/> ```FROM films ``` <br/> ```WHERE release_year = 2000;``` |

It makes sense that these two queries would return the same values because we are looking specifically for an instance where the release_year is 2000. Count(*) typically just counts all of the rows in a table even including nulls, and Count(column) counts all rows without a null value. However, in this instance, the WHERE statement used eliminates all the null values, because the release_year must be 2000. As a result, we should expect the same number of rows to be counted by the queries.



# QUESTION 16

Did you follow the conventions expected?  For this question I will give/takeaway points for doing so.

----

## Closing the DB connection.

Throughout this whole file, we've kept our connection to the Database (the hearts_on_fire.db file) open.  

We must close the connection when done.

Run the following cell when you are done working on this HW.


In [None]:
#Close the cursor and connection
conn.close()

---
All done! Good job!