# 1. Why SQL is Important to Learn

**Here are a few key reasons why learning SQL is beneficial to anybody that is interested in working with data:**

**SQL is everywhere**

Almost all of the biggest names in tech use SQL — which is pronounced either “sequel” or “S.Q.L.”, by the way. Companies like Facebook, Google, and Amazon have built their own high-performance database systems, but even their data teams use SQL to query data and perform data analysis. And it’s not just tech companies: companies big and small around the globe use SQL.

**SQL enables us to pull data from many sources**

In many real-word situations, data is spread across many sources. Using SQL allows us to select specific data and transform it to fit our needs. For example, working with spreadsheets can be difficult if the data we need to answer our question is spread across many files. SQL allows us to structure our data in a way that makes all of our data accessible from one place.

**`SQL data is structured into multiple, connected tables.`**

**SQL is here to stay**

As mentioned above, SQL is one of the most popular tools used by data professionals. The Stack Overflow annual Developer Survey, which is the largest and most comprehensive survey of programmers around the world, consistently reveals that SQL is one of the most popular technologies used today.

# 2. Introduction to Databases

When we work with data stored on our computers, we load the data from files like spreadsheets (and text files in several different formats). Working with files solely on our computer is fine for most occasions, but we run into problems when we consider a few questions:

* What if the data is **too big** to fit into a single spreadsheet file?
* What if you share the data with team members and **keep it updated**?
* What if there's sensitive information in your data that **needs protection**?

**A database** structures data much like how we would see it in a spreadsheet file, where data is organized in different tables which in turn are organized into rows and columns

**A database is capable of storing much more data and in a more secure way than a spreadsheet or text file. Unlike opening a spreadsheet, we actually have to "ask" for data from the database in order to visualize it.**

* We primarily interact with a database using a [database management system (DBMS)](https://en.wikipedia.org/wiki/Database) — a computer program to help users interact with data. Users can do this by by giving the computer instructions through the DBMS.

* We'll start learning SQL with the DBMS [SQLite](https://www.sqlite.org/index.html). SQLite is a lightweight DBMS and is the most popular database in the world

# 3. Your First Query

In this course, we'll explore data from the American Community Survey on job outcome statistics based on college majors that we loaded into a SQLite database.

Head to the [dataset page](https://github.com/fivethirtyeight/data/tree/master/college-majors) and spend some time getting familiar with what each column represents.

We provide a database, jobs.db, loaded with these data into a single table named recent_grads (in the next course, we'll learn how to work with a database containing multiple tables.)

In [1]:
%load_ext sql

In [2]:
%sql sqlite:///jobs.db

In [3]:
%%sql

SELECT *
FROM recent_grads;

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,Engineering,2573,17,2200,373,0.144966965,1857,2038,264,1449,400,0.177226407,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,Business,3777,51,832,960,0.535714286,2912,2924,296,2482,308,0.095652174,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,10,2110,1667,0.4413555729999999,1526,1085,553,827,33,0.021167415,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,Engineering,91227,1029,12953,2105,0.139792801,76442,71298,13101,54639,4650,0.0573422779999999,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,Engineering,81527,631,8407,6548,0.437846874,61928,55450,12695,41413,3895,0.059173845,60000,45000,72000,45829,10874,3170


# 4. Understanding your First Query

The process by which you were able to visualize recent_grads can be broken down into two seps:

* Write a SQL query that expresses the request "fetch all the data in the table".
* Ask the SQLite DBMS software to run the code and display the results.

In [4]:
from IPython.display import Image
Image(url='https://s3.amazonaws.com/dq-content/252/select_breakdown.svg')

* SELECT and FROM are written in uppercase letters. This isn't required by the syntax, but it helps with making your code easier to read.

# 5. Previewing a Table

You may also have noticed that this table has 173 rows and over 20 columns. For a computer, these aren't that many rows, nor that many columns. For a human, however, it's hard to make sense of these much data.

In the following missions and in the next course we'll learn how to make sense of large amounts of data using SQL. For now, we'll focus on how we can preview of a table without displaying it completely in the screen.

In practice, you will often find yourself having access to a database with no documentation. In this situation, you'll have to rely on the surrounding context of the database and on your exploration of it.

# 6. The LIMIT Clause

* SQL is a database dependent language. This means that the SQL that you're learning here (for SQLite) is not the same as SQL for other DBMSs like PostgreSQL or Oracle.
* The main takeaway is that different SQL dialects exist and we should watchout for differences when switching between DBMSs.

**Write a SQL query that returns the first five rows from recent_grads.***

In [5]:
%%sql

SELECT *
FROM recent_grads
LIMIT 5;

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972


# 7. Selecting Specific Columns

Notice that we use commas to separate column names, and that after the last column name there is no comma. When SQL finds a comma, it expects a column to follow, so we need to make sure we don't include a comma after the last column. 

**Write a SQL query that returns only the Major and ShareWomen columns (in that order).**

In [6]:
%%sql

SELECT Major,ShareWomen
FROM recent_grads

 * sqlite:///jobs.db
Done.


Major,ShareWomen
PETROLEUM ENGINEERING,0.120564344
MINING AND MINERAL ENGINEERING,0.1018518519999999
METALLURGICAL ENGINEERING,0.153037383
NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313196
CHEMICAL ENGINEERING,0.341630502
NUCLEAR ENGINEERING,0.144966965
ACTUARIAL SCIENCE,0.535714286
ASTRONOMY AND ASTROPHYSICS,0.4413555729999999
MECHANICAL ENGINEERING,0.139792801
ELECTRICAL ENGINEERING,0.437846874


# 8. Filtering Rows Using WHERE


**The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:**

* only the Major columns
* only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

Here are the comparison operators we can use:

* Less than: <
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=
* Equal to: =
* Not equal to: != or <>

## TODO:
* Write a SQL query that returns the majors in the recent_grads table where females were a minority.
  * Only return the Major and ShareWomen columns (in that order).
  * Return only the values where ShareWomen is less than 0.5.
  * Don't limit the number of rows returned.

In [7]:
%%sql

SELECT Major,ShareWomen
FROM recent_grads
WHERE ShareWomen<0.5;

 * sqlite:///jobs.db
Done.


Major,ShareWomen
PETROLEUM ENGINEERING,0.120564344
MINING AND MINERAL ENGINEERING,0.1018518519999999
METALLURGICAL ENGINEERING,0.153037383
NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313196
CHEMICAL ENGINEERING,0.341630502
NUCLEAR ENGINEERING,0.144966965
ASTRONOMY AND ASTROPHYSICS,0.4413555729999999
MECHANICAL ENGINEERING,0.139792801
ELECTRICAL ENGINEERING,0.437846874
COMPUTER ENGINEERING,0.199412643


# 9. Expressing Multiple Filter Criteria Using 'AND'

We can also use the AND operator to combine multiple filter criteria

## TODO:
* Write a SQL query that returns all majors that had a majority of female students and a median salary greater than 50000.
  * Only include the following columns in the results and in this order:
Major,
Major_category,
Median,
ShareWomen,
  * Return only the values where ShareWomen is greater than 0.5 and Median is greater than 50000.

In [8]:
%%sql

SELECT Major,Major_category,Median,Sharewomen
FROM recent_grads
WHERE ShareWomen>0.5
AND Median>50000;

 * sqlite:///jobs.db
Done.


Major,Major_category,Median,ShareWomen
ACTUARIAL SCIENCE,Business,62000,0.535714286
COMPUTER SCIENCE,Computers & Mathematics,53000,0.578766338


# 10. Returning One of Several Conditions With OR

## TODO:
* Write a SQL query that returns the first 20 majors that either:

  * Have a Median salary greater than or equal to 10,000, or
  * Have more men that women

* Only include the following columns in the results and in this order:

    Major,
Median,
Unemployed

In [9]:
%%sql

SELECT Major,Median,Unemployed
FROM recent_grads
WHERE Median>=10000
OR Men>Women
LIMIT 20;

 * sqlite:///jobs.db
Done.


Major,Median,Unemployed
PETROLEUM ENGINEERING,110000,37
MINING AND MINERAL ENGINEERING,75000,85
METALLURGICAL ENGINEERING,73000,16
NAVAL ARCHITECTURE AND MARINE ENGINEERING,70000,40
CHEMICAL ENGINEERING,65000,1672
NUCLEAR ENGINEERING,65000,400
ACTUARIAL SCIENCE,62000,308
ASTRONOMY AND ASTROPHYSICS,62000,33
MECHANICAL ENGINEERING,60000,4650
ELECTRICAL ENGINEERING,60000,3895


# 11. Grouping Operators With Parentheses

## TODO:
* Run the query we explored above, which returns all majors that:

  * Fell under the category of Engineering and either

  * Had mostly women graduates
  * Or had an unemployment rate below 5.1%, which was the rate in August 2015
* Only include the following columns in the results and in this order:

  * Major
  * Major_category
  * ShareWomen
  * Unemployment_rate

In [10]:
%%sql

SELECT Major,Major_category,ShareWomen,Unemployment_rate
FROM recent_grads
WHERE (Major_category=='Engineering')
AND (ShareWomen>0.5 OR Unemployment_rate<0.051);

 * sqlite:///jobs.db
Done.


Major,Major_category,ShareWomen,Unemployment_rate
PETROLEUM ENGINEERING,Engineering,0.120564344,0.018380527
METALLURGICAL ENGINEERING,Engineering,0.153037383,0.024096386
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313196,0.050125313
MATERIALS SCIENCE,Engineering,0.310820285,0.023042836
ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985189,0.006334343
INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.3434732179999999,0.042875544
MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607004,0.027788805
ENVIRONMENTAL ENGINEERING,Engineering,0.558548009,0.093588575
INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.75047259,0.028308097
ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174122505,0.03365166


# 12. Ordering Results Using ORDER BY

## TODO:
* Write a query that returns all majors where:

   * ShareWomen is greater than 0.3
   * And Unemployment_rate is less than .1

* Only include the following columns in the results and in this order:

   Major,
ShareWomen,
Unemployment_rate
* Order the results in descending order by the ShareWomen column.

In [11]:
%%sql


SELECT Major,ShareWomen,Unemployment_rate
FROM recent_grads
WHERE ShareWomen>0.3 
AND Unemployment_rate<1
ORDER BY ShareWomen DESC

 * sqlite:///jobs.db
Done.


Major,ShareWomen,Unemployment_rate
ANTHROPOLOGY AND ARCHEOLOGY,0.968953683,0.102791571
EARLY CHILDHOOD EDUCATION,0.967998119,0.040104981
MATHEMATICS AND COMPUTER SCIENCE,0.927807246,0.0
ELEMENTARY EDUCATION,0.923745479,0.046585715
ANIMAL SCIENCES,0.91093257,0.050862499
PHYSIOLOGY,0.906677337,0.0691628
MISCELLANEOUS PSYCHOLOGY,0.90558993,0.05190783
HUMAN SERVICES AND COMMUNITY ORGANIZATION,0.904074544,0.037819026
NURSING,0.896018988,0.044862724
GEOSCIENCES,0.881293889,0.024373731


# 13. Practice Writing a Query

**Which engineering majors had the highest full time employment rates?**

In [13]:
%%sql

SELECT Major_category, Major, Unemployment_rate 
FROM recent_grads 
WHERE Major_category='Engineering' 
OR Major_category='Physical Sciences' 
ORDER BY Unemployment_rate;

 * sqlite:///jobs.db
Done.


Major_category,Major,Unemployment_rate
Engineering,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334343
Engineering,PETROLEUM ENGINEERING,0.018380527
Physical Sciences,ASTRONOMY AND ASTROPHYSICS,0.021167415
Physical Sciences,ATMOSPHERIC SCIENCES AND METEOROLOGY,0.022228555
Engineering,MATERIALS SCIENCE,0.023042836
Engineering,METALLURGICAL ENGINEERING,0.024096386
Physical Sciences,GEOSCIENCES,0.024373731
Engineering,MATERIALS ENGINEERING AND MATERIALS SCIENCE,0.027788805
Engineering,INDUSTRIAL PRODUCTION TECHNOLOGIES,0.028308097
Engineering,ENGINEERING AND INDUSTRIAL MANAGEMENT,0.03365166


In this mission, we became familiar with a dataset stored in a SQLite table by learning how to craft basic SQL queries. The kind of queries we learned here and that we'll keep studying for the next few courses are one of four types of SQL commands we can give:

[Data query language](https://en.wikipedia.org/wiki/Data_query_language)

[Data definition language](https://en.wikipedia.org/wiki/Data_definition_language)

[Data control language](https://en.wikipedia.org/wiki/Data_control_language)

[Data manipulation language](https://en.wikipedia.org/wiki/Data_manipulation_language)