# Benefits of SQL for Data Analysis
* SQL for Data Analysis is a powerful programming language that helps data analysts interact with data stored in Relational databases. By using SQL several companies have built their proprietary tools to fetch information from databases quickly.
* This data-driven approach has enabled the industry to channel its growth by analyzing meaningful information to make critical business decisions. Data Analysis has played a vital role in identifying trends and patterns as organizations forecast their business goals by extracting their historical data in databases.

### Importing required libraries

In [1]:
import pandas as pd
import sqlite3 as sql

### Connecting to the Database
* Connecting to the SQLite Database can be established using the connect() method, passing the name of the database to be accessed as a parameter.

In [2]:
connection = sql.connect('jobs.db')

In [3]:
connection

<sqlite3.Connection at 0x1afa3635b70>

# SQLite Queries

### Defining a function to read queries

In [4]:
def read(query):
    return pd.read_sql_query(query, connection)

### The Schema Table
* Every SQLite database contains a single "schema table" that stores the schema for that database. The schema for a database is a description of all of the other tables, indexes, triggers, and views that are contained within the database
* It contains one row for each table, index, view, and trigger (collectively "objects") in the schema, except there is no entry for the sqlite_schema table itself.
* The schema table can always be referenced using the name "sqlite_schema"

### Select Data from Table
* Select statement is used to retrieve data from an SQLite table and this returns the data contained in the table.

In [5]:
read("""
SELECT * FROM sqlite_master
""")

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,recent_grads,recent_grads,2,"CREATE TABLE ""recent_grads"" (\n""index"" INTEGER..."
1,index,ix_recent_grads_index,recent_grads,3,"CREATE INDEX ""ix_recent_grads_index""ON ""recent..."


### Selecting single column from table

In [6]:
read('SELECT Major_category FROM recent_grads')

Unnamed: 0,Major_category
0,Engineering
1,Engineering
2,Engineering
3,Engineering
4,Engineering
...,...
168,Biology & Life Science
169,Psychology & Social Work
170,Psychology & Social Work
171,Psychology & Social Work


### Selecting multiple columns from table

In [7]:
read('SELECT Major_category,Major FROM recent_grads')

Unnamed: 0,Major_category,Major
0,Engineering,PETROLEUM ENGINEERING
1,Engineering,MINING AND MINERAL ENGINEERING
2,Engineering,METALLURGICAL ENGINEERING
3,Engineering,NAVAL ARCHITECTURE AND MARINE ENGINEERING
4,Engineering,CHEMICAL ENGINEERING
...,...,...
168,Biology & Life Science,ZOOLOGY
169,Psychology & Social Work,EDUCATIONAL PSYCHOLOGY
170,Psychology & Social Work,CLINICAL PSYCHOLOGY
171,Psychology & Social Work,COUNSELING PSYCHOLOGY


### Selecting all the columns from table

In [8]:
read('SELECT * FROM recent_grads').head()

Unnamed: 0,index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


### Column Alias
* You can rename a table or a column temporarily by giving another name, which is known as ALIAS
* The use of table aliases means to rename a table in a particular SQLite statement
* Renaming is a temporary change and the actual table name does not change in the database
* The column aliases are used to rename a table's columns for the purpose of a particular SQLite query

In [9]:
read('SELECT Major_category AS "MAJOR CATAGORY" FROM recent_grads')

Unnamed: 0,MAJOR CATAGORY
0,Engineering
1,Engineering
2,Engineering
3,Engineering
4,Engineering
...,...
168,Biology & Life Science
169,Psychology & Social Work
170,Psychology & Social Work
171,Psychology & Social Work


### WHERE Clause
* Where clause is used in order to make our search results more specific, using the where clause in SQL/SQLite we can go ahead and specify specific conditions that have to be met when retrieving data from the database.
* If we want to retrieve, update or delete a particular set of data we can use the where clause. If we don’t have condition matching values in your database tables we probably didn’t get anything returned.

In [10]:
read('SELECT ShareWOmen, major FROM recent_grads WHERE ShareWOmen > 0.5')

Unnamed: 0,ShareWomen,Major
0,0.535714,ACTUARIAL SCIENCE
1,0.578766,COMPUTER SCIENCE
2,0.558548,ENVIRONMENTAL ENGINEERING
3,0.896019,NURSING
4,0.750473,INDUSTRIAL PRODUCTION TECHNOLOGIES
...,...,...
92,0.637293,ZOOLOGY
93,0.817099,EDUCATIONAL PSYCHOLOGY
94,0.799859,CLINICAL PSYCHOLOGY
95,0.798746,COUNSELING PSYCHOLOGY


### ORDER BY Clause
* The ORDER BY statement is a SQL statement that is used to sort the data in either ascending or descending according to one or more columns. By default, ORDER BY sorts the data in ascending order.

* DESC is used to sort the data in descending order.
* ASC to sort in ascending order.

In [11]:
read("""
SELECT ShareWOmen, major FROM recent_grads 
WHERE ShareWOmen > 0.5 and ShareWOmen < 0.8 
ORDER BY ShareWOmen DESC
""")

Unnamed: 0,ShareWomen,Major
0,0.799859,CLINICAL PSYCHOLOGY
1,0.798920,TEACHER EDUCATION: MULTIPLE LEVELS
2,0.798746,COUNSELING PSYCHOLOGY
3,0.792095,MATHEMATICS TEACHER EDUCATION
4,0.779933,PSYCHOLOGY
...,...,...
74,0.515406,BIOCHEMICAL SCIENCES
75,0.507377,INTERCULTURAL AND INTERNATIONAL STUDIES
76,0.506721,PHYSICAL AND HEALTH EDUCATION TEACHING
77,0.505141,CHEMISTRY


### LIMIT Clause
* LIMIT keyword is used to limit the data given by the SELECT statement
* If there are many tuples satisfying the query conditions, it might be resourceful to view only a handful of them at a time

In [12]:
read("""
SELECT ShareWOmen, major FROM recent_grads 
WHERE ShareWOmen > 0.5 and ShareWOmen < 0.8 
ORDER BY ShareWOmen DESC
LIMIT 10
""")

Unnamed: 0,ShareWomen,Major
0,0.799859,CLINICAL PSYCHOLOGY
1,0.79892,TEACHER EDUCATION: MULTIPLE LEVELS
2,0.798746,COUNSELING PSYCHOLOGY
3,0.792095,MATHEMATICS TEACHER EDUCATION
4,0.779933,PSYCHOLOGY
5,0.774577,GENERAL MEDICAL AND HEALTH SERVICES
6,0.770901,HEALTH AND MEDICAL ADMINISTRATIVE SERVICES
7,0.764427,SOIL SCIENCE
8,0.76432,LINGUISTICS AND COMPARATIVE LANGUAGE AND LITER...
9,0.75806,AREA ETHNIC AND CIVILIZATION STUDIES


### GROUP BY clause
* The GROUP BY clause is an optional clause of the SELECT statement. The GROUP BY clause a selected group of rows into summary rows by values of one or more columns
* The GROUP BY clause returns one row for each group. For each group, you can apply an aggregate function such as MIN, MAX, SUM, COUNT, or AVG to provide more information about each group

In [13]:
read("""
SELECT AVG(ShareWOmen) AS Avg_Share_Women ,Major_category FROM recent_grads 
GROUP BY Major_category 
""")

Unnamed: 0,Avg_Share_Women,Major_category
0,0.617938,Agriculture & Natural Resources
1,0.561851,Arts
2,0.584518,Biology & Life Science
3,0.405063,Business
4,0.643835,Communications & Journalism
5,0.512752,Computers & Mathematics
6,0.674986,Education
7,0.257158,Engineering
8,0.616857,Health
9,0.676193,Humanities & Liberal Arts


### HAVING clause
* SQLite HAVING clause is an optional clause of the SELECT statement. The HAVING clause specifies a search condition for a group
* You often use the HAVING clause with the GROUP BY clause. The GROUP BY clause groups a set of rows into a set of summary rows or groups. Then the HAVING clause filters groups based on a specified condition
* If you use the HAVING clause, you must include the GROUP BY clause; otherwise, you will get the error

In [14]:
read("""
SELECT SUM(ShareWOmen) AS "Total ShareWomen" , Major_category FROM recent_grads 
GROUP BY Major_category
HAVING "Total ShareWomen" > 5
""")

Unnamed: 0,Total ShareWomen,Major_category
0,6.179384,Agriculture & Natural Resources
1,8.183259,Biology & Life Science
2,5.265821,Business
3,5.640272,Computers & Mathematics
4,10.799768,Education
5,7.457579,Engineering
6,7.402279,Health
7,10.142901,Humanities & Liberal Arts
8,5.087494,Physical Sciences
9,6.999868,Psychology & Social Work


### Comparison Operators
* ==	Checks if the values of two operands are equal or not, if yes then the condition becomes true
* !=	Checks if the values of two operands are equal or not, if the values are not equal, then the condition becomes true
* <>	Checks if the values of two operands are equal or not, if the values are not equal, then the condition becomes true.
* etc.

In [15]:
read("""
SELECT ShareWOmen, major, Major_category FROM recent_grads 
WHERE ShareWOmen < 0.5
LIMIT 10
""")

Unnamed: 0,ShareWomen,Major,Major_category
0,0.120564,PETROLEUM ENGINEERING,Engineering
1,0.101852,MINING AND MINERAL ENGINEERING,Engineering
2,0.153037,METALLURGICAL ENGINEERING,Engineering
3,0.107313,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,0.341631,CHEMICAL ENGINEERING,Engineering
5,0.144967,NUCLEAR ENGINEERING,Engineering
6,0.441356,ASTRONOMY AND ASTROPHYSICS,Physical Sciences
7,0.139793,MECHANICAL ENGINEERING,Engineering
8,0.437847,ELECTRICAL ENGINEERING,Engineering
9,0.199413,COMPUTER ENGINEERING,Engineering


In [16]:
read("""
SELECT ShareWOmen, major, Major_category FROM recent_grads 
WHERE Major_category == 'Engineering'
LIMIT 10
""")

Unnamed: 0,ShareWomen,Major,Major_category
0,0.120564,PETROLEUM ENGINEERING,Engineering
1,0.101852,MINING AND MINERAL ENGINEERING,Engineering
2,0.153037,METALLURGICAL ENGINEERING,Engineering
3,0.107313,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,0.341631,CHEMICAL ENGINEERING,Engineering
5,0.144967,NUCLEAR ENGINEERING,Engineering
6,0.139793,MECHANICAL ENGINEERING,Engineering
7,0.437847,ELECTRICAL ENGINEERING,Engineering
8,0.199413,COMPUTER ENGINEERING,Engineering
9,0.19645,AEROSPACE ENGINEERING,Engineering


### Logical Operators
* AND - The AND operator allows the existence of multiple conditions in an SQL statement's WHERE clause
* BETWEEN - The BETWEEN operator is used to search for values that are within a set of values, given the minimum value and the maximum value
* IN - The IN operator is used to compare a value to a list of literal values that have been specified
* OR - The OR operator is used to combine multiple conditions in an SQL statement's WHERE clause
* UNIQUE - The UNIQUE operator searches every row of a specified table for uniqueness (no duplicates)
* etc.

In [17]:
read("""
SELECT ShareWOmen, major, Major_category FROM recent_grads 
WHERE ShareWOmen > 0.5 OR (ShareWOmen <= 0.5 AND Major_category == 'Engineering')
""")

Unnamed: 0,ShareWomen,Major,Major_category
0,0.120564,PETROLEUM ENGINEERING,Engineering
1,0.101852,MINING AND MINERAL ENGINEERING,Engineering
2,0.153037,METALLURGICAL ENGINEERING,Engineering
3,0.107313,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,0.341631,CHEMICAL ENGINEERING,Engineering
...,...,...,...
119,0.637293,ZOOLOGY,Biology & Life Science
120,0.817099,EDUCATIONAL PSYCHOLOGY,Psychology & Social Work
121,0.799859,CLINICAL PSYCHOLOGY,Psychology & Social Work
122,0.798746,COUNSELING PSYCHOLOGY,Psychology & Social Work


### Arithmetic Operators
* "+" - Adds the values present on both sides of the operator
* "–" - Subtracts the right hand operand from left hand operand.
* "*" - Multiplies the values of both sides.
* "/" - Divides the left hand operand by right hand operand.
* "%" - Divides the left hand operand by right hand operand and returns the remainder.

In [18]:
read("""
SELECT sharewomen, men, women, total, round((cast(women as float)/total)*100,2) wo_perc, 
round((cast(men as float)/total)*100, 3) men_perc FROM recent_grads
""")

Unnamed: 0,ShareWomen,Men,Women,Total,wo_perc,men_perc
0,0.120564,2057,282,2339,12.06,87.944
1,0.101852,679,77,756,10.19,89.815
2,0.153037,725,131,856,15.30,84.696
3,0.107313,1123,135,1258,10.73,89.269
4,0.341631,21239,11021,32260,34.16,65.837
...,...,...,...,...,...,...
168,0.637293,3050,5359,8409,63.73,36.271
169,0.817099,522,2332,2854,81.71,18.290
170,0.799859,568,2270,2838,79.99,20.014
171,0.798746,931,3695,4626,79.87,20.125


### Aggregate functions
* An aggregate function is a function that groups the values of numerous rows into a single summary value
* SQLite provides us with many aggregate functions used for statistical analysis. 
* AVG (i.e., arithmetic mean), SUM, MAX, MIN, COUNT are common aggregation functions

In [19]:
read("""
SELECT AVG(sharewomen),major FROM recent_grads
GROUP BY major
HAVING AVG(sharewomen) > 0.5
ORDER BY AVG(sharewomen) DESC 
LIMIT 10
""")

Unnamed: 0,AVG(sharewomen),Major
0,0.968954,ANTHROPOLOGY AND ARCHEOLOGY
1,0.967998,EARLY CHILDHOOD EDUCATION
2,0.927807,MATHEMATICS AND COMPUTER SCIENCE
3,0.923745,ELEMENTARY EDUCATION
4,0.910933,ANIMAL SCIENCES
5,0.906677,PHYSIOLOGY
6,0.90559,MISCELLANEOUS PSYCHOLOGY
7,0.904075,HUMAN SERVICES AND COMMUNITY ORGANIZATION
8,0.896019,NURSING
9,0.881294,GEOSCIENCES


### Subquery
* A subquery is a SELECT statement nested in another statement
* You must use a pair of parentheses to enclose a subquery
* Typically, a subquery returns a single row as an atomic value, though it may return multiple rows for comparing values with the IN operator
* You can use a subquery in the SELECT, FROM, WHERE, and JOIN clauses

In [20]:
read("""
    SELECT major, major_category, sharewomen FROM recent_grads
    WHERE sharewomen > (SELECT avg(sharewomen) FROM recent_grads)
""")

Unnamed: 0,Major,Major_category,ShareWomen
0,ACTUARIAL SCIENCE,Business,0.535714
1,COMPUTER SCIENCE,Computers & Mathematics,0.578766
2,ENVIRONMENTAL ENGINEERING,Engineering,0.558548
3,NURSING,Health,0.896019
4,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.750473
...,...,...,...
86,ZOOLOGY,Biology & Life Science,0.637293
87,EDUCATIONAL PSYCHOLOGY,Psychology & Social Work,0.817099
88,CLINICAL PSYCHOLOGY,Psychology & Social Work,0.799859
89,COUNSELING PSYCHOLOGY,Psychology & Social Work,0.798746


In [21]:
read("""
    SELECT 
        COUNT(sharewomen), (SELECT COUNT(*) FROM recent_grads),
        COUNT(sharewomen)/CAST((SELECT COUNT(*) FROM recent_grads) AS FLOAT)*100 FROM recent_grads
    WHERE 
        sharewomen > (SELECT AVG(sharewomen) FROM recent_grads)
""")

Unnamed: 0,COUNT(sharewomen),(SELECT COUNT(*) FROM recent_grads),COUNT(sharewomen)/CAST((SELECT COUNT(*) FROM recent_grads) AS FLOAT)*100
0,91,173,52.601156
