<a href="https://colab.research.google.com/github/JulTob/SQL/blob/master/Starting_with_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. What a database is (before SQL) üß†

A database contains datasets. A dataset contains tables. A table is a rectangle with a schema (columns) and observations (rows).

SQL is about asking structured questions to the database.

In [2]:
pip install duckdb



In [3]:
# @title
import duckdb

csv_url = 'https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/US-shooting-incidents.csv'

# Connect to an in-memory DuckDB database
con = duckdb.connect(database=':memory:', read_only=False)

# Load the CSV file directly into DuckDB as a table
con.execute(f"CREATE OR REPLACE TABLE incidents AS SELECT * FROM read_csv_auto('{csv_url}');")

# Rename 'column00' to 'id'
con.execute("ALTER TABLE incidents RENAME COLUMN column00 TO id;")

print("DuckDB in-memory database created and 'US Shooting Incidents' table loaded successfully, with 'column00' renamed to 'id'.")

DuckDB in-memory database created and 'US Shooting Incidents' table loaded successfully, with 'column00' renamed to 'id'.


> What does this table look like?

# 2. Meeting the data üîç

Every investigation starts small. To get a first look at our data, we use the fundamental SQL clauses: `SELECT`, `FROM`, and `LIMIT`.

-   `SELECT`: Specifies which columns you want to retrieve from the table. Using `*` selects all columns.
-   `FROM`: Indicates the table you are querying, in this case, `incidents`.
-   `LIMIT`: Restricts the number of rows returned, which is useful for quickly previewing a large dataset without fetching all of it.

Together, these clauses allow us to inspect a small sample of our data.


```SQL
SELECT *
FROM incidents
LIMIT 5;
```

This query answers two things at once:

- What columns exist?

- What kind of values do they hold?


SQL rewards curiosity. Always look first.


In [4]:
# @title A first look üîé
def SQL(query = "SELECT * FROM incidents LIMIT 5;"):

    # Execute the query and fetch results into a Pandas DataFrame
    result_df = con.execute(query).fetchdf()

    # Print the DataFrame
    print("Query Results:")
    print(result_df)

SQL()

Query Results:
   id                                     person  \
0   1                                  K9 Roscoe   
1   2            Police Officer Roy L. Leon, Jr.   
2   3                  Officer Stanley D. Pounds   
3   4  Enforcement Agent Ernest Joseph Gray, Jr.   
4   5       Police Officer James W. Carozza, Jr.   

                                         dept                            eow  \
0               Phoenix Police Department, AZ     EOW: Friday, July 13, 1984   
1          Cotton Plant Police Department, AR     EOW: Friday, July 13, 1984   
2                  Portland Police Bureau, OR  EOW: Wednesday, July 18, 1984   
3  Pennsylvania Public Utility Commission, PA     EOW: Friday, July 20, 1984   
4            Greenburgh Police Department, NY     EOW: Friday, July 20, 1984   

                                 cause          cause_short       date  year  \
0    Cause of Death: Struck by vehicle    Struck by vehicle 1984-07-13  1984   
1              Cause of Death: 

In [14]:
# @title What columns exist? (schema) üß¨
SQL("""
PRAGMA table_info('incidents');
""")


Query Results:
    cid         name     type  notnull dflt_value     pk
0     0           id   BIGINT    False       None  False
1     1       person  VARCHAR    False       None  False
2     2         dept  VARCHAR    False       None  False
3     3          eow  VARCHAR    False       None  False
4     4        cause  VARCHAR    False       None  False
5     5  cause_short  VARCHAR    False       None  False
6     6         date     DATE    False       None  False
7     7         year   BIGINT    False       None  False
8     8       canine  BOOLEAN    False       None  False
9     9    dept_name  VARCHAR    False       None  False
10   10        state  VARCHAR    False       None  False
11   11  description  VARCHAR    False       None  False
12   12     latitude   DOUBLE    False       None  False
13   13    longitude   DOUBLE    False       None  False
14   14   state_name  VARCHAR    False       None  False


# 3. Columns are variables, rows are values üìê

A table row is a single event. A column is a property of that event.

When we want to focus on specific aspects of our data, we can select only the columns that are relevant. This process is often referred to as **filtering vertically**, as we are choosing a subset of the available columns.

For example, if we are only interested in the `year` of an incident and the `state_name` where it occurred, we can explicitly name these columns in our `SELECT` statement:





In [5]:
SQL("""
SELECT year, state_name
FROM incidents;
""")

Query Results:
      year    state_name
0     1984       Arizona
1     1984      Arkansas
2     1984        Oregon
3     1984  Pennsylvania
4     1984      New York
...    ...           ...
4994  2016     Louisiana
4995  2016       Indiana
4996  2016    California
4997  2016     Tennessee
4998  2016         Idaho

[4999 rows x 2 columns]


This query retrieves data only from the `year` and `state_name` columns for all incidents in the `incidents` table.

While `SELECT` clauses allow us to choose which columns to display (vertical filtering), the `WHERE` clause enables us to filter the rows based on specific conditions. This is often referred to as **filtering horizontally**, as we are choosing a subset of the available rows.

For instance, if we only want to see incidents that occurred in a particular year, such as `2001`, we can use the `WHERE` clause to specify this condition:





In [13]:
SQL("""
SELECT *
FROM incidents
WHERE year = 2001;
""")

Query Results:
       id                             person  \
0    2567          Trooper John Gregory Mann   
1    2568       Corporal James Brian Moulson   
2    2569  Corporal Phillip Charles Anderson   
3    2570              Corporal Ronnie Bogan   
4    2571     Trooper John Henry Duncan, Jr.   
..    ...                                ...   
221  2788            Detective Donald Miller   
222  2789      Police Officer Ron Jones, Jr.   
223  2790          Major Alister C. McGregor   
224  2791       Police Officer Michael Johns   
225  2792            Lieutenant Randy Gerald   

                                       dept  \
0              Tennessee Highway Patrol, TN   
1    Jerome County Sheriff's Department, ID   
2    Jerome County Sheriff's Department, ID   
3           Notasulga Police Department, AL   
4         North Carolina Highway Patrol, NC   
..                                      ...   
221          New Bern Police Department, NC   
222          Prentiss Police Dep

This query will return all columns (`*`) for only those rows where the `year` column has a value of `2001`.

When you want to see all the unique values within a column, you use the `DISTINCT` keyword. This is especially useful when you suspect there might be duplicate entries and you only care about the unique occurrences.

For example, to find out all the unique years present in our `incidents` table, we can use:




In [6]:
SQL("""
SELECT DISTINCT year
FROM incidents
ORDER BY year DESC
;
""")

Query Results:
    year
0   2016
1   2015
2   2014
3   2013
4   2012
5   2011
6   2010
7   2009
8   2008
9   2007
10  2006
11  2005
12  2004
13  2003
14  2002
15  2001
16  2000
17  1999
18  1998
19  1997
20  1996
21  1995
22  1994
23  1993
24  1992
25  1991
26  1990
27  1989
28  1988
29  1987
30  1986
31  1985
32  1984



This query will return a list of each `year` only once, even if multiple incidents occurred in that year. We also add `ORDER BY year DESC` to sort the unique years in descending order for better readability.

After understanding how to select specific columns and filter rows, we often need to perform aggregations on our data. This involves summarizing data, which is where aggregate functions and clauses like `COUNT()`, `AS`, `GROUP BY`, and `ORDER BY` become essential.

-   **`COUNT()`**: This is an aggregate function used to count the number of rows or non-NULL values in a specified column. For example, `COUNT(id)` counts the total number of incidents.
-   **`AS`**: The `AS` keyword is used to assign an alias (a temporary name) to a column or a table. This makes the output more readable and understandable. For instance, `COUNT(id) AS total_incidents` renames the result of `COUNT(id)` to `total_incidents`.
-   **`GROUP BY`**: This clause is used to group rows that have the same values in one or more specified columns into a summary row. It's almost always used with aggregate functions. When you `GROUP BY` a column, aggregate functions like `COUNT()` will operate on each group separately. For example, `GROUP BY year` will count incidents for each year individually.
-   **`ORDER BY`**: This clause sorts the result set of a query in ascending (ASC, default) or descending (DESC) order based on one or more columns. It is typically used at the end of a query to arrange the results for easier analysis.

Let's see how these are used together to answer questions about our data, such as counting incidents per year or per state.

In [7]:
SQL("""
SELECT COUNT(id) AS total_incidents, year
FROM incidents
GROUP BY year
;
""")

Query Results:
    total_incidents  year
0                82  1984
1               161  1985
2               162  1986
3               172  1987
4               180  1988
5               177  1989
6               144  1990
7               131  1991
8               152  1992
9               140  1993
10              160  1994
11              163  1995
12              126  1996
13              168  1997
14              146  1998
15              140  1999
16              162  2000
17              226  2001
18              143  2002
19              153  2003
20              153  2004
21              160  2005
22              152  2006
23              189  2007
24              158  2008
25              138  2009
26              165  2010
27              175  2011
28              141  2012
29              123  2013
30              148  2014
31              140  2015
32               69  2016


In [8]:
SQL("""
SELECT COUNT(id) AS total_incidents, state_name AS State
FROM incidents
GROUP BY state_name
ORDER BY total_incidents DESC
;
""")

Query Results:
    total_incidents           State
0               460           Texas
1               456      California
2               410        New York
3               318         Florida
4               213         Georgia
5               165       Louisiana
6               158        Illinois
7               157  North Carolina
8               148    Pennsylvania
9               141         Alabama
10              140            Ohio
11              135       Tennessee
12              130        Michigan
13              128  South Carolina
14              126        Virginia
15              123         Arizona
16              119        Missouri
17              109         Indiana
18              109     Mississippi
19              105      New Jersey
20              102        Maryland
21               83        Kentucky
22               79        Arkansas
23               77      Washington
24               75        Oklahoma
25               72   Massachusetts
26           

In [9]:
SQL("""
SELECT DISTINCT cause_short AS Cause, COUNT(id) AS Total_Incidents
FROM incidents
GROUP BY cause_short
ORDER BY Total_Incidents DESC
;
""")

Query Results:
                       Cause  Total_Incidents
0                    Gunfire             1768
1        Automobile accident              936
2               Heart attack              388
3          Vehicular assault              366
4          Struck by vehicle              285
5            Vehicle pursuit              169
6        Motorcycle accident              163
7          Aircraft accident              125
8       Gunfire (Accidental)              110
9       9/11 related illness              108
10                   Assault               92
11          Terrorist attack               70
12      Duty related illness               65
13                   Stabbed               62
14                   Drowned               60
15                      Fall               51
16           Heat exhaustion               46
17                Accidental               21
18           Struck by train               20
19         Training accident               19
20        Exposure 

Beyond simple aggregations and filtering, SQL allows for conditional logic within queries using the `CASE WHEN` statement. This is a powerful tool for categorizing data or performing different calculations based on specific conditions, acting much like an `if-then-else` statement in programming.

-   **`CASE`**: Initiates a conditional expression.
-   **`WHEN condition THEN result`**: Specifies a condition and the result to return if that condition is true. You can have multiple `WHEN` clauses.
-   **`ELSE result`**: (Optional) Specifies a default result to return if none of the `WHEN` conditions are met.
-   **`END`**: Concludes the `CASE` statement.

In our context, `CASE WHEN cause_short = 'Gunfire' THEN id END` means: if the `cause_short` for an incident is 'Gunfire', then consider its `id`. When `COUNT()` is applied to this `CASE` statement, it will count only those `id`s for which the condition (`cause_short = 'Gunfire'`) is true, effectively giving us a count of 'gunfire incidents'.

This allows us to create new, derived metrics within our aggregated results, as shown in the example below where we count total incidents and also specifically gunfire incidents per state.

In [10]:
SQL("""
SELECT
   state_name AS State,
   COUNT(id) AS total_incidents,
   COUNT(CASE WHEN cause_short = 'Gunfire' THEN id END) AS gunfire_incidents
FROM incidents
GROUP BY state_name
ORDER BY total_incidents DESC
;
""")

Query Results:
             State  total_incidents  gunfire_incidents
0            Texas              460                151
1       California              456                174
2         New York              410                 90
3          Florida              318                108
4          Georgia              213                 77
5        Louisiana              165                 60
6         Illinois              158                 60
7   North Carolina              157                 58
8     Pennsylvania              148                 62
9          Alabama              141                 45
10            Ohio              140                 58
11       Tennessee              135                 46
12        Michigan              130                 60
13  South Carolina              128                 48
14        Virginia              126                 57
15         Arizona              123                 45
16        Missouri              119               

In [16]:
# @title How big is our universe? üåç
SQL("""
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT state_name) AS total_states,
  MIN(year) AS first_year,
  MAX(year) AS last_year
FROM incidents;
""")


Query Results:
   total_rows  total_states  first_year  last_year
0        4999            50        1984       2016


In [18]:
# @title Top causes üßæ
SQL("""
SELECT
  cause_short AS cause,
  COUNT(*) AS total_incidents
FROM incidents
GROUP BY cause_short
ORDER BY total_incidents DESC
LIMIT 12;
""")


Query Results:
                   cause  total_incidents
0                Gunfire             1768
1    Automobile accident              936
2           Heart attack              388
3      Vehicular assault              366
4      Struck by vehicle              285
5        Vehicle pursuit              169
6    Motorcycle accident              163
7      Aircraft accident              125
8   Gunfire (Accidental)              110
9   9/11 related illness              108
10               Assault               92
11      Terrorist attack               70
