# Table of Content

1. [Table of Content](#Table-of-Content)
2. [SQL](#SQL)
3. [Terminology](#Terminology)
4. [SQL Basics](#SQL-Basics)
5. [SQL commands](#SQL-commands)
    1. [SHOW](#SHOW)
    2. [SELECT](#SELECT)
    3. [DISTINCT](#DISTINCT)
    4. [WHERE](#WHERE)
    5. [Excursion: Searching for column names](#Excursion:-Searching-for-column-names)
    6. [ORDER BY](#ORDER-BY)
    7. [NULL values](#NULL-values)
    8. [LIMIT](#LIMIT)
    9. [Aggregate functions](#Aggregate-functions)
         1. [MIN and MAX](#MIN-and-MAX)
         2. [COUNT](#COUNT)
         3. [AVG](#AVG)
         4. [SUM](#SUM)
         5. [ROUND](#ROUND)
         6. [UNION](#UNION)
    10. [LIKE](#LIKE)
    11. [IN](#IN)
    12. [BETWEEN](#BETWEEN)
    13. [GROUP BY](#GROUP-BY)
10. [Exercises](#Exercises)
    1. [Exercise 1 - Simple SELECT](#Exercise-1---Simple-SELECT)
    2. [Exercise 2 - Different values](#Exercise-2---Different-values)
    3. [Exercise 3 - Filtering Selections](#Exercise-3---Filtering-Selections)
    4. [Exercise 4 - Logical Operators](#Exercise-4---Logical-Operators)
    5. [Exercise 5 - Ordering selections](#Exercise-5---Ordering-selections)
    6. [Exercise 6 - NULL values](#Exercise-6---NULL-values)
    7. [Exercise 7 - Limits](#Exercise-7---Limits)
    8. [Exercise 8 - MIN and MAX](#Exercise-8---MIN-and-MAX)
    9. [Exercise 9 - Mathematical Operations](#Exercise-9---Mathematical-Operations)
    10. [Exercise 10 - Text patterns](#Exercise-10---Text-patterns)
    11. [Exercise 11 - Using IN](#Exercise-11---Using-IN)
    12. [Exercise 12 - Ranges](#Exercise-12---Ranges)
    13. [Exercise 13 - Grouping Selections](#Exercise-13---Grouping-Selections)

# SQL

SQL = Structured Query Language

SQL allows to access and manipulate databases. SQL is an ANSI/ISO standard, but there are different versions. The major commands are supported by all.

We use MySWL workbench as GUI, but this is only one of many RBDMS (Relational Database Management System). There are different server types, I have to have the correct RBDMS to match the type of my server.

When working with workbench we are always connected to one server. One server can hold several databases, the databases have to match the server type. If we want to work on several databases they all have to be on the same server. It is possible to connect to different servers, just not at the same time.

I can write scripts in SQL that I can load into workbench to run. If another database has the same server type I can use these scripts (with tweaking to allow for the other database structure)

# Terminology

- field is the column definition, i.e. the column name, data type, rules etc. (We will call this column header and stick to field when meaning a single row/column combination)
- record is a row (a horizontal entity in a table)
- a column is a vertical entity in a table
- PK as primary key and FK or MUL(?) as foreign key
- a function is called a statement, the function within the statement are clauses (e.g. SELECT * FROM table WHERE condition is a select statement with a Where clause)

# SQL Basics

SQL is completely case insensitive! It is convention to use all capital letters for SQL statements, but it is not a requirement for the system. 

Statements are closed with a`;`. Everything before the `;`, even in several lines, is recognized as one statement. Indentation and linebreaks don't matter.

We add single line comments with `-- comment` or multiline comments with `/* comment */`

# SQL commands 

## SHOW

Very handy especially for databases without a GUI (e.g. MariaDB), as it displays the content in the terminal.

- `SHOW DATABASES; ` shows all databses in the system
- `SHOW TABLES;` shows all tables in the active database
- `SHOW TABLES dbs;` shows all tables in the database called dbs
- `SHOW COLUMNS FROM table_name;` shows all columns of the table called table_name

## SELECT

`SELECT` is used to filter data from a database. The returned data is stored in a temporary result tabel = result set. `SELECT` works row by row

`SELECT` requires

- the table we want to get data from
- the columns we want included in the result
- more statements, e.g. conditions, can be added to `SELECT`
- Multiple `SELECT` statements can be combined (sub queries) to make use of the relation between tables

General Syntax:

    SELECT column1, column2, ..., columnLast
    FROM table_name;

to look at all columns use `*` to stand for all columns:

    SELECT * FROM table_name;

Note: we can't use the wildcard together with text, `SELECT Protein* FROM proteins` *doesn't* work

Order of additions to `SELECT`

- `SELECT`
- `DISTINCT` / `COUNT` / `AVG` etc. 
- `FROM`
- `WHERE`
- `GROUP BY`
- `HAVING`
- `ORDER BY`
- `LIMIT`

## DISTINCT

To show only unique values in a column use `DISTINCT`

Syntax:

    SELECT DISTINCT column1 FROM table1;

Also works for distinc combinations

    SELECT DISTINCT column1, column2 FROM table1;

## WHERE

`WHERE` is used to filter records/rows according to specific conditions. 

- Multiple conditions can be combined with `AND`/`OR` or negated with `NOT`

Syntax:

    SELECT column1, column2, ...
    FROM table_name
    WHERE condition;

Operators: 

- Equal `=`
- Greater than `>`
- Less than `<`
- greater than or equal `>=`
- less than or equal `<=`
- not equal `<>` (in some SQL `!=`)

These operators work for numbers *and strings*, in which case they're sorted alphabetically, in a case insensitive manner. Comparison goes letter by letter until a mismatch is found. Text has to be in `' '`, most systems also accept `" "`. Numbers can be in quotation marks but don't have to be

Condition Syntax:

    column_name operator value

Example:

    SELECT * FROM secondary_structure
    WHERE Structure_Name = 'Helix';

Note that the column we filter against in the `WHERE` statement doesn't have to be displayed (selected with `SELECT`), but does have to be part of the table we select `FROM`

## Excursion: Searching for column names

There are server internal tables that contain all of the table information that is searchable. `Information_schema` is  hidden database of our server that we can access by point notation. To find e.g. all columns named `Mass` we can use the following command:

```
SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = 'Mass';
```

This will show the database (`TABLE_SCHEMA`), table therein (`TABLE_NAME`) and column therein (`COLUMN_NAME`) from the table `COLUMNS` of the internal database `INFORMATION_SCHEMA` in which the `COLUMN_NAME` is `Mass`

## ORDER BY

`ORDER BY` is used to sort the result-set in ascending or descending order. 

- Default is ascending, for descending add `DESC`
- Careful if we have numbers in a column formatted as string: sorting will go letter by letter (so the order would be 1, 10, 2, etc.)
- Same if sorting by date: this only works if the column is acutally formatted as date!
- Possible to order by multiple columns if the first column contains duplicate values
- the column we sort by has to be part of the table we work on (but doesn't have to be selected) or the result table we create (e.g. if we add an aggregate function like mean, we create a new column and can sort by that as well)

Syntax

```
SELECT column_name(s)
FROM table_name
ORDER BY column1, column2 DESC;
```

## NULL values

A field with `NULL` is an empty field. It is **not** possible to test for this using the operators (`=`, `<>`). We have to use `IS NULL` and `IS NOT NULL`

Syntax

```
SELECT column_name(s)
FROM table_name
WHERE columns_name IS NULL;
```
 or

```
SELECT column_name(s)
FROM table_name
WHERE columns_name IS NOT NULL;
```

## LIMIT

`LIMIT` specifies how many rows are shown in the result set. Especially useful for trying out code

- easier overview
- faster return time

Syntax

```
SELECT column(s)
FROM table
LIMIT NumberOfRows
```

To get several rows from a fixed starting point use two numbers: SkippedRows,NumberOfRows. Careful to be aware how this is computed in the background: 

1. `SELECT` uses `WHERE` to filter the table
2. The whole table is then sorted according to `ORDER BY` (If there are duplicate values this can lead to confusing results later on, if in doubt add the primary key as second order criterium)
3. The number of rows specified in `LIMIT` are displayed. If we use two numbers the first is the number of rows skipped in the result set.

```
SELECT column(s) FROM table LIMIT SkippedRows,NumberOfRows
```

We can access the list from the back by adding an descending order statement:

```
-- The first x rows:
SELECT column(s) FROM table WHERE condition ORDER BY column LIMIT x

--- The last x rows:
SELECT column(s) FROM table WHERE condition ORDER BY column DESC LIMIT x
```

 

## Aggregate functions

General Syntax Note: When using the aggregate functions make sure there is *no space* between the function and the brackets!! `MIN(Column)` *not* `MIN (Column)`


### MIN and MAX

`MIN` and `MAX` return the smallest and largest value of a selected column respectively

Syntax

```
SELECT MIN(Column_name)
FROM table
WHERE condition;

-- or
SELECT MAX (column_name) FROM table WHERE condition;
```

Order:

1. Filter
2. Lowest/Highest value picked

To pick the whole row we create a subquery:

- the subquery has to be encapsulated in `()`
- the subquery is run first and the returned value is used in the main select statement

```
SELECT * FROM table
WHERE column = (SELECT MIN(column) FROM table);
```

### COUNT 

- `COUNT` returns the number of rows matching a criterium
- if no condition is applied it returns the number of rows the table has (minus `NULL` rows)
- very useful in combination with `DISTINCT` to show how many unique values there are

Syntax:

```
SELECT COUNT (column_name)
FROM table_name
WHERE condition;
```

with `DISTINCT`:  

```
SELECT COUNT (DISTINCT column_name) FROM table_name;
```



### AVG

- the average function `AVG` returns the mean of a numeric columns
- `NULL` fields are ignored
- returns a floating point number
- ensure that the column is formatted as a number!!!

Snytax:

```
SELECT AVG(column_name)
FROM table_name
WHERE condition;
```

### SUM

- `SUM` returns the sum of a numerical columns
- ignores `NULL` fields
- returns an integer if all numbers are integers or a float if one was a float
- ensure that the column is formatted as a number!!!

Syntax

```
SELECT (SUM (column_name)
FROM table_name
WHERE condition;
```

### ROUND

- rounds a number
- used e.g. in combination with `AVG`

Example syntax where the average is rounded to two decimals:

```
SELECT ROUND(AVG(Mass),2) FROM proteins;
```

### UNION

- to display two aggregate function together, `UNION` places the two results beneath each other
- they column number and type have to match for `UNION` to work.
- In this exmaple we just plays two numbers underneath each other, in more complex statements be aware

Syntax:

```
SELECT AVG(Mass) FROM proteins
UNION
SELECT SUM(Mass) FROM proteins;
```

The column name is taken from the first table (here `AVG(Mass)`). This can be changed, e.g. to `Value1` by defining it in the first select statement:

```
SELECT AVG(Mass) as Value1 FROM proteins
UNION
SELECT SUM(Mass) FROM proteins;
```




## LIKE

- Used inside a WHERE statement to look for a specific pattern in text
- two wildcards:
  - `%` = 0, 1 or multiple characters
  - `_`= one single character
- the pattern is case insensitive
- spaces are recognized as part of the pattern
- multiple conditions have to be repeated in several complete statements, i.e. `WHERE column LIKE 'xy' OR columns LIKE 'wz'`

Syntax:

```
SELECT column
FROM table_name
WHERE column LIKE pattern
```

**Examples:**

- Protein Name starts with an A: `WHERE Protein_Name LIKE 'A%'`
- Protein Name contains the word RNA with spaces before and after `WHERE Protein_Name LIKE '% RNA %'`
- Protein Name containing the word RNA: `WHERE Protein_Name LIKE '%RNA%'`
- Protein Name has an i at the third position: `WHERE Protein_Name LIKE '__i%'`
- Protein Name starts with A or ends with S: `WHERE Protein_Name LIKE 'A%' OR Protein_Name LIKE '%s';`

Syntax for Escape characters, e.g. if searching for % anyhwere within a word:

```
SELECT columns FROM table
WHERE column LIKE '%\%%';

-- or define your own escape character:
SELECT columns FROM table
WHERE column LIKE '%/%%' ESCAPE '/';
```


More information:

https://dev.mysql.com/doc/refman/8.0/en/string-literals.html



## IN

In a `WHERE` statement `IN` allows to specify multiple values connected by `OR` (equivalent to `%in%` in R)

Syntax:

```
SELECT column_name(s) FROM table_name
WHERE column_name IN (value1, value2, ...);
```

`IN` can also be used in a subquery:

```
SELECT column_name(s) FROM table_name
WHERE column_name IN (SELECT STATEMENT);

-- Example: look only at those entries in Kingdoms that have an entry in Organisms:
SELECT * FROM kingdoms
WHERE ID IN (SELECT Kingdom_ID FROM organisms);
```

Note on Python:

`IN` doesn't exist in Python, a round about way is:

```
if any(organism_ID == elem for elem in (1,2,3,4)):
```





## BETWEEN

`BETWEEN` selects values given a range

- `BETWEEN` is inclusive (i.e. start and end are included)
- works on numbers, strings, dates
- the start *always* has to be lower than the endvalue!
  - numbers are self explanatory
  - dates: the earlier the lower
  - strings: a < z. Careful, if you choose a range until M, this will end at the single letter M, anything else like MA is already > M as is excluded.
- When combined with `NOT` excludes the range

Syntax:

```
SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN startvalue AND endvalue;
```

## GROUP BY

`GROUP BY` allows to group rows that have the same value in a particular column into a 'Summary' row

Most often used together with the aggregate functions `COUNT`, `MAX`, `MIN`, `AVG`

Syntax:

```
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s);
```

e.g. if we want to count how many rows of ID have the same value in Organism_ID

Since we now have a new column in our result set (e.g. Count(ID)) we can order by this new column:

```
SELECT COUNT(*), Organism_ID FROM proteins
GROUP BY Organism_ID ORDER BY COUNT(*) DESC
```




## HAVING

It is not possible to use `WHERE` on aggregate functions. `HAVING` fills this gap alowing us to filter our results

Syntax:

```
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition;
```




# Exercises

## Exercise 1 - Simple SELECT

Write SELECT statements that fulfill the following criteria from the following *tables*:

1. The Name and taxonomic name of all *organisms*
2. All information about *kingdoms*
3. The resolution, the R-free value and the Clashscore for *structural_data*
4. All information about *secondary_structure*

Optional:

5. Charge, Mass and simplified formula for small molecules
6. Weight, Melting Point and IUPAC-Name for non-canonical amino acids

```
-- Exercise 1
SELECT Organism_name, taxonomy FROM organisms;
SELECT * FROM kingdoms;
SELECT Resolution, R_Free, Clashscore FROM structural_data;
SELECT * FROM secondary_structure;
SELECT Charge, Mass, SMILES FROM atom_information;
SELECT Molecular_Weight, Melting_Point, IUPAC_Name FROM modification_data;
```

## Exercise 2 - Different values

Write SELECT statements that fulfill the following criteria:

- For which Kingdom_IDs do you have organisms in the *database*?
- For which Structure_IDs do you have data in *secondary_protein*?
- For which Method_IDs do you have *structures* in your database?
- For which Protein_IDs do you have data in *modifications_proteins*?

Optional:

- For which proteins can you find structures in the database?
- For which hetero atoms do you have IUPAC-Names in the database?
- What are the different maximal repeats you can find for domains in the database?

      -- Exercise 2
      SELECT DISTINCT Kingdom_ID FROM organisms;
      SELECT DISTINCT Structure_ID FROM secondary_protein;
      SELECT DISTINCT Method_ID FROM structures;
      SELECT DISTINCT Protein_ID FROM modifications_proteins;

      -- Optional Exercises
      SELECT DISTINCT Protein_ID FROM structures;
      SELECT DISTINCT Hetero_ID FROM IUPAC_names;
      SELECT DISTINCT Max_Repeats FROM domain_data;

## Exercise 3 - Filtering Selections

Write SELECT statements that fulfill the following criteria:

- All *proteins* with more than 1000 amino acids
- All *structures* with a Source_ID of 1
- All *structural_data* with a Resolution smaller than 2.0
- All *organisms* that have a Kingdom_ID of 1
- All *proteins* with a mass smaller than 25000
- All *proteins* named ‘Cytochrome c oxidase subunit 1’

Optional:

- Atoms with a positive charge
- Atoms with a mass between 50 and 150

        -- Exercise 3
        SELECT * FROM proteins
        WHERE Protein_Length > 1000;

        SELECT * FROM structures
        WHERE Source_ID = 1;

        SELECT * FROM structural_data
        WHERE Resolution < 2.0;

        SELECT * FROM organisms
        WHERE Kingdom_ID = 1;

        SELECT * FROM proteins
        WHERE Mass < 25000;

        SELECT * FROM proteins
        WHERE Protein_Name = 'Cytochrome c oxidase subunit 1';

        -- Optional Exercises
        SELECT * FROM atom_information
        WHERE Charge > 0;

        SELECT * FROM atom_information
        WHERE Mass >= 50 AND Mass =< 150;

## Exercise 4 - Logical Operators

Write SELECT statements that fulfill the following criteria:

- All *proteins* with more than 1000 amino acids and a Mass greater than 100000
- All *structural_data* with a Resolution less than 2.0 or an R-free value smaller than 0.25
- All *proteins* where the Organism_ID is not 4
- All *organisms* with a Kingdom_ID of 1 or 2
- All *proteins* with an Organism_ID of 3 or 28

Optional:

- Atoms with a positive charge, a mass higher than 100 and a CHEBI-ID over 20000
- Structures generated with the Methods 1, 2 or 3
- Modifications with a mass over 125, more than 4 Hydrogenbond-donors and -acceptors

```
-- Exercise 4
SELECT * FROM proteins
WHERE Protein_Length > 1000 AND Mass > 100000;

SELECT * FROM structural_data
WHERE Resolution < 2.0 OR R_Free < 0.25;

SELECT * FROM proteins
WHERE NOT Organism_ID = 4;

SELECT * FROM organisms
WHERE Kingdom_ID = 1 OR Kingdom_ID = 2;

SELECT * FROM proteins
WHERE Organism_ID = 3 OR Organism_ID = 28;


-- Optional Exercises
SELECT * FROM atom_information
WHERE Charge > 0 AND Mass > 100 AND CHEBI_ID > 20000;

SELECT * FROM structures
WHERE Method_ID = 1 OR Method_ID = 2 OR Method_ID = 3;

SELECT * FROM modification_data
WHERE Molecular_Weight > 125 AND Hydrogenbond_acceptors > 4 AND Hydrogenbond_donors > 4;
```

## Exercise 5 - Ordering selections

Write SELECT statements that fulfill the following criteria:

- All *proteins* ordered by length
- All *proteins* ordered by Mass in desc. order
- All *proteins* with Organism_ID 4 ordered by Annotation
- All *structures* with Source_ID 2 ordered by their Identifier
- All *organisms* ordered by taxonomic name in desc. order
- All *structural_data* ordered by asc. Ramachandran outliers and desc. sidechain outliers

Optional:

- Non-canonic amino acids sorted by hydrogen bond donors and acceptors
- All domains with at least two repeats sorted by their minimum length

```
-- Exercise 5
SELECT * FROM proteins
ORDER BY Protein_Length;

SELECT * FROM proteins
ORDER BY Mass DESC;

SELECT * FROM proteins
WHERE Organism_ID = 4
ORDER BY Annotation;

SELECT * FROM structures
WHERE Source_ID = 2
ORDER BY Identifier;

SELECT * FROM organisms
ORDER BY Taxonomy DESC;

SELECT * FROM structural_data
ORDER BY Ramachandran_outl, Sidechain_outl DESC;

-- Optional Exercises
SELECT * FROM modification_data
ORDER BY Hydrogenbond_donors, Hydrogenbond_acceptors;

SELECT * FROM domain_data
WHERE Min_Repeats >= 2
ORDER BY Min_Size;
```

## Exercise 6 - NULL values

Write SELECT statements that fullfill the following criteria:

- All *structural_data* without a R-free value
- All *structural_data* with a Resolution
- All *domain_data* that don’t have a maximum repeat
- All *cellular_locations* that have a description
- All *modification_data* that don’t have an EC-Number

Optional:

- hetero atoms with a SMILES formula and a positive charge
- modifications without a melting point and at least 3 hydrogen bond donors sorted by name
- structures with R-free value and Resolution sorted by release date

```
-- Exercise 6
SELECT * FROM structural_data
WHERE R_Free IS NULL;

SELECT * FROM structural_data
WHERE Resolution IS NOT NULL;

SELECT * FROM domain_data
WHERE Max_Repeats IS NULL;

SELECT * FROM cellular_location
WHERE Location_Description IS NOT NULL;

SELECT * FROM modification_data
WHERE EC_Number IS NULL;

-- Optional Exercises
SELECT * FROM atom_information
WHERE SMILES IS NOT NULL AND Charge > 0;

SELECT * FROM modification_data
WHERE Melting_Point IS NULL AND Hydrogenbond_donors >= 3
ORDER BY IUPAC_Name;

SELECT * FROM structural_data
WHERE R_Free IS NOT NULL AND Resolution IS NOT NULL
ORDER BY Released;
```

## Exercise 7 - Limits

Write SELECT statements that fulfill the following criteria:

- The first ten *proteins*
- The first four *organisms* that have a Kingdom_ID of 2
- The first six *structures* with a Method_ID of 2
- The first 12 *proteins* ordered by length
- The first 5 *proteins* with Organism_ID of 1 ordered by descending Mass

Optional:

- The first 3 locations with a description and GO-Term higher than 10000
- The fourth, fifth, sixth and seventh protein with Organism_ID of 4 ordered by Mass
- The 10th to 20th from Alphafold ordered by identifier

```
-- Exercise 7
SELECT * FROM proteins LIMIT 10;

SELECT * FROM organisms
WHERE Kingdom_ID = 2
LIMIT 4;

SELECT * FROM structures
WHERE Method_ID = 2
LIMIT 6;

SELECT * FROM proteins
ORDER BY Protein_Length 
LIMIT 12;

SELECT * FROM proteins
WHERE Organism_ID = 1
ORDER BY Mass DESC
LIMIT 5;

-- Optional
SELECT * FROM cellular_location
WHERE Location_Description IS NOT NULL AND Gene_Ontology > 10000
LIMIT 3;

SELECT * FROM proteins
WHERE Organism_ID = 4 ORDER BY Mass LIMIT 3,4;

SELECT * FROM structures
WHERE Source_ID = 2
ORDER BY Identifier
LIMIT 9,11;
```


## Exercise 8 - MIN and MAX

Write SELECT statements that hand you back the following:

- The biggest Protein_Length in proteins
- The smallest Resolution in structural_data
- The alphabetically first taxonomic name of an organism
- The biggest Mass for proteins with less than 400 amino acids
- The alphabetically last name of a modification
- The complete row for the shortest protein
- The complete row for the protein with the biggest Mass

Optional:

- Rows of domains with the lowest maximum repeats ordered by Annotation

```
-- Exercise 8
SELECT MAX(protein_length) FROM proteins;
SELECT MIN(resolution) FROM structural_data;
SELECT MIN(organism_name) FROM organisms;
SELECT MAX(Mass) from proteins WHERE Protein_Length < 400;
SELECT MAX(Modification_Name) FROM modifications;
SELECT * FROM proteins
WHERE Protein_Length = (SELECT MIN(Protein_Length) FROM proteins);
SELECT * FROM proteins
WHERE Mass = (SELECT MAX(Mass) FROM proteins);

-- Optional
SELECT * FROM domain_data
WHERE Max_Repeats =(SELECT MIN(Max_Repeats) FROM domain_data)
ORDER BY Prosite_Annotation;
```



## Exercise 9 - Mathematical Operations

Write SELECT statements that answer the following questions:

- How many *proteins* have an Organism_ID of 1?
- What is the average length of *proteins*?
- How many *organisms* have a Kingdom_ID of 1?
- What is the complete length of all *proteins*?
- What is the average length of *proteins* with a mass smaller than 15000?
- What is the average resolution of all *structural_data* that have an R-free value?

Optional:

- Proteins with a Mass higher than the average Mass
- Proteins with a length smaller than the average Length of proteins with a mass > 15000

```
-- Exercise 9
SELECT COUNT(*) FROM proteins WHERE Organism_ID = 1;
SELECT AVG(Protein_Length) FROM proteins;
SELECT COUNT(ID) FROM organisms WHERE Kingdom_ID = 1;
SELECT SUM(Protein_Length) FROM proteins;
SELECT AVG(Protein_Length) FROM proteins WHERE Mass < 15000;
SELECT ROUND(AVG(Resolution),5) FROM structural_data WHERE R_Free IS NOT NULL;

-- Optional Exercises
SELECT * FROM proteins
WHERE Mass > (SELECT AVG(Mass) FROM proteins);

SELECT * FROM proteins
WHERE Protein_Length < (SELECT AVG(Protein_Length) FROM proteins WHERE Mass > 15000);
```


## Exercise 10 - Text patterns

Write SELECT statements that fullfill the following criteria:

- All *proteins* whose name starts with an “S”
- All *organisms* whose taxonomic names contain “ano”
- All *structures* with Identifiers that end with a “7”
- All *mol_functions* that include “RNA” or “DNA”
- All *modifications* that include “lysine”
- All *biol_processes* that include “ATP”, “GTP” or “UTP”

Optional:

- Which protein whose names start with “H” has the biggest Mass?
- Average length of proteins whose names start with a “C” and end with a “1”

```
-- Exercise 10
SELECT * FROM proteins WHERE Protein_Name LIKE 'S%';
SELECT * FROM organisms WHERE Taxonomy LIKE '%ano%';
SELECT * FROM structures WHERE Identifier LIKE '%7';
SELECT * FROM mol_functions WHERE Function_Name LIKE '%RNA%' OR Function_Name LIKE '%DNA%';
SELECT * FROM modifications WHERE Modification_Name LIKE '%lysine%';
SELECT * FROM biol_processes WHERE 
Process_Name LIKE '%ATP%' OR Process_Name LIKE '%GTP%' OR Process_Name LIKE '%UTP%';

-- Optional
SELECT * FROM proteins 
WHERE Protein_Name LIKE 'H%' AND
Mass = (SELECT MAX(Mass) FROM proteins WHERE Protein_Name LIKE 'H%');

SELECT AVG(Protein_Length) FROM proteins
WHERE Protein_Name LIKE 'C%1';
```


## Exercise 11 - Using IN

Write SELECT statements that fullfill the following criteria:

- All *proteins* that have Organism_IDs of 1, 3, 28, 21 or 22
- All *organisms* that have Kingdom_IDs of 1, 2, 3, 4 or 8
- For which *organisms* do we have proteins in our database
- All *IUPAC Names* for the Hetero_IDs 1, 6, 7 and 14
- All *proteins* for which structures with a Source_ID of 1 exist

Optional: 

- Which proteins have structures with a Resolution?
- Which proteins are in locations with descriptions? (Many-to-many Relationship)
- Which proteins have modifications with a melting point?

```
-- Exercise 11
SELECT * FROM proteins WHERE Organism_ID IN (1,3,28,21,22);
SELECT * FROM organisms WHERE Kingdom_ID IN (1,2,3,4,8);
SELECT * FROM organisms WHERE ID IN (SELECT organism_ID from proteins);
SELECT * FROM IUPAC_names WHERE Hetero_ID IN (1,6,7,14);
SELECT * FROM proteins WHERE ID IN (SELECT Protein_ID FROM structures WHERE Source_ID = 1);

-- Optional:
SELECT * FROM proteins WHERE ID IN 
(SELECT Protein_ID FROM structures WHERE Identifier IN
(SELECT Identifier FROM structural_data WHERE Resolution IS NOT NULL));

SELECT * FROM proteins WHERE ID IN
(SELECT Protein_ID FROM protein_location WHERE Location_ID IN
(SELECT ID FROM cellular_location WHERE Location_Description IS NOT NULL));

SELECT * FROM proteins WHERE ID IN
(SELECT Protein_ID FROM modifications_proteins WHERE Modification_ID IN
(SELECT ID FROM modification_data WHERE Melting_Point IS NOT NULL))

```



## Exercise 12 - Ranges

Write SELECT statements that fulfill the following criteria:

- All *proteins*SEM whose mass is between 25000 and 50000
- All *structural_data* whose resolution is between 2.1 and 2.8
- All *organisms* between V. cholerea and S. flexneri
- All *domains* between MG binding site and FE binding site
- All *proteins* whose length is between 100 and 200 ordered by Name

Optional:

- proteins with a length between 300 and 600 where the mass is higher than the average mass
- structures with sidechain outliers between 0 and 1 with a resolution smaller then the average
resolution

```
-- Exercise 12
SELECT * FROM proteins WHERE Mass BETWEEN 25000 AND 50000;
SELECT * FROM structural_data WHERE Resolution BETWEEN 2.1 AND 2.8;
SELECT * FROM organisms WHERE Organism_name between 'S. flexneri' and 'V. cholerae';
SELECT * FROM domains WHERE Domain_Name Between 'FE binding site' and 'MG binding site';
SELECT * FROM proteins WHERE Protein_Length BETWEEN 100 AND 200 ORDER BY Protein_Name;

-- Optional:
SELECT * FROM proteins WHERE Protein_Length BETWEEN 300 AND 600 AND
Mass > (SELECT AVG(Mass) FROM proteins);

SELECT * FROM structural_data 
WHERE Sidechain_outl BETWEEN 0 AND 1
AND Resolution < (SELECT AVG(Resolution) FROM structural_data);
```


## Exercise 13 - Grouping Selections

Write SELECT statements that fulfill the following criteria:

- Show the number of Protein_IDs for the different structure_IDs in *secondary_protein*
- Show the number of *organisms* for the different Kingdom_IDs
- Show the number of *structures* for the different Method_IDs
- Show the number of Protein_IDs of the different Domain_IDs in *domains_proteins*

Optional:

- Overall mass for proteins from different organisms
- Average length for proteins from different organisms with an annotation bigger than 3
- Average resolution for different clashscores of structures with an R-free value
- Overall weight of modifications grouped by the amount of hydrogen bond donors

```
-- Exercise 13
SELECT COUNT(Protein_ID), Structure_ID FROM secondary_protein GROUP BY Structure_ID;
SELECT COUNT(*), Kingdom_ID FROM organisms GROUP BY Kingdom_ID;
SELECT COUNT(Identifier), Method_ID FROM structures GROUP BY Method_ID;
SELECT COUNT(*),Domain_ID FROM domains_proteins group by Domain_ID;

-- Optional
SELECT SUM(Mass), organism_ID FROM proteins GROUP BY organism_ID;

SELECT AVG(Protein_Length), organism_ID FROM proteins
WHERE annotation > 3 GROUP BY organism_ID;

SELECT AVG(Resolution), clashscore FROM structural_data
WHERE R_Free IS NOT NULL GROUP BY Clashscore;

SELECT SUM(Molecular_Weight),Hydrogenbond_donors from modification_data
GROUP BY Hydrogenbond_donors
```
