# Categorical Variables in SQL

Categorical variables are variables that represent one or a finite number of categories. Working with categorical data is an essential skill as a data analyst or data scientist. This template will teach you how to inspect, create, filter, and aggregate categorical variables.

This tutorial will use Summer Olympics data. You are free to create an integration to your data set or use another existing integration. You can learn more about integrations [here](https://workspace-docs.datacamp.com/integrations/what-is-an-integration).

_Note: The databases from different PostgreSQL courses are available in the Course Databases database. You can click the "Browse tables" button in the upper righthand corner of the cell below to view the available schemas and tables. The data used for this workspace is contained in the medals schema. To access each table, you need to specify this schema in your queries (e.g., medals.summer_medals for the summer_medals table)._

## Inspecting categorical variables

Often, you will want to know the unique values in a categorical column. In the query below, we use the `DISTINCT` keyword to display the unique values of the `medal` variable.

👇&nbsp;&nbsp;**To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.**

In [1]:
SELECT DISTINCT medal 
FROM medals.summer_medals

Unnamed: 0,medal
0,Bronze
1,Silver
2,Gold


### Inspect unique combinations of categorical variables
You can also use multiple columns to inspect the unique combinations of two or more variables. In the query below, we use `DISTINCT` to inspect the unique combination of `year` and `city` to return the date and location of each Summer Olympics.

In [2]:
SELECT DISTINCT year, city
FROM medals.summer_medals
ORDER BY year

Unnamed: 0,year,city
0,1896,Athens
1,1900,Paris
2,1904,St Louis
3,1908,London
4,1912,Stockholm
5,1920,Antwerp
6,1924,Paris
7,1928,Amsterdam
8,1932,Los Angeles
9,1936,Berlin


### Inspect the counts of categorical variables
You can also use the `COUNT()` function combined with `DISTINCT` to return the number of distinct categories within a column. Here, we inspect the number of unique sports in the `sport` column.

In [3]:
SELECT COUNT(DISTINCT sport) AS event_count
FROM medals.summer_medals

Unnamed: 0,event_count
0,43


## Filtering categorical variables
Categorical variables can also be filtered, much like quantitative variables. If you know the specific category you want to search for, you can use the `WHERE` clause to specify the category of interest.

In the query below, we return all athletes who have competed in "High Jump".

In [4]:
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event = 'High Jump'

Unnamed: 0,athlete,country,event
0,CLARK Ellery,USA,High Jump
1,CONNOLLY James,USA,High Jump
2,GARRETT Robert,USA,High Jump
3,GÖNCZY Lajos,HUN,High Jump
4,BAXTER Irving,USA,High Jump
...,...,...,...
146,DROUIN Derek,CAN,High Jump
147,GRABARZ Robert,GBR,High Jump
148,CHICHEROVA Anna,RUS,High Jump
149,BARRETT Brigetta,USA,High Jump


### Filtering for multiple values
If you want to filter for multiple values in a categorical column, you can use the `IN` keyword combined with your `WHERE` clause. Here, we use `IN` to filter for all athletes who have competed in both "High Jump" _or_ "Long Jump".

You can learn more about filtering data using `IN` operator in the second chapter of [Introduction to SQL](https://app.datacamp.com/learn/courses/introduction-to-sql).

In [5]:
SELECT athlete, country, event
FROM medals.summer_medals
WHERE event IN ('High Jump', 'Long Jump')

Unnamed: 0,athlete,country,event
0,CLARK Ellery,USA,High Jump
1,CONNOLLY James,USA,High Jump
2,GARRETT Robert,USA,High Jump
3,CONNOLLY James,USA,Long Jump
4,CLARK Ellery,USA,Long Jump
...,...,...,...
278,WATT Mitchell,AUS,Long Jump
279,CLAYE Will,USA,Long Jump
280,REESE Brittney,USA,Long Jump
281,SOKOLOVA Elena,RUS,Long Jump


### Filtering using string matching
If you want to search for a general pattern, you can combine the `LIKE` keyword with the `WHERE` clause. There are two wildcards you can use when defining a pattern.
- The `%` wildcard will match zero, one, or many characters.
- The `_` wildcard will match a single character.
In the query below, we enclose "Team" with two `%` wildcards to return all rows where the `event` contains "Team" (whether it is at the beginning, middle, or end). You can learn more about pattern matching in [this exercise](https://campus.datacamp.com/courses/introduction-to-sql/filtering-rows?ex=12) of [Introduction to SQL](https://app.datacamp.com/learn/courses/introduction-to-sql).

_Note: This string matching will be case-sensitive. If you want to make your search case insensitive, you can use `ILIKE` instead of `LIKE`._

In [6]:
SELECT *
FROM medals.summer_medals
WHERE event LIKE '%Team%'

Unnamed: 0,year,city,sport,discipline,athlete,country,gender,event,medal
0,1896,Athens,Gymnastics,Artistic G.,BÖCKER Konrad,GER,Men,Team Horizontal Bar,Gold
1,1896,Athens,Gymnastics,Artistic G.,FLATOW Alfred,GER,Men,Team Horizontal Bar,Gold
2,1896,Athens,Gymnastics,Artistic G.,FLATOW Gustav Felix,GER,Men,Team Horizontal Bar,Gold
3,1896,Athens,Gymnastics,Artistic G.,HILMAR Georg,GER,Men,Team Horizontal Bar,Gold
4,1896,Athens,Gymnastics,Artistic G.,HOFMANN Fritz,GER,Men,Team Horizontal Bar,Gold
...,...,...,...,...,...,...,...,...,...
4978,2012,London,Table Tennis,Table Tennis,HIRANO Sayaka,JPN,Women,Team,Silver
4979,2012,London,Table Tennis,Table Tennis,ISHIKAWA Kasumi,JPN,Women,Team,Silver
4980,2012,London,Table Tennis,Table Tennis,FENG Tian Wei,SGP,Women,Team,Bronze
4981,2012,London,Table Tennis,Table Tennis,LI Jia Wei,SGP,Women,Team,Bronze


## Creating categorical variables
Sometimes, you will want to create categorical variables based on existing columns. You can create a categorical variable using a CASE statement. 

A `CASE` statement has the following structure: 
    
    CASE WHEN <condition> THEN <value> ELSE <alternative_value> END AS <column_name>
- The first `WHEN` specifies a condition.
- `THEN` specifies the value if the condition is met.
- `ELSE` returns an alternate value if the condition is not met.
- `END` finishes the case statement, after which you can use `AS` to alias the column.

In the query below, we create a new categorical variable called `event_type` based on whether the `event` column contains individual or team events. You can learn more about creating categorical variables [here](https://campus.datacamp.com/courses/intermediate-sql/well-take-the-case?ex=1).

In [7]:
SELECT 
  event,
  CASE WHEN event LIKE '%Team%' or event LIKE '%Relay%' THEN 'Team' 
       ELSE 'Individual' END AS event_type 
FROM medals.summer_medals 
WHERE discipline = 'Swimming' 
LIMIT 100

Unnamed: 0,event,event_type
0,100M Freestyle,Individual
1,100M Freestyle,Individual
2,100M Freestyle For Sailors,Individual
3,100M Freestyle For Sailors,Individual
4,100M Freestyle For Sailors,Individual
...,...,...
95,4X200M Freestyle Relay,Team
96,4X200M Freestyle Relay,Team
97,4X200M Freestyle Relay,Team
98,4X200M Freestyle Relay,Team


You can also use multiple `WHEN` statements to provide additional conditions. In the query below, we use multiple `WHEN` statements to create a `century` column based on the year of the Olympics.

In [8]:
SELECT DISTINCT
       year,
       city,
       CASE WHEN (year > 1800 AND year <= 1900) THEN '19th Century'
            WHEN (year > 1900 AND year <= 2000) THEN '20th Century'
            WHEN (year > 2000 AND year <= 2100) THEN '21st Century'
            ELSE 'Before 19th Century' END AS century
FROM medals.summer_medals
ORDER BY year

Unnamed: 0,year,city,century
0,1896,Athens,19th Century
1,1900,Paris,19th Century
2,1904,St Louis,20th Century
3,1908,London,20th Century
4,1912,Stockholm,20th Century
5,1920,Antwerp,20th Century
6,1924,Paris,20th Century
7,1928,Amsterdam,20th Century
8,1932,Los Angeles,20th Century
9,1936,Berlin,20th Century


## Aggregating categorical variables
Categorical variables can be aggregated using a `GROUP BY` statement. To aggregate by a categorical variable, you need to use an aggregation function on the column or columns of interest. Example aggregation functions include `SUM()`, `AVG()`, `COUNT()`, `MIN()`, and `MAX()`.

The following query returns the total count of medals by the `medal` type. You can learn more about aggregations [here](https://campus.datacamp.com/courses/introduction-to-sql/aggregate-functions?ex=1).

In [9]:
SELECT medal, COUNT(*) AS medal_count
FROM medals.summer_medals
GROUP BY medal

Unnamed: 0,medal,medal_count
0,Bronze,10369
1,Silver,10310
2,Gold,10486


### Aggregating using CASE statements
Apart from categorizing data, the `CASE` statement can also be used in combination with an aggregation function to perform more complex aggregations.

The following query uses a `CASE` statement to calculate the fraction of gold medals out of all the medals for each `country` where the `discipline` is "Swimming". A second `CASE` statement also calculates the total number of gold medals that country has won. You can learn more about aggregating categorical variables using `CASE` statements [here](https://campus.datacamp.com/courses/intermediate-sql/well-take-the-case?ex=8).

In [10]:
SELECT
	country,
	AVG(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) AS pct_gold,
    SUM(CASE WHEN medal = 'Gold' THEN 1 ELSE 0 END) AS gold_count
FROM medals.summer_medals
WHERE discipline = 'Swimming'
GROUP BY country
ORDER BY pct_gold DESC

Unnamed: 0,country,pct_gold,gold_count
0,LTU,1.0,1
1,IRL,0.75,3
2,UKR,0.571429,4
3,USA,0.558659,500
4,SUR,0.5,1
5,MEX,0.5,1
6,YUG,0.5,1
7,TUN,0.5,1
8,ANZ,0.454545,5
9,GDR,0.386861,53


## Next Steps
Well done on completing this tutorial! If you are interested in learning more about working with categorical data in SQL, check out [Intermediate SQL](https://app.datacamp.com/learn/courses/intermediate-sql). You can also continue to hone your SQL skills by taking the courses in the [SQL Fundamentals Skill Track](https://app.datacamp.com/learn/skill-tracks/sql-fundamentals).

If you are interested in applying these new skills to other SQL databases, check out our sample integrations [here](https://app.datacamp.com/workspace/integrations)!