# Cleaning SQL

**LEFT** pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left). As you saw here, you can pull the first three digits of a phone number using **LEFT(phone_number, 3).**

**RIGHT** pulls a specified number of characters for each row in a specified column starting at the end (or from the right). As you saw here, you can pull the last eight digits of a phone number using **RIGHT(phone_number, 8).**

**LENGTH** provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as **LENGTH(phone_number).**

Example:
```sql
SELECT first_name,
last_name,
phone_number,
LEFT(phone_number, 3) AS area_code,
RIGHT(phone_number, 8) AS phone_number_only,
RIGHT(phone_number, LENGTH(phone_number)-4) AS phone_number_alt
FROM customer_data
```

Test Questions: 
1. In the accounts table, there is a column holding the website for each company. The last three digits specify what type of web address they are using. A list of extensions (and pricing) is provided here. Pull these extensions and provide how many of each website type exist in the accounts table.
```sql
SELECT RIGHT(website, 3) AS extention_code, COUNT(*) AS count
FROM accounts
GROUP BY extention_code
ORDER BY count DESC;
```
2. There is much debate about how much the name (or even the first letter of a company name) matters. Use the accounts table to pull the first letter of each company name to see the distribution of company names that begin with each letter (or number).
```sql
SELECT LEFT(a.name, 1) AS first_letter, COUNT(*) AS num_authors
FROM accounts a
GROUP BY 1
ORDER BY 2 DESC
```
3. Use the accounts table and a CASE statement to create two groups: one group of company names that start with a number and a second group of those company names that start with a letter. What proportion of company names start with a letter?
```sql
WITH first_letter_table AS (
    SELECT LEFT(a.name, 1) AS first_letter, COUNT(*) AS num_authors
    FROM accounts a
    GROUP BY 1
    ORDER BY 2 DESC
)

SELECT CASE WHEN first_letter ~ '^\d$' THEN 'Number' ELSE 'Letter' END AS letter_type, COUNT(*) AS type_count
FROM first_letter_table
GROUP BY 1
```
4. Consider vowels as a, e, i, o, and u. What proportion of company names start with a vowel, and what percent start with anything else?
```sql
WITH first_letter_table AS (
    SELECT LEFT(a.name, 1) AS first_letter, COUNT(*) AS num_authors
    FROM accounts a
    GROUP BY 1
    ORDER BY 2 DESC
),
vowel AS (
    SELECT 
CASE WHEN first_letter ~* '^[a,e,i,o,u]$' THEN 1 ELSE 0 END AS is_vowel,
CASE WHEN first_letter ~* '^[a,e,i,o,u]$' THEN 0 ELSE 1 END AS not_vowel
FROM first_letter_table)

SELECT SUM(is_vowel) vowels, SUM(not_vowel) other_letters
FROM vowel;
```


## POSITION, STRPOS, LOWER, UPPER

- POSITION takes a character and a column, and provides the index where that character is for each row. The index of the first position is 1 in SQL. If you come from another programming language, many begin indexing at 0. Here, you saw that you can pull the index of a comma as POSITION(',' IN city_state).
- STRPOS provides the same result as POSITION, but the syntax for achieving those results is a bit different as shown here: STRPOS(city_state, ',').
- LOWER used to make the whole string lowercase, LOWER(city_state)
- UPPER used to make the whole string uppercase, UPPER(city_state)

Test Questions:
1. Use the accounts table to create first and last name columns that hold the first and last names for the primary_poc.
```sql
SELECT primary_poc, LEFT(primary_poc, POSITION(' ' IN primary_poc)-1) AS first_name,
RIGHT(primary_poc, LENGTH(primary_poc)-STRPOS(primary_poc, ' ')) AS last_name
FROM accounts
```
2. Now see if you can do the same thing for every rep name in the sales_reps table. Again provide first and last name columns.
```sql
SELECT s.name,
LEFT(s.name, STRPOS(s.name, ' ')) AS first_name,
RIGHT(s.name, LENGTH(s.name) - STRPOS(s.name, ' ')) AS last_name
FROM sales_reps s
```


## CONCAT

Used to combine columns across rows
```sql
CONCAT(first_name, '',last_name) AS full_name
```
Alt is to use || to pipe them together
```sql
first_name || ' ' || last_name AS full_name
```
Test Questions:
1. Each company in the accounts table wants to create an email address for each primary_poc. The email address should be the first name of the primary_poc . last name primary_poc @ company name .com.
```sql
WITH pos_data AS (
    SELECT
    LEFT(a.primary_poc, STRPOS(a.primary_poc, ' ')-1) AS first_name,
    RIGHT(a.primary_poc, LENGTH(a.primary_poc) - STRPOS(a.primary_poc, ' ')) AS last_name,
    a.name AS company_name
    FROM accounts a
)

SELECT first_name || '.' || last_name || '@' || company_name || '.com' AS email
FROM pos_data
```
2. You may have noticed that in the previous solution some of the company names include spaces, which will certainly not work in an email address. See if you can create an email address that will work by removing all of the spaces in the account name, but otherwise your solution should be just as in question 1.
```sql
WITH pos_data AS (
    SELECT
    LEFT(a.primary_poc, STRPOS(a.primary_poc, ' ')-1) AS first_name,
    RIGHT(a.primary_poc, LENGTH(a.primary_poc) - STRPOS(a.primary_poc, ' ')) AS last_name,
    a.name AS company_name
    FROM accounts a
)

SELECT first_name || '.' || last_name || '@' || REPLACE(company_name, ' ', '') || '.com' AS email
FROM pos_data
```
3. We would also like to create an initial password, which they will change after their first log in. The first password will be the first letter of the primary_poc's first name (lowercase), then the last letter of their first name (lowercase), the first letter of their last name (lowercase), the last letter of their last name (lowercase), the number of letters in their first name, the number of letters in their last name, and then the name of the company they are working with, all capitalized with no spaces.
```sql
WITH pos_data AS (
    SELECT
    LEFT(a.primary_poc, STRPOS(a.primary_poc, ' ')-1) AS first_name,
    RIGHT(a.primary_poc, LENGTH(a.primary_poc) - STRPOS(a.primary_poc, ' ')) AS last_name,
    a.name AS company_name
    FROM accounts a
)

SELECT first_name || '.' || last_name || '@' || REPLACE(company_name, ' ', '') || '.com' AS email,
LOWER(LEFT(first_name, 1)) || LOWER(RIGHT(first_name, 1)) || LOWER(LEFT(last_name, 1)) || LOWER(RIGHT(last_name,1)) || LENGTH(first_name) || LENGTH(last_name) || UPPER(REPLACE(company_name, ' ', '')) AS password
FROM pos_data
```



## CAST

Allows us to change columns from one data type to another

DATE_PART('month', TO_DATE(month, 'month')) here changed a month name into the number associated with that particular month.

```sql 
CAST(date_column AS DATE) 
```
Shorthand:
```sql
date_column::DATE
```

Taking and Cleaning date:
original: 01/31/2014 08:00:00 AM +0000
After: 2014-01-31T00:00:00.000Z

```sql
WITH date_full AS(SELECT
SUBSTR(date, 1, STRPOS(date, ' ')-1) as date_pull
FROM sf_crime_data)
       
SELECT
       LEFT(date_pull, 2) AS month,
       SUBSTR(date_pull, 4, 2) AS day,
       RIGHT(date_pull, 4) AS year,
       (RIGHT(date_pull, 4) || '-' || LEFT(date_pull, 2) || '-' || SUBSTR(date_pull, 4, 2))::DATE
       date_cleaned
FROM date_full
LIMIT 10;
```

## COALESCE function

Returns the first NON-NULL value passed for each row

```sql
SELECT COALESCE(inital, alt_if_null, cont)
```
Coalesce is used in the columns section select, and we can give it as many alt values to try till one comes back with a non null value.

In [None]:
SELECT COALESCE(*,0)
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;