# SQL Data Cleaning 1


**In this lesson, you will be learning a number of techniques to**

* Clean and re-structure messy data.
* Convert columns to different data types.
* Tricks for manipulating NULLs.

This will give you a robust toolkit to get from raw data to clean data that's useful for analysis.


We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [None]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

In [None]:
import mysql
from mysql.connector import Error
from getpass import getpass

db_name = 'parch_and_posey'
try:
    connection = mysql.connector.connect(host='localhost',
                                         database=db_name,
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In [None]:
# Let's see the tables in the database

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

**Defining a method that converts the result of a query to a dataframe**

In [None]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [None]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

In [None]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

In [None]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

In [None]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

In [None]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

### LEFT
### RIGHT
### LENGTH

**LEFT** pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left). As you saw here, you can pull the first three digits of a phone number using LEFT(phone_number, 3).


**RIGHT** pulls a specified number of characters for each row in a specified column starting at the end (or from the right). As you saw here, you can pull the last eight digits of a phone number using RIGHT(phone_number, 8).


**LENGTH** provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as LENGTH(phone_number).


**LEFT & RIGHT Quizzes**

* In the accounts table, there is a column holding the website for each company. The last three digits specify what type of web address they are using. Pull these extensions and provide how many of each website type exist in the accounts table.

In [None]:
query_to_df(
"SELECT RIGHT(website, 3) website_type, COUNT(*) counts FROM accounts GROUP BY 1;"
)

* There is much debate about how much the name (or even the first letter of a company name) matters. Use the accounts table to pull the first letter of each company name to see the distribution of company names that begin with each letter (or number).

In [None]:
query_to_df(
"SELECT LEFT(name, 1) first_char, COUNT(*) counts FROM accounts GROUP BY 1 ORDER BY 2 DESC;"
)

* Use the accounts table and a CASE statement to create two groups: one group of company names that start with a number and a second group of those company names that start with a letter. What proportion of company names start with a letter?

In [None]:
# Let's see the spread of company names starting with either alphabets or numbers

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char = first_char*10 THEN 'is_alpha' ELSE 'is_num' END first_char, COUNT(*) counts \
FROM table1 GROUP BY 1) \
SELECT * FROM table2;"
)

In [None]:
# Next, let's calculate the proportion of numbers in the first letters of company names

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char = first_char*10 THEN 'is_alpha' ELSE 'is_num' END new_col, COUNT(*) counts \
FROM table1 GROUP BY 1), \
table3 AS (SELECT counts FROM table2 WHERE new_col='is_num'), \
table4 AS (SELECT ((SELECT * FROM table3) / SUM(counts)) letter_prop FROM table2) \
SELECT letter_prop*100 letter_pct FROM table4;"
)

* Consider vowels as a, e, i, o, and u. What proportion of company names start with a vowel, and what percent start with anything else?

In [None]:
# Let's first see the number of names whose first letter is a vowel or not

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char IN ('A', 'E', 'I', 'O', 'U') THEN 'is_vowel' ELSE 'is_not' END vowel_or_not, \
COUNT(*) counts FROM table1 GROUP BY 1) \
SELECT * FROM table2;"
)

In [None]:
# Next, let's calculate the proportion of vowels in the first letters of company names

query_to_df(
"WITH \
table1 AS (SELECT LEFT(name, 1) first_char FROM accounts), \
table2 AS (SELECT CASE WHEN first_char IN ('A', 'E', 'I', 'O', 'U') THEN 'is_vowel' ELSE 'is_not' END vowel_or_not, \
COUNT(*) counts FROM table1 GROUP BY 1), \
table3 AS (SELECT counts FROM table2 WHERE vowel_or_not='is_vowel'), \
table4 AS (SELECT ((select * FROM table3)/SUM(counts)) vowel_prop FROM table2) \
SELECT vowel_prop*100 vowel_pct FROM table4;"
)

### POSITION
### STRPOS
### LOWER
### UPPER

**POSITION** takes a character and a column, and provides the index where that character is for each row. The index of the first position is 1 in SQL. If you come from another programming language, many begin indexing at 0. Here, you saw that you can pull the index of a comma as **`POSITION(',' IN city_state)`**.


**STRPOS** provides the same result as POSITION, but the syntax for achieving those results is a bit different as shown here: **`STRPOS(city_state, ',')`**.


Note, both POSITION and STRPOS are case sensitive, so looking for A is different than looking for a.


Therefore, if you want to pull an index regardless of the case of a letter, you might want to use **LOWER or UPPER** to make all of the characters lower or uppercase.

**Position Quizzes**

* Use the accounts table to create first and last name columns that hold the first and last names for the primary_poc.

In [None]:
query_to_df(
"SELECT LEFT(primary_poc, POSITION(' ' IN primary_poc)-1) first_name, \
RIGHT(primary_poc, LENGTH(primary_poc) - POSITION(' ' IN primary_poc)) last_name FROM accounts;"
)

* Now see if you can do the same thing for every rep name in the sales_reps table. Again provide first and last name columns.

In [None]:
query_to_df(
"SELECT LEFT(name, POSITION(' ' IN name)-1) first_name, \
RIGHT(name, LENGTH(name) - POSITION(' ' IN name)+1) last_name FROM sales_reps;"
)

### CONCAT or Piping ||
### REPLACE

Each of **Concat/Piping** will allow you to combine columns together across rows. In this video, you saw how first and last names stored in separate columns could be combined together to create a full name: 
```
CONCAT(first_name, ' ', last_name)
````
or with piping as 
```
first_name || ' ' || last_name.
```

**Replace** takes a column and the value to replace as well as the new value to input instead for example
```
REPLACE(name, ' ', '_')
```
Where name is the column of interest and we're replacing the spaces with underscores for each row in this column. 

**Quizzes CONCAT**

* Each company in the accounts table wants to create an email address for each primary_poc. The email address should be the first name of the primary_poc . last name primary_poc @ company name .com.

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT LEFT(primary_poc, POSITION(' ' IN primary_poc)-1) first_name, \
RIGHT(primary_poc, LENGTH(primary_poc)-POSITION(' ' IN primary_poc)) last_name, \
REPLACE(name, ' ', '') company FROM accounts), \
table2 AS (SELECT *, CONCAT(LOWER(first_name), '.', LOWER(last_name), '@', LOWER(company), '.com') email FROM table1) \
SELECT * FROM table2 LIMIT 20;"
)

* We would also like to create an initial password, which they will change after their first log in. The first password will be<br> 
a. The first letter of the primary_poc's first name (lowercase), then<br> 
b. The last letter of their first name (lowercase), <br>
c. The first letter of their last name (lowercase), <br>
d. The last letter of their last name (lowercase), <br>
e. The number of letters in their first name, <br>
f. The number of letters in their last name, and then <br> 
g. The name of the company they are working with, all capitalized with no spaces.

In [None]:
query_to_df(
"WITH \
table1 AS (SELECT LOWER(LEFT(primary_poc, POSITION(' ' IN primary_poc)-1)) first_name, \
LOWER(RIGHT(primary_poc, LENGTH(primary_poc)-POSITION(' ' IN primary_poc))) last_name, \
UPPER(REPLACE(name, ' ', '')) company FROM accounts), \
\
table2 AS (SELECT *, CONCAT(LEFT(first_name, 1), RIGHT(first_name, 1), \
LEFT(last_name, 1), RIGHT(last_name, 1), LENGTH(first_name), LENGTH(last_name), company) signature FROM table1) \
\
SELECT * FROM table2;"
)

## Cast
## Casting with ::

You can change a string to a date using CAST. CAST is actually useful to change lots of column types. Commonly you might use CAST to change a string to a Datetime object or number. In the reverse, if we want to change a number to a string, whatever operations

**Expert Tip**
Most of the functions presented in this lesson are specific to strings. They won’t work with dates, integers or floating-point numbers. However, using any of these functions will automatically change the data to the appropriate type.

LEFT, RIGHT, and TRIM are all used to select only certain elements of strings, but using them to select elements of a number or date will treat them as strings for the purpose of the function. Though we didn't cover TRIM in this lesson explicitly, it can be used to remove characters from the beginning and end of a string. This can remove unwanted spaces at the beginning or end of a row that often happen with data being moved from Excel or other storage systems.

There are a number of variations of these functions, as well as several other string functions not covered here. Different databases use subtle variations on these functions, so be sure to look up the appropriate database’s syntax if you’re connected to a private database.The Postgres literature contains a lot of the related functions.

**CAST Quizzes:**<br>
For this set of quiz questions, you are going to be working with a single table from a different database. This is a different database than Parch & Posey. We shall use the San-Francisco crime-data for this exercise

In [None]:
# Let's tell MySQL that we want to use a different DataBase, the crime_data Database

query_to_df(
"USE crime_data;"
)

In [None]:
# Let's see the tables in the crime_data database

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

In [None]:
# Let's see the first few rows of the sf_crime_data table

query_to_df(
"SELECT * FROM sf_crime_data LIMIT 5;"
)

**Let's clean the dates column to the format that dates should use in SQL**

In [None]:
query_to_df(
"WITH \
t1 AS (SELECT dates, SUBSTR(dates, 1, 10) AS date1 FROM sf_crime_data), \
t2 AS (SELECT dates, CONCAT(RIGHT(date1, 4), '-', LEFT(date1, 2), '-', SUBSTR(date1, 4, 2)) new_date FROM t1), \
t3 AS (SELECT dates, CAST(new_date AS date) new_date FROM t2) \
SELECT * FROM t3 LIMIT 5;"
)

## COALESCE:

In general, COALESCE returns the first non-NULL value passed for each row. It's simply a function to replace NULL values with a certain value for each row.

In [None]:
# Let's change back to Parch-and-Posey

query_to_df(
"USE parch_and_posey;"
)

In [None]:
# Let's see the tables again

query_to_df(
"SHOW TABLES;"
)

In [None]:
query_to_df(
"SELECT * FROM accounts a LEFT JOIN orders o ON a.id = o.account_id WHERE o.total IS NULL;"
)

In [None]:
query_to_df(
"SELECT COALESCE(o.id, a.id) filled_id, a.name, a.website, a.lat, a.longs, a.primary_poc, a.sales_rep_id, o.* \
FROM accounts a LEFT JOIN orders o ON a.id = o.account_id WHERE o.total IS NULL;"
)

In [None]:
query_to_df(
"SELECT COALESCE(o.id, a.id) id, a.name, a.website, a.lat, a.longs, a.primary_poc, a.sales_rep_id, \
COALESCE(o.account_id, a.id) account_id, o.occurred_at, o.standard_qty, o.gloss_qty, o.poster_qty, o.total, \
o.standard_amt_usd, o.gloss_amt_usd, o.poster_amt_usd, o.total_amt_usd FROM accounts a LEFT JOIN orders o \
ON a.id = o.account_id WHERE o.total IS NULL;"
)

In [None]:
query_to_df(
"SELECT COALESCE(o.id, a.id) filled_id, a.name, a.website, a.lat, a.longs, a.primary_poc, a.sales_rep_id, \
COALESCE(o.account_id, a.id) account_id, o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, \
COALESCE(o.gloss_qty,0) gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total, \
COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0) gloss_amt_usd, \
COALESCE(o.poster_amt_usd,0) poster_amt_usd, COALESCE(o.total_amt_usd,0) total_amt_usd FROM accounts a \
LEFT JOIN orders o ON a.id = o.account_id WHERE o.total IS NULL;"
)

In [None]:
query_to_df(
"SELECT COUNT(*) FROM accounts a LEFT JOIN orders o ON a.id = o.account_id;"
)

In [None]:
query_to_df(
"SELECT COALESCE(o.id, a.id) filled_id, a.name, a.website, a.lat, a.longs, a.primary_poc, a.sales_rep_id, \
COALESCE(o.account_id, a.id) account_id, o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, \
COALESCE(o.gloss_qty,0) gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total, \
COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0) gloss_amt_usd, \
COALESCE(o.poster_amt_usd,0) poster_amt_usd, COALESCE(o.total_amt_usd,0) total_amt_usd \
FROM accounts a LEFT JOIN orders o ON a.id = o.account_id;"
)

In [None]:
# Change False to True to end the connection when done.

if True and connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Connection to Database: {record} Closed!')