# SQL Data Cleaning 1


**In this lesson, you will be learning a number of techniques to**

* Clean and re-structure messy data.
* Convert columns to different data types.
* Tricks for manipulating NULLs.

This will give you a robust toolkit to get from raw data to clean data that's useful for analysis.


We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [1]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

Done!


In [2]:
import mysql
from mysql.connector import Error
from getpass import getpass

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='parch_and_posey',
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

Enter UserName:root
Enter Password:········
Connected to MySQL Server version  8.0.24
You're connected to database:  ('parch_and_posey',)


In [3]:
# Let's see the tables in the database

# let's run the show tables command 

cursor.execute('show tables')
out = cursor.fetchall()
out

[('accounts',), ('orders',), ('region',), ('sales_reps',), ('web_events',)]

**Defining a method that converts the result of a query to a dataframe**

In [4]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [53]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.005228519439697266 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.23849561,-75.10329704,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.1691563,-73.84937379,Sung Shields,321510
2,1021,Apple,www.apple.com,42.29049481,-76.08400942,Jodee Lupo,321520


In [6]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.010972738265991211 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [7]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.008974552154541016 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


In [8]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.0069849491119384766 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


In [9]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.006981849670410156 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


### LEFT
### RIGHT
### LENGTH

**LEFT** pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left). As you saw here, you can pull the first three digits of a phone number using LEFT(phone_number, 3).


**RIGHT** pulls a specified number of characters for each row in a specified column starting at the end (or from the right). As you saw here, you can pull the last eight digits of a phone number using RIGHT(phone_number, 8).


**LENGTH** provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as LENGTH(phone_number).


**LEFT & RIGHT Quizzes**

* In the accounts table, there is a column holding the website for each company. The last three digits specify what type of web address they are using. Pull these extensions and provide how many of each website type exist in the accounts table.

In [11]:
query_to_df(
"SELECT RIGHT(a.website, 3) website_type, COUNT(*) counts FROM accounts a GROUP BY 1;"
)

Query ran for 0.005982637405395508 secs!


Unnamed: 0,website_type,counts
0,com,349
1,org,1
2,net,1


* There is much debate about how much the name (or even the first letter of a company name) matters. Use the accounts table to pull the first letter of each company name to see the distribution of company names that begin with each letter (or number).

In [100]:
query_to_df(
"SELECT LEFT(a.name, 1) first_char, COUNT(*) counts FROM accounts a GROUP BY 1 ORDER BY 2 DESC;"
)

Query ran for 0.0649268627166748 secs!


Unnamed: 0,first_char,counts
0,A,37
1,C,37
2,P,27
3,M,22
4,S,17
...,...,...
21,O,7
22,X,2
23,3,1
24,Q,1


* Use the accounts table and a CASE statement to create two groups: one group of company names that start with a number and a second group of those company names that start with a letter. What proportion of company names start with a letter?

In [103]:
# Let's see the spread of company names starting with either alphabets or numbers

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char = first_char*10 THEN 'is_alpha' ELSE 'is_num' END first_char, COUNT(*) counts \
FROM table1 GROUP BY 1) \
SELECT * FROM table2;"
)

Query ran for 0.007170200347900391 secs!


Unnamed: 0,first_char,counts
0,is_alpha,350
1,is_num,1


In [109]:
# Next, let's calculate the proportion of numbers in the first letters of company names

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char = first_char*10 THEN 'is_alpha' ELSE 'is_num' END new_col, COUNT(*) counts \
FROM table1 GROUP BY 1), \
table3 AS (SELECT counts FROM table2 WHERE new_col='is_num'), \
table4 AS (SELECT ((SELECT * FROM table3) / SUM(counts)) letter_prop FROM table2) \
SELECT letter_prop*100 letter_pct FROM table4;"
)

Query ran for 0.004545927047729492 secs!


Unnamed: 0,letter_pct
0,0.28


* Consider vowels as a, e, i, o, and u. What proportion of company names start with a vowel, and what percent start with anything else?

In [106]:
# Let's first see the number of names whose first letter is a vowel or not

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char IN ('A', 'E', 'I', 'O', 'U') THEN 'is_vowel' ELSE 'is_not' END vowel_or_not, \
COUNT(*) counts FROM table1 GROUP BY 1) \
SELECT * FROM table2;"
)

Query ran for 0.0064067840576171875 secs!


Unnamed: 0,vowel_or_not,counts
0,is_not,271
1,is_vowel,80


In [108]:
# Next, let's calculate the proportion of vowels in the first letters of company names

query_to_df(
"WITH \
table1 AS (SELECT LEFT(a.name, 1) first_char FROM accounts a), \
table2 AS (SELECT CASE WHEN first_char IN ('A', 'E', 'I', 'O', 'U') THEN 'is_vowel' ELSE 'is_not' END vowel_or_not, \
COUNT(*) counts FROM table1 GROUP BY 1), \
table3 AS (SELECT counts FROM table2 WHERE vowel_or_not='is_vowel'), \
table4 AS (SELECT ((select * FROM table3)/SUM(counts)) vowel_prop FROM table2) \
SELECT vowel_prop*100 vowel_pct FROM table4;"
)

Query ran for 0.003350973129272461 secs!


Unnamed: 0,vowel_pct
0,22.79


### POSITION
### STRPOS
### LOWER
### UPPER

**POSITION** takes a character and a column, and provides the index where that character is for each row. The index of the first position is 1 in SQL. If you come from another programming language, many begin indexing at 0. Here, you saw that you can pull the index of a comma as **`POSITION(',' IN city_state)`**.


**STRPOS** provides the same result as POSITION, but the syntax for achieving those results is a bit different as shown here: **`STRPOS(city_state, ',')`**.


Note, both POSITION and STRPOS are case sensitive, so looking for A is different than looking for a.


Therefore, if you want to pull an index regardless of the case of a letter, you might want to use **LOWER or UPPER** to make all of the characters lower or uppercase.

**Position Quizzes**

* Use the accounts table to create first and last name columns that hold the first and last names for the primary_poc.

In [136]:
query_to_df(
"SELECT LEFT(primary_poc, POSITION(' ' IN primary_poc)) first_name, \
RIGHT(primary_poc, LENGTH(primary_poc) - POSITION(' ' IN primary_poc)) last_name FROM accounts;"
)

Query ran for 0.22927117347717285 secs!


Unnamed: 0,first_name,last_name
0,Tamara,Tuma
1,Sung,Shields
2,Jodee,Lupo
3,Serafina,Banda
4,Angeles,Crusoe
...,...,...
346,Buffy,Azure
347,Esta,Engelhardt
348,Khadijah,Riemann
349,Deanne,Hertlein


* Now see if you can do the same thing for every rep name in the sales_reps table. Again provide first and last name columns.

In [146]:
query_to_df(
"SELECT LEFT(name, POSITION(' ' IN name)-1) first_name, \
RIGHT(name, LENGTH(name) - POSITION(' ' IN name)+1) last_name FROM sales_reps;"
)

Query ran for 0.047872304916381836 secs!


Unnamed: 0,first_name,last_name
0,Samuel,Racine
1,Eugena,Esser
2,Michel,Averette
3,Renetta,Carew
4,Cara,Clarke
...,...,...
45,Elwood,Shutt
46,Maryanna,Fiorentino
47,Georgianna,Chisholm
48,Micha,Woodford


### CONCAT or Piping ||
### REPLACE

Each of **Concat/Piping** will allow you to combine columns together across rows. In this video, you saw how first and last names stored in separate columns could be combined together to create a full name: 
```
CONCAT(first_name, ' ', last_name)
````
or with piping as 
```
first_name || ' ' || last_name.
```

**Replace** takes a column and the value to replace as well as the new value to input instead for example
```
REPLACE(name, ' ', '_')
```
Where name is the column of interest and we're replacing the spaces with underscores. 

**Quizzes CONCAT**

* Each company in the accounts table wants to create an email address for each primary_poc. The email address should be the first name of the primary_poc . last name primary_poc @ company name .com.

In [151]:
query_to_df(
"WITH \
table1 AS (SELECT LEFT(primary_poc, POSITION(' ' IN primary_poc)-1) first_name, \
RIGHT(primary_poc, LENGTH(primary_poc)-POSITION(' ' IN primary_poc)) last_name, \
REPLACE(name, ' ', '') company FROM accounts), \
table2 AS (SELECT *, CONCAT(LOWER(first_name), '.', LOWER(last_name), '@', LOWER(company), '.com') email FROM table1) \
SELECT * FROM table2 LIMIT 20;"
)

Query ran for 0.027922391891479492 secs!


Unnamed: 0,first_name,last_name,company,email
0,Tamara,Tuma,Walmart,tamara.tuma@walmart.com
1,Sung,Shields,ExxonMobil,sung.shields@exxonmobil.com
2,Jodee,Lupo,Apple,jodee.lupo@apple.com
3,Serafina,Banda,BerkshireHathaway,serafina.banda@berkshirehathaway.com
4,Angeles,Crusoe,McKesson,angeles.crusoe@mckesson.com
5,Savanna,Gayman,UnitedHealthGroup,savanna.gayman@unitedhealthgroup.com
6,Anabel,Haskell,CVSHealth,anabel.haskell@cvshealth.com
7,Barrie,Omeara,GeneralMotors,barrie.omeara@generalmotors.com
8,Kym,Hagerman,FordMotor,kym.hagerman@fordmotor.com
9,Jamel,Mosqueda,AT&T,jamel.mosqueda@at&t.com


* We would also like to create an initial password, which they will change after their first log in. The first password will be<br> 
a. The first letter of the primary_poc's first name (lowercase), then<br> 
b. The last letter of their first name (lowercase), <br>
c. The first letter of their last name (lowercase), <br>
d. The last letter of their last name (lowercase), <br>
e. The number of letters in their first name, <br>
f. The number of letters in their last name, and then <br> 
g. The name of the company they are working with, all capitalized with no spaces.

In [159]:
query_to_df(
"WITH \
table1 AS (SELECT LOWER(LEFT(primary_poc, POSITION(' ' IN primary_poc)-1)) first_name, \
LOWER(RIGHT(primary_poc, LENGTH(primary_poc)-POSITION(' ' IN primary_poc))) last_name, \
UPPER(REPLACE(name, ' ', '')) company FROM accounts), \
\
table2 AS (SELECT *, CONCAT(LEFT(first_name, 1), RIGHT(first_name, 1), \
LEFT(last_name, 1), RIGHT(last_name, 1), LENGTH(first_name), LENGTH(last_name), company) signature FROM table1) \
\
SELECT * FROM table2;"
)

Query ran for 0.27861499786376953 secs!


Unnamed: 0,first_name,last_name,company,signature
0,tamara,tuma,WALMART,tata64WALMART
1,sung,shields,EXXONMOBIL,sgss47EXXONMOBIL
2,jodee,lupo,APPLE,jelo54APPLE
3,serafina,banda,BERKSHIREHATHAWAY,saba85BERKSHIREHATHAWAY
4,angeles,crusoe,MCKESSON,asce76MCKESSON
...,...,...,...,...
346,buffy,azure,KKR,byae55KKR
347,esta,engelhardt,ONEOK,eaet410ONEOK
348,khadijah,riemann,NEWMONTMINING,khrn87NEWMONTMINING
349,deanne,hertlein,PPL,dehn68PPL


In [161]:
# Change False to True to end the connection when done.

if True and connection.is_connected():
    cursor.close()
    connection.close()
    print(f'Connection to Database: {record} Closed!')

Connection to Database: ('parch_and_posey',) Closed!
