## **DATA WRANGLING**

What is Data Wrangling ?

1. Data wrangling or data munging, is the process of **transforming** and **mapping** data from one "raw" data-source data-form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

2. We can simply say that the data wrangling process is a **method of data cleaning and data preparation** by converting it from one form to a more understandable form mainly for preliminary data analytics.

3. The process of transformation such as :
  * Data Exploration
  * Data Preparation
  * Data Cleaning
  * Data Validation
  * Data Enrichment
  * etc.

4. This might mean modifying all of the values in a given column in a certain way, or merging multiple columns together.

5. The necessity for data wrangling is often a biproduct of poorly collected or presented data. Data that is entered manually by humans is typically fraught with errors; data collected from websites is often optimized to be displayed on websites, not to be sorted and aggregated.

6. You can think Data Wrangling is like Preprocessing in Machine Learning. But, we are using SQL to cleaning the data rather than using Python.

### **DATASET**


#### Table `crunchbase_companies_clean_data`

We will use this data for Date Format Wrangling session. Data Definition Language (DDL):
```sql
CREATE TABLE crunchbase_companies_clean_data (
    permalink VARCHAR(50),
    name VARCHAR(50),
    homepage_url VARCHAR(50),
    category_code VARCHAR(50),
    funding_total_usd BIGINT,
    status VARCHAR(20),
    country_code VARCHAR(5),
    state_code VARCHAR(5),
    region VARCHAR(50),
    city VARCHAR(50),
    funding_rounds INT,
    founded_at VARCHAR(20),
    founded_at_clean VARCHAR(20),
    id SERIAL PRIMARY KEY
);
```

#### Table `dc_bikeshare_q1_2012`

As for String/Varchar Format Wrangling session, we will use this data. Data Definition Language (DDL):
```mysql
CREATE TABLE dc_bikeshare_q1_2012 (
    id SERIAL PRIMARY KEY,
    duration VARCHAR(20),
    duration_seconds INT,
    start_time TIMESTAMP,
    start_station VARCHAR(70),
    start_terminal INT,
    end_time TIMESTAMP,
    end_station VARCHAR(70),
    end_terminal INT,
    bike_number VARCHAR(10),
    rider_type VARCHAR(20)
);
```

<h2><b> NOTE: </b></h2>

To ease your learning in this session, you can use the sql file to running the DDL and DML to create two tables above.

You can access the script [here](https://github.com/FTDS-learning-materials/phase-0/blob/main/src/w2d3pm.sql).

You can copy paste into your Query Tool in pgadmin4 or you can download and open the script from pgadmin4.

#### **Data Exploration**

First, you need to know about your dataset. You learned that certain functions work on some data types, but not others.

For example, COUNT works with any data type, but SUM only works for numerical data. In order to use SUM, the data must appear to be numeric, but it must also be stored in the database in a numeric form.

You might run into this problem, for example, if you have a column that appears to be entirely numeric, but happens to contain spaces or commas. If you upload data to particular SQL databases software with commas in a column full of numbers, that SQL database software will treat that column as non-numeric.

Generally, numeric column types in various SQL databases do not support commas or currency symbols. To make things more complicated, SQL databases can store data in many different formats with different levels of precision.

To see a list of data types, you can visit the website of each SQL database software, or at a glance, you can visit [this](https://www.w3schools.com/sql/sql_datatypes.asp).

### **DATE FORMAT**


#### **Converting Datatype**

In our table, you can see in the table `crunchbase_companies_clean_data`, there is a column named `founded_at` and `founded_at_clean`. Let's check the difference between those two with this query:
```sql
SELECT founded_at, founded_at_clean
FROM crunchbase_companies_clean_data;
```
It looks like these two columns contain the same information but have different time formats. While `founded_at` uses the US date format and `founded_at_clean` uses the PostgreSQL default date format.

Let's check further which date format we should use:
```sql
SELECT founded_at, founded_at_clean
FROM crunchbase_companies_clean_data
ORDER BY founded_at;
```
As you can see, the result is not ordered properly. So we can conclude that it's better to make sure our date/datetime format follows SQL defaults. While we're at it, let's practice changing the `founded_at` date format.

We can convert the data type at the time of querying so that it doesn't change the original dataset, using:
```sql
CAST(value AS type)
```
Oops, it seems that `CAST(founded_at AS DATE)` didn't return the result we expected. Since this function assumes the value has the SQL default format, we need another command. We can use the `TO_DATE` function to specify the format of our value and convert it to the DATE format.
```sql
TO_DATE(founded_at, 'MM/DD/YY');
```

You can also apply data type formatting with `ALTER TABLE ... ALTER COLUMN ... TYPE ... USING ... ::...`. This way, your change will be saved and stored in the database. Normally, we can use:
```sql
ALTER TABLE crunchbase_companies_clean_data
ALTER COLUMN founded_at TYPE DATE USING founded_at::date;
```
But since our DATE format is in US Format, you will get an error message. To overcome this, we need a little workaround:
```sql
UPDATE crunchbase_companies_clean_data
SET founded_at = TO_DATE(founded_at, 'MM/DD/YY');

ALTER TABLE crunchbase_companies_clean_data
ALTER COLUMN founded_at TYPE DATE USING founded_at::date;
```
Now, our `founded_at` column has the SQL DATE format and DATE type.

---
#### ****Deconstruct DATE/DATETIME/TIMESTAMP Format****

You've learned how to construct a date field, but what if you want to deconstruct one? You can use EXTRACT to pull the pieces apart one-by-one:

```sql
SELECT founded_at,
       EXTRACT(year FROM founded_at) AS year,
       EXTRACT(MONTH FROM founded_at) AS month,
       EXTRACT(DAY FROM founded_at) AS day,
       EXTRACT(QUARTER FROM founded_at) AS quarter
FROM crunchbase_companies_clean_data;
```
You can also use `HOUR`, `MINUTE` and `SECOND` if your data type is TIMESTAMP or TIME.

What if you want to include today's date or time? You can instruct your query to pull the local date and time at the time the query is run using any number of functions. Interestingly, you can run them without a `FROM` clause:

```sql
SELECT CURRENT_DATE AS date,
       CURRENT_TIME AS time,
       CURRENT_TIMESTAMP AS timestamp,
       LOCALTIME AS local_time,
       LOCALTIMESTAMP AS local_timestamp,
       NOW() AS now
```

As you can see, the different options vary in precision. You might notice that these times probably aren't actually your local time. If you run a current time function against a connected database, you might get a result in a different time zone.

We can also calculate the time interval using `-` or `+`. You just need to make sure the value/variable used in this function is in `DATE`, `TIMESTAMP`, or `TIMESTAMPTZ` type.


```sql
SELECT founded_at,
       CURRENT_DATE AS local_time,
       CURRENT_DATE - founded_at AS founded_time_ago,
       founded_at + INTERVAL '10 DAY' AS plus_10_days
FROM crunchbase_companies_clean_data;
```

There are a lot of functions related to Date & Time. These are examples of those functions in [PostgreSQL](https://www.postgresql.org/docs/current/functions-datetime.html).

---
#### **Handling Missing Value**

Before we are moving into our next dataset, we can see that founded_at column has several missing value. we can handle these with `COALESCE`. It will impute the missing value with value we put into the function:
```mysql
SELECT founded_at, COALESCE(founded_at, 'No Date')
FROM crunchbase_companies_clean_data;
```

### **STRING FORMAT**



#### **LEFT, RIGHT, SUBSTR**
You can use `LEFT` to pull a certain number of characters from the left side of a string and present them as a separate string. The syntax is `LEFT(string, number of characters)`.

`RIGHT` does the same thing, but from the right side.

While `LEFT` and `RIGHT` both create substrings of a specified length, but they only do so starting from the sides of an existing string. If you want to start in the middle of a string, you can use `SUBSTRING`. The syntax is `SUBSTRING(string FROM starting character position FOR # of characters)`

As a practical example, we can see that the date field in this dataset begins with a 10-digit date, and includes the timestamp to the right of it. We can pull either the date, timestamp, or minute using this query.

```sql
SELECT start_time,
    LEFT(start_time::TEXT, 10) AS selected_date,
    RIGHT(start_time::TEXT, 8) AS selected_time,
    SUBSTRING(start_time::TEXT FROM 15 FOR 2) AS selected_minute
FROM dc_bikeshare_q1_2012;
```

The `LENGTH` function returns the length of a string. So `LENGTH(date)` will always return 19 in this dataset. Since we know that the first 10 characters will be the date, and they will be followed by a space (total 11 characters), we could represent the `RIGHT` function like this:

```sql
SELECT start_time,
       RIGHT(start_time::TEXT, LENGTH(start_time::TEXT) - 11) AS selected_time
FROM dc_bikeshare_q1_2012;
```

When using functions within other functions, it's important to remember that **the innermost functions will be evaluated first, followed by the functions that encapsulate them**.

#### **TRIM**
The `TRIM` function is used to remove characters from the beginning and end of a string. Here's an example:
```mysql
SELECT bike_number,
	   TRIM(leading 'W0' FROM bike_number) AS trimmed
from dc_bikeshare_q1_2012;
```
The `TRIM` function takes 3 arguments. First, you have to specify whether you want to remove characters from the beginning ('leading'), the end ('trailing'), or both ('both'). Next you must specify all characters to be trimmed. Any characters included in the single quotes will be removed from both beginning, end, or both sides of the string. Finally, you must specify the text you want to trim using `FROM`.

#### **POSITION**
`POSITION` allows you to specify a substring, then returns a numerical value equal to the character number (counting from left) where that substring first appears in the target string. For example, the following query will return the position of the character 'A' (case-sensitive) where it first appears in the `descript` field:
```mysql
SELECT bike_number,
       POSITION('1' in bike_number) as pos
FROM dc_bikeshare_q1_2012;
```

#### **UPPER AND LOWER**
Sometimes, you just don't want your data to look like it's screaming at you.
* You can use **`LOWER` to force every character in a string to become lower-case**.
* Similarly, you can use **`UPPER` to make all the letters appear in upper-case**:

```mysql
SELECT start_station,
    LOWER(start_station) AS lowered,
    UPPER(start_station) AS uppered
FROM dc_bikeshare_q1_2012;
```

#### **CONCAT**
You can combine strings from several columns together (and with hard-coded values) using `CONCAT`. Simply order the values you want to concatenate and separate them with commas. If you want to hard-code values, enclose them in single quotes. Here's an example:
```mysql
SELECT start_station,
	   start_terminal,
       CONCAT(start_terminal, ' - ', start_station) AS station_id_name
FROM dc_bikeshare_q1_2012;
```

### SQL CASE

The CASE statement is SQL's way of handling if/then logic. The CASE statement is followed by at least one pair of WHEN and THEN statements—SQL's equivalent of IF/THEN in Excel. Because of this pairing, you might be tempted to call this SQL CASE WHEN, but CASE is the accepted term.

Every CASE statement must end with the END statement. The ELSE statement is optional, and provides a way to capture values not specified in the WHEN/THEN statements. CASE is easiest to understand in the context of an example:

```sql
SELECT name, category_code,
    CASE
        WHEN funding_total_usd > 1000000 THEN 'High Funding'
        WHEN funding_total_usd > 100000 THEN 'Medium Funding'
        ELSE 'Low Funding'
    END AS funding_category
FROM crunchbase_companies_clean_data;

```

## Intermediate SQL: Sub Query

Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you'd need to do each aggregation in a distinct step.
Subqueries can be used in several places within a query, but it's easiest to start with the FROM statement. Here're some examples of subquery:


1. Retrieve a list of start_station along with their respective trip counts, sorted by the trip count in descending order.:

```sql
SELECT subquery.start_station,
       subquery.trip_count
FROM (
    SELECT start_station,
           COUNT(*) AS trip_count
    FROM dc_bikeshare_q1_2012
    GROUP BY start_station
) AS subquery
ORDER BY subquery.trip_count DESC;
```

2. Retrieve list of top funded companies in each region that total funding more than USD 1,000,000 :

```sql
SELECT name,
	status,
	region,
	founded_at
FROM crunchbase_companies_clean_data AS C
WHERE funding_total_usd IN
		(SELECT MAX(funding_total_usd)
			FROM crunchbase_companies_clean_data
			GROUP BY region
			HAVING MAX(funding_total_usd) > 1000000)
	AND founded_at IS NOT NULL;
```
