## DATA WRANGLING with SQL: Transforming Raw Data into Actionable Insights

### What is DATA WRANGLING?

In the era of big data and data-driven decision-making, the process of **data wrangling** has become a crucial step in turning raw data into valuable insights. **Data wrangling** involves `cleaning, transforming, and preparing data` for analysis, ensuring that it is in a structured and usable format. SQL (Structured Query Language) plays a vital role in data wrangling, as it offers powerful tools to efficiently manipulate and manage data stored in relational databases. In this chapter, we will explore the concept of **data wrangling using SQL** and highlight its importance in the world of **data analytics**.

#### Understanding Data Wrangling

**Data wrangling**, also known as **data munging or data preprocessing**, encompasses a series of tasks aimed at improving data quality and making it more suitable for analysis. The process involves multiple steps, each contributing to the overall transformation of raw data into actionable insights:

1. **Data Extraction**: The first step in data wrangling is extracting data from various sources, such as databases, spreadsheets, or external APIs. SQL's SELECT statement enables data analysts to retrieve specific data columns or entire datasets from relational databases.

Example:
```sql
SELECT customer_id, order_date, order_total
FROM orders;
```

2. **Data Cleaning**: Raw data often contains errors, missing values, duplicates, or inconsistencies that can adversely impact analysis. SQL's data cleaning capabilities, such as NULLIF, COALESCE, and CASE statements, are valuable in handling missing values and ensuring data accuracy.

Example:
```sql
SELECT product_id, product_name, COALESCE(unit_price, 0) AS unit_price
FROM products;
```

3. **Data Transformation**: Data transformation involves converting data into a format suitable for analysis or combining data from different sources. SQL provides numerous string functions, mathematical functions, and date functions to perform data transformation tasks.

Example:
```sql
SELECT customer_id, UPPER(first_name) AS first_name,
       CONCAT(last_name, ', ', first_name) AS full_name
FROM customers;
```

4. **Data Aggregation**: Data aggregation involves summarizing data to gain insights at a higher level. SQL's GROUP BY clause is essential for aggregating data based on one or more columns.

Example:
```sql
SELECT department, COUNT(*) AS num_employees,
       AVG(salary) AS average_salary
FROM employees
GROUP BY department;
```

5. **Data Joins**: Data wrangling often requires combining data from multiple tables or datasets. SQL's JOIN operation allows analysts to merge data based on common columns between tables.

Example:
```sql
SELECT orders.order_id, customers.customer_name, orders.order_date
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;
```

6. **Data Filtering**: Data filtering is the process of extracting relevant data for analysis based on specific conditions. SQL's WHERE clause is used to filter data based on these conditions.

Example:
```sql
SELECT product_id, product_name, unit_price
FROM products
WHERE unit_price > 100;
```

7. **Data Sorting**: SQL's ORDER BY clause is used to sort data in ascending or descending order based on specific columns, facilitating easier analysis.

Example:
```sql
SELECT product_id, product_name, unit_price
FROM products
ORDER BY unit_price DESC;
```

8. **Data Deduplication**: Removing duplicate rows from the dataset is crucial for maintaining data integrity. SQL's DISTINCT keyword or GROUP BY clause can be used for deduplication.

Example:
```sql
SELECT DISTINCT category
FROM products;
```

#### Importance of Data Wrangling in SQL

**Data wrangling** using SQL is essential for several reasons:

1. **Improving Data Quality**: Data wrangling processes help identify and correct data errors, ensuring high-quality data for accurate analysis and decision-making.

2. **Enabling Data Integration**: SQL's JOIN operation allows data from different sources to be combined, enabling comprehensive analysis from multiple datasets.

3. **Enhancing Data Usability**: By transforming data into a structured format, data wrangling enhances data usability, making it more accessible and interpretable for analysts.

4. **Supporting Data Analysis**: Aggregating and summarizing data using SQL allows analysts to draw valuable insights and patterns from large datasets.

5. **Boosting Business Decisions**: Data wrangling prepares data for analysis, empowering organizations to make data-driven decisions and gain a competitive edge.

### How to use DATA WRANGLING using SQL?

**Data wrangling using SQL** involves cleaning, transforming, and preparing data for analysis. 
Here's how to use SQL for data wrangling:

1. **Data Cleaning**:
   - Remove duplicates: Use the `DISTINCT` keyword or `GROUP BY` to eliminate duplicate rows.
   - Filter data: Use the `WHERE` clause to exclude or include specific rows based on conditions.
   - Handle missing values: Use `NULL` values to represent missing data and handle them using `IS NULL` or `IS NOT NULL`.
   - Convert data types: Use SQL functions like `CAST` or `CONVERT` to change data types.

2. **Data Transformation**:
   - Rename columns: Use `AS` to rename columns for better readability.
   - Concatenate strings: Use the `CONCAT` function to combine strings from different columns.
   - Split strings: Use `SUBSTRING` or `SPLIT_PART` functions to break a string into parts.
   - Aggregate data: Use `GROUP BY` with aggregate functions like `SUM`, `COUNT`, `AVG`, etc., to summarize data.
   - Pivot data: Use conditional aggregation with `CASE` statements to pivot data from rows to columns.
   - Unpivot data: Use `UNION` to combine multiple columns into rows.

3. **Data Preparation**:
   - Sort data: Use `ORDER BY` to sort data based on one or more columns.
   - Create calculated columns: Use arithmetic or logical operations to create new columns.
   - Apply conditional logic: Use `CASE` statements for conditional operations.
   - Create views: Use `CREATE VIEW` to store a query as a virtual table for easy access.

4. **Data Joins**:
   - Combine data from multiple tables: Use `JOIN` to merge data from related tables based on common columns.
   - Inner join: Retrieve only the matching rows from both tables.
   - Left join: Retrieve all rows from the left table and matching rows from the right table.
   - Right join: Retrieve all rows from the right table and matching rows from the left table.
   - Full outer join: Retrieve all rows from both tables and match them where possible.

5. **Data Analysis**:
   - Use window functions: Analyze data within partitions or frames using window functions like `ROW_NUMBER`, `RANK`, `SUM`, etc.
   - Calculate aggregates: Use `GROUP BY` with aggregate functions to calculate statistics for subsets of data.
   - Apply time series analysis: Use date functions to analyze data trends over time.
   - Use subqueries: Nest queries to perform complex analyses by breaking them down into smaller steps.

6. **Data Validation**:
   - Check for data integrity: Use constraints and validations to ensure data quality.
   - Handle outliers: Identify and handle outliers that may affect analysis results.
   - Validate results: Cross-check results using different methods to verify accuracy.

7. **Data Export**:
   - Save results: Use `INSERT INTO` to store the final prepared data in a new table.
   - Export data: Use SQL commands or tools to export data in different formats like CSV, Excel, etc.



    Note: - Remember to create backups of your data before performing data wrangling operations, especially if you are modifying the original dataset. SQL offers a wide range of functions and capabilities for data wrangling, making it a powerful tool for cleaning, transforming, and preparing data for analysis and decision-making.
    


### Conclusion

**Data wrangling with SQL** is a fundamental process in the world of `data analytics`. It involves `extracting, cleaning, transforming, and preparing data for analysis`, enabling organizations to gain valuable insights from their raw data. SQL's powerful capabilities for `data manipulation, aggregation, and filtering` make it an indispensable tool for **data wrangling** tasks. By transforming raw data into actionable insights, data wrangling using SQL paves the way for data-driven decision-making, innovation, and success in today's data-driven world.

# Theory Questions:

1. What is data wrangling and why is it important in the realm of data analytics?

2. Differentiate between the terms "data wrangling", "data munging", and "data preprocessing". Are they synonymous?

3. Describe the role of the SQL SELECT statement in data extraction. Provide an example.

4. Explain the importance of data cleaning in the data wrangling process. How can SQL help in data cleaning?

5. What is the purpose of data transformation in SQL? Provide an example of a SQL query that transforms data.

6. How does the GROUP BY clause in SQL assist with data aggregation? Give an example of its application.

7. Discuss the various types of JOIN operations in SQL. How do they aid in data wrangling?

8. What is the significance of data filtering in SQL? Illustrate with a SQL query example.

9. Why is data deduplication essential in data wrangling? How can SQL help in achieving data deduplication?

10. Enumerate the reasons why data wrangling using SQL is vital.

11. Describe the process of data transformation using SQL. Provide examples of SQL functions that can be used for this purpose.

12. What precautions should be taken before initiating the data wrangling process in SQL?

13. Explain the concept of data validation in the context of data wrangling using SQL. Why is it necessary?

14. How can SQL be used to prepare and analyze time series data?

15. Discuss the role of window functions in SQL data analysis. Provide an example of its use.

16. In the conclusion, it's mentioned that SQL is fundamental for data-driven decision-making. Elaborate on this statement.

17. Describe the benefits of creating views in SQL during the data wrangling process.

18. How can data integrity be maintained during the data wrangling process using SQL?

19. Why is it important to handle outliers during data validation, and how might SQL be used to identify and manage them?

### Table: students

Use student.csv file for further questions.

# Easy Level:

Q. Retrieve all student names from the students table.

Q. Count the number of students enrolled in the "Computer Science" course.

Q. Find all students older than 25.

Q. Retrieve the first 10 students in the table.

Q. Identify the unique courses in the students table.

Q. Count the number of male students in the table.

Q. Get the youngest student's name and age.

Q. List all students enrolled after '2023-01-01'.

Q. Retrieve students with a GPA greater than 3.5.

Q. Find the average GPA of students.

# Medium Level:

Q. Identify the course with the highest average GPA.

Q. List the number of students enrolled in each course.

Q. Retrieve students who have the same GPA.

Q. Retrieve students who have the same GPA.

Q. Get the median age of students.
(Assuming an even number of students)

Q. Identify the gender distribution (percentage) in the "Computer Science" course.

Q. Get the average GPA of male and female students.

Q. List all students who have a GPA greater than the average GPA of the table.

Q. Retrieve the course that has the most female students.

Q. Find the difference between the highest and lowest GPA in the table.


# Hard Level:

Q. For each course, calculate the GPA gap between male and female students.

Q. Identify students whose GPA is above the average GPA of their course.

Q. Retrieve the month and year with the most student enrollments.

Q. Rank students based on their GPA within their respective courses.

Q. Identify courses where the average GPA of female students is higher than male students.

Q. Calculate the year-over-year growth in student enrollments.

Q. Calculate the year-over-year growth in student enrollments.

Q. For each course, get the percentage of students with a GPA above 3.5.

