# FIT5196 Week 6: Data Structuring

---

## 1. Introduction to Data Structuring

### 1.1. Role in Data Wrangling
Data wrangling is the process of acquiring, cleaning, structuring, and enriching raw data to make it usable for analysis. **Data structuring** is a critical part of this process. It involves organizing data into a systematic format so it can be accessed and manipulated efficiently. The main goal is to improve processing speed and accessibility while minimizing resource use.

### 1.2. Importance of Good Structure
-   **Efficiency**: Proper structuring significantly reduces the time required for data manipulation tasks.
-   **Data Integrity**: A well-structured format helps maintain the accuracy and consistency of data, which is crucial for reliable analysis.
-   **Scalability**: Effective data structures can handle large increases in data volume without a major drop in performance, making them essential for big data.

---

## 2. Types of Data Structures

Data structures are generally divided into two main categories:

-   **Primitive Data Structures**: These are the basic building blocks, such as integers, floats, Booleans, and characters.
-   **Non-Primitive Data Structures**: These are more complex and are built from primitive types. They can be:
    -   **Linear**: Data elements are arranged sequentially. Examples include arrays, lists, stacks, and queues.
    -   **Non-Linear**: Data elements have a hierarchical or networked relationship. Examples include trees and graphs.

### 2.1. Primitive Data Types
Primitive types are directly supported by machine hardware, making them fast and memory-efficient.

-   **Integer**: Represents whole numbers (e.g., -10, 0, 100). Used for counting and indexing.
-   **Floating Point**: Represents real numbers with fractional parts (e.g., 3.14, -0.01). Used for scientific and decimal calculations.
-   **Character**: Represents a single character (e.g., 'a', '$', '7'). Used for text processing.
-   **Boolean**: Represents truth values, either `true` or `false`. Crucial for control flow and logic.

### 2.2. Non-Primitive Data Types

-   **Array**: A collection of elements of the same type stored in contiguous memory. Ideal for quick access to elements via an index.
-   **String**: A sequence of characters used to store text. Often represented internally as an array of characters.
-   **List (Linked List)**: A collection where elements are not stored contiguously but are linked by pointers. Excellent for dynamic data where frequent insertions and deletions are needed.
-   **Queue**: A collection that follows the **First In, First Out (FIFO)** principle. Used in task scheduling and print queues.
-   **Dictionary (Map/Hash Table)**: Stores data in **key-value pairs**. Essential for fast data retrieval using a unique key.

---

## 3. Complex Data Structures

These are sophisticated structures built from simpler types to solve specific problems and optimize operations like search, insert, and delete.

### 3.1. Graphs 🕸️
A graph is a set of **nodes** (or vertices) connected by **edges**. They are used to model networks like social connections or transportation routes.

-   **Directed Graph**: Edges have a direction (A → B is not the same as B → A).
-   **Undirected Graph**: Edges are bidirectional (A — B).
-   **Weighted Graph**: Edges have an associated weight or cost.

### 3.2. Trees 🌳
A tree is a hierarchical data structure with a **root** node and child nodes, forming a parent-child relationship with **no cycles**.

#### A. Binary Tree
A tree where each node has at most two children. Key types include:
-   **Full Binary Tree**: Every node has either 0 or 2 children.
-   **Complete Binary Tree**: All levels are filled except possibly the last, which is filled from left to right.
-   **Perfect Binary Tree**: All internal nodes have 2 children, and all leaf nodes are at the same level.
-   **Balanced Binary Tree**: The height of the left and right subtrees of any node differs by at most 1.
-   **Degenerate Binary Tree**: Each parent node has only one child, resembling a linked list.

#### B. Binary Search Tree (BST)
A special type of binary tree used for efficient searching. For any given node:
-   All keys in its **left subtree** are **less than** the node's key.
-   All keys in its **right subtree** are **greater than** the node's key.

-   **Advantages**: Provides efficient searching, insertion, and deletion, typically with a time complexity of O(log n) if the tree is balanced.
-   **Disadvantages**: Performance degrades to O(n) if the tree becomes unbalanced.
-   **Operations**: Insertion, Searching, Deletion, and Traversal (visiting nodes in a specific order: inorder, preorder, postorder).

#### C. B-Tree
A generalization of a BST where a node can have **more than two children**. B-trees are commonly used in databases and filesystems because they are optimized for systems that read and write large blocks of data.

### 3.3. Hash Tables (Hash Maps) 🗺️
A hash table maps **keys to values** for highly efficient lookup.

-   **How it works**: A **hash function** computes an index (a "hash code") from a key. This index is used to store the corresponding value in an array (of "buckets" or "slots").
-   **Collision Resolution**: Since different keys can produce the same hash, strategies like **chaining** (storing multiple items at one index using a linked list) are needed to handle these "collisions".
-   **Applications**: Widely used for database indexing, caching, and implementing dictionaries in programming languages.
-   **Advantage**: Offers nearly constant time complexity—O(1)—on average for lookup, insertion, and deletion.

### 3.4. Heaps
A heap is a specialized tree-based structure that satisfies the **heap property**: a parent node is always ordered in a specific way relative to its children.

-   **Max-Heap**: The value of each parent node is greater than or equal to its children's values. The **largest** element is at the root.
-   **Min-Heap**: The value of each parent node is less than or equal to its children's values. The **smallest** element is at the root.
-   **Applications**: Commonly used to implement **priority queues** and for the **Heapsort** sorting algorithm (which has O(n log n) time complexity).

Q1: What is a more suitable data type for the attribute "age" to record the demographic information in data?
Datetime object. If not usable, then integer.

Question 2: Which tree traversal is efficient 
Depends on whether you want to explore leaf or root nodes first. Inorder traversal returns nodes in ascending order, which is better for a sorted order of nodes. Preorder traversal is better for copying or constructing trees, as it starts with the root node first. Postorder traversal starts with the leaf nodes first, so it is better for deleting nodes or post-processing children.

![image.png](attachment:image.png)