#### Understanding Data – Definitions & Types

##### Data

This is one of the **most important lessons** in data analysis. If you use the wrong tool, you’ll likely get poor results.  
In data science, the first step to choosing the right tool is **understanding the data** you’re working with.

It’s important to build a clear way of **thinking about different types of data**, so you can analyze it more effectively.


---

##### Tools You’ll Use:
- **NumPy** – for working with numerical data and arrays  
- **Pandas** – for handling structured data using DataFrames


### Part 1: The Great Divide - Structured vs. Unstructured Data

#### Structured Data

* Structured data is data that is **highly organized**, often in a **tabular format** (think rows and columns).
* It's easy for machines to **read and process**.

---

##### **Characteristics:**
- Conforms to a **pre-defined data model**
- Values have **clear labels**
- **Easily searchable**

---

##### **Examples:**
- An Excel spreadsheet  
- A SQL database table  
- A CSV file  
- The python dictionary


In [2]:
# A Python example of structured data: a list of dictionaries.
# Each dictionary is a "row", and the keys are "columns".
student_data = [
    {'student_id': 101, 'name': 'Alice', 'major': 'Physics'},
    {'student_id': 102, 'name': 'Bob', 'major': 'Art History'},
    {'student_id': 103, 'name': 'Charlie', 'major': 'Physics'}
]

print(student_data[0]['name'])

Alice


### Unstructured Data

- This is data that has **pre-defined structure**. It's often **text-heavy**, but can include many kinds of files.

---

#### **Characteristics:**
- No formal data model  
- Difficult to process without advanced tools (e.g., Natural Language Processing)  


---

#### **Examples:**
- The body of an email  
- A PDF document  
- A tweet  
- A JPG image  
- An MP3 audio file


In [3]:
   
# A Python example of unstructured data: a block of text.

unstructured_email = """
Hello Team,
Please remember that the project deadline is this Friday, Nov 10th.
Bob from QA has reported a bug in the login module.
Thanks,
Alice
"""

data = unstructured_email.split(' ')

In [None]:

print(unstructured_email)

## Structured vs. Unstructured Data



| **Feature**   | **Structured Data**                        | **Unstructured Data**                  |
|---------------|--------------------------------------------|----------------------------------------|
| **Format**    | Tabular (Rows/Columns)                     | Varied (Text, Media)                   |
| **Source**    | Databases, Spreadsheets                    | Emails, Social Media, Docs             |
| **Analysis**  | Straightforward (SQL, Pandas)              | Complex (NLP, CV)                      |


> **Key Takeaway**:  
> For the majority of this course, especially when we get to **NumPy** and **Pandas**, we will be focusing on **structured data**.


## Qualitative vs. Quantitative Data

All data—whether **structured** or **unstructured**—can be broken down into **two main types**:

- **Qualitative Data**: Descriptive, non-numerical information (e.g., color, opinion, category)
- **Quantitative Data**: Numerical information that represents a measurable quantity (e.g., height, age, temperature)


#### Quantitative Data (Quantity)

Quantitative data is data that is **measured and expressed numerically**. It represents a **quantity or an amount**, and you can perform mathematical operations on it (e.g., add, subtract, average, etc.).

---

##### Examples:

- A person's height: **175 cm**  
- The temperature outside: **21.5°C**  
- The number of students in this class: **30**  
- The price of a stock: **$150.25**


#### Qualitative Data (Quality)

Qualitative data describes a **quality or characteristic**. It's **categorical** and **non-numerical** in nature.  
You can't perform meaningful math on it (e.g., `"blue"` + `"green"` doesn't make sense).

---

#### **Examples:**
- A person's eye color: `"Blue"`  
- The model of a car: `"Toyota"`  
- A student's major: `"Computer Science"`  
- A 'yes' or 'no' answer


#### Deeper Dive: Sub-types of Qualitative Data

Qualitative data itself has **two important sub-types**:

---

#### **1. Nominal Data**
- Categories that **do not have a natural order or ranking**.
- Simply labels or names used to identify a type.

**Examples:**
- "Country of birth" (USA, Canada, India)  
- "Gender" (Male, Female, Other)  
- "Car Brand" (Ford, Tesla)

---

#### **2. Ordinal Data**
- Categories that **do have a meaningful, logical order or rank**,  
  but the **distance between categories is not measurable**.

**Examples:**
- "T-shirt size" (Small, Medium, Large)  
- "Customer satisfaction" (Poor, Good, Excellent)  
- "Education level" (High School, Bachelor's, Master's)


#### Discrete vs. Continuous Data

Quantitative (numerical) data can also be broken down further.  
This distinction is **crucial** for statistics and data visualization.

---

#### **Discrete Data (Countable)**

Discrete data can only take **specific, distinct values**.  
It's often (but not always) an **integer**, and you can **count** it.

---

#### **Examples:**
- Number of children in a family (you can't have 2.5 children)  
- The number of cars in a parking lot  
- The score on a standard dice (can only be 1, 2, 3, 4, 5, or 6)  
- Shoe size (e.g., 8, 8.5, 9, 9.5 – there are no values in between these sizes)


### Continuous Data (Measurable)

Continuous data can take **any value within a given range**.  
You **measure** continuous data.

---

#### **Examples:**
- A person's height (could be 180 cm, 180.1 cm, 180.11 cm, ...)  
- The temperature in a room  
- The exact time taken to run a race

---

## Our Data Classification Hierarchy

So, we have a clear hierarchy we can use to classify any data point:


In [None]:
⏱️ In-Class Activity 1: Let's Classify!
For each data point below, classify it using the full hierarchy. Is it Qualitative or Quantitative? 
If Qualitative, is it Nominal or Ordinal?
If Quantitative, is it Discrete or Continuous?
Data Point	Qual/Quan?	Sub-Type (Nom/Ord/Disc/Cont)


A person's name	                    _____Your Answer	Your Answer
The number of siblings a person has	_____Your Answer	Your Answer
Temperature in Celsius	            _____Your Answer	Your Answer
A movie rating (1 to 5 stars)	    ______Your Answer	Your Answer
A person's postal code	            ______Your Answer	Your Answer
The weight of a banana in grams	    ______Your Answer	Your Answer
The brand of a smartphone	        _______Your Answer	Your Answer

#### Part 4: The Bridge - Mapping Data Types to Python Structures

This theory is only useful if we can apply it.  
So, how do we represent these data types in Python using the structures we already know?

| Data Type    | Python Representation                   | Example                      |
|--------------|----------------------------------------|------------------------------|
| Structured   | List of Dictionaries (good), Pandas DataFrame (best!) | `[{'id': 1}, {'id': 2}]`     |
| Unstructured | String                                 | `"The quick brown fox..."`    |
| Nominal      | String                                 | `major = "Physics"`           |
| Ordinal      | String (with known order), or Integer  | `size = "Medium"`, `rating = 4` |
| Discrete     | Integer                                | `num_children = 3`            |
| Continuous   | Float                                  | `temperature = 21.57`         |


### In-Class Activity 2: Design a Python Representation

Imagine you are building a system for an e-commerce website. You need to store information about a single product.  
How would you represent the following product information in a single Python dictionary?

---

### Product Information:
- Product ID: `"a4e-11b-89c"`  
- Product Name: `"Wireless Noise-Cancelling Headphones"`  
- Price: `199.99`  
- Average Customer Rating: `4.6` (out of 5)  
- Number in Stock: `250`  
- Available Colors: `["Black", "White", "Silver"]`  
- Category: `"Electronics"`

---

### Task:
Create a dictionary called `product` to hold this structured data.  
Think carefully about the Python data type for each value.

In [None]:
product = {
    "product_id": "a4e-11b-89c",    # Qualitative (Nominal) -> String
    "product_name": "Wireless Noise-Cancelling Headphones", # Qualitative (Nominal) -> String
    "price": 199.99,                # Quantitative (Continuous) -> Float
    "average_customer_rating": 4.6, # Quantitative (Continuous) -> Float
    "number_in_stock": 250,         # Quantitative (Discrete) -> Integer
    "available_colors": ["Black", "White", "Silver"], # A collection of Qualitative (Nominal) data -> List of Strings
    "category": "Electronics"       # Qualitative (Nominal) -> String
}


import json
print(json.dumps(product, indent=4))