# Types of Data

## 1 Structured versus unstructured data

* **Structured (that is, organized) data** This is data that can be thought of as observations and characteristics. It is usually organized using a table method (rows and columns) that can be
organized in a spreadsheet format or a relational database.
* **Unstructured (that is, unorganized) data** This data exists as a free entity and does not follow any standard organization hierarchy such as images, text, or videos.
* **Semi-structured Data** a form of data that does not conform entirely to the formal structure of data models associated with relational databases or other forms of data tables, yet contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.


a. Most estimates place unstructured data as **80-90%** of the world’s data. ---> b. implies that about 90%
of the world’s information is trapped in a difficult format. ---> c. **pre-processing**, to apply transformations to convert unstructured data into a structured counterpart




In [None]:
import json

# Semi-structured JSON string
data_json = '{"name": "Alice", "age": 30, "skills": ["Python", "Data Science"]}'

# Parse JSON
parsed_data = json.loads(data_json)
print(parsed_data)
print(f"Name: {parsed_data['name']}")
print(f"Skills: {parsed_data['skills']}")

{'name': 'Alice', 'age': 30, 'skills': ['Python', 'Data Science']}
Name: Alice
Skills: ['Python', 'Data Science']


## 2 Quantitative versus qualitative data

### Definition
* **Quantitative data** This data can be described using numbers, and basic mathematical procedures, including addition, are possible on the set.
* **Qualitative data** This data cannot be described using numbers and basic mathematics. This data is generally thought of as being described using natural categories and language

### Example – coffee shop data

Say that we were processing the customers of a local coffee shop in a major city using five descriptors (characteristics) for each customer:
* Name of a coffee shop
  * Qualitative: The name of a coffee shop is not expressed as a number and we cannot perform mathematical operations on the name of the shop

* Revenue (in thousands of dollars)
  * Quantitative: How much money a coffee shop brings in can be described using a number. Also, we can do basic operations, such as adding up the revenue for 12 months to get a year’s worth of revenue.

* Zip code
  * Qualitative: This one is tricky. A zip code is always represented using numbers, but what makes it qualitative
is that it does not fit the second part of the definition of quantitative – **we cannot perform basic mathematical operations on a zip code**. If we add together two zip codes, it is a nonsensical
measurement. We don’t necessarily get a new zip code and we don’t get “double the zip code.”

* Average monthly customers
  * Quantitative: Again, describing this factor using numbers and addition makes sense. Add up all of your monthly customers and you get your yearly customers.


* Country of coffee origin
  * Qualitative: We will assume this is a very small coffee shop with coffee from a single origin. This country is described using a name (Ethiopian, Colombian), not numbers.

### How to ?
* Can you describe the value with numbers?
  * No? It is most likely qualitative.
  * Yes? Move on to the next question.
* Do the values make numerical sense if you add them together?
  * No? They are most likely qualitative.
  * Yes? You probably have quantitative data on your hands.

### types of questions you may ask
* a quantitative column
  * What is the average value?
  * Does this quantity increase or decrease over time (if time is a factor)?
  * Is there a threshold where if this number became too high or too low, it would signal trouble for the company?
* a qualitative column
  * Which value occurs the most and the least?
  * How many unique values are there?
  * What are these unique values?


### Example – inspecting world alcohol consumption data

In [None]:
import pandas as pd
drinks = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/pds/0_pds_data/0_raw_data/drinks.csv')
# examine the data's first five rows
drinks.head()# print the first 5 rows


# country、continent：Qualitative
# beer_servings、spirit_servings、wine_servings、total_litres_of_pure_alcohol：Quantitative



Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [None]:
drinks['continent'].describe()
# AF -> africa

Unnamed: 0,continent
count,170
unique,5
top,AF
freq,53


In [None]:
drinks['beer_servings'].describe()

Unnamed: 0,beer_servings
count,193.0
mean,106.160622
std,101.143103
min,0.0
25%,20.0
50%,76.0
75%,188.0
max,376.0


### Discrete vs Continuous
* Discrete data: This describes data that is counted. It can only take on certain values.
  * beer_servings、spirit_servings、wine_servings：Discrete
* Continuous data: This describes data that is measured. It exists on an infinite range of values.
  * total_litres_of_pure_alcohol：Continuous


## 3 The four levels of data

| **Data Level** | **Can be Ordered?** | **Can Perform Addition/Subtraction?** | **Meaningful Ratios?** | **Examples**                |
|----------------|---------------------|---------------------------------------|-------------------------|------------------------------|
| **Nominal**   | ✖ No                | ✖ No                                  | ✖ No                   | Country, Gender, Color       |
| **Ordinal**   | ✔ Yes               | ✖ No                                  | ✖ No                   | Ranking, Survey Score, Education Level |
| **Interval**  | ✔ Yes               | ✔ Yes                                 | ✖ No                   | Temperature (°C/°F), Year    |
| **Ratio**     | ✔ Yes               | ✔ Yes                                 | ✔ Yes                  | Height, Weight, Age, Income  |

## 3.1 nominal level
* description: consists of data that is described purely by name or category.
* feature: mostly categorical
* e.g., A type of animal...
* qualitative
* Mathematical operations
  * Equality
  * Set Membership




## 3.2 ordinal level
* description: Data at the ordinal level provides us with a rank order or the means to place one observation before the other.
However, it does not provide us with relative differences between observations, meaning that while we may order the observations from first to last, we cannot add or subtract them to get any real meaning.
* e.g., Likert -> rate your satisfaction on a scale from 1 to 10.  
  * p.s., 8 is better than 7 while 3 is worse than 9
* Mathematical operations
  * equality
  * set membership
  * Ordering：Ordering refers to the natural order provided to us by the data (Red, Orange, Yellow, Green, Blue, Indigo, and Violet); an artist may impose another order on the data, such as sorting the colors based on the cost of the material to make the said color.
  * Comparison: we can talk about how putting a 7 on a survey is worse than putting a 10.
* Measures of center

In [None]:
## How happy are you to be working here on a scale from 1-5?”

import numpy
results = [5, 4, 3, 4, 5, 3, 2, 5, 3, 2, 1, 4, 5, 3, 4, 4, 5, 4, 2, 1,
4, 5, 4, 3, 2, 4, 4, 5, 4, 3, 2, 1]
sorted_results = sorted(results)
print(sorted_results)

print(numpy.mean(results)) # == 3.4375 -> make no sense

# If addition/subtraction among the scores doesn’t make sense, the mean won’t make sense either

print(numpy.median(results)) # == 4.0

[1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
3.4375
4.0


## 3.3 interval level
* description: data that can be expressed through very quantifiable means, and where much more complicated mathematical formulas are allowed. The basic difference between the ordinal level and the interval level is, well, just that difference. Data at the interval level allows meaningful subtraction between data points.
* e.g., Temperature
* Mathematical operations
  * equality
  * set membership
  * Ordering
  * Comparison
  * Addition
  * Subtraction
* Measures of center

In [None]:
'''
Suppose we’re looking at the temperature of a fridge containing a pharmaceutical company’s new
 vaccine. We measure the temperature every hour with the following data points (in Fahrenheit):
'''

import numpy
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
print(numpy.mean(temps))
print(numpy.median(temps))


# the mean and median are quite close to each other and both are around 31 degrees. The
# question How cold is the fridge? on average has an answer of about 31.

30.733333333333334
31.0


In [None]:
import numpy
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
mean = numpy.mean(temps)
# == 30.73
squared_differences = []
# empty list o squared differences
for temperature in temps:
    difference = temperature - mean
    # how far is the point from the mean
    squared_difference = difference**2
    # square the difference
    squared_differences.append(squared_difference)
    # add it to our list
    average_squared_difference = numpy.mean(squared_differences)
    # This number is also called the "Variance"
standard_deviation = numpy.sqrt(average_squared_difference)
# We did it!
print(standard_deviation) # == 2.5157

2.5157283018817607


## 3.4 ratio level
* description: After moving through three different levels with differing levels of allowed mathematical operations, the ratio level proves to be the strongest of the four
* Mathematical operations
  * equality
  * set membership
  * Ordering
  * Comparison
  * Addition
  * Subtraction
  * multiply & divide (A true zero means the value zero reflects a real, measurable absence of the quantity.)
* e.g.,
  * Celsius and Fahrenheit are classified as Interval Level data: Their “zero point” is arbitrarily defined and does not represent the absolute absence of temperature.
  * Kelvin is classified as Ratio Level data: A measurement of 0 Kelvin indicates absolute zero, meaning no thermal energy exists. Therefore, 200 K is scientifically considered twice as hot as 100 K.
* Measures of center

In [None]:
# The arithmetic mean still holds meaning at this level, as does a new type of mean called the geometric
# mean, which is the square root of the product of all the values.

import numpy
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
num_items = len(temps)
product = 1.
for temperature in temps:
    product *= temperature
geometric_mean = product**(1.0/num_items)
print(geometric_mean)

30.63473484374659
