# 3.1: Intro to Tabular Data

* Tabular data is **data that is organized into a table-structure** (rows x columns)
* Each **row** is called an *instance*, *record*, *object*
* Each **column** is called an *attribute*, *variable*, *row*, *field*, *feature*
* A dataset is a *sample*/*population* set of instances that have 1 OR MORE tables
* A **key** is one or more attributes that *uniquely identify instances*
    * You should ALWAYS have some keys for your data, so if there isn't then MAKE YOUR OWN using standard indexing

## Joins

We can combine two different tables on any attributes/variables. We typically do this on key/foreign keys when we try to do joins

There are two main types of joins:

1. **INNER JOIN**: only includes rows that match on attributes in both tables
2. **OUTER JOIN**: include non-matching rows as well
    * We fill non-matching attribute values with N/A which we represent in Python with `NaN`

## Aspects of Attributes

* data (storage) type: int, float, strings, ...
* measurement scales: categorical, continuous
* semantic type - what do the values represent?
    * people, ages, hometown, ...
* noisy vs invalid values?
* labeled or unlabeled?
    * labeled data: trying to make predictions
    * unlabeled data: pattern mining

## Measurement Scales

1. **NOMINAL:** discrete data/values *without an inherent ordering*
    * Ex: occupation- accountant, lawyer, programmer
    * Ex: colors- red, green, blue
    * stored with any data type
    
2. **ORDINAL:** discrete data/values *with an inherent ordering*
    * Ex: letter grades- A, A-, B+, ...
    * t-shirt sizes- small, medium, large, ...
    * no guarantee on the distance between two values
    * stored with any data type
    
3. **INTERVAL:** values measured on a scale of equal sized widths with no inherent zero point
    * Ex: temperature in celcius or farenheight
    * stored in ints/floats
    
4. **RATIO:** interval values but with an inherent zero point
    * Ex: temperature in Kelvin (0 literally means no temperature)
    * Ex: weight/height
    
> nominal and ordinal scales go with categorical/discrete attributes, interval and ratio go with continuous/numerical attributes

## Noisy VS Invalid

* **NOISY:** valid on the appropriate measurement scale, but *recorded incorrectly*
    * Ex: "fat fingering"
    * Ex: a Spokane resident when asked for the state that you live in, you say ID instead of WA
* **INVALID:** not valid on the state scale
    * Ex: when asked for the state that you live in, you say "1"

## Labeled VS Unlabeled

* **LABELED:** an attribute that represents a "class"
    * "class" is a general term for something that can be classified
    * used for classification in supervised machine learning
    * the goal of supervised machine learning is to predict a class value
* **UNLABELED:** not a class attribute (dataset doesn't have a class attribute)
    * unsupervised machine learning - looking/mining for patterns, groups, relationships, etc.