# Dataframes and Datatypes

It may seem obvious what data is: any grouping of information or observations. It could be a list of your friends' birthdays and phone numbers. Or a catalog of galaxies and images of them. Or quarterly earnings data from the companies in the S&P 500. But there's a huge difference between data scrawled in a notebook and a well-organized, logically organized dataset that a computer can read and analyze. Sometimes coercing data into this format will be difficult, and other times not.

Check out the dataset about cereals on github that I took from [this](https://www.kaggle.com/crawford/80-cereals) Kaggle link. You don't need to download it.

## 1. Rows and Columns

The data itself is organized into **rows** (horizontal groups) and **columns** (vertical ones). 

Another name for rows are **entries**, because each contains all of the information about one observation. In this case, each entry is one type of cereal.

Columns are also called **attributes** and describe the same piece of information for each observation. For example, the first column is the name of each cereal. The fifth column is the number of grams of protein per serving, and so on. There is no deviation from this pattern, and this is important.

Data are not always neatly arranged into rows and columns like these, but this is the most common standard, and the one we will study in this course.

## 2. Data and Metadata

Almost but not all of the spreadsheet is data. The very first row is not: it's header information. It tells us that this data will contain a name, mfr, type, calories and so on. Header information is **metadata**, or data about the dataset.

In addition, not all of the metadata is even *in the spreadsheet*. For example, you may guess that the *mfr* column stands for the manufacturer of the cereal, but what is manufacturer G? Or Q? It's not clear without this additional metadata. The same is true for calories. It is likely that this is calories per serving, but it might be per ounce of cereal, per cup of cereal, or per box (those tiny single-serving boxes).

Metadata that explains what each of the columns means can be found on the [website](https://www.kaggle.com/crawford/80-cereals) that the data was downloaded from, including what the manufacturer labels mean. One of the columns does not have a clear origin or meaning. Which is it?

## 3. Data Types

A data type is a description of what is and is not allowed in a particular column of your dataset. For example, is a column a number, or is it text? These vary from programming language to programming language, and R has many more data types than we will use. Below are the generic names of most of the data types we will use in this course. In parentheses are the particular name that R gives that data type.

### Numbers (`numeric`)
Any sort of number, positive or negative, integers and ones with decimal places.<br>
*Numeric* data is useful because we can compare numbers (3 is larger than 2) and do math with them (2 + 2 is 4).<br><br>
*Examples:* 2, -1, 3.1415, 8675309

### Text (`character`)
Text, from a single letter, to a novel in length.<br>
Manipulating and studying character data is a more advanced technique called [*text mining*](https://www.tidytextmining.com/) or Natural Language Processing, which we will not be covering in this course.<br>
<br>
*Examples:* 'hello', "a", "fifteen", "15" <br>

Whenver you type character data into R, it must be enclosed in single or double quotation marks as shown above (`'` or `"`).<br>

### Categorical Data (`factor`)
Text that can take on specific list of values. Each different possibility is called a *level.* The difference between a factor and character data is that each entry of character data can be unique, but there is a finite list of options for factors. *Factors* are to organize or categorize individual observations into groups. Are two things in the same category, or not?<br><br>
*Example:* 'Monday', 'Tuesday', 'Wednesday', ...<br>
*Another:* 'CIS', 'GWS', 'PSY', '', ...

In the cereal dataset, the first column (*name*) is characters, the second (*mfr*) is a factor, and the rest are numeric.

### Booleans (`logical`)
A variable that can take the values `TRUE` and `FALSE` and no others. Under many circumstances, R will treat logical variables as factors (which takes only those two values), and under others R will treat them as numeric variables (as 1 and 0 for `TRUE` and `FALSE`, respectively).

### Special: keys.
A **key** is a special type of column and not a data type. A key will uniquely identify a particular row or entry in the dataset. Sometimes this will be a number and sometimes it will be characters.

*Example:* Student ID numbers, social security numbers (SSN), and patient or subject IDs are all numeric keys.<br>
*Another:* If your dataset contains 50 entries, one for each US state, then the name of the state is also a key.

In the cereal dataset, the first column (*name*) is a key. Each cereal has a unique name.

## 4. Relationships to Other Languages (Optional)

If you're familiar with other programming languages, you've seen data types before, but they may have gone by different names. Categorical data (factors) are used widely in R, but are much less frequently used in other languages—and Python doesn't have one at all!

Here's a handy table, for reference:

Generic Name | R | Python | Java | SQL
:-------------|---:|--------:|-----:|--:
Integers | numeric | int | int | INT
Decimal Numbers| numeric | float | float | FLOAT
Text | characters | string | String | TEXT
Categorical Data | factor |  | enum | ENUM
Booleans | logical | boolean | Boolean | BOOLEAN
             


