# Reviewing Statistics and R

This is your first lab for the Introduction to Statistical and Mathematical Foundations of Data Science course. 
You can refer to chapters 1 to 3 in [intro to statistics textbook](http://onlinestatbook.com/2/index.html) book for reference. 

**Make sure to run the STATMATH CONTAINER.** 

We will begin with statistics mainly for univariate data analysis, 
covering some basic concepts like descriptive and inferential statistics and distributions.

Some of the concepts may have been covered in the Intro course or boot camp; we will refresh those concepts here a little bit.


### Loading data

Load the data `auto-mpg` into R and view the first few rows 
(including the column names (called the header)) using the following code. 
This data is about city-cycle fuel consumption based on different types of cars.

In [1]:
auto_mpg <- read.csv("/dsa/data/all_datasets/auto-mpg/auto-mpg.csv", header = TRUE)

# how is the following different? 

#auto_mpg <- read.csv("/dsa/data/all_datasets/auto-mpg/auto-mpg.csv", header = TRUE, stringsAsFactors=TRUE)

## Currently, setting stringsAsFactors=FALSE is the default and loads strings from the CSV as characters.
## Setting stringsAsFactors=TRUE turns strings into factors/categorical values, which is useful for
## statistics and ML, but can make data carpentry more difficult in other applications.

We can get a quick view of some of the relevant information about the data set `auto_mpg` using head() function. 
**`head()`** shows the first six rows by default. 
Just typing the variable `auto_mpg` will return all the rows in the table which is time consuming if you are dealing with a very big dataset.

In [2]:
head(auto_mpg)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model.year,origin,car.name
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<chr>,<int>,<dbl>,<int>,<int>,<chr>
1,18,8,307,130.0,3504,12.0,70,1,chevrolet chevelle malibu
2,15,8,350,165.0,3693,11.5,70,1,buick skylark 320
3,18,8,318,150.0,3436,11.0,70,1,plymouth satellite
4,16,8,304,150.0,3433,12.0,70,1,amc rebel sst
5,17,8,302,140.0,3449,10.5,70,1,ford torino
6,15,8,429,198.0,4341,10.0,70,1,ford galaxie 500


**`names()`** : The auto_mpg dataset has the column names already set. 
But in many cases, a dataset will not have headers for the columns or the names will need some formatting. 
`names()` function is helpful in manipulating the names of dataframe columns. 
It is illustrated below how you can use this command. 


`Usage: names(x) <- value`


`names(x)`, where `x` is the input dataframe, shows the variable names of the dataframe. 
You can also modify column names using this command.
You can assign the variable names as a vector of names. 
If the length of character vector of names is less than the number of variables in the dataframe, 
it is extended by character NAs to the length of `x`.

In [3]:
names(auto_mpg)

In [4]:
#Assign the current names of auto_mpg columns to the variable 'column_names', which we 
#will use later.
column_names=names(auto_mpg)
column_names

In [5]:
#Modify the names of columns of auto_mpg dataset by assigning new names.
names(auto_mpg)=c('a','b','c','d','e','f','g','h','i')
names(auto_mpg)

### Changing a specific column name 

In [6]:
#Modify the name of first column of auto_mpg dataset.
names(auto_mpg)[1]='z'


In [7]:
#Display the names of auto_mpg columns
names(auto_mpg)

In [8]:
#Assigning the original column names back to auto_mpg dataset variables, using the variable
#column_names we created before.
names(auto_mpg)=column_names
names(auto_mpg)


----
**`summary()`** 

`summary()` command gives a summary of each variable in the dataframe, show some descriptive statistics. 
As shown below, the command is very informative. 
It calculates the minimum value, 1st quartile, 2nd quartile(median), mean value, 
3rd quartile, and maximum values of numeric variables. 
If the variable has NA values, number of such rows with NA values is displayed too. 

You can use this information to quickly identify if the variables are qualitative (discrete) or quantitative (continuous). 
For example, all of the variables in auto_mpg dataset except for origin, horsepower, and car.name are continuous.  Origin is discrete; horsepower is discrete;  and `car.name` is a qualitative (nominal) variable (also discrete).

In [9]:
summary(auto_mpg)

      mpg          cylinders      displacement    horsepower       
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
 Median :23.00   Median :4.000   Median :148.5   Mode  :character  
 Mean   :23.51   Mean   :5.455   Mean   :193.4                     
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
 Max.   :46.60   Max.   :8.000   Max.   :455.0                     
     weight      acceleration     model.year        origin     
 Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
 1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
 Median :2804   Median :15.50   Median :76.00   Median :1.000  
 Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
 3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
 Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
   car.name        
 Length:398        
 Class :character  
 Mode  :characte

**Note**: Do you wonder why there is no min, max, mean, or other values given for the **horsepower variable**? 
The same is the case with car.name but it makes sense for the car.name variable. 
The horsepower variable can't have a minimum value or mean values because it has strings in it. 
Run the `str()` function to understand the difference between `horsepower` and other numeric variables.

The `str()` function displays **the internal structure of an R object** and is an alternative to `summary()`. It tells you the datatype of variables, the dimensions of the dataframe, 
and also gives an overview of the kind of values each variable contains.

In [10]:
str(auto_mpg)

'data.frame':	398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : chr  "130.0" "165.0" "150.0" "150.0" ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car.name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...


Horsepower is a character datatype. Character fields in the csv file are read as characters by default, but we can change that behavior by setting the `stringsAsFactors` parameter to TRUE. This will read character fields as **factors**.

In [11]:
auto_mpg_fct <- read.csv("/dsa/data/all_datasets/auto-mpg/auto-mpg.csv", header = TRUE, stringsAsFactors=TRUE)
str(auto_mpg_fct)

'data.frame':	398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : Factor w/ 94 levels "?    ","100.0",..: 17 35 29 29 24 42 47 46 48 40 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car.name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...


Now, horsepower and car.name are of `factor` datatype. 
Factor is a categorical (nominal/discrete) datatype. 
 It is used for variables that take discrete values from a limited, specific set of values, so they are not continuous.

The summary function would not calculate mean, median, min etc. for horsepower or car.name because they are either factor or character variables, but it would give a breakdown of the **levels** of the factor variable. 

Each value a factor variable can take is called a "**level**".


In [12]:
summary(auto_mpg_fct)

      mpg          cylinders      displacement     horsepower      weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150.0  : 22   Min.   :1613  
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90.00  : 20   1st Qu.:2224  
 Median :23.00   Median :4.000   Median :148.5   88.00  : 19   Median :2804  
 Mean   :23.51   Mean   :5.455   Mean   :193.4   110.0  : 18   Mean   :2970  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100.0  : 17   3rd Qu.:3608  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   75.00  : 14   Max.   :5140  
                                                 (Other):288                 
  acceleration     model.year        origin                car.name  
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   ford pinto    :  6  
 1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000   amc matador   :  5  
 Median :15.50   Median :76.00   Median :1.000   ford maverick :  5  
 Mean   :15.57   Mean   :76.01   Mean   :1.573   toyota corolla:  5  
 3rd Qu.:17.18   3rd Qu.:7

In [13]:
levels(auto_mpg_fct$horsepower)

**Horsepower is not a good example of a factor variable because there is no reason for it to be a categorical type.** Since it it has missing values represented by a '?' character in the csv file, it is either read as character or factor into the data frame. 

Similarly, car.name is NOT a good example of a categorical variable either, because each data item has a unique name creating 369 levels. 

---

## Column Access in R

Recall, that column access in R is accomplished by using the __ \$ __ operator.  
```R
dataframe$columnName
```

**`summary(auto_mpg$weight)`** shows summary of `weight` variable of the `auto_mpg` data.

In [14]:
summary(auto_mpg$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1613    2224    2804    2970    3608    5140 

----
## Descriptive Statistics

Descriptive statistics are used to summarize and describe data.
If we are analyzing miles per gallon data, for example, 
a descriptive statistic might be the percentage of cars with different numbers of 
cylinders or the average miles per gallon for all cars. 
Many descriptive statistics are often used at one time to give a full picture of the data. 

**There are mainly two categories of descriptive statistics:**
measures of central tendency (or averages) and measures of dispersion 
(which summarizes how spread out or dispersed the data points are). 


A variable can have many observations (data points or values) and a summary set of numbers that describe those multiple observations, such as those as shown by the `summary()` command, are descriptive statistics. 

There are **three important measures** of central tendency used to summarize data: the mean, the median, and the mode. 


When we talk about the **mean**, we'll be referring to the arithmetic mean as contrasted to some other means, 
such as the geometric mean or the harmonic mean, which are not used as frequently as arithmetic mean. 
The mean of a set of data is simply the sum of data observations divided by the total number of observations. 

The **median** of a set of ordered observations is a middle number that divides the data into two parts, 
where half of the data points are in one part and the other half in the second part. 

**The mean is influenced to a greater extent by extreme observations.**

**So, if you notice extreme observations in your data, then perhaps a median is a better summary of data than a mean.**

Income and price data generally follow this pattern, which is why census organizations report **median incomes and median prices.**

Descriptive statistics are just descriptive. 
They cannot generalize anything beyond the data at hand. 
Generalizing from our data to another set of cases is dealt with in **inferential statistics**.

R has built-in functions to calculate mean, median, standard deviation (a measure of the spread of data points around the mean), and other descriptive statistics.

These terms should be familiar from the previous intro course or boot camp.

In [15]:
#Let's run the head() command on auto_mpg to peek into the dataset.
head(auto_mpg)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model.year,origin,car.name
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<chr>,<int>,<dbl>,<int>,<int>,<chr>
1,18,8,307,130.0,3504,12.0,70,1,chevrolet chevelle malibu
2,15,8,350,165.0,3693,11.5,70,1,buick skylark 320
3,18,8,318,150.0,3436,11.0,70,1,plymouth satellite
4,16,8,304,150.0,3433,12.0,70,1,amc rebel sst
5,17,8,302,140.0,3449,10.5,70,1,ford torino
6,15,8,429,198.0,4341,10.0,70,1,ford galaxie 500


----
#### Mean
The `mean()` function gives the average value of the column. 
The mean is the most basic statistic to help you understand the distribution of observations (data points) of a variable.
    
    Mean = Sum of all Observations / No. of Observations

In [16]:
#The paste function below is used to concatenate strings to make things readable. 
#paste concatenates strings in the order they are given as input. 
paste('Average auto displacement is:',mean(auto_mpg$displacement))

# Note, that paste adds a space character between the parts when concatenating

In [17]:
#Without the paste function
mean(auto_mpg$displacement)

These commands tell us on an average each vehicle has a displacement of 193.42. 

#### Mode
    
    Mode = Most occuring value in a set of values, i.e., in a column of a dataframe.
    
Mode is the value that has been repeated most frequently in a set of values and is especially useful when dealing with discrete variables.
R does not have any built-in function to compute Mode as you would expect. 
Instead, the `mode()` function returns the type or **storage mode** of the object. 
For example:

_mode_(dataframe\$columnName). 

In [19]:
paste('Datatype of mpg is',mode(auto_mpg$mpg))


In [20]:
#Without the paste function
mode(auto_mpg$mpg)

However, you can use a command like below to calculate the most frequently occuring value. 
The `table()` command tells us the distribution/count of different values of a variable. 
So, using the command below, we are able to print how often each value occurred in the dataset. 

In [21]:
table(auto_mpg$mpg)


   9   10   11   12   13   14 14.5   15 15.5   16 16.2 16.5 16.9   17 17.5 17.6 
   1    2    4    6   20   19    1   16    5   13    1    3    1    7    5    2 
17.7   18 18.1 18.2 18.5 18.6   19 19.1 19.2 19.4 19.8 19.9   20 20.2 20.3 20.5 
   1   17    2    1    3    1   12    1    3    2    1    1    9    4    1    3 
20.6 20.8   21 21.1 21.5 21.6   22 22.3 22.4 22.5   23 23.2 23.5 23.6 23.7 23.8 
   2    1    8    1    3    1   10    1    1    1   10    1    1    1    1    1 
23.9   24 24.2 24.3 24.5   25 25.1 25.4 25.5 25.8   26 26.4 26.5 26.6 26.8   27 
   2   11    1    1    2   11    1    2    2    1   14    1    1    2    1    9 
27.2 27.4 27.5 27.9   28 28.1 28.4 28.8   29 29.5 29.8 29.9   30 30.5 30.7 30.9 
   3    1    1    1   10    1    1    1    8    2    2    1    7    2    1    1 
  31 31.3 31.5 31.6 31.8 31.9   32 32.1 32.2 32.3 32.4 32.7 32.8 32.9   33 33.5 
   7    1    2    1    1    1    6    1    1    1    2    1    1    1    3    3 
33.7 33.8   34 34.1 34.2 34

**Note:** In the output above, the table is line-wrapping in the display
```
Value->   9   10   11   12   13   14 14.5   15 15.5   16 16.2 16.5 16.9   17 17.5 17.6 
Count->   1    2    4    6   20   19    1   16    5   13    1    3    1    7    5    2 

Value->   17.7   18 18.1 18.2 18.5 18.6   19 19.1 19.2 19.4 19.8 19.9   20 20.2 20.3 20.5
Count->      1   17    2    1    3    1   12    1    3    2    1    1    9    4    1    3

...
```

We see that 13 MPG occurs 20 times (13 is at position 5).

**Note:** What table() is actually doing is computing a histogram of the value for the given set / column.

This tells us, 13 is the most commonly occuring mileage (miles per gallon) of the vehicles. Of the 398 vehicles, 20 vehicles have 13 miles per gallon. 

This is not quite the answer we are looking for, so a little more code and we can get what we need.
In this case we will use the `which.max()` function to ask:  
**Which index holds the greatest value?**


In some other languages and mathematical notation, this is often referred to as "argmax" for:  
Which argument (aka index) has the maximum value?

In [22]:
#Let's use which.max with the table function to find out which value 
#occurred the most frequently in the dataset
which.max(table(auto_mpg$mpg))

In [23]:
#And now we can combine the previous code with the paste function...
paste("The mode using which.max():",names(which.max(table(auto_mpg$mpg))))

#Note: If we don't use `names()` around `which.max`, the returned value is not correct.

#### Median
    
    Median = Mid point of all values. 

Median value divides the data set into two equal halves. 
One half lies to the left of median and the other to the right. 
Median values are less affected than the mean by outliers (extreme values). 
Therefore, the median is considered an ideal choice for measuring central tendency when the data is skewed 
(when the data has outliers). 

R has a built-in function to calculate the median. 
Let's calculate the median for the displacement variable in the auto_mpg dataset.

In [24]:
median(auto_mpg$acceleration)

In [26]:
paste("median:",median(auto_mpg$acceleration))

Of the 398 observations in the dataset, 199 observations have acceleration less than or equal to 15.5 and other 199 observations have acceleration greater than or equal to 15.5

#### Range
    
    Range = (lowest value, highest value)
    
The range is also a measure of spread or extremal values of a variable. 
R also has a built-in function to calculate range. 
Let's calculate the range of the variable `model.year` in the auto_mpg dataset. 

In [27]:
range(auto_mpg$model.year)

So, the range of model years for cars in the dataset are from year 70 to year 82

#### Quantile

The `quantile()` function divides the data set into **4 equal parts**, based on quantity of measurements. 
The first is Quantile (Q1), the second is Quantile (Q2), the third Quantile (Q3), and the fourth quantile (Q4).

Quantiles are well understood when used with box plots. 
Box plots summarize and identify the range (minimum and maximum), Q1, Q2, and Q3 of a variable.

In [28]:
quantile(auto_mpg$displacement)

In [29]:
quantile(auto_mpg$model.year)

This command is very informative as it gives minimum, maximum, 25th percentile, 50th percentile (median) and 75th percentile values of the variable. 
Quantiles are used for explaining the variance in the variable as it is less immune to outliers and explains variation better than other measures. 

#### Variance

    The average of the squared differences from the mean.
    
Variance measures how widely the values in a variable are spread around the mean. 
If the observations vary greatly from the variable mean, the variance will be big and vice versa. 
R has built-in functions to calculate the variance so that we don't have to get into the math of it. 

In [30]:
var(auto_mpg$displacement)

In [32]:
paste('variance:', var(auto_mpg$displacement))

The above value represents the squared error of all the displacement values. 
Variance often doesn't make much sense when trying to understand the spread of the data, as the units of variance are not the same as the units of the original data.
However, standard deviation will give us a clearer idea of how data is spread. 

#### Standard deviation

    SQRT(variance) = Standard Deviation

In [33]:
sd(auto_mpg$displacement)

In [34]:
paste("standard deviation:", sd(auto_mpg$displacement))

The values in the displacement variable have a standard deviation of 104.27. 
Recall that the mean was 193.42. 
In later lessons, the combination of mean and standard deviation can be combined to model a data population.

**Maximum, Minimum, Median, and Mean Absolute Deviation (similar to standard deviation):**

We will do this using a for-loop, which is a slow process in R.  
Inside the loop, `c()` is a generic R function that **combines** its arguments into a vector.

`print()` is a generic R command that prints the contents of an object.

Minimum, Maximum etc. cannot be calculated for factor data. 
So let's create a new dataframe called numeric_data with horsepower as a numeric type variable.
We will also exclude car.name.

In [35]:
#ncol is a function that returns the number of columns
numeric_data=auto_mpg[-ncol(auto_mpg)] # take all columns except the last column
numeric_data$horsepower = as.numeric(numeric_data$horsepower) # Convert horsepower to a numeric class

“NAs introduced by coercion”


See this link for an explanation of the sprintf() function and its syntax: https://www.gastonsanchez.com/r4strings/c-style-formatting.html

In [36]:
print(sprintf("%15s %10s %10s %10s %30s","Column", "Maximum", "Minimum", "Median","Mean Absolute Deviation"))
for(i in 1:ncol(numeric_data))
{
    print(sprintf("%15s %10.1f %10.1f %10.1f %10.1f", 
                  names(numeric_data[i]), 
                  max(numeric_data[,i]), 
                  min(numeric_data[,i]), 
                  median(numeric_data[,i]), 
                  mad(numeric_data[,i])
                 )) 
}

#sprintf() is a wrapper for the C function sprintf, that returns a character vector containing a formatted 
#combination of text and variable values.  
# "%15s" returns a string with 15 leading zeros
# "%10.1f" returns a floating point number with 10 leading zeros and 1 digit after the decimal point

[1] "         Column    Maximum    Minimum     Median        Mean Absolute Deviation"
[1] "            mpg       46.6        9.0       23.0        8.9"
[1] "      cylinders        8.0        3.0        4.0        0.0"
[1] "   displacement      455.0       68.0      148.5       86.7"
[1] "     horsepower         NA         NA         NA         NA"
[1] "         weight     5140.0     1613.0     2803.5      945.2"
[1] "   acceleration       24.8        8.0       15.5        2.5"
[1] "     model.year       82.0       70.0       76.0        4.4"
[1] "         origin        3.0        1.0        1.0        0.0"


The output of the cell above is inefficient but intuitive. 
We will do the same job in a more efficient fashion using the apply() command.  

Notice the "**NA**" values for horsepower; it had missing values that are converted to NAs so the functions cannot compute numerical value for horsepower. We need to exclude the NAs from computation. 

Below, the "2" refers to columns in the x array; a "1" would refer to rows.
This is because rows are the _first index_ and columns are the _second index_.

In [37]:
cbind(Max=apply(numeric_data, 2, max, na.rm=TRUE), 
      Min=apply(numeric_data, 2, min, na.rm=TRUE), 
      Median=apply(numeric_data, 2, median, na.rm=TRUE), 
      Mean_Absolute_Deviation=apply(numeric_data, 2, mad, na.rm=TRUE))

# cbind() takes a sequence of vector, matrix or data-frame arguments and combines by columns or rows, respectively.

# apply() returns a vector /array / list of values obtained by applying a function to margins of an array or matrix.
# The second argument in the apply function is the MARGIN argument, which, here, indicates the function is to be 
# applied to the columns (as indicated by the number "2".)  The third argument is the FUN argument, which specifies 
# the name of the function to be applied.

# na.rm excludes the NA values so that the functions can compute numerical values. 

Unnamed: 0,Max,Min,Median,Mean_Absolute_Deviation
mpg,46.6,9,23.0,8.8956
cylinders,8.0,3,4.0,0.0
displacement,455.0,68,148.5,86.7321
horsepower,230.0,46,93.5,28.9107
weight,5140.0,1613,2803.5,945.1575
acceleration,24.8,8,15.5,2.52042
model.year,82.0,70,76.0,4.4478
origin,3.0,1,1.0,0.0


What does the `cbind()` function do?
When you see functions used and you want more information, you can use the `help()` function to ask the R environment.

In a similar fashion you can read the documentation of the `apply()` function as well.

In [38]:
help(cbind)

---

## Variables in R

In statistical inference, there are basically two treatments of variables in a dataset, **Dependent** and **Independent** variables. 
The independent variables are used to predict the dependent variable's outcome. 
For example, in our auto-mpg dataset the variable 'mpg' can be dependent, i.e., it can be predicted by other variables in the dataset.  

The variables often share a correlation among themselves. 
For example, the displacement of a vehicle might be correlated to mpg. 
Small vehicles tend to have more miles per gallon compared to big vehicles with low miles per gallon. 
A correlation between a dependent variable and an independent variable doesn't mean that independent variable is causing the changes in dependent variable. 
Dependent variables change according to independent variables, but aren't necessarily caused by them. 

### Types of Variables:

The most important distinction between variables is if they are either qualitative or quantitative. 

**Qualitative variables:** Variables that express a qualitative attribute such as religion, favorite movie, gender, and so on fall into this category. They are sometimes referred to as categorical or nominal variables. 

**Quantitative variables:** Variables that are measured in terms of numbers. 
Some examples are height, weight, and shoe size.

Generally speaking, when the variable has a numeric value, we call it quantitative data. When you classify something, we call it qualitative data.


##### Flavors of Quantitative data:

<b>Discrete variables: </b> Some measures in data are discrete and cannot be made more precise. 
For example, number of children in a family is discrete, because you are counting indivisible entities. 
You can't have 2.5 kids or 1.3 pets.

<b>Continuous data: </b> Data that can be reduced to finer levels is said to be continuous in nature. 

### Levels of measurement

Both qualitative and quantitative variables follow levels of measurement. 
There are 4 levels: nominal, ordinal, interval and ratio scaled. 

<b>nominal:</b> Example : car.name - since it just has car names has its values. 
More examples could be marital status, gender, religious affliation, etc.

<b>ordinal:</b> Example : cylinders - increase in the number of cylinders in a car means higher horsepower. 
Number of cylinders in cars have an order. 
More examples could be ranking of soldiers, grade a student belongs to, etc.

<b>interval: </b> We can classify acceleration as an interval scaled variable. 
More examples could be temperature, I.Q, etc. 

<b>ratio:</b> We don't have a ratio scaled variable in the auto_mpg dataset. 
An example of a ratio scaled variable could be daily calorie intake or GPA score. 

These levels are explained in further detail in **Chapter 1 in the [statistics online text book](http://onlinestatbook.com/2/introduction/levels_of_measurement.html)**. 
We will explore the concepts with the  _auto_mpg_ dataset columns and classify them into different kinds of variables.

In [39]:
summary(auto_mpg)

      mpg          cylinders      displacement    horsepower       
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
 Median :23.00   Median :4.000   Median :148.5   Mode  :character  
 Mean   :23.51   Mean   :5.455   Mean   :193.4                     
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
 Max.   :46.60   Max.   :8.000   Max.   :455.0                     
     weight      acceleration     model.year        origin     
 Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
 1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
 Median :2804   Median :15.50   Median :76.00   Median :1.000  
 Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
 3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
 Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
   car.name        
 Length:398        
 Class :character  
 Mode  :characte

In [40]:
str(auto_mpg)

'data.frame':	398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : chr  "130.0" "165.0" "150.0" "150.0" ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car.name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...


From the above summary and structure of data, we can identify what kind of variable each one is!

mpg - quantitative[continuous] : 

      mpg data is numeric and continuous in nature.

cylinders - quantitative[discrete] : 

      cylinders data is integer type and is discrete in nature.

displacement - quantitative[continuous] : 

      Displacement data is numeric and continuous in nature.

horsepower - quantitative[discrete] : 

      Horsepower data is character type and discrete in nature.

weight - quantitative[continuous] : 

      Weight data is numeric and continuous in nature.

acceleration - quantitative[continuous] : 

      Acceleration data is numeric and continuous in nature.

model.year - quantitative[discrete] : 

      model.year data is numeric and is discrete with values ranging from 70 through 82.

origin - quantitative[discrete] : 

      origin data is integer type and it's discrete in nature with levels 1, 2, 3.

car.name - qualitative[nominal] : 

      Car names are nominal. So, it is qualitative data. 

For variables origin, model.year, and cylinders, the summary() doesn't give enough information to classify them as continuous or discrete. 
**You should explore the data and see the values in the dataset to characterize their level of measurement.**  

## This concludes part A of the introductory statistics lab.  

### Please take a break, then continue with [Part B](./1_lab1_intro_to_stats_ptB.ipynb)

