# 1.2 Variables and Scales of Measurement
A **variable** is a feature of interest that (often) differs between various observations. We can classify variables as **categorical** (qualitative) or **numerical** (quantitative). We can further distinguish numerical variables. There are **discrete variables** which are countable. For example, the number of dependents in a household. Or **continuous variables** which are uncountable variables such as time like 0.002 ms or weight 100.4312 pounds. 

## The Measurement Scales
To summarize and analyze variables we need to understand the different measurement scales used. We have four scales: nominal, ordinal, interval, or ratio. **Nominal scale** refers to categorizations or groupings, this is the least sophisticated of the scales. In other words, the scale differentiates observations by label. Let's consider the following pandas dataframe.

In [48]:
import pandas as pd

# Make dataframe.
politicians = pd.DataFrame({
    'name': ['Donald Trump', 'Hillary Clinton', 'Barack Obama', 'John Thune', 'Matt Gaetz', 'Alexandria Ocasio-Cortez'],
    'party': ['R', 'D', 'D', 'R', 'R', 'D']
})
politicians

Unnamed: 0,name,party
0,Donald Trump,R
1,Hillary Clinton,D
2,Barack Obama,D
3,John Thune,R
4,Matt Gaetz,R
5,Alexandria Ocasio-Cortez,D


We have four observations for US politicians. Each politican can be categorized or grouped by which political party they belong to. Thus, the *party* variable is a categorical variable on a nominal scale. Often we can substitute numbers for these categories. There are many reasons we could do this, but one of them is for simplicity. Let's create a numerical represenation of the party variable by seeing which politicians are Democrats.

In [49]:
# Which politicians are Democrats?
dem_mask = politicians['party'] == 'D'
politicians['is_dem'] = dem_mask.astype(int)

politicians

Unnamed: 0,name,party,is_dem
0,Donald Trump,R,0
1,Hillary Clinton,D,1
2,Barack Obama,D,1
3,John Thune,R,0
4,Matt Gaetz,R,0
5,Alexandria Ocasio-Cortez,D,1


Here we used the practice of creating a boolean mask for our data. As you can see, the each Democrat now has a value of $1$ for the column *is_dem*. Notice that while we can categorize data with a nominal scale, we can't use it to rank our observations. Compared to the nominal scale, the **ordinal scale** gives a stronger level of measurement. With ordinal scales we can categorize and rank our data. The weakness with ordinal observations is that we can't interpret the difference between the rankings because the actual numbers are arbitrary. Let's apply an ordinal scale to our politicians, based on the level of office they've achieved. 
1. President = 1
2. Senator = 2
3. Representative = 3

In [50]:
# Ordinal scale for candidate office.
politicians['office'] = [1, 2, 1, 2, 3, 3]

politicians

Unnamed: 0,name,party,is_dem,office
0,Donald Trump,R,0,1
1,Hillary Clinton,D,1,2
2,Barack Obama,D,1,1
3,John Thune,R,0,2
4,Matt Gaetz,R,0,3
5,Alexandria Ocasio-Cortez,D,1,3


Notice we could order (hence, ordinal) the variable *office* but we can't necessarily interpret the difference between the rankings and numbers used. We know for example, that Obama has a higher office ranking than John Thune, but since the numbers are arbitrary, a difference of 1 doesn't have much interpretation. We could have just as easily ranked the political offices in the inverse order: President = 3, Senator = 2, Representative = 1. Or we could have started at any numerical value, for example President = 0.

Let's sort by *office* to give a sense of how ordinal scales could be used.

In [51]:
politicians.sort_values('office')

Unnamed: 0,name,party,is_dem,office
0,Donald Trump,R,0,1
2,Barack Obama,D,1,1
1,Hillary Clinton,D,1,2
3,John Thune,R,0,2
4,Matt Gaetz,R,0,3
5,Alexandria Ocasio-Cortez,D,1,3


The next scale is the **interval scale** which increases in complexity from the ordinal. We can now categorize, rank, and find meaningful differences. One example of an interval scale is the Fahrenheit scale for temperatures. An interval scale that we could apply for our politicians is a scale for their placement on the left-right political spectrum. The maindrawback of an intervale scale is that the value of zero is arbitrary. For Fahrenheit, 0 degrees does not mean an absence of temperature. It only implies 10 degrees colder than 10 degrees Fahrenheit. Similarly, a political spectrum value of 0 on a scale of $[-3, -2, -1, 0, 1, 2, 3]$ simply implies that a candidate is in the political center, and is one step away from the center left or center right. 

For this scale, consider $-3$ to be far left, $-2$ to be left, $-1$ to be left, and $0$ to be center. Positive values correspond to the same, but for the political right.

In [52]:
# Left-right scale for candidates.
politicians['left_right'] = [2, -1, 0, 2, 3, -3]

politicians

Unnamed: 0,name,party,is_dem,office,left_right
0,Donald Trump,R,0,1,2
1,Hillary Clinton,D,1,2,-1
2,Barack Obama,D,1,1,0
3,John Thune,R,0,2,2
4,Matt Gaetz,R,0,3,3
5,Alexandria Ocasio-Cortez,D,1,3,-3


As we can see this scale is fairly arbitrary. We can identify how far away politicians are from one another on this interval, but any values can be used. In political science and sociology, it's common to see values between $[0, 7]$ used for scales like this. Where $3$ represents the center, or maybe a neutral opinion on some topic. While $7$ could represent a strong agreement or disagreement. 

Our final scale is the **ratio scale** which is our strongest level of measurement. This has all the advantages of our previous scales. We can rank, order, meaningfully differentiate, and see a true zero point for all of our observations. An example of a ratio scale for our politicians would be the number of votes they got in their last election. For simplicity, we'll consider the "last election" to be the last election corresponding to the highest office level they've achieved. So for Hillary Clinton, we'll consider her $2006$ US senate election in New York, rather than her $2016$ presidential run.

In [53]:
# Number of votes, includes corresponding election year for understanding
politicians['last_election_vote'] = [74223975, 3008428, 65915795, 242316, 197349, 82453]
politicians['election_year'] = [2020, 2006, 2012, 2022, 2022, 2022]

politicians

Unnamed: 0,name,party,is_dem,office,left_right,last_election_vote,election_year
0,Donald Trump,R,0,1,2,74223975,2020
1,Hillary Clinton,D,1,2,-1,3008428,2006
2,Barack Obama,D,1,1,0,65915795,2012
3,John Thune,R,0,2,2,242316,2022
4,Matt Gaetz,R,0,3,3,197349,2022
5,Alexandria Ocasio-Cortez,D,1,3,-3,82453,2022


In [54]:
""" EXAMPLE 1.2

We want to understand the needs of the "tween" population (kids between the ages of 8 and 12) for a ski resort.

The owner of the resort believes that tween spending power has grown, and he wants us to see if we can provide
insights to increase their return to the resort and thus revenue. The following survey questions were given
to the tweens:

-> Q1: "On your car drive to the resort, which music streaming service was playing?"
-> Q2: "On a scale of 1 to 4, rate the quality of the food at the resort."
    -> 1 = poor, 2 = fair, 3 = good, 4 = excellent
-> Q3: "Presently, the main dining area closes at 3:00 pm. What time do you think it should close?"
-> Q4: "How much of your own money did you spend at the lodge today?"
"""
# Read the data.
data = pd.read_csv('Tween_Survey.csv', index_col=False)
data.head(5)

Unnamed: 0,Tween,Question 1,Question 2,Question 3,Question 4
0,1,Apple Music,4,5:00pm,20
1,2,Pandora,2,5:00pm,10
2,3,Spotify,2,4:30pm,10
3,4,Apple Music,3,4:00pm,0
4,5,Spotify,1,3:30pm,0


In [55]:
# Question 1 has nominal data, we can only categorize the observations.
music_counts = data['Question 1'].value_counts()
n = len(data)

q1_summary = pd.DataFrame({'Service': music_counts.index,
                           'Count': music_counts, 'Percent': music_counts / n})
q1_summary

Unnamed: 0_level_0,Service,Count,Percent
Question 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spotify,Spotify,12,0.6
Apple Music,Apple Music,6,0.3
Pandora,Pandora,2,0.1


Given that $60%$ of tweens use Spotify, the owner may want to direct advertisements to Spotify!

In [56]:
# Question 2 has ordinal data, we can categorize and rank observations.
quality_counts = data['Question 2'].value_counts().sort_index()

fair_at_best = quality_counts[1] + quality_counts[2]
print(fair_at_best,'  ',fair_at_best/n)

11    0.55


From the survey, 11 tweens or $55%$, thought the food was fair at best. Perhaps the owner should conduct a survey on food satisfaction to understand why this is.

In [57]:
# Question 3 has an interval scale.
time_prefs = data['Question 3'].str.strip('pm').replace(':', '.', regex=True).astype(float)

past_close = sum(time_prefs > 3.0)
past_close / n

0.95

19 out of 20, or $95%$ of survey respondents would prefer the dining area to stay open past 3pm.

In [58]:
# Question 4 has a ratio scale.
spend_money = sum(data['Question 4'] > 0)
avg_spent = data['Question 4'].mean()

print('Tweens who spent money: ', spend_money, ' ', spend_money/n)
print('Average spent:', avg_spent)

Tweens who spent money:  17   0.85
Average spent: 11.5


17 of 20 tweens or $85%$ spent their own money at the resort, with an average of $\$11.5$