## Variables in Statistics

Previously, we discussed the details around collecting data for our analysis. In this lesson, we'll focus on understanding the structural parts of a dataset, and how they're measured.

Whether a sample or a population, a dataset is generally an attempt to describe correctly a relatively small part of the world. The dataset we worked with in the previous lesson describes basketball players and their performance in the season 2016-2017.

Other datasets might attempt to describe the stock market, patient symptoms, stars from galaxies other than ours, movie ratings, customer purchases, and all sorts of other things.

The things we want to describe usually have a myriad of properties. A human, for instance, besides the property of being a human, can also have properties like height, weight, age, name, hair color, gender, nationality, whether they're married or not, whether they have a job or not, etc.

In practice, we limit ourselves to the properties relevant to the questions we want to answer, and to the properties that we can actually measure. Let's consider three rows at random from the basketball dataset we've previously worked with:

![image.png](attachment:56ff062e-81ae-41d0-b757-1ba772a06a30.png)

Each row describes an individual having a series of properties: name, team, position on the field, height, etc. For most properties, the values vary from row to row. All players have a height, for example, but the height values vary from player to player.

The properties with varying values we call variables. The height property in our dataset is an example of a variable. In fact, all the properties described in our dataset are variables.

A row in our dataset describes the actual values that each variable takes for a given individual.

Notice that this particular meaning of the "variable" concept is restricted to the domain of statistics. A variable in statistics is not the same as a [ in programming](https://en.wikipedia.org/wiki/Variable_(computer_science)), or [other domains](https://en.wikipedia.org/wiki/Variable).

## Quantitative and Qualitative Variables

Variables in statistics can describe either quantities, or qualities.

For instance, the Height variable in our dataset describes how tall each player is. The Age variable describes how much time has passed since each player was born. The MIN variable describes how many minutes each player played in the 2016-2017 WNBA season.

Generally, a variable that describes how much there is of something describes a quantity, and, for this reason, it's called a quantitative variable.

Usually, quantitative variables describe a quantity using real numbers, but there are also cases when words are used instead. Height, for example, can be described using real numbers, like in our dataset, but it can also be described using labels like "tall" or "short".

A few variables in our dataset clearly don't describe quantities. The Name variable, for instance, describes the name of each player. The Team variable describes what team each player belongs to. The College variable describes what college each player goes or went to.

The Name, Team, and College variables describe for each individual a quality, that is, a property that is not quantitative. Variables that describe qualities are called qualitative variables or categorical variables. Generally, qualitative variables describe what or how something is.

Usually, qualitative variables describe qualities using words, but numbers can also be used. For instance, the number of a player's shirt or the number of a racing car are described using numbers. The numbers don't bear any quantitative meaning though, they are just names, not quantities.

In the diagram below we do a head-to-head comparison between qualitative and quantitative variables:

![image.png](attachment:fb4d65e7-8fb3-44a1-82c2-40dc90669058.png)

We've selected a few variables from our dataset. For each of the variables selected, indicate whether it's quantitative or qualitative.

In [1]:
import pandas as pd
import numpy as np

In [2]:
wnba = pd.read_csv('../../Datasets/wnba.csv')

In [3]:
wnba.head(2)

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,...,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,...,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,...,19,82,101,72,63,13,40,217,0,0


In [4]:
wnba['3PA'].unique()

array([ 32,  18,  64,  68,  20,   9,  15, 150, 103,  11,  62,   3,   8,
        10,   0,   1, 123,  93,  50,  57,  87,  52, 114,   2,  45, 100,
        60,  30,  91,   5, 225,  66,  33,  79, 129,  35,   6,  13, 116,
        49,  29,  21,  55,  23,  83, 147,  78, 194,  53,  92,  34, 132,
        28,  51,   4,  48,  40,  89,  74,  94,   7,  41,  46,  26, 149,
       119,  38, 134, 163,  70, 101, 112,  69,  56,  47], dtype=int64)

## Scales of Measurement

The amount of information a variable provides depends on its nature (whether it's quantitative or qualitative), and on the way it's measured.

For instance, if we analyze the Team variable for any two individuals:

- We can tell whether or not the two individuals are different from each other with respect to the team they play.
-  But if there's a difference:
    - We can't tell the size of the difference.
    - We can't tell the direction of the difference - we can't say that team A is greater or less than team B.

On the other side, if we analyze the Height variable:
- We can tell whether or not two individuals are different.
- If there's a difference:
   - We can tell the size of the difference. If player A has 190 cm and player B has 192 cm, then the difference between the two is 2 cm.
   - We can tell the direction of the difference from each perspective: player A has 2 cm less than player B, and player B has 2 cm more than player A.

![image.png](attachment:186fc04f-f8c3-4da2-a815-b2ada11ec571.png)

The Team and Height variables provide different amounts of information because they have a different nature (one is qualitative, the other quantitative), and because they are measured differently.

The system of rules that define how each variable is measured is called scale of measurement or, less often, level of measurement.

In the next screens, we'll learn about a system of measurement made up of four different scales of measurement: nominal, ordinal, interval, and ratio. As we'll see, the characteristics of each scale pivot around three main questions:

- Can we tell whether two individuals are different?
- Can we tell the direction of the difference?
- Can we tell the size of the difference?

## The Nominal Scale

In the previous screen, we've discussed about the Team variable, and said that by examining its values we can tell whether two individuals are different or not, but we can't indicate the size and the direction of the difference.

The Team variable is an example of a variable measured on a nominal scale. For any variable measured on a nominal scale:

- We can tell whether two individuals are different or not (with respect to that variable).
- We can't say anything about the direction and the size of the difference.
- We know that it can only describe qualities.

![image.png](attachment:1b44175c-079d-402a-99d2-3b0159d28a1c.png)

When a qualitative variable is described with numbers, the principles of the nominal scale still hold. We can tell whether there's a difference or not between individuals, but we still can't say anything about the size and the direction of the difference.

If basketball player A has the number 5 on her shirt, and player B has 8, we can tell they're different with respect to shirt numbers, but it doesn't make any sense to subtract the two values and quantify the difference as a 3. Nor it makes sense to say that B is greater than A. The numbers on the shirts are just identifiers here, they don't quantify anything.

## The Ordinal Scale

In our last exercise, we saw that the new Height_labels variable was showing labels like "short", "medium", or "tall". By examining the values of this new variable, we can tell whether two individuals are different or not. But, unlike in the case of a nominal scale, we can also tell the direction of the difference. Someone who is assigned the label "tall" has a bigger height than someone assigned the label "short".

However, we still can't determine the size of the difference. This is an example of a variable measured on an ordinal scale.

![image.png](attachment:1748c3f1-3832-4f9c-be0a-cfd7ce1965af.png)

Generally, for any variable measured on an ordinal scale, we can tell whether individuals are different or not and we can tell the direction of the difference, but we still can't determine the size of the difference.

Variables measured on an ordinal scale are generally qualitative, but can be quantitative if the intervals between the categories are treated as equal and continuous. Quantitative variables, however, can be measured on other ways too, as we'll see next in this lesson.

![image.png](attachment:715f4866-59dd-4444-b70f-fc6b7fd1cfad.png)

Common examples of variables measured on ordinal scales include ranks: ranks of athletes, of horses in a race, of people in various competitions, etc.

For example, let's say we only know that athlete A finished second in a marathon, and athlete B finished third in the same race. We can immediately tell their performance is different, we know that athlete A finished faster, but we don't know how much faster. The difference between the two could be half a second, 12 minutes, half an hour, etc.

Other common examples include measurements of subjective evaluations that are generally difficult or near to impossible to quantify with precision. For instance, when answering a survey about how much they like a new product, people may have to choose a value 1-5 for the following:

1. "I hate it"
2. "I don't like it"
3. "I don't like or dislike it"
4. "I like it"
5. "I love it"

This is an example of a quantitative variable measured on an ordinal scale because the intervals between categories can be considered equal, and mathematical operations such as finding the mean and median can be used meaningfully.

The values of the variables measured on an ordinal scale can be both words and numbers. When the values are numbers, they are usually ranks, but we still can't use the numbers to compute the size of the difference. We can't say how much faster an athlete was than another by simply comparing their ranks.

In [5]:
wnba['Experience']

0       2
1      12
2       4
3       6
4       R
       ..
138     6
139     9
140     2
141     8
142     2
Name: Experience, Length: 143, dtype: object

## The Interval and Ratio Scales

We've seen in the case of the Height variable that the values have direction when measured on an ordinal scale. The downside is that we don't know the size of each interval between values, and because of this we can't determine the size of the difference.

![image.png](attachment:97a99c4c-f638-439e-a0e6-f587e3acadd1.png)

An alternative here is to measure the Height variable using real numbers, which will result in having well-defined intervals, which in turn will allow us to determine the size of the difference between any two values.

![image.png](attachment:56814a2a-ba16-425b-8f1c-b6beccad9a21.png)

A variable measured on a scale that preserves the order between values and has well-defined intervals using real numbers is an example of a variable measured either on an interval scale, or on a ratio scale.

In practice, variables measured on interval or ratio scales are very common, if not the most common. Examples include:

- Height measured with a numerical unit of measurement (like inches or centimeters).
- Weight measured with a numerical unit of measurement (multiples and submultiples of grams, for instance).
- Time measured with a numerical unit of measurement (multiples and submultiple of seconds, for example).
- The price of various products measured with a numerical unit of measurement (like dollars, pounds, etc.).

![image.png](attachment:8c27f7f2-bf0d-4ed1-b3b1-c63f0341b11f.png)

## The Difference Between Ratio and Interval Scales

What sets apart ratio scales from interval scales is the nature of the zero point.

On a ratio scale, the zero point means no quantity. For example, the Weight variable is measured on a ratio scale, which means that 0 grams indicate the absence of weight.

On an interval scale, however, the zero point doesn't indicate the absence of a quantity. It actually indicates the presence of a quantity.

To exemplify this case using our dataset, we've used the Weight variable (measured on a ratio scale), and created a new variable that is measured on an interval scale. The new variable describes by how many kilograms the weight of a player is different than the average weight of the players in our dataset. Here's a random sample that includes values from the new variable named Weight_deviation:

![image.png](attachment:37cc1c32-8b63-4b08-9289-469225b10139.png)

If a player had a value of 0 for our Weight_deviation variable (which is measured on an interval scale), that wouldn't mean the player has no weight. Rather, it'd mean that her weight is exactly the same as the mean. The mean of the Weight variable is roughly 78.98 kg, which means that the zero point in the Weight_deviation variable is equivalent to 78.98 kg.

On the other side, a value of 0 for the Weight variable, which is measured on a ratio scale, indicates the absolute absence of weight.

Another important difference between the two scales is given by the way we can measure the size of the differences.

On a ratio scale, we can quantify the difference in two ways. One way is to measure a distance between any two points by simply subtracting one from another. The other way is to measure the difference in terms of ratios.

For example, by doing a simple subtraction using the data in the table above, we can tell that the difference (the distance) in weight between Clarissa dos Santos and Alex Montgomery is 5 kg. In terms of ratios, however, Clarissa dos Santos is roughly 1.06 (the result of 89 kg divided by 84 kg) times heavier than Alex Montgomery. To give a straightforward example, if player A had 90 kg and player B had 45 kg, we could say that player A is two times (90 kg divided by 45 kg) heavier than player B.

On an interval scale, however, we can measure meaningfully the difference between any two points only by finding the distance between them (by subtracting one point from another). If we look at the weight deviation variable, we can say there's a difference of 5 kg between Clarissa dos Santos and Alex Montgomery. However, if we took ratios, we'd have to say that Clarissa dos Santos is two times heavier than Alex Montgomery, which is not true.

![image.png](attachment:7cee6961-b27d-4659-8bf3-479a299c21c7.png)

In [6]:
wnba.head(2)

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,...,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,...,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,...,19,82,101,72,63,13,40,217,0,0


In [7]:
wnba.columns

Index(['Name', 'Team', 'Pos', 'Height', 'Weight', 'BMI', 'Birth_Place',
       'Birthdate', 'Age', 'College', 'Experience', 'Games Played', 'MIN',
       'FGM', 'FGA', 'FG%', '15:00', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB',
       'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'DD2', 'TD3'],
      dtype='object')

## Common Examples of Interval Scales

In practice, variables measured on an interval scale are relatively rare. Below we discuss two examples that are more common.

Generally, points in time are indicated by variables measured on an interval scale. Let's say we want to indicate the point in time of the first manned mission on the Moon. If we want to use a ratio scale, our zero point must be meaningful and denote the absence of time. For this reason, we'd basically have to begin the counting at the very beginning of time.

There are many problems with this approach. One of them is that we don't know with precision when time began (assuming time actually has a beginning), which means we don't know how far away in time we are from that zero point.

To overcome this, we can set an arbitrary zero point, and measure the distance in time from there. Customarily, we use the Anno domini system where the zero point is arbitrarily set at the moment Jesus was born. Using this system, we can say that the first manned mission on the Moon happened in 1969. This means that the event happened 1968 years after Jesus' birth (1968 because there's no year 0 in the Anno domini system).

![image.png](attachment:da5d30ff-2e16-4bc7-9722-94ba8fecf94b.png)

Another common example has to do with measuring temperature. In day to day life, we usually measure temperature on a Celsius or a Fahrenheit scale. These scales are examples of interval scales.

Because temperature is measured on an interval scale, we need to avoid quantifying the difference in terms of ratio. For example, 0°C or 0°F are arbitrarily set zero points and don't indicate the absence of temperature. If 0°C or 0°F were meaningful zero points, temperatures below 0°C or 0°F wouldn't be possible. But we know that we can go way below 0°C or 0°F.

If yesterday was 10°C, and today is 20°C, we can't say that today is twice as hot as yesterday. We can say, however, that today's temperature is 10°C more compared to yesterday.

Temperature can be measured on a ratio scale too, and this is done using the Kelvin scale. 0 K (0 Kelvin) is not set arbitrarily, and it indicates the lack of temperature. The temperature can't possibly drop below 0 K.

![image.png](attachment:9b9d0e41-613a-4ac0-b8af-65d233921436.png)

##  Discrete and Continuous Variables

Previously in this lesson we divided variables into two big categories: quantitative and qualitative. We've seen that quantitative variables can be measured on ordinal, interval, or ratio scales. In this screen, we zoom in on variables measured on interval and ratio scales.

We've learned that variables measured on interval and ratio scales can only take real numbers as values. Let's consider a small random sample of our dataset and focus on the Weight and PTS (total points) variables, which are both measured on a ratio scale.

![image.png](attachment:cc4dbacf-ed08-43c5-8032-6ca01f1848ec.png)

The first two players scored 32 and 31 points, respectively. Between 32 and 31 points there's no possible intermediate value. Provided the measurements are correct, it's impossible to find a player having scored 31.5 or 31.2 points. In basketball, players can only score 1,2 or 3 points at a time, so the points variable can only be expressed in integers when measured on an interval or ratio scale.

Generally, if there's no possible intermediate value between any two adjacent values of a variable, we call that variable discrete.

Common examples of discrete variables include counts of people in a class, a room, an office, a country, a house etc. For instance, if we counted the number of people living in each house of a given street, the results of our counting could only be integers. For any given house, we could count 1, 3, 7, 0 people, but we could not count 2.3 people, or 4.1378921521 people.

In the table above, we can also see that the first player weighs 86 kg, and the second 76 kg. Between 86 kg and 76 kg, there's an infinity of possible values. In fact, between any two values of the Weight variable, there's an infinity of values.

This is strongly counter-intuitive, so let's consider an example of two values that are relatively close together: 86kg and 87kg. Between these values we can have an infinity of values: 86.2 kg, 86.6 kg, 86.40 kg, 86.400001 kg, 86.400000000000001 kg, 86.400000000000000000000000000000000000000000001 kg, and so on.

In the diagram below we consider values between 86 and 87 kg, and break down the interval in five equal parts. Then we take two values (86.2 and 86.8) from the interval 86 - 87, and break down the interval between these values (86.2 and 86.8) in five equal parts. Then we repeat the process for the interval 86.2 - 86.8. In fact, we could repeat the process infinitely.

![image.png](attachment:f7217895-0b27-4f41-9f73-97cdfddfc665.png)

In practice, we limit ourselves to rounding the weights to a couple of decimal places either for practical purposes or because the instruments we use to measure weight are imperfect.

Generally, if there's an infinity of values between any two values of a variable, we call that variable continuous.

Whether a variable is discrete or continuous is determined by the underlying nature of the variable being considered, and not by the values obtained from the measurement. For instance, we can see in our dataset that height only takes integer values:

In [17]:
wnba['DD2']

0       0
1       0
2       0
3       2
4       0
       ..
138     0
139     0
140     0
141    11
142     0
Name: DD2, Length: 143, dtype: int64

This doesn't make the Height variable discrete. It just tells us that the height is not measured with a great degree of precision.

## Real Limits

Let's consider these ten rows where players are recorded as having the same weight:

Do all these players really have the exact same weight? Most likely, they don't. If the values were measured with a precision of one decimal, we'd probably see that the players have different weights. One player may weigh 76.7 kg, another 77.2 kg, another 77.1 kg.

As an important parenthesis here, the weight values in the table above are all 77.0, and the trailing zero suggests a precision of one decimal point, but this is not the case. The values are automatically converted by pandas to float64 because of one NaN value in the Weight column, and end up with a trailing zero, which gives the false impression of one decimal point precision. So a player was recorded to weigh 77 kg (zero decimals precision), not 77.0 kg (one decimal precision).

Returning to our discussion, if we measure the weight with zero decimals precision (which we do in our dataset), a player weighing 77.4 kg will be assigned the same weight (77 kg) as a player weighing 76.6 kg. So if a player's recorded weight is 77 kg, we can only tell that her actual weight is somewhere between 76.5 kg and 77.5 kg. The value of 77 is not really a distinct value here. Rather, it's an interval of values.

This principle applies to any possible numerical weight value. If a player is measured to weigh 76.5 kg, we can only tell that her weight is somewhere between 76.45 kg and 76.55 kg. If a player has 77.50 kg, we can only tell that her weight is somewhere between 77.495 kg and 77.505 kg. Because there can be an infinite number of decimals, we could continue this breakdown infinitely.

![image.png](attachment:4bc5b071-a3d6-44a7-a17a-d18dd989bc9c.png)

Generally, every value of a continuous variable is an interval, no matter how precise the value is. The boundaries of an interval are sometimes called real limits. The lower boundary of the interval is called lower real limit, and the upper boundary is called upper real limit.

![image.png](attachment:6730d850-132a-40ef-a0ed-a9276ffceab1.png)

In the figure above we can see for example that 88.5 is halfway between 88 and 89. If we got a measurement of 88.5 kg in practice, but we want only integers in our dataset (hence zero decimals precision), you might wonder whether to assign the value to 88 or 89 kg. The answer is that 88.5 kg is exactly halfway between 88 and 89 kg, and it doesn't necessarily belong to any of those two values. The assignment only depends on how you choose to round numbers: if you round up, then 88.5 kg will be assigned to 89 kg; if you round down, then the value will be assigned to 88 kg.

## Next Steps

In this lesson, our focus was on understanding variables, the structural parts of a dataset. We've seen that variables can be either quantitative or qualitative, and, depending on that, they can be measured on different scales.

We completed one more step in the workflow we follow throughout this course.

![image.png](attachment:39fa63a4-f7c5-4ec5-a91e-d63a09d66dcf.png)

Next in this course, we'll learn how to organize data in comprehensible forms to find patterns.