In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 

# Preview
Tables and graphs of frequency distributions are important points of departure when attempting to describe data. More precise summaries, such as averages, provide additional valuable information. Long-term investors in the stock market are able to ignore, with only an occasional sleepless night, daily fluctuations in their stocks by remembering that, on average, the annual growth rate of stocks during the past 50 years has exceeded by several percentage points that of more conservative investments in bonds. You might stop smoking because, on average, nonsmokers can expect to live longer than heavy smokers (by as much as 10 years, according to some researchers). You might strengthen your resolve to graduate from college upon hearing that, on average, the lifetime earnings of college graduates are almost double those of high school graduates.<br>
Averages consist of numbers (or words) about which the data are, in some sense, centered. They are often referred to as measures of central tendency, the several types of average yield numbers or words that attempt to describe, most generally, the middle or typical value for a distribution. This chapter focuses on three different measures of central tendency—the mode, median, and mean. Each of these has its special uses, but the mean is the most important average in both descriptive and inferential statistics.

## Measures of Central Tendency:-
Numbers or words that attempt to describe, most generally, the middle or typical value for a distribution.

<div style="display: flex;">

<div style="flex: 1; padding-right: 10px;">
    
![image.png](attachment:525e6f12-1bf8-47ce-a489-510548ef03cc.png) 

</div>

<div style="flex: 1; padding-left: 10px;">

![image.png](attachment:160bebf3-21ac-4f7c-937f-95fcb810f9d3.png)
</div>

</div>



## 1. Mode
### The mode reflects the value of the most frequently occurring score.
Table 3.1 shows the number of years served by 20 recent U.S. presidents, beginning with Benjamin Harrison (4 years) and ending with Bill Clinton (8 years). Four years is the modal term, since the greatest number of presidents, 7, served this term. Note that the mode equals 4 years, the value of the most frequently occurring term, not 7, the frequency with which that term occurred.<br>
It is easy to assign a value to the mode. If the data are organized, as in Figure 3.1, a glance will often be enough. However, if the data are not organized, as in Table 3.1, some counting may be required. The mode is readily understood as the most prevalent or typical value.<br>
## More Than One Mode
### Bimodal describes any distribution with two obvious peaks.
<b><u><i>Distributions can have more than one mode (or no mode at all). Distributions with two obvious peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions with more than two peaks are referred to as multimodal.</b></u></i><br>
The presence of more than one mode might reflect important differences among subsets of data. For instance, the distribution of weights for both male and female statistics students would most likely be bimodal, reflecting the combination of two separate weight distributions—a heavier one for males and a lighter one for females. Notice that even the distribution of presidential terms in Figure 3.1 tends to be bimodal, with a major peak at 4 years and a minor peak at 8 years, reflecting the two most typical terms of office.

## 2. Median
### The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves. In other words, the median has a percentile rank of 50, since observations with equal or smaller values constitute 50 percent of the entire distribution.<br>
## Finding the Median
Table 3.2 shows how to find the median for two different sets of scores. The numbers in shaded squares cross-reference instructions in the top panel with examples in the bottom panel.<br>
![image.png](attachment:dac00d81-1d3b-4b83-82ca-a6c66bc3d5b2.png)<br>

To find the median, scores always must be ordered from least to most (or vice versa). This task is straightforward with small sets of data but becomes increasingly cumbersome with larger sets of data that must be ordered manually.<br>
When the total number of scores is odd, as in the lower left-hand panel of Table 3.2, there is a single middle-ranked score, and the value of the median equals the value of this score. When the total number of scores is even, as in the lower right-hand panel of Table 3.2, the value of the median equals a value midway between the values of the two middlemost scores. In either case, the value of the median always reflects the value of middle-ranked scores, not the position of these scores among the set of ordered scores.<br>
The median term can be found for the 20 presidents. First, rank the terms from longest (12 for Franklin Roosevelt) to shortest (2 for Harding and Kennedy), as shown in the left-hand column of Table 3.3. Then, following the instructions in Table 3.2, verify that the median term for the 20 presidents equals 4.5 years, since 4.5 is the value midway between the values (4 and 5) of the two middlemost (10th- and 11th-ranked) terms in Table 3.3.<br>
Notice that although the values for median and modal presidential terms are quite similar, they have different interpretations. The median term (4.5 years) describes the middle-ranked term; the modal term (4 years) describes the most frequent term in the distribution.<br>
![image.png](attachment:0e918535-5ad6-44d6-9f35-4f0bb968701b.png)

## 3. Mean
The mean is the most common average, one you have doubtless calculated many times.
### The mean is found by adding all scores and then dividing by the number of scores.
$$Mean = \frac{sum of all scores}{number of scores} $$
To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .+ 4 + 8) to obtain a sum of 112 years, and then divide this sum by 20, the number of presidents, to obtain a mean of 5.60 years.<br>
There is no requirement that presidential terms be ranked before calculating the mean. Even when large sets of unorganized data are involved, the calculation of the mean is usually straightforward, particularly with the aid of a calculator or computer.<br>
### Sample or Population?
Statisticians distinguish between two types of means— <b><u><i>the population mean and the sample mean—depending on whether the data are viewed as a population (a complete set of scores) or as a sample (a subset of scores).</b></u></i>For example, if the terms of the 20 U.S. presidents are viewed as a population, then 5.60 years qualifies as a population mean. On the other hand, if the terms of the 20 U.S. presidents are viewed as a sample from the terms of all U.S. presidents, then 5.60 years qualifies as a sample mean. Not only is the present distinction entirely a matter of perspective, but it also produces exactly the same numerical value of 5.60 for both means. This distinction is introduced here because of its importance in later chapters, where the population mean usually is unknown but fixed as a constant, while the sample mean is known but varies from sample to sample. <b><u><i>Until then, unless noted otherwise, you can assume that we are dealing with the sample mean.</b></u></i>
### Sample Mean ($ \overline{X} $)
The balance point for a sample,found by dividing the sum for the values of all scores in the sample by the number of scores in the
sample.
$$ \overline{X} = \frac{\Sigma X}{n}$$
### Sample Size (n)
The total number of scores in the sample.
### Population Mean ($\mu$)
The balance point for a population,found by dividing the sum for all scores in the population by the number of scores in the population.
$$ \mu = \frac{\Sigma X}{N}$$
### Population Size (N)
The total number of scores in the population

## Mean as Balance Point
### The mean serves as the balance point for its frequency distribution.
Imagine that the histogram for the terms of the 20 presidents in Figure 3.1 has been constructed out of some rigid material such as wood. Also imagine that, while using only one finger placed under its base, you wish to lift the histogram without disturbing its horizontal balance. To accomplish this, your finger should be at 5.60, the value of the mean, shown as a dot in Figure 3.1. If your finger were to the right of this point, the entire histogram would seesaw down to the left; if your finger were to the left of this point, the histogram would seesaw down to the right.<br>
The mean serves as the balance point for its distribution because of a special property: <b><u><i>The sum of all scores, expressed as positive and negative deviations from the mean, always equals zero.</b></u></i> In the right-hand column of Table 3.3, each presidential term reappears as a deviation from the mean term, obtained by taking each term (including duplications) one at a time and subtracting the mean. Terms above the mean of 5.60 reappear as positive deviations (for example, 12 reappears as a positive deviation of 6.40 from the mean, since 12 − 5.60 = 6.40). Terms below the mean of 5.60 reappear as negative deviations (for example, 2 reappears as a negative deviation of –3.60 from the mean, since 2 − 5.60 = –3.60). As suggested in Table 3.3, when the sum of all positive deviations, 21.6, is combined with the sum of all negative deviations, –21.6, the resulting sum equals zero.<br>
In its role as balance point, the mean describes the single point of equilibrium at which, once all scores have been expressed as deviations from the mean, those above the mean counterbalance those below the mean. You can appreciate, therefore, why a change in the value of a single score produces a change in the value of the mean for the entire distribution. <b><u><i>The mean reflects the values of all scores, not just those that are middle ranked (as with the median), or those that occur most frequently (as with the mode).</b></u></i>

![image.png](attachment:4e5c3c46-6792-42e1-ac23-b3423337e683.png)

# WHICH AVERAGE?
## If Distribution Is Not Skewed
### When a distribution of scores is not too skewed, the values of the mode, median, and mean are similar, and any of them can be used to describe the central tendency of the distribution.
This tends to be the case in Figure 3.1, where the mode of 4 describes the typical term; the median of 4.5 describes the middle-ranked term; and the mean of 5.60 describes the balance point for terms. The slightly larger mean term is caused by a shift upward in the balance point to compensate for the large positive deviation of 6.40 years for Roosevelt’s lengthy 12-year term.
## If Distribution Is Skewed
When extreme scores cause a distribution to be skewed, as for the infant death rates for selected countries listed in Table 3.4, the values of the three averages can differ appreciably. The modal infant death rate of 4 describes the most typical rate (since it occurs most frequently, five times, in Table 3.4). The median infant death rate of 7 describes the middle-ranked rate (since the United States, with a death rate of 7, occupies the middle-ranked, or 10th, position among the 19 ranked countries). Finally, the mean infant death rate of 30.00 describes the balance point for all rates (since the sum of all rates, 570, divided by the number of countries, 19, equals 30.00).<br>
Unlike the mode and median, the mean is very sensitive to extreme scores, or outliers. Any extreme score, such as the high infant death rate of 182 for Sierra Leone in Table 3.4, contributes directly to the calculation of the mean and, with arithmetic inevitability, sways the value of the mean—the balance point for the entire distribution—in its direction. In extreme cases, the mean describes the central tendency of a distribution only in the more abstract sense of being the balance point of the distribution.<br>
## Interpreting Differences between Mean and Median
### Ideally, when a distribution is skewed, report both the mean and the median.
Appreciable differences between the values of the mean and median signal the presence of a skewed distribution. If the mean exceeds the median, as it does for the infant death rates, the underlying distribution is positively skewed because of one or more scores with relatively large values, such as the very high infant death rates for a number of countries, especially Sierra Leone. On the other hand, if the median exceeds the mean, the underlying distribution is negatively skewed because of one or more scores with relatively small values. Figure 3.2 summarizes the relationship between the various averages and the two types of skewed distributions (shown as smoothed curves).<br>
![image.png](attachment:52ad8a05-4a6a-4dce-bb57-cfff9ff08432.png)

## Using The Word Average
### Strictly speaking, an average can refer to the mode, median, or mean—or even to some more exotic average, such as the geometric mean or the harmonic mean.
Conventional usage prescribes that average usually signifies mean, and this connotation is often reinforced by the context. For instance, grade point average is virtually synonymous with mean grade point. To our knowledge, even the most enterprising grade-point- impoverished student has never attempted to satisfy graduation requirements by exchanging a more favorable modal or median grade point for the customary mean grade point. Unless context and usage make it clear, however, it’s a good policy to specify the particular average being used, even if it requires a short explanation. When dealing with controversial topics, it is always wise to insist that the exact type of the average be identified.

# AVERAGES FOR QUALITATIVE AND RANKED DATA

## Mode Always Appropriate for Qualitative Data
So far, we have been talking about quantitative data for which, in principle, all three averages can be used. But when the data are qualitative, your choice among averages is restricted.<b><u><i>The mode always can be used with qualitative data.</b></u></i>
## Medians Sometimes Appropriate
<b><u><i>The median can be used whenever it is possible to order qualitative data from least to most because the level of measurement is ordinal.</b></u></i><br>
It’s easiest to determine the median class for ordered qualitative data by using relative frequencies, as in Table 3.5. (Otherwise, first convert regular frequencies to relative frequencies.) Cumulate the relative frequencies, working up from the bottom of the distribution, until the cumulative percentage first equals or exceeds 50 percent. Since the corresponding class includes the median and, roughly speaking, splits the distribution into an upper and a lower half, it is designated as the median or middle-ranked class. For instance, the qualitative data in Table 3.5 can be ordered from lieutenant to general. Starting at the bottom of Table 3.5 and cumulating upward, we have a percent of 25.5 for the class of lieutenant and a cumulative percent 62.5 for the class of captain. Accordingly, since it includes a cumulative percent of 50, captain is the median rank of officers in the U.S. Army.<br>
One word of caution when you are finding the median for ordered qualitative data. Avoid a common error that identifies the median simply with the middle or two middle-most classes, such as “between captain and major,” without regard to the cumulative relative frequencies and the location of the 50th percentile. In other words, do not treat the various classes as though they have the same frequencies when they actually have different frequencies.<br>
![image.png](attachment:63abbd09-3fef-496a-bca2-1c269fb64988.png)<br>
## Inappropriate Averages
It would not be appropriate to report a median for unordered qualitative data with nominal measurement, such as the ancestries of Americans. Nor would it be appropriate to report a mean for any qualitative data, such as the ranks of officers in the U.S. Army. After all, words cannot be added and then divided, as required by the formula for the mean.<br>
<br></n>
## NOTE:- Mean cannot be used with qualitative data.

## Averages for Ranked Data
When the data consist of a series of ranks, with its ordinal level of measurement, the median rank always can be obtained. It’s simply the middlemost or average of the two middlemost ranks.