# Numerical Summary
Sometimes, data can be large and difficult to understand. Numerical Summary thus summurizes such data into a single value that is easy to interpret. There are two types of Numerical Summaries:
* Measures of Central Tendancy
* Measures of Dispersion

This notebook explores Measures of Central Tendancy.

## 1. Measures of Central Tendancy
When given a large set of Data, measures of central tendancy tend to estimate the middle ground as a way of summarizing the data. There are three major measures of central tendancy:
- The Mean
- The Median
- The Mode

### i. The Mean
The mean is simply the average of the data. This is a type of mean called the **arithmetic mean**. There are other types of mean as well, such as the weighted mean and the golden mean. Our main focus will be the arithmetic mean which is simply the average of the data.

Given a set of observations, the arithmetic mean is derived by dividing the sum of the observations by the number of observations (N).

Observe:

In [3]:
# The following data shows the annual salaries of 7 employees in two companies:
company_a_salaries = [34500, 30700, 32900, 36000, 34100, 33800, 32500]
company_b_salaries = [34900, 27500, 31600, 39700, 35300, 33800, 31700]

# Find the mean salary in both companies
def get_mean(data: list):
    
    """
    When called, this function gets the mean of a list of numbers.
    The mean is obtained by dividing the sum of the data by the 
    number of datapoints (number of observations).
    """
    
    mean = (sum(data))/len(data)
    return mean

company_a_mean_salary = get_mean(company_a_salaries)
company_b_mean_salary = get_mean(company_b_salaries)

print(f"The average salary of Company A employees is ${company_a_mean_salary} and that of Company B employees is ${company_b_mean_salary}")

The average salary of Company A employees is $Error in your data. Ensure all data points are numbers and that of Company B employees is $Error in your data. Ensure all data points are numbers


The arithmetic means of both data sets turned out to be identical. 

Above is an example of a population mean, that is the mean of all the observations. Sometimes, when the population size is too large, we may opt to calculate the mean of a sample - a subset of the population. The sample mean is calculated in the same way as the population mean.

### ii. The Median
The median is simply defined as the middle observance when a set of observances is ordered in ascending order.
When calculating the median, one may run into one of these two cases:
1. The number of observations (N) is odd.
2. The number of observations (N) is even.

In [19]:
# Find the median of both companies:
def get_median(data: list):

    data.sort() # This arranges the data in ascending order
    
    if (len(data)%2) != 0:
        median = data[(int((len(data) + 1) / 2) - 1)] 
        """
        All we're doing is getting the {(n+1)/2}th value. 
        I then turn this from a float to an int in order to index it from the list. 
        I finally subtract 1 because indices usually start from 0, so if I want the 4th item, I should index the 3rd.
        """

    else:
        median = (data[(int((len(data)) / 2)) - 1] + data[(int(((len(data)) + 2) / 2)) - 1]) / 2
        """
        Here, I get the {n/2}th value and the {(n+2)/2}th value.
        To get the {n/2}th value, I divide the len by 2, then change it from float to int to index it.
        I subtract by 1 because of zero indexing. Same goes for the {(n+2)/2}th value.
        I then get the mean between the {n/2}th value and the {(n+2)/2}th value to get the median
        """ 
    
    return median

company_a_median_salary = get_median(company_a_salaries)
company_b_median_salary = get_median(company_b_salaries)

print(f"The median salary of Company A is {company_a_median_salary} and that of Company B is {company_b_median_salary}")

The median salary of Company A is 33800 and that of Company B is 33800


Coincidentally, just like the arithmetic mean, the median of these data sets are identical.

The if-else statement checks whether N is even or odd and calculates the mean based on the findings. Both our data sets had the same N, 7, which is odd. What if we had a data set where N is even?

In [20]:
'''
A sample of 8 US corporations showed the following percentage changes in earnings per share in the current year
compared with the previous year. Find the mean and the median of the percentage change in earnings per share.
'''

# percentage changes in earnings per share
percentage_changes = ['13.6%', '25.5%', '43.6%', '-19.8%', '-13.8%', '12.0%', '36.3%', '14.3%']

# Changing the format in order to compute the mean and the median
changes = [] # a list without the % and the strings have been turned to numbers

for change in percentage_changes:
    remove_percentage = change[:-1]
    change_to_number = float(remove_percentage)
    changes.append(change_to_number)

# Compute the Mean
sample_mean = get_mean(changes)

# Compute the Median
median_change = get_median(changes)

print(f"The mean percentage change in earnings per share is {sample_mean}% while the median percentage change in earnings per share is {median_change}%")

The mean percentage change in earnings per share is 13.9625% while the median percentage change in earnings per share is 13.95%


This time, the mean and the median aren't identical, but they sure are close. 

While the mean is the most popular measure of central tendancy, there are times when it is more appropriate to use the median, for example when there are **outliers**. Outliers are extreme values, that is, they are abnormally low or high. Outliers tend to affect the mean. However, the median is rarely affected by outliers.

### iii. The Mode
The mode is simply the observance that occurs the most number of times.

Observe:

In [21]:
"""
A radio manufacturer decided to test radio models for defects. 
He selected 5 radio models and tested 20 radios from each model.
These were the results: 
"""

results = {
    "Model A": 3,
    "Model B": 2,
    "Model C": 15,
    "Model D": 0,
    "Model E": 2
    }

# Find the modal number of defects
def get_mode_dict(data: dict):

    list_of_observances = [] # A list of the number of observances gotten from the dictionary

    for i in data:
        list_of_observances.append(data[i]) # appends the values of the dict into a list

    return max(list_of_observances) # Returns the highest number in the list

modal_number_of_defects = get_mode_dict(results)
print(modal_number_of_defects)
        

15


The modal number of defects is 15. And the model that displayed these defects is...

In [22]:
for key, value in results.items():
    if value == modal_number_of_defects:
        print(key)


Model C


Model C! This was the most defective model with 15 out of the 20 radios having defects!

Data may not always present itself as a dictionary (or a JSON file), sometimes, it may just be a simple list. How do we deal with that?

In [33]:
# Getting the mode of a random list

random_list = [
    2, 4, 2, 5, 3, 2, 4, 6, 2, 6, 2, 5, 4, 2, 4, 5, 8, 2, 8, 6, 3, 4, 4, 3, 2, 7, 7, 5, 1, 6, 8, 6, 7, 7, 8, 4, 7, 4
    ]

def get_mode(data: list):
    list_of_modes = [] # A list was made just incase there is more than one element occur the highest number of times
    mode_count = 0 
    # The number of times an element occurs is inputed here and compared with the next element.
    # The highest number is the one that is stored.

    for i in data:
        if data.count(i) > mode_count:
            list_of_modes = [] # Resets the list when a new high occuring item is found
            mode_count = data.count(i) # Counts the number of times an element occurs.
            list_of_modes.append(i) # The element is appended to the list iff it occurs the highest number of times

        elif data.count(i) == mode_count: # Comes in handy when we have more than one mode
            if i in list_of_modes: # Avoids repetition
                pass

            else:
                list_of_modes.append(i) # Appends another element that occurs just as much as the highest element did

    return list_of_modes

print(get_mode(random_list))

[2, 4]


The mode is less popular than the mean and the median especially for economists and business people. However, there are still scenarios when it comes in handy like when determining which type of product out of a range of products performed the best.