# Central Tendency Measures

This notebook is intended to provide a brief introduction to the central tendency measures (mean, mode, median).

The central tendency measures are intended to summarize in one value the whole data points and they give a hint of where are located those points around a center (being this center the central tendency measure).

## Mean

### Definition
The mean (also called the expected value or average) is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values.

- Sample mean: $\bar {x}$ (Mean of sample data taken from the whole population). <br>
- Population mean: $\mu$ (Mean of the whole population).

### Formula / Procedure to find it

The equation to get the mean is shown next: $\displaystyle \mu = \frac{\sum x_{i}}{n}$ <br>
Being $x_{i}$: Data points <br>
&emsp;&emsp;&ensp; $n$: Amount of data points

### Visualization

In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Creating the data set

In [None]:
data_set = [5,  2,  6, 10,  9,  3,  1,  9,  6,  2,  1,  4,  6,  5,  0,  4,  6,
            5,  4,  6,  3,  0,  2,  8,  6,  7,  8,  0,  4,  7,  0,  1,  0,  8,
            2,  0, 10,  2,  6,  6,  0,  5,  2,  0, 10,  3,  9,  8,  4,  7,  6,
            1, 10,  7, 10,  3,  0,  6,  5,  8,  4,  3,  7,  3,  1,  5,  3,  0,
            3,  3,  1,  2,  1,  5,  0,  5,  8,  1, 10,  7,  8,  6,  9,  3,  3,
            7,  3,  4, 10,  8,  2,  0,  0,  2,  9,  0,  5,  5,  5,  6]

The mean can be visualized in different ways, through distribution plots, two-dimensional plots, one-dimensional plots, ... <br>
But all of them shows a value that represents the center where all other values in a distribution are spread around.

In [None]:
fig = go.Figure(data=go.Scatter(y=data_set, mode='markers+lines'))
fig.update_layout(height=600, width=1000, title_text=f"Distribution of data set")
fig.show()

In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Scatter(y=data_set,
               name="2D data_set"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=[0, 99], y=[np.mean(data_set), np.mean(data_set)],
               name="2D mean"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=data_set, y=np.zeros(len(data_set)),
               mode='markers',
               name="1D data_set"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=[np.mean(data_set)], y=[0],
               mode='markers',
               name="1D mean"),
    row=2, col=1
)

fig.update_layout(height=600, width=1000, title_text=f"Distribution of data set - Mean: {np.mean(data_set)}")
fig.show()

Note: Good representations of MEAN are 1d, 2D and distribution plots.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16,8))
plt.subplot(2,1,1)
sns.distplot(data_set)
plt.vlines(x=np.mean(data_set), ymin=0, ymax=0.12, colors='r')
plt.ylim([0,0.12])
plt.legend(["Mean", "Distribution"])
plt.title(f"Distribution of data set - Mean: {np.mean(data_set)}")

plt.subplot(2,1,2)
plt.boxplot(data_set, 'h', vert=False)
plt.vlines(x=np.mean(data_set), ymin=0.9, ymax=1.1, colors='r')
plt.ylim([0.9,1.1])
plt.title(f"Boxplot of data set - Mean: {np.mean(data_set)}")
plt.show()

## Median

### Definition
Denoting or relating to a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.

### Formula / Procedure to find it

To find the median:<br>
- Arrange the data points from smallest to largest.
- If the number of data points is odd, the median is the middle data point in the list.
- If the number of data points is even, the median is the average of the two middle data points in the list.

### Creating the data set

In [None]:
data_set_median = np.sort(data_set)

Following the procedure to find the median we sorted the dataset.<br>
In this case the number of data points is even: 100 <br>
So we take the two middle data points (index 49 and 50) and average to get the median.

In [None]:
print(f"Index 49: {data_set_median[49]}\nIndex 50: {data_set_median[50]}\nAverage: {np.mean(data_set_median[49:51])}")

In [None]:
fig = go.Figure(data=go.Scatter(y=data_set_median, mode='markers+lines'))
fig.update_layout(height=600, width=1000, title_text=f"Distribution of data set")
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=1)

fig.add_trace(
    go.Scatter(y=data_set_median,
               name="data_set"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=[0, 99], y=[np.median(data_set_median), np.median(data_set_median)],
               name="median"),
    row=1, col=1
)

fig.update_layout(height=600, width=1000, title_text=f"Distribution of data set - Median: {np.median(data_set_median)}")
fig.show()

In [None]:
plt.figure(figsize=(16,8))
plt.subplot(2,1,1)
sns.distplot(data_set_median)
plt.vlines(x=np.median(data_set_median), ymin=0, ymax=0.12, colors='r')
plt.ylim([0,0.12])
plt.legend(["Median", "Distribution"])
plt.title(f"Distribution of data set - Mean: {np.mean(data_set_median)}")

plt.subplot(2,1,2)
plt.boxplot(data_set_median, 'h', vert=False)
plt.vlines(x=np.median(data_set_median), ymin=0.9, ymax=1.1, colors='r')
plt.ylim([0.9,1.1])
plt.title(f"Boxplot of data set - Median: {np.median(data_set_median)}")
plt.show()

Note: Boxplots are a good way to observe the MEDIAN.

## Mode

### Definition
The mode is the most commonly occurring data point in a dataset. The mode is useful when there are a lot of repeated values in a dataset.

### Formula / Procedure to find it

To find the mode you just have to:
- Determine the unique values in a data set.
- Count the amount of occurences of each unique value in the dataset.

We will be using the first data set (the one used in MEAN section).<br>
We will be using scipy and collections as numpy does not have a module to get the mode.

In [None]:
from scipy import stats

mode_result = stats.mode(data_set)
mode = mode_result[0][0]
print(f"{mode_result}\nMode: {mode}")

In [None]:
from collections import Counter

mode_counts = Counter(data_set)
mode_counts

In [None]:
plt.figure(figsize=(16,8))
sns.barplot(list(mode_counts.keys()), list(mode_counts.values()))
plt.title(f"Counts of unique values - Mode: {mode}")
plt.show()

Note: Bar plots makes easier to observe the MODE.

In [None]:
plt.figure(figsize=(16,8))
plt.subplot(2,1,1)
sns.distplot(data_set_median)
plt.vlines(x=mode, ymin=0, ymax=0.12, colors='r')
plt.ylim([0,0.12])
plt.legend(["Mode", "Distribution"])
plt.title(f"Distribution of data set - Mode: {mode}")

plt.subplot(2,1,2)
plt.boxplot(data_set_median, 'h', vert=False)
plt.vlines(x=mode, ymin=0.9, ymax=1.1, colors='r')
plt.ylim([0.9,1.1])
plt.title(f"Boxplot of data set - Mode: {mode}")
plt.show()

## Excercise

You can test your learning of the introduction of central tendency measures next:

In [None]:
class test:
    def __init__(self):
        self.questions = list()
        self.answers = list()
        self.correct_answers = 0
        self.score = 0

    def add_element(self, q, a):
        self.questions.append(q)
        self.answers.append(a)

    def remove_element(self, index):
        self.questions.pop(index)
        self.answers.pop(index)
        
    def show_answer(self, index):
        print(f"Q{index}: {self.questions[index-1]} - Ans_{index}: {self.answers[index-1]}")
    
    def show_answers(self):
        for index, (q, a) in enumerate(zip(self.questions, self.answers)):
            print(f"Q{index+1}: {q} - Ans_{index+1}: {a}")
    
    def build_from_csv(self, filename):
        df = pd.read_csv(filename)
        for index in range(df.shape[0]):
            self.add_element(df['Questions'][index], df['Answers'][index])
    
    def visualize_score(self):
        fig = go.Figure(data=[go.Pie(labels=["Correct", "Incorrect"],
                                     values=[self.score, 100-self.score],
                                     marker_colors=['rgb(10,100,10)', 'rgb(230,70,70)'],
                                     hole=.3)])
        fig.show()

    def test(self):
        self.correct_answers = 0
        for index, (q, a) in enumerate(zip(self.questions, self.answers)):
            current_answer = ''
            while len(str(current_answer))==0:
                current_answer = input(f"Q{index+1}: " + q)
                if len(current_answer)>0:
                    current_answer = float(current_answer)
                    self.correct_answers += int(current_answer == a)
                    if a==current_answer:
                        print("Correct")
                    else:
                        print("Incorrect")
        self.score =  100*np.sum(self.correct_answers)/len(self.questions)
        
        print(f"Your score: {self.score}")
        self.visualize_score()

In [None]:
exam = test()
exam.build_from_csv("https://raw.githubusercontent.com/Ricardo-DG/data_analytics_training/main/central_tendency_test.csv")

In [None]:
# If you would like to see the answers uncomment and run the following line

# exam.show_answers()

In [None]:
# If you would like to see a specific answer uncomment and run the following line
# (make sure to replace "index" with the number of the question you want to know the answer).

# exam.show_answer(index)

In [None]:
score = exam.test()