## Measures of Central Tendency 集中趋势
Measures of central tendency describe the center of the data, and are often represented by the mean, the median, and the mode.

### Mean
Mean represents the arithmetic average of the data. The line of code below prints the mean of the numerical variables in the data. From the output, we can infer that the average age of the applicant is 55 years, the average temperature is 37.8. The command df.mean(axis = 0) will also give the same output.

In [1]:
import pandas as pd
import numpy as np

# read in the data
df = pd.read_excel("Comorbid.xlsx")  

In [18]:
print(df.shape)

(597, 7)


In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 597 entries, 0 to 596
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Age               597 non-null    int64  
 1   Grade             597 non-null    object 
 2   Logistics         597 non-null    object 
 3   Temperature       597 non-null    float64
 4   Pneumonia         597 non-null    object 
 5   Hypertension      597 non-null    object 
 6   DiabetesMellitus  597 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 32.8+ KB
None


In [5]:
df.mean()

  """Entry point for launching an IPython kernel.


Age            55.103853
Temperature    37.825796
dtype: float64

In [6]:
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Temperature'].mean())

55.10385259631491
37.82579564489117


In the previous sections, we computed the column-wise mean. It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code below calculates the mean of the first five rows.

In [6]:
df.mean(axis = 1)[0:5]

0    57.20
1    51.00
2    62.75
3    54.10
4    61.55
dtype: float64

### Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the distribution into two halves. The line of code below prints the median of the numerical variables in the data. The command df.median(axis = 0) will also give the same output.

In [7]:
df.median()

  """Entry point for launching an IPython kernel.


Age            56.0
Temperature    37.8
dtype: float64

From the output, we can infer that the median age of the applicants is 56 years, the median temprature is 37.8.

It is also possible to calculate the median of a particular variable in a data, as shown in the first two lines of code below. We can also calculate the median of the rows by specifying the (axis = 1) argument. The third line below calculates the median of the first five rows.

In [8]:
#to calculate a median of a particular column
print(df.loc[:,'Age'].median())
print(df.loc[:,'Temperature'].median())

df.median(axis = 1)[0:5]

56.0
37.8


  """


0    57.20
1    51.00
2    62.75
3    54.10
4    61.55
dtype: float64

### Mode
Mode represents the most frequent value of a variable in the data. This is the only central tendency measure that can be used with categorical variables, unlike the mean and the median which can be used only with quantitative data.

The line of code below prints the mode of all the variables in the data. The .mode() function returns the most common value or most repeated value of a variable. The command df.mode(axis = 0) will also give the same output.

In [9]:
df.mode()

Unnamed: 0,Age,Grade,Logistics,Temperature,Pneumonia,Hypertension,DiabetesMellitus
0,30,A,Admit to Ward for Care,39.1,No,No,No
1,87,,,,,,


##### 第二行是极端数据，没法计算

## Measures of Dispersion 离散趋势
In the previous sections, we have discussed the various measures of central tendency. However, as we have seen in the data, the values of these measures differ for many variables. This is because of the extent to which a distribution is stretched or squeezed. In statistics, this is measured by dispersion which is also referred to as variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the interquartile range.

### Standard Deviation 标准差
Standard deviation is a measure that is used to quantify the amount of variation of a set of data values from its mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa. The line of code below prints the standard deviation of all the numerical variables in the data.

In [10]:
df.std()

  """Entry point for launching an IPython kernel.


Age            21.712566
Temperature     1.198883
dtype: float64

In [11]:
print(df.loc[:,'Age'].std())
print(df.loc[:,'Temperature'].std())

#calculate the standard deviation of the first five rows 
df.std(axis = 1)[0:5]

21.71256560765737
1.198882832230591


  """


0    28.001429
1    18.384776
2    34.294679
3    22.485996
4    33.163308
dtype: float64

### Variance 方差
Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the random variable with itself. The line of code below prints the variance of all the numerical variables in the dataset. The interpretation of the variance is similar to that of the standard deviation.

In [12]:
df.var()

  """Entry point for launching an IPython kernel.


Age            471.435505
Temperature      1.437320
dtype: float64

### Interquartile Range (IQR) 四分间距
The Interquartile Range (IQR) is a measure of statistical dispersion, and is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR is also a very important measure for identifying outliers and could be visualized using a boxplot.

IQR can be calculated using the iqr() function. The first line of code below imports the 'iqr' function from the scipy.stats module, while the second line prints the IQR for the variable 'Age'.

In [13]:
from scipy.stats import iqr
iqr(df['Age'])

40.0

### Skewness 偏度
Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly symmetrical distribution, the mean, the median, and the mode will all have the same value. However, the variables in our data are not symmetrical, resulting in different values of the central tendency.

We can calculate the skewness of the numerical variables using the skew() function, as shown below.

In [14]:
print(df.skew())

Age           -0.104796
Temperature    0.003143
dtype: float64


  """Entry point for launching an IPython kernel.


The skewness values can be interpreted in the following manner:

Highly skewed distribution: If the skewness value is less than −1 or greater than +1.

Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.

Approximately symmetric distribution: If the skewness value is between −½ and +½.

### Kurtosis 峰度
Kurtosis is simply a measure of how pointy or flat the peak of your distribution curve is.The kurtosis of any univariate normal distribution is 3.
峰度为0表示该总体数据分布与正态分布的陡缓程度相同；
峰度大于0表示该总体数据分布与正态分布相比较为陡峭，为尖顶峰；
峰度小于0表示该总体数据分布与正态分布相比较为平坦，为平顶峰.

In [15]:
print(df.kurt())

Age           -1.278724
Temperature   -1.181043
dtype: float64


  """Entry point for launching an IPython kernel.


## Putting Everything Together
We have learned the measures of central tendency and dispersion, in the previous sections. It is important to analyse these individually, however, because there are certain useful functions in python that can be called upon to find these values. One such important function is the .describe() function that prints the summary statistic of the numerical variables. The line of code below performs this operation on the data.

In [16]:
df.describe()

Unnamed: 0,Age,Temperature
count,597.0,597.0
mean,55.103853,37.825796
std,21.712566,1.198883
min,18.0,35.8
25%,35.0,36.8
50%,56.0,37.8
75%,75.0,38.9
max,90.0,39.9


In [17]:
df.describe(include='all')

Unnamed: 0,Age,Grade,Logistics,Temperature,Pneumonia,Hypertension,DiabetesMellitus
count,597.0,597,597,597.0,597,597,597
unique,,2,4,,2,2,2
top,,A,Admit to Ward for Care,,No,No,No
freq,,336,362,,544,352,532
mean,55.103853,,,37.825796,,,
std,21.712566,,,1.198883,,,
min,18.0,,,35.8,,,
25%,35.0,,,36.8,,,
50%,56.0,,,37.8,,,
75%,75.0,,,38.9,,,
