### Packt >
# Python Data Analysis - Third Edition

## Statistics

**Exploratory data analysis (EDA)** is the first step toward data analysis and building a machine learning model. Statistics provide fundamental knowledge and a set of tools for exploratory or descriptive data analysis.

Statistics a primary and very necessary skill for data professionals, helping them gain initial insights and an understanding of the data.

### Types of attributes

The data type of attributes helps analysts select the correct method for data analysis and visualization plots. 
___
1. **Nominal (categorical) attributes**: Nominal refers to names or labels of categorized variables. The values are categorical, qualitative, and unordered in nature such as product name, brand name, zip code, state, gender, and marital status. Data analysts can calculate the mode, which is the most commonly occurring value. 
2. **Ordinal attributes**: Ordinal refers to names or labels with a meaningful order or ranking. These types of attributes measure subjective qualities alone. That is why they are used in surveys for customer satisfaction ratings, product ratings, and movie rating reviews.
3. **Numeric attributes**: A numeric attribute is quantitatively presented as integer or real values. Numeric attributes can be of two types: **interval-scaled** or **ratio-scaled**
- **interval-scaled** attributes are measured on an ordered scale of equal-sized units. The main problem with interval-scaled attribute values is that they don't have a "true zero"—for example, if the temperature in °C is 0 then it doesn't mean that temperature doesn't exist. Interval-scaled data can add and subtract but can't multiply and divide because of no true zero. We can also calculate the mean value of an interval-scaled attribute, in addition to the median and mode.
- **ratio-scaled** attributes are measured on an ordered scale of equal-sized units, similar to an interval scale with an inherent zero point. Examples of ratio-scaled attributes are height, weight, latitude, longitude, years of experience, and the number of words in a document. We can perform multiplication and division, and calculate the difference between ratio-scaled values. We can also compute central tendency measures such as mean, median, and mode. The Celsius and Fahrenheit temperature scales are measured on an interval scale, while the Kelvin temperature scale is measured on a ratio scale because it has a true zero point. 
___

Discrete and continuous attributes

- a **discrete** variable accepts only a countable finite number, such as how many students are present in a class, how many cars are sold, and how many books are published. It can be obtained by counting numbers
- a **continuous** variable accepts an infinite number of possible values, such as the weight and height of students. It can be obtained by measuring.



### Mean

The mean value is computed by the sum of observations divided by the number of observations. It is sensitive to outliers and noise, with the result that whenever uncommon or unusual values are added to a group, its mean gets deviated from the typical central value. 

In [1]:
import pandas as pd

In [9]:
sample_data = {'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],
              'gender': ['M', 'F', 'F', 'M', 'M'],
              'communication_skill_score': [40, 45, 23, 39, 39],
              'quantitative_skill_score': [38, 41, 42, 48, 32]}
df = pd.DataFrame(sample_data, columns=['name', 'gender', 'communication_skill_score', 'quantitative_skill_score'])
df['communication_skill_score'].mean()

37.2

### Mode

The mode is the highest-occurring item in a group of observations. The mode value occurs frequently in data and is mostly used for categorical values. If all the values in a group are unique or non-repeated, then there is no mode. It is also possible that more than one value has the same occurrence frequency. In such cases, there can be multiple modes. 

In [10]:
df['communication_skill_score'].mode()

0    39
dtype: int64

### Median

The median is the midpoint or middle value in a group of observations. It is also called the 50th percentile. The median is less affected by outliers and noise than the mean, and that is why it is considered a more suitable statistic measure for reporting. It is much near to a typical central value. 

In [11]:
df['communication_skill_score'].median()

39.0

### Dispersion

Dispersion metrics measure the deviation in observations. 

In [12]:
# the range is the difference between the maximum and minimum value of an observation
communication_skill_score_range = df['communication_skill_score'].max() - df['communication_skill_score'].min()
communication_skill_score_range

22

In [13]:
# IQR (interquartile range) is the difference between the third and first quartiles
# it measures the middle 50% in the observation
# it represents the range where most of the observation lies
q1 = df['communication_skill_score'].quantile(.25)
q2 = df['communication_skill_score'].quantile(.75)
iqr = q2 - q1
iqr

1.0

In [16]:
# the variance measures the deviation from the mean
# it is the average value of the squared difference between observed values and the mean
# The main problem with the variance is its unit of measurement 
# because of squaring the difference between observations and mean.
df['communication_skill_score'].var()

69.2

In [18]:
# the standard deviation unit is the same as for the original observations
# this makes it easier for an analyst to evaluate the exact deviation from the mean
df['communication_skill_score'].std()

8.318653737234168

In [20]:
df.describe()

Unnamed: 0,communication_skill_score,quantitative_skill_score
count,5.0,5.0
mean,37.2,40.2
std,8.318654,5.848077
min,23.0,32.0
25%,39.0,38.0
50%,39.0,41.0
75%,40.0,42.0
max,45.0,48.0


### Skewness and kurtosis
___
**Skewness** measures the symmetry of a distribution. It shows how much the distribution deviates from a normal distribution. Its values can be zero, positive, and negative. A zero value represents a perfectly normal shape of a distribution. Positive skewness is shown by the tails pointing toward the right—that is, outliers are skewed to the right and data stacked up on the left. Negative skewness is shown by the tails pointing toward the left—that is, outliers are skewed to the left and data stacked up on the right. Positive skewness occurs when the mean is greater than the median and the mode. Negative skewness occurs when the mean is less than the median and mode. 
___
**Kurtosis** measures the tailedness (thickness of tail) compared to a normal distribution. High kurtosis is heavy-tailed, which means more outliers are present in the observations, and low values of kurtosis are light-tailed, which means fewer outliers are present in the observations. There are three types of kurtosis shapes: mesokurtic, platykurtic, and leptokurtic. 

- A normal distribution having zero kurtosis is known as a **mesokurtic** distribution.
- A **platykurtic** distribution has a negative kurtosis value and is thin-tailed compared to a normal distribution.
- A **leptokurtic** distribution has a kurtosis value greater than 3 and is fat-tailed compared to a normal distribution. 

<img src='img/kurtosis.png'>

In [21]:
df['communication_skill_score'].skew()

-1.704679180800373

In [22]:
df['communication_skill_score'].kurtosis()

3.6010641852384015

### Covariance and correlation coefficients

**Covariance** measures the relationship between a pair of variables. It shows the degree of change in the variables—that is, how the change in one variable affects the other variable. Its value ranges from -infinity to + infinity. The problem with covariance is that it does not provide effective conclusions because it is not normalized. 

**Correlation** shows how variables are correlated with each other. Correlation offers a better understanding than covariance and is a normalized version of covariance. Correlation ranges from -1 to 1. A negative value represents the increase in one variable, causing a decrease in other variables or variables to move in the same direction. A positive value represents the increase in one variable, causing an increase in another variable, or a decrease in one variable causes decreases in another variable. A zero value means that there is no relationship between the variable or that variables are independent of each other. 

- **pearson**: Standard correlation coefficient
- **kendall**: Kendall's tau correlation coefficient
- **spearman**: Spearman's rank correlation coefficient

**Spearman's rank correlation coefficient** is Pearson's correlation coefficient on the ranks of the observations. It is a non-parametric measure for rank correlation. It assesses the strength of the association between two ranked variables. Ranked variables are ordinal numbers, arranged in order. First, we rank the observations and then compute the correlation of ranks. It can apply to both continuous and discrete ordinal variables. When the distribution of data is skewed or an outlier is affected, then Spearman's rank correlation is used instead of Pearson's correlation because it doesn't have any assumptions for data distribution.

**Kendall's rank correlation coefficient** or Kendall's tau coefficient is a non-parametric statistic used to measure the association between two ordinal variables. It is a type of rank correlation. It measures the similarity or dissimilarity between two variables. If both the variables are binary, then Pearson's = Spearman's = Kendall's tau. 

In [23]:
df.cov()

Unnamed: 0,communication_skill_score,quantitative_skill_score
communication_skill_score,69.2,-6.55
quantitative_skill_score,-6.55,34.2


In [25]:
df.corr(method ='pearson')

Unnamed: 0,communication_skill_score,quantitative_skill_score
communication_skill_score,1.0,-0.13464
quantitative_skill_score,-0.13464,1.0


In [26]:
df.corr(method ='spearman')

Unnamed: 0,communication_skill_score,quantitative_skill_score
communication_skill_score,1.0,-0.307794
quantitative_skill_score,-0.307794,1.0


In [27]:
df.corr(method ='kendall')

Unnamed: 0,communication_skill_score,quantitative_skill_score
communication_skill_score,1.0,-0.105409
quantitative_skill_score,-0.105409,1.0


### Central limit theorem

Data analysis methods involve hypothesis testing and deciding confidence intervals. All statistical tests assume that the population is normally distributed. The central limit theorem is the core of hypothesis testing. According to this theorem, the sampling distribution approaches a normal distribution with an increase in the sample size. Also, the mean of the sample gets closer to the population means and the standard deviation of the sample gets reduced. This theorem is essential for working with inferential statistics,  helping data analysts figure out how samples can be useful in getting insights about the population.

Does it provide answers to questions such as what size of sample should be taken or which sample size is an accurate representation of the population? You can understand this with the help of the following diagram:

<img src='img/sample_sizes.png'>

As the sample size increases, the histogram approaches a normal curve.

### Collecting samples

Sampling is a method or process of collecting sample data from various sources. It is the most crucial part of data collection. The success of an experiment depends upon how well the data is collected. If anything goes wrong with sampling, it will hugely affect the final interpretations. 

Sampling helps researchers to infer the population from the sample and reduces the survey cost and workload to collect and manage data. 

1. **Probability sampling**: With this technique, there is a random selection of every respondent of the population, with an equal chance of the selected sample. Such types of sampling techniques are more time-consuming and expensive, and include the following:
- *Simple random sampling*: each respondent is selected by chance, meaning that each respondent has an equal chance of being selected.
- *Stratified sampling*: the whole population is divided into small groups known as strata that are based on some similarity criteria. These strata can be of unequal size. This technique improves accuracy by reducing selection bias.
- *Systematic sampling*: respondents are selected at regular intervals. In other words, we can say respondents are selected in systematic order from the target population, such as every nth respondent from the population.
- *Cluster sampling*: the entire population is divided into clusters or sections. Clusters are formed based on gender, location, occupation, and so on. These entire clusters are used for sampling rather than the individual respondent.

2. **Non-probability sampling**: This sampling non-randomly selects every respondent of the population, with an unequal chance of the selected sample. Its outcome might be biased. Such types of sampling techniques are cheaper and more convenient, and include the following:
- *Convenience sampling*: selects respondents based on their availability and willingness to participate. Statisticians prefer this technique for the initial survey due to cost and fast collection of data, but the results are more prone to bias.
- *Purposive sampling*: This is also known as judgmental sampling because it depends upon the statistician's judgment. Statisticians decide at runtime who will participate in the survey based on certain predefined characteristics. News reporters use this technique to select people whose opinions they wish to obtain.
- *Quota sampling*: This technique predefines the properties of strata and proportions for the sample. Sample respondents are selected until a definitive proportion is met. It differs from stratified sampling in terms of selection strategy; it selects items in strata using random sampling.
- *Snowball sampling*: This technique is used in a situation where finding respondents in a population is rare and difficult to trace, in areas such as illegal immigration or HIV. Statisticians contact volunteers to reach out to the victims. It is also known as referral sampling because the initial person taking part in the survey refers to another person who fits the sample description.

### Performing parametric tests

A *t-test* is a kind of parametric test that is used for checking if there is a significant difference between the means of the two groups concerned. It is the most commonly used inferential statistic that follows the normal distribution. A t-test has two types: a one-sample t-test and a two-sample t-test. 

A **one-sample t-test** is used for checking if there is a significant difference between a sample and hypothesized population means. 

A **two-sample t-test** is used for comparing the significant difference between two independent groups. This test is also known as an independent samples t-test.

In [34]:
import numpy as np
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind

In [30]:
data = np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])
mean_value = np.mean(data)
mean_value

70.5

In [32]:
t_test_value, p_value = ttest_1samp(data, 68)
print("P Value:",p_value)
print("t-test Value:",t_test_value)
# 0.05 or 5% is significance level or alpha.
if p_value < 0.05: 
    print("Hypothesis Rejected")
else:  
    print("Hypothesis Accepted")
# the output results have shown that the null hypothesis is accepted with a 95% confidence interval, 
# which means that the average weight of 10 students is 68 kg

P Value: 0.5986851106160134
t-test Value: 0.5454725779039431
Hypothesis Accepted


In [33]:
data2=np.array([53, 43, 31, 113, 33, 57, 27, 23, 24, 43])

In [35]:
# Compare samples
stat, p = ttest_ind(data, data2)
print("p-values:",p)
print("t-test:",stat)

# 0.05 or 5% is significance level or alpha.
if p < 0.05: 
    print("Hypothesis Rejected")
else:
    print("Hypothesis Accepted") 
# we have tested the hypothesis average weight of two groups using the ttest_ind() method, 
# and results show that the null hypothesis is rejected with a 95% confidence interval, 
# which means that the sample means are different.

p-values: 0.015170931362451255
t-test: 2.6835879913819185
Hypothesis Rejected
