
## Chapter 3
- **Exploratory Data Analysis (EDA)**  
    - First step in data analysis and machine learning model building.  
    - Statistics provide tools for exploratory/descriptive data analysis.  
    - Essential for understanding real-world data (noisy, missing values, diverse sources).  

- **Importance of Statistics**  
    - Helps familiarize with data before preprocessing and analysis.  
    - Necessary skill for data professionals to gain initial insights.  
    - Examples:  
        - Arithmetic mean: Understand workload (e.g., employee working hours).  
        - Standard deviation: Infer range of values.  
        - Correlation: Understand relationships (e.g., blood pressure vs. age).  
        - Sampling: Useful for primary data collection.  
        - Hypothesis tests: Infer population facts (parametric/non-parametric).  

- **Topics Covered in Chapter**  
    - Attributes and their types.  
    - Measuring central tendency.  
    - Measuring dispersion.  
    - Skewness and kurtosis.  
    - Covariance and correlation coefficients.  
    - Central limit theorem.  
    - Sampling methods.  
    - Parametric tests.  
    - Non-parametric tests.  

### Understanding Attributes and Their Types

**What is Data and Attributes?**  
- Data: Collection of raw facts, statistics, numbers, words, or observations.  
- Attribute: A column, data field, or series representing the characteristics of an object.  
    - Also known as variable, feature, or dimension.  
    - Terminology varies:  
        - Statisticians: Variable  
        - Machine Learning Engineers: Feature  
        - Data Warehousing: Dimension  
        - Database Professionals: Attribute  

**Types of Attributes**  
Attributes are categorized based on their data types, which are crucial for selecting appropriate analysis methods and visualization techniques.

1. **Nominal Attributes**  
     - Represent names or labels of categorized variables.  
     - Values are categorical, qualitative, and unordered (e.g., product name, gender, zip code).  
     - Central tendency: Mode (most frequent value).  
     - Mean and median are not meaningful for nominal data.

2. **Ordinal Attributes**  
     - Represent names or labels with a meaningful order or ranking.  
     - Magnitude of values is unknown (e.g., customer satisfaction ratings, drink sizes).  
     - Examples:  
         - Satisfaction ratings: 1 (Very dissatisfied) to 5 (Very satisfied).  
         - Drink sizes: Small, Medium, Large.  
     - Central tendency: Mode and median.  
     - Mean is not meaningful due to qualitative nature.  
     - Can be created by discretizing quantitative variables into finite ranges.

3. **Numeric Attributes**  
     - Quantitative attributes presented as integers or real values.  
     - Two types:  
         - **Interval-Scaled Attributes**  
             - Measured on an ordered scale with equal-sized units.  
             - No "true zero" (e.g., temperature in °C, birth year).  
             - Operations: Addition, subtraction.  
             - Central tendency: Mean, median, mode.  
         - **Ratio-Scaled Attributes**  
             - Similar to interval scale but with a "true zero" (e.g., height, weight, years of experience).  
             - Operations: Addition, subtraction, multiplication, division.  
             - Central tendency: Mean, median, mode.  
             - Example: Kelvin temperature scale (true zero) vs. Celsius/Fahrenheit (interval scale).  

### Discrete and Continuous Attributes

**Classification of Attributes**  
In addition to nominal, ordinal, and numeric attributes, attributes can also be classified as discrete or continuous.

1. **Discrete Attributes**  
	- Accept only a countable finite number of values.  
	- Examples:  
		- Number of students in a class.  
		- Number of cars sold.  
		- Number of books published.  
	- Obtained by counting.  
	- Accept integral values.  
	- Fractions do not make sense.  
	- Use a limited number of values.

2. **Continuous Attributes**  
	- Accept an infinite number of possible values.  
	- Examples:  
		- Weight of students.  
		- Height of students.  
	- Obtained by measuring.  
	- Accept real values.  
	- Fractions make sense.  
	- Use an unlimited number of values.

**Key Difference**  
- Discrete attributes: Values are countable and finite.  
- Continuous attributes: Values are measurable and infinite.

**Next Steps**  
After understanding the types of attributes, the focus shifts to basic statistical descriptions, such as measures of central tendency.

### Measuring Central Tendency

Central tendency is the trend of values clustered around the averages such as the mean, mode, and median values of data. The main objective of central tendency is to compute the center-leading value of observations. Central tendency determines the descriptive summary and provides quantitative information about a group of observations. It has the capability to represent a whole set of observations. Let's see each type of central tendency measure in detail in the coming sections.

#### Mean
The mean value is the arithmetic mean or average, which is computed by the sum of observations divided by the number of observations. It is sensitive to outliers and noise, with the result that whenever uncommon or unusual values are added to a group, its mean gets deviated from the typical central value.

![image.png](attachment:image.png)

In [8]:
# Import pandas library
import pandas as pd

# Create dataframe
sample_data = {'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],
'gender': ['M', 'F', 'F', 'M', 'M'],
'communication_skill_score': [40, 45, 23, 39, 39],
'quantitative_skill_score': [38, 41, 42, 48, 32]}

data = pd.DataFrame(sample_data, columns = ['name', 'gender',
'communication_skill_score', 'quantitative_skill_score'])

# Correct column name
data['communication_skill_score'].mean(axis=0)


np.float64(37.2)

#### Mode
The mode is the highest-occurring item in a group of observations. The mode value occurs frequently in data and is mostly used for categorical values. If all the values in a group are unique or non-repeated, then there is no mode. It is also possible that more than one value has the same occurrence frequency. In such cases, there can be multiple modes.

In [10]:
# find mode of communication_skill_score column

data['communication_skill_score'].mode()

0    39
Name: communication_skill_score, dtype: int64

#### Median
The median is the midpoint or middle value in a group of observations. It is also called the 50th percentile. The median is less affected by outliers and noise than the mean, and that is why it is considered a more suitable statistical measure for reporting. It is much closer to a typical central value.

In [13]:
# find median of communication_skill_score column
data['communication_skill_score'].median()

np.float64(39.0)

### Measuring Dispersion

While central tendency provides the middle value of a group of observations, it does not offer a complete picture of the data. Dispersion metrics are used to measure the deviation or variability in observations. These metrics help understand the spread of data. The most commonly used dispersion metrics are:

1. **Range**  
    - The range is the difference between the maximum and minimum values in a dataset.  
    - It is simple to compute and easy to understand.  
    - The unit of the range is the same as the unit of the observations.  

Other dispersion metrics, such as interquartile range (IQR), variance, and standard deviation, also provide valuable insights into the spread of data.

In [16]:
column_range = (
    data['communication_skill_score'].max() - 
    data['communication_skill_score'].min()
)
print(column_range)

22


#### Interquartile Range (IQR)

**Definition**  
The Interquartile Range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1). It is a measure of statistical dispersion and represents the range where the middle 50% of the observations lie.

**Key Points**  
- **Computation**: IQR = Q3 - Q1.  
- **Ease of Use**: It is simple to compute and easy to understand.  
- **Unit**: The unit of IQR is the same as the unit of the observations.  
- **Purpose**:  
    - Measures the spread of the middle 50% of the data.  
    - Represents the range where most of the observations are concentrated.  
- **Other Names**:  
    - Midspread  
    - Middle 50%  
    - H-spread  

IQR is a robust measure of variability and is less affected by outliers compared to other metrics like range or standard deviation.

In [20]:
# First Quartile
q1 = data['communication_skill_score'].quantile(.25)

# Third Quartile
q3 = data['communication_skill_score'].quantile(.75)

# Inter Quartile Ratio
iqr=q3-q1
print(iqr)


1.0


#### Variance

**Definition**  
The variance measures the deviation from the mean. It is the average value of the squared difference between observed values and the mean. Variance provides insights into the spread of data, but its unit of measurement is squared, which can make interpretation challenging.

**Key Points**  
- **Purpose**: Measures how far each observation in the dataset is from the mean.  
- **Formula**:  
![image.png](attachment:image.png)

- **Unit**: The unit of variance is the square of the unit of the observations.  
- **Limitation**: The squared unit makes it less interpretable compared to other metrics like standard deviation.  

Variance is a fundamental statistical measure used in various fields, including data analysis, machine learning, and finance, to understand data variability.

In [21]:
# Variance of communication_skill_score
data['communication_skill_score'].var()

np.float64(69.2)

#### Standard Deviation

**Definition**  
The standard deviation is the square root of the variance. Its unit is the same as the unit of the original observations, making it easier for analysts to evaluate the exact deviation from the mean.

**Key Points**  
- **Interpretation**:  
    - A lower standard deviation indicates that observations are closer to the mean, meaning they are less widely spread.  
    - A higher standard deviation indicates that observations are farther from the mean, meaning they are more widely spread.  
- **Symbol**: Represented by the Greek letter sigma (Σ).  
- **Formula**:  
   
   ![image.png](attachment:image.png)

Standard deviation is a widely used measure of dispersion in data analysis and statistics.

In [22]:
# Standard deviation of communication_skill_score
data['communication_skill_score'].std()

np.float64(8.318653737234168)

### Using the `describe()` Function

The `describe()` function in pandas provides a summary of statistics for each numeric column in a DataFrame. It includes the following metrics:

- **Count**: The number of non-null observations.
- **Mean**: The average value.
- **Standard Deviation**: The measure of spread or dispersion of the data.
- **First Quartile (25%)**: The value below which 25% of the data falls.
- **Median (50%)**: The middle value or 50th percentile.
- **Third Quartile (75%)**: The value below which 75% of the data falls.
- **Minimum**: The smallest value.
- **Maximum**: The largest value.

This function is useful for quickly understanding the distribution and summary statistics of numeric data in a DataFrame.


In [23]:
# Describe dataframe
data.describe()

Unnamed: 0,communication_skill_score,quantitative_skill_score
count,5.0,5.0
mean,37.2,40.2
std,8.318654,5.848077
min,23.0,32.0
25%,39.0,38.0
50%,39.0,41.0
75%,40.0,42.0
max,45.0,48.0


### Skewness and Kurtosis

#### Skewness
Skewness measures the symmetry of a distribution and indicates how much the distribution deviates from a normal distribution. It can take the following values:
- **Zero Skewness**: Represents a perfectly normal distribution.
- **Positive Skewness**: 
    - Tails point toward the right (outliers skewed to the right).
    - Data is stacked up on the left.
    - Occurs when the **mean > median > mode**.
- **Negative Skewness**: 
    - Tails point toward the left (outliers skewed to the left).
    - Data is stacked up on the right.
    - Occurs when the **mean < median < mode**.

#### Kurtosis
Kurtosis measures the "tailedness" (thickness of tails) of a distribution compared to a normal distribution. It provides insights into the presence of outliers:
- **High Kurtosis**: Heavy-tailed distribution with more outliers.
- **Low Kurtosis**: Light-tailed distribution with fewer outliers.

Types of Kurtosis:
1. **Mesokurtic**: 
     - Normal distribution with zero kurtosis.
2. **Platykurtic**: 
     - Negative kurtosis value.
     - Thin-tailed compared to a normal distribution.
3. **Leptokurtic**: 
     - Kurtosis value greater than 3.
     - Fat-tailed compared to a normal distribution.

In [25]:
# skewness of communication_skill_score column
data['communication_skill_score'].skew()

np.float64(-1.704679180800373)

In [30]:
# Ensure only numeric columns are used for correlation
numeric_data = data.select_dtypes(include=['number'])

# Calculate Pearson correlation
correlation_matrix = numeric_data.corr(method='pearson')
print(correlation_matrix)

                           communication_skill_score  quantitative_skill_score
communication_skill_score                    1.00000                  -0.13464
quantitative_skill_score                    -0.13464                   1.00000


### Spearman's Rank Correlation Coefficient
- Spearman's rank correlation coefficient is Pearson's correlation coefficient applied to the ranks of observations.
- It is a **non-parametric measure** for rank correlation.
- Assesses the strength of association between two ranked variables (ordinal numbers arranged in order).
- Steps:
    1. Rank the observations.
    2. Compute the correlation of ranks.
- Applicable to both **continuous** and **discrete ordinal variables**.
- Preferred over Pearson's correlation when:
    - Data distribution is skewed.
    - Outliers are present.
- Does not assume any specific data distribution.

### Kendall's Rank Correlation Coefficient
- Also known as **Kendall's tau coefficient**.
- A **non-parametric statistic** used to measure the association between two ordinal variables.
- Measures the similarity or dissimilarity between two variables.
- For binary variables: **Pearson's = Spearman's = Kendall's tau**.


### Transition to Inferential Statistics
- So far, we have covered descriptive statistics topics:
    - Central measures.
    - Dispersion measures.
    - Distribution measures.
    - Variable relationship measures.
- Next, we move to inferential statistics topics:
    - Central limit theorem.
    - Sampling techniques.
    - Parametric and non-parametric tests.



### Central Limit Theorem
- Core of hypothesis testing and confidence interval estimation.
- Assumes that the population is normally distributed.
- Key points:
    1. **Sampling distribution** approaches a normal distribution as sample size increases.
    2. **Sample mean** gets closer to the population mean.
    3. **Sample standard deviation** decreases with larger sample sizes.
- Essential for inferential statistics:
    - Helps analysts use samples to gain insights about the population.

### Collecting Samples

#### What is Sampling?
- **Definition**: Sampling is the process of collecting a small set of data (sample) from a larger population for analysis.
- **Importance**:
    - Reduces cost and workload.
    - Helps infer population characteristics from the sample.
    - Essential for successful experiments and accurate interpretations.
- **Challenges**: Poor sampling can lead to biased results and affect final interpretations.

#### Categories of Sampling Techniques
Sampling techniques are broadly categorized into two types:

---

### 1. **Probability Sampling**
- **Definition**: Random selection of respondents, giving each an equal chance of being selected.
- **Advantages**:
    - Reduces selection bias.
    - Provides more accurate and reliable results.
- **Disadvantages**:
    - Time-consuming and expensive.
- **Techniques**:
    1. **Simple Random Sampling**:
         - Each respondent is selected by chance.
         - Example: Randomly selecting 20 products from 500 for quality testing.
    2. **Stratified Sampling**:
         - Population is divided into smaller groups (strata) based on similarity criteria.
         - Improves accuracy by reducing selection bias.
    3. **Systematic Sampling**:
         - Respondents are selected at regular intervals (e.g., every nth respondent).
    4. **Cluster Sampling**:
         - Population is divided into clusters (e.g., based on gender, location, or occupation).
         - Entire clusters are sampled instead of individual respondents.

---

### 2. **Non-Probability Sampling**
- **Definition**: Non-random selection of respondents, giving unequal chances of being selected.
- **Advantages**:
    - Cheaper and more convenient.
- **Disadvantages**:
    - Results may be biased.
- **Techniques**:
    1. **Convenience Sampling**:
         - Respondents are selected based on availability and willingness.
         - Common for initial surveys due to low cost and fast data collection.
         - Results are prone to bias.
    2. **Purposive Sampling**:
         - Also known as judgmental sampling.
         - Respondents are selected based on predefined characteristics and the statistician's judgment.
         - Example: News reporters selecting people for opinions.
    3. **Quota Sampling**:
         - Predefines strata properties and proportions.
         - Respondents are selected until the required proportion is met.
         - Differs from stratified sampling as it uses non-random selection within strata.
    4. **Snowball Sampling**:
         - Used when respondents are rare or difficult to trace (e.g., illegal immigration, HIV cases).
         - Initial participants refer others who fit the sample description.
         - Also known as referral sampling.

---

### Summary
- **Probability Sampling**: Random, accurate, but time-consuming and costly.
- **Non-Probability Sampling**: Non-random, convenient, but prone to bias.
- Choosing the right sampling technique depends on the research goals, resources, and population characteristics.

### Performing Parametric Tests

#### Overview
- **Hypothesis Testing**: Core topic of inferential statistics.
- **Parametric Tests**: Statistical tests that assume an underlying statistical distribution.
    - Used for **quantitative** and **continuous data**.
    - Based on **parameters** (numeric quantities representing the population).
    - More **powerful** and **reliable** than non-parametric tests.

#### Key Points
- Parametric tests rely on the assumption that the data follows a specific distribution (e.g., normal distribution).
- Hypotheses are developed based on the parameters of the population distribution.

#### Examples of Parametric Tests
1. **T-Test**:
    - Used to check if there is a significant difference between the means of two groups.
    - Follows a normal distribution.
    - Most commonly used inferential statistic.
    - Types:
        - **One-Sample T-Test**:
            - Compares a sample mean to a hypothesized population mean.
        - **Two-Sample T-Test**:
            - Compares the means of two independent groups.

In [31]:
import numpy as np
from scipy.stats import ttest_1samp
# Create data
data=np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])

# Find mean
mean_value = np.mean(data)

print("Mean:",mean_value)


Mean: 70.5


In [33]:
import numpy as np
from scipy.stats import ttest_1samp

# Create data (ensure it's a numeric array)
data = np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])

# Perform one-sample t-test
t_test_value, p_value = ttest_1samp(data, 68)

# Print results
print("P Value:", p_value)
print("t-test Value:", t_test_value)

# 0.05 or 5% is significance level or alpha
if p_value < 0.05:
    print("Hypothesis Rejected")
else:
    print("Hypothesis Accepted")


P Value: 0.5986851106160134
t-test Value: 0.545472577903943
Hypothesis Accepted


### Two-Sample T-Test

#### Definition
A two-sample t-test is used to compare the significant difference between the means of two independent groups. It is also known as an independent samples t-test.

#### Example
To compare the average weight of two independent student groups, the hypotheses are defined as follows:

- **Null Hypothesis (H₀)**: The sample means are equal (μ₁ = μ₂).
- **Alternative Hypothesis (Hₐ)**: The sample means are not equal (μ₁ > μ₂ or μ₂ > μ₁).

This test helps determine whether the difference in means is statistically significant.

In [36]:
from scipy.stats import ttest_ind
import numpy as np

# Create numpy arrays
data1 = np.array([63, 75, 84, 58, 52, 96, 63, 55, 76, 83])
data2 = np.array([53, 43, 31, 113, 33, 57, 27, 23, 24, 43])

# Compare samples
stat, p = ttest_ind(data1, data2)

# Print results
print("p-values:", p)
print("t-test:", stat)

# 0.05 or 5% is significance level or alpha
if p < 0.05:
    print("Hypothesis Rejected")
else:
    print("Hypothesis Accepted")

p-values: 0.015170931362451255
t-test: 2.683587991381918
Hypothesis Rejected


### Paired Sample T-Test

#### Definition
A paired sample t-test, also known as a dependent sample t-test, is used to determine whether the mean difference between two observations of the same group is zero. It is equivalent to a one-sample t-test.

#### Example
A common use case is comparing the difference in measurements before and after a treatment for the same group of individuals. For instance, assessing the difference in blood pressure levels for a group of patients before and after drug treatment.

#### Hypotheses
- **Null Hypothesis (H₀)**: The mean difference between the two dependent samples is 0.
- **Alternative Hypothesis (Hₐ)**: The mean difference between the two dependent samples is not 0.

#### Use Case
To assess the impact of a weight loss treatment, we can perform a paired t-test by comparing the weights of patients before and after the treatment.

In [39]:
from scipy.stats import ttest_rel
import numpy as np

# Weights before treatment
data1 = np.array([63, 75, 84, 58, 52, 96, 63, 65, 76, 83])

# Weights after treatment
data2 = np.array([53, 43, 67, 59, 48, 57, 65, 58, 64, 72])

# Compare weights
stat, p = ttest_rel(data1, data2)

# Print results
print("p-values:", p)
print("t-test:", stat)

# 0.05 or 5% is the significance level or alpha
if p < 0.05:
    print("Hypothesis Rejected")
else:
    print("Hypothesis Accepted")


p-values: 0.013685575312467715
t-test: 3.0548295044306903
Hypothesis Rejected


### ANOVA (Analysis of Variance)

#### Overview
- ANOVA is a statistical inference test used to compare multiple groups.
- It analyzes the variance **between** and **within** groups and tests several null hypotheses simultaneously.
- Typically used to compare more than two sets of data and check for statistical significance.

#### Types of ANOVA
1. **One-Way ANOVA**: Compares multiple groups based on a single independent variable.
2. **Two-Way ANOVA**: Compares groups based on two independent variables.
3. **N-Way Multivariate ANOVA**: Compares groups based on multiple independent variables.

#### Example: One-Way ANOVA
- **Scenario**: An IT company wants to compare the productivity of employees in three locations: Mumbai, Chicago, and London.
- **Objective**: Check for a significant difference in performance scores across locations.

#### Hypotheses
- **Null Hypothesis (H₀)**: There is no difference between the mean performance scores of the locations.
- **Alternative Hypothesis (Hₐ)**: There is a difference between the mean performance scores of the locations.

In [41]:
from scipy.stats import f_oneway
# Performance scores of Mumbai location
mumbai=[0.14730927, 0.59168541, 0.85677052, 0.27315387,
0.78591207,0.52426114, 0.05007655, 0.64405363, 0.9825853 ,
0.62667439]

# Performance scores of Chicago location
chicago=[0.99140754, 0.76960782, 0.51370154, 0.85041028,
0.19485391,0.25269917, 0.19925735, 0.80048387, 0.98381235,
0.5864963 ]

# Performance scores of London location
london=[0.40382226, 0.51613408, 0.39374473, 0.0689976 ,
0.28035865,0.56326686, 0.66735357, 0.06786065, 0.21013306,
0.86503358]

# Compare results using Oneway ANOVA
stat, p = f_oneway(mumbai, chicago, london)

print("p-values:", p)
print("ANOVA:", stat)
if p < 0.05:
	print("Hypothesis Rejected")
else:
	print("Hypothesis Accepted")

p-values: 0.27667556390705783
ANOVA: 1.3480446381965452
Hypothesis Accepted


### Two-Way ANOVA and N-Way ANOVA

#### Two-Way ANOVA
- Compares multiple groups based on **two independent variables**.
- Example:
    - An IT company compares the productivity of employee groups or teams based on:
        1. **Working hours**.
        2. **Project complexity**.

#### N-Way ANOVA
- Compares multiple groups based on **N independent variables**.
- Example:
    - An IT company compares the productivity of employee groups or teams based on:
        1. **Working hours**.
        2. **Project complexity**.
        3. **Employee training**.
        4. **Other employee perks and facilities**.

## Performing non-parametric tests

### Non-Parametric Tests

#### Overview
- Non-parametric tests do not rely on any statistical distribution, making them "distribution-free" hypothesis tests.
- These tests do not involve population parameters and are used for the order and rank of observations.
- Special ranking and counting methods are required for non-parametric tests.

#### Examples of Non-Parametric Tests
1. **Chi-Square Test**
    - Determines the significant difference or relationship between two categorical variables from a single population.
    - Assesses whether distributions of categorical variables differ from each other.
    - Also known as:
        - Chi-Square Goodness of Fit Test.
        - Chi-Square Test for Independence.
    - Interpretation:
        - **Small Chi-Square Statistic**: Observed data fits well with expected data.
        - **Large Chi-Square Statistic**: Observed data does not fit well with expected data.
    - Use Cases:
        - Assessing the impact of gender on voting preference.
        - Evaluating the impact of company size on health insurance coverage.

        ![image.png](attachment:image.png)

        Here, O is the observed value, E is the expected value, and "i" is the "ith" position
        in the contingency table.

#### Example: Chi-Square Test
- **Scenario**: A survey of 200 employees in a company was conducted to compare their highest qualification levels (e.g., High School, Higher Secondary, Graduate, Post-Graduate) with performance levels (e.g., Average, Outstanding).
- **Hypotheses**:
    - **Null Hypothesis (H₀)**: The two categorical variables are independent (employee performance is independent of the highest qualification level).
    - **Alternative Hypothesis (Hₐ)**: The two categorical variables are not independent (employee performance is not independent of the highest qualification level).

In [43]:
from scipy.stats import chi2_contingency

# Average performing employees
average=[20, 16, 13, 7]

# Outstanding performing employees
outstanding=[31, 40, 60, 13]

# contingency table
contingency_table= [average, outstanding]

# Apply Test
stat, p, dof, expected = chi2_contingency(contingency_table)
print("p-values:",p)
if p < 0.05:
	print("Hypothesis Rejected")
else:
	print("Hypothesis Accepted")


p-values: 0.059155602774381255
Hypothesis Accepted


### Wilcoxon Signed-Rank Test

#### Overview
- The Wilcoxon signed-rank test is a **non-parametric test** used to compare two paired samples.
- It is the **non-parametric counterpart** of the paired t-test.
- The test evaluates whether the two paired samples come from the same distribution.

#### Purpose
- Used to test the **null hypothesis** that the two paired samples belong to the same distribution.
- Commonly applied to compare the difference between two treatment observations for multiple groups.

#### Hypotheses
1. **Null Hypothesis (H₀)**: There is no difference between the dependent sample distributions.
2. **Alternative Hypothesis (Hₐ)**: There is a difference between the dependent sample distributions.

#### Example Use Case
- Comparing the difference between two treatment observations (e.g., before and after treatment) for a group of individuals.

In [47]:
from scipy.stats import wilcoxon

# Sample-1
data1 = [1, 3, 5, 7, 9]

# Sample-2 after treatment
data2 = [2, 4, 6, 8, 10]

# Apply Wilcoxon signed-rank test
stat, p = wilcoxon(data1, data2)

# Print results
print("p-values:", p)

# 0.01 or 1% is significance level or alpha
if p < 0.01:
    print("Hypothesis Rejected")
else:
    print("Hypothesis Accepted")


p-values: 0.0625
Hypothesis Accepted


### Chapter 3 Summary: Fundamentals of Statistics

#### Key Concepts Covered:
1. **Attributes and Their Types**:
    - Nominal, Ordinal, and Numeric attributes.

2. **Measures of Central Tendency**:
    - Mean, Median, and Mode.

3. **Measures of Variability**:
    - Range, Interquartile Range (IQR), Variance, and Standard Deviation.

4. **Data Distribution**:
    - Skewness and Kurtosis.

5. **Relationships Between Variables**:
    - Covariance and Correlation.

6. **Inferential Statistics**:
    - Central Limit Theorem.
    - Sampling techniques.
    - Parametric and Non-Parametric Tests.

#### Hands-On Practice:
- Coding statistical concepts using Python libraries:
  - **pandas**: For data manipulation and descriptive statistics.
  - **scipy.stats**: For performing statistical tests.

