describe() method in Pandas is used to generate descriptive statistics of DataFrame columns. It gives a quick summary of key statistical metrics like mean, standard deviation, percentiles, and more. By default, describe() works with numeric data but can also handle categorical data, offering tailored insights based on data type.

Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None)

Parameters:


percentiles: A list of numbers between 0 and 1, specifying which percentiles to return. The default is None, which returns the 25th, 50th, and 75th percentiles.

include: A list of data types to include in the summary. You can specify data types such as int, float, object (for strings), etc. The default is None, meaning all numeric types are included.

exclude: A list of data types to exclude from the summary. This parameter is also None by default, meaning no types are excluded.

dataset: "https://media.geeksforgeeks.org/wp-content/uploads/20241129175106825136/nba.csv"

In [2]:
import pandas as pd

In [3]:
# Reading the CSV file
data = pd.read_csv('C:\\Users\\kasin\\Downloads\\nba.csv')

# Displaying the first few rows of the dataset
print("NBA Dataset:")
display(data.head())

print("\n Summary Table Generated by .describe() Method:")
display(data.describe())

NBA Dataset:


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0



 Summary Table Generated by .describe() Method:


Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


Descriptive Statistics for Numerical Columns generated using .describe() Method

count: Total number of non-null values in the column.
mean: Average value of the column.
std: Standard deviation, showing how spread out the values are.
min: Minimum value in the column.
25%: 25th percentile (Q1).
50%: Median value (50th percentile).
75%: 75th percentile (Q3).
max: Maximum value in the column.

In [5]:
percentiles = [.20, .40, .60, .80]
include = ['object', 'float', 'int']

desc = data.describe(percentiles=percentiles, include=include)

print(desc)

                 Name                  Team      Number Position         Age  \
count             457                   457  457.000000      457  457.000000   
unique            457                    30         NaN        5         NaN   
top     Avery Bradley  New Orleans Pelicans         NaN       SG         NaN   
freq                1                    19         NaN      102         NaN   
mean              NaN                   NaN   17.678337      NaN   26.938731   
std               NaN                   NaN   15.966090      NaN    4.404016   
min               NaN                   NaN    0.000000      NaN   19.000000   
20%               NaN                   NaN    4.000000      NaN   23.000000   
40%               NaN                   NaN   10.000000      NaN   25.000000   
50%               NaN                   NaN   13.000000      NaN   26.000000   
60%               NaN                   NaN   18.600000      NaN   27.000000   
80%               NaN                   

In [6]:
desc = data["Name"].describe()

print(desc)


count               457
unique              457
top       Avery Bradley
freq                  1
Name: Name, dtype: object


count: Total number of non-null values.
unique: The number of unique values.
top: The most frequent value.
freq: The frequency of the most common value.

How to Calculate Skewness and Kurtosis

Skewness: 
It is a statistical term and it is a way to estimate or measure the shape of a distribution.  It is an important statistical methodology that is used to estimate the asymmetrical behavior rather than computing frequency distribution. Skewness can be two types:

Symmetrical: A distribution can be called symmetric if it appears the same from the left and right from the center point.
Asymmetrical: A distribution can be called asymmetric if it doesn’t appear the same from the left and right from the center point.

Distribution on the basis of skewness value:

Skewness = 0: Then normally distributed.
Skewness > 0: Then more weight in the left tail of the distribution.
Skewness < 0: Then more weight in the right tail of the distribution.

Kurtosis:
It is also a statistical term and an important characteristic of frequency distribution. It determines whether a distribution is heavy-tailed in respect of the normal distribution. It provides information about the shape of a frequency distribution.

***(A frequency distribution is a table or graph that shows how often a value occurs in a data set. It's a way to organize data so that it's easier to understand. 
How it works 
Frequency: The number of times a value occurs in a data set
Distribution: The pattern of frequencies of a variable)***

kurtosis for normal distribution is equal to 3.
For a distribution having kurtosis < 3: It is called playkurtic.
For a distribution having kurtosis > 3, It is called leptokurtic and it signifies that it tries to produce more outliers rather than the normal distribution.


In [9]:
# Importing scipy 
from scipy.stats import skew , kurtosis

In [10]:
# Creating a dataset 
dataset = [88, 85, 82, 97, 67, 77, 74, 86,  
           81, 95, 77, 88, 85, 76, 81] 
# Calculate the skewness 
print(skew(dataset, axis=0, bias=True).round(3))

"""It signifies that the distribution is positively skewed"""

0.029


'It signifies that the distribution is positively skewed'

scipy.stats.kurtosis(array, axis=0, fisher=True, bias=True)
Parameters:
fisher = True: The fisher’s definition will be used (normal 0.0).
fisher =  False: The Pearson’s definition will be used (normal 3.0).
Bias = True: Calculations are corrected for statistical bias, if set to False.

In [12]:
# Calculate the kurtosis 
print(kurtosis(dataset, axis=0, bias=True))

"""It signifies that the distribution has more values in the tails compared to a normal distribution."""

-0.29271198374234686


'It signifies that the distribution has more values in the tails compared to a normal distribution.'

Difference Between Skewness and Kurtosis:
"https://www.geeksforgeeks.org/difference-between-skewness-and-kurtosis/?ref=next_article"

# Pandas DataFrame corr() Method
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. To ignore any non-numeric values, use the parameter numeric_only = True. 

DataFrame.corr(method=’pearson’, min_periods=1, numeric_only = False) 


Parameters:  
method:
    pearson: standard correlation coefficient 
    kendall: Kendall Tau correlation coefficient 
    spearman: Spearman rank correlation
min_periods: Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
numeric_only: Whether only the numeric values are to be operated upon or not. It is set to False by default.

A good correlation depends on the use, but it is safe to say you have at least 0.6 (or -0.6) to call it a good correlation. 

+1 indicates a perfect positive linear relationship,
-1 indicates a perfect negative linear relationship,
0 indicates no linear relationship.



In [22]:
# To find the correlation among
# the columns using pearson method
data.corr(method='pearson',numeric_only = True)


Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.028724,0.206921,-0.112386
Age,0.028724,1.0,0.087183,0.213459
Weight,0.206921,0.087183,1.0,0.138321
Salary,-0.112386,0.213459,0.138321,1.0


In [24]:
# To find the correlation among the columns using kendall method
data.corr(method='kendall', numeric_only = True)

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.005536,0.15585,-0.075301
Age,0.005536,1.0,0.06613,0.172616
Weight,0.15585,0.06613,1.0,0.087165
Salary,-0.075301,0.172616,0.087165,1.0


In [26]:
# To find the correlation among the columns using kendall method
data.corr(method='spearman',numeric_only = True)

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.00849,0.226084,-0.114683
Age,0.00849,1.0,0.096163,0.266299
Weight,0.226084,0.096163,1.0,0.127628
Salary,-0.114683,0.266299,0.127628,1.0


Understanding Hypothesis Testing

Hypothesis method compares two opposite statements about a population and uses sample data to decide which one is more likely to be correct.
To test this assumption we first take a sample from the population and analyze it and use the results of the analysis to decide if the claim is valid or not.

Null hypothesis (H0): The null hypothesis is the starting assumption in statistics. It says there is no relationship between groups.
eg.  A company claims its average production is 50 units per day then here:
Null Hypothesis: H₀: The mean number of daily production (μ) = 50.

Alternative hypothesis (H1): The alternative hypothesis is the opposite of the null hypothesis it suggests there is a difference between groups.
e.g  The alternative hypothesis is the opposite of the null hypothesis it suggests there is a difference between groups. like The company’s production is not equal to 50 units per day then the alternative hypothesis would be:
H₁: The mean number of daily production (μ) ≠ 50.

Key Terms of Hypothesis Testing
1. Level of significance: 
It refers to the degree of significance in which we accept or reject the null hypothesis.(your output should be 95% confident to give a similar kind of result in each sample.)

2. P-value: 
When analyzing data the p-value tells you the likelihood of seeing your result if the null hypothesis is true. If your P-value is less than the chosen significance level then you reject the null hypothesis otherwise accept it.

3. Test Statistic: 
Test statistic is the number that helps you decide whether your result is significant. It’s calculated from the sample data you collect it could be used to test if a machine learning model performs better than a random guess.

4. Critical value: 
Critical value is a boundary or threshold that helps you decide if your test statistic is enough to reject the null hypothesis

5. Degrees of freedom: Degrees of freedom are important when we conduct statistical tests they help you understand how much data can vary.

Types of Hypothesis Testing

1. One-Tailed Test
A one-tailed test is used when we expect a change in only one direction—either an increase or a decrease but not both.

Let’s say if we’re analyzing data to see if a new algorithm improves accuracy we would only focus on whether the accuracy goes up not down.

The test looks at just one side of the data to decide if the result is enough to reject the null hypothesis. If the data falls in the critical region on that side then we reject the null hypothesis.

There are two types of one-tailed test:

Left-Tailed (Left-Sided) Test: If the alternative hypothesis say that the true parameter value is less than the null hypothesis. then it is a Left tailed test. Example: H0: μ ≥ 50 and H1: μ < 50
Right-Tailed (Right-Sided) Test: when the alternative hypothesis say that the true parameter value is greater than the null hypothesis then it is called Right Tailed test. Example: H0: μ ≤ 50 and H1: μ > 50

2. Two-Tailed Test
A two-tailed test is used when we want to check for a significant difference in both directions—whether the result is greater than or less than a specific value. We use this test when we don’t have a specific expectation about the direction of change.

If we are testing whether a new marketing strategy affects sales we want to know if sales increase or decrease so we look at both possibilities.

Example: H0: μ = 50 and H1: μ ≠ 50

To go deeper into differences into both types of test: Refer to link 
"https://www.geeksforgeeks.org/difference-between-one-tailed-and-two-tailed-tests/"

Type I error: When we reject the null hypothesis although that hypothesis was true. Type I error is denoted by alpha(α).(False Negative)

Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta(β).(False Positive)

Example:
Imagine we want to test if a new recommendation algorithm increases user engagement.
1. Definition:
(H₀): The new algorithm has no effect on user engagement.
(H₁): The new algorithm increases user engagement.

2. choose a significance level (α):
commonly set at 0.05. This level defines the threshold for deciding if the results are statistically significant.
It also tells us the probability of making a Type I error—rejecting a true null hypothesis.

3. Collect and Analyze data.
we gather data this could come from user observations or an experiment.
Once collected we analyze the data using appropriate statistical methods to calculate the test statistic.
Example: We collect data on user engagement before and after implementing the algorithm. We can also find the mean engagement scores for each group.

4. Calculate Test Statistic
The test statistic is a measure used to determine if the sample data support in reject the null hypothesis. The choice of the test statistic depends on the type of hypothesis test being conducted it could be a Z-test, Chi-square, T-test and so on. For our example we are dealing with a t-test because:
We have a smaller sample size.
The population standard deviation is unknown.
T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.


5. Comparing Test Statistic
Now we compare the test statistic to either the critical value or the p-value to decide whether to reject the null hypothesis or not.

Method A: 
Using Critical values: 
We refer to a statistical distribution table like the t-distribution in this case to find the critical value based on the chosen significance level (α).

If Test Statistic > Critical Value then we Reject the null hypothesis.
If Test Statistic ≤ Critical Value then we fail to reject the null hypothesis.
Example: If the p-value is 0.03 and α is 0.05 then we reject the null hypothesis because the p-value is smaller than the significance level.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table such as the normal distribution or t-distribution tables based on.


Method B:
Using P-values:
We can also come to an conclusion using the p-value
If the p-value ≤ significance level (p ≤ α) then you reject the null hypothesis.
If the p-value ≥ significance level (p ≥ α) then you fail to reject the null hypothesis.
Note: To determine p-value for hypothesis testing we typically refer to a statistical distribution table such as the normal distribution or t-distribution tables based on.
 

6. Interpret the Results
Based on the comparison of the test statistic to the critical value or p-value we can conclude whether there is enough evidence to reject the null hypothesis or not.

1. Z-statistics:
It is used when population means and standard deviations are known. The formula of z-statistics is given by:
    z = x̄ - μ  /  σ/√n

where
x̄ is the sample mean,
μ represents the population mean, 
σ is the standard deviation
and n is the size of the sample.


2. T-Statistics
T-test is used to compare the means of two datasets (e.g., experimental vs. control groups) to assess if the difference is statistically significant. It is used when n<30 t-statistic calculation is given by:

t = x̄ - μ / s/√n

where
t = t-score,
x̄ = sample mean
μ = population mean,
s = standard deviation of the sample,
n = sample size

3. Chi-Square Test
Chi-Square Test compares the observed frequency distribution of data to the expected distribution to see if the differences are just random or meaningful using:
χ2 = ∑ (Oij - Eij)2 / Eij
where:
Oij is the observed frequency in cell 
i,j are the rows and columns index respectively.
Eij is the expected frequency in cell ij, calculated as : 
    Row total × Column total / Total observations

Real life Examples of Hypothesis Testing:

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market they need to conduct a study to see its impact on blood pressure.

Data:
Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

(H0)The new drug has no effect on blood pressure.
(H1)The new drug has an effect on blood pressure.
Significance level at 0.05 
   


In [28]:
import numpy as np
from scipy import stats

In [30]:
before_treatment = np.array([120, 122, 118, 130, 125, 128, 115, 121, 123, 119])
after_treatment = np.array([115, 120, 112, 128, 122, 125, 110, 117, 119, 114])

In [32]:
null_hypothesis = "The new drug has no effect on blood pressure."
alternate_hypothesis = "The new drug has an effect on blood pressure."
alpha = 0.05

In [44]:
t_statistic, p_value = stats.ttest_rel(after_treatment, before_treatment)
print("t_statistic:",t_statistic)
print("p_value:", p_value)

t_statistic: -9.0
p_value: 8.538051223166285e-06


In [96]:
# m = np.subtract(after_treatment,before_treatment).mean()

In [98]:
m = np.mean(after_treatment - before_treatment)
s = np.std(after_treatment - before_treatment, ddof=1)  # using ddof=1 for sample standard deviation
n = len(before_treatment)
t_statistic_manual = m / (s / np.sqrt(n))

if p_value <= alpha:
    decision = "Reject"
else:
    decision = "Fail to reject"

if decision == "Fail to reject":
    conclusion = "There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different."
else:
    conclusion = "There is insufficient evidence to claim a significant difference in average blood pressure before and after treatment with the new drug."
print("T-statistic (from scipy):", t_statistic)
print("P-value (from scipy):", p_value)
print("T-statistic (calculated manually):", t_statistic_manual)
print(f"Decision: {decision} the null hypothesis at alpha={alpha}.")
print("Conclusion:", conclusion)

T-statistic (from scipy): -9.0
P-value (from scipy): 8.538051223166285e-06
T-statistic (calculated manually): -9.0
Decision: Reject the null hypothesis at alpha=0.05.
Conclusion: There is insufficient evidence to claim a significant difference in average blood pressure before and after treatment with the new drug.


the T-statistic of approximately -9 and an extremely small p-value the results indicate a strong case to reject the null hypothesis at a significance level of 0.05.

The results suggest that the new drug or treatment has a significant effect on lowering blood pressure.
The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Limitations of Hypothesis Testing
Although hypothesis testing is a useful technique but it have some limitations as well:

Limited Scope: Hypothesis testing focuses on specific questions or assumptions and not capture the complexity of the problem being studied.

Data Quality Dependence: The accuracy of the results depends on the quality of the data. Poor-quality or inaccurate data can led to incorrect conclusions.

Missed Patterns: By focusing only on testing specific hypotheses important patterns or relationships in the data might be missed.

Context Limitations: It doesn’t always consider the bigger picture which can oversimplify results and led to incomplete insights.

Need for Additional Methods: To get a better understanding of the data hypothesis testing should be combined with other analytical methods such as data visualization or machine learning techniques.