# 7. Statistical Functions
- `df.mean()`: Calculate the mean value.
- `df.median()`: Calculate the median value.
- `df.mode()`: Calculate the mode value.
- `df.min(), df.max()`: Get the minimum and maximum values.
- `df.std(), df.var()`: Standard deviation and variance.
- `df.corr()`: Calculate correlation coefficient.
- `df.cov()`: Calculate covariance.

In [1]:
import pandas as pd
import numpy as np
from datashader.datashape import object_, numeric


## pandas.DataFrame.mean

```python
DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs)
```

Return the mean of the values over the requested axis.

### Parameters:

- **axis** {index (0), columns (1)}  
  Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.  
  For DataFrames, specifying `axis=None` will apply the aggregation across both axes.  
  _Added in version 2.0.0._

- **skipna** `bool`, default `True`  
  Exclude NA/null values when computing the result.

- **numeric_only** `bool`, default `False`  
  Include only float, int, boolean columns. Not implemented for Series.

- **kwargs**  
  Additional keyword arguments to be passed to the function.

In [2]:
np.random.seed(42)

columns = ['EmployeeID', 'Name', 'Age', 'Department', 'YearsExperience', 'Salary']

data = {
    'EmployeeID': np.arange(1, 21),
    'Name': [f'Employee_{i}' for i in range(1, 21)],
    'Age': np.random.randint(20, 60, size=20),
    'Department': np.random.choice(['HR', 'Engineering', 'Marketing', 'Sales'], size=20),
    'YearsExperience': np.random.randint(1, 20, size=20),
    'Salary': np.random.randint(40000, 120000, size=20)
}

df = pd.DataFrame(data, columns=columns)

print(df)


    EmployeeID         Name  Age   Department  YearsExperience  Salary
0            1   Employee_1   58  Engineering               15   75920
1            2   Employee_2   48        Sales               15  107121
2            3   Employee_3   34        Sales               19  109479
3            4   Employee_4   27           HR               12   59457
4            5   Employee_5   40           HR                3  106557
5            6   Employee_6   58        Sales                5  117189
6            7   Employee_7   38  Engineering               19  118953
7            8   Employee_8   42  Engineering                7   92995
8            9   Employee_9   30           HR                9   80757
9           10  Employee_10   30        Sales                7   49692
10          11  Employee_11   43           HR               18   85758
11          12  Employee_12   55           HR                4  112409
12          13  Employee_13   59    Marketing               14  111211
13    

In [3]:
df.at[5,'Salary']=np.nan
df.at[15,'Salary']=np.nan
df.at[2,'Salary']=np.nan
df.at[5,'YearsExperience']=np.nan
df.at[9,'YearsExperience']=np.nan
df.at[18,'YearsExperience']=np.nan

In [4]:
df.select_dtypes(include=np.number).mean(skipna=False)

EmployeeID         10.5
Age                41.9
YearsExperience     NaN
Salary              NaN
dtype: float64

In [5]:
df.mean(skipna=True,numeric_only=True,axis=1)

0     18998.500000
1     26796.500000
2        18.666667
3     14875.000000
4     26651.250000
5        32.000000
6     29754.250000
7     23263.000000
8     20201.250000
9     16577.333333
10    21457.500000
11    28120.000000
12    27824.250000
13    26443.000000
14    19277.750000
15       19.666667
16    12896.750000
17    20116.250000
18    13694.666667
19    23919.000000
dtype: float64

## `pandas.DataFrame.median`

`DataFrame.median(axis=0, skipna=True, numeric_only=False, **kwargs)`

This function returns the median of the values over the requested axis in a DataFrame.

### Parameters:

- **axis**: `{index (0), columns (1)}`
    - The axis to apply the function on.
    - For Series, this parameter is unused and defaults to 0.
    - For DataFrames, specifying `axis=None` will apply the aggregation across both axes.
    - *Added in version 2.0.0.*

- **skipna**: `bool, default True`
    - Excludes NA/null values when computing the result.

- **numeric_only**: `bool, default False`
    - Includes only float, int, and boolean columns.
    - Not implemented for Series.

- **kwargs**: `Additional keyword arguments`
    - Additional keyword arguments to pass to the function.


In [6]:
df.median(numeric_only=True)

EmployeeID            10.5
Age                   42.5
YearsExperience       12.0
Salary             85758.0
dtype: float64

## `pandas.DataFrame.mode`

`DataFrame.mode(axis=0, numeric_only=False, dropna=True)`

This function calculates the mode(s) of each element along the specified axis in a DataFrame. The mode is the value that appears most frequently, and there may be multiple modes if values appear with the same frequency.

### Parameters:

- **axis**: `{0 or 'index', 1 or 'columns'}, default 0`
    - Determines the axis to iterate over while searching for the mode:
        - `0` or `'index'`: Calculates the mode of each column.
        - `1` or `'columns'`: Calculates the mode of each row.

- **numeric_only**: `bool, default False`
    - If `True`, the function is applied only to numeric columns.

- **dropna**: `bool, default True`
    - If `True`, NaN/NaT values are not considered in the mode counts.

### Returns:

- **DataFrame**: Returns a DataFrame with the modes of each column or row.

In [7]:
df['YearsExperience'].mode()

0    15.0
Name: YearsExperience, dtype: float64

In [8]:
df.mode(dropna=True,numeric_only=True)

Unnamed: 0,EmployeeID,Age,YearsExperience,Salary
0,1,43.0,15.0,41016.0
1,2,,,49692.0
2,3,,,51534.0
3,4,,,59457.0
4,5,,,75920.0
5,6,,,77065.0
6,7,,,80397.0
7,8,,,80757.0
8,9,,,85758.0
9,10,,,92995.0


## `pandas.DataFrame.max`
`DataFrame.max(axis=0, skipna=True, numeric_only=False, **kwargs)[source]`  
Return the maximum of the values over the requested axis.

If you want the index of the maximum, use `idxmax`. This is the equivalent of the `numpy.ndarray` method `argmax`.

---

## `pandas.DataFrame.min`
`DataFrame.min(axis=0, skipna=True, numeric_only=False, **kwargs)[source]`  
Return the minimum of the values over the requested axis.

If you want the index of the minimum, use `idxmin`. This is the equivalent of the `numpy.ndarray` method `argmin`.

### Parameters:
- **axis** `{index (0), columns (1)}`  
  Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

  For DataFrames, specifying `axis=None` will apply the aggregation across both axes.

  *Added in version 2.0.0.*

- **skipna** `bool, default True`  
  Exclude NA/null values when computing the result.

- **numeric_only** `bool, default False`  
  Include only float, int, boolean columns. Not implemented for Series.

- **kwargs**  
  Additional keyword arguments to be passed to the function.


In [9]:
df.max(numeric_only=True)

EmployeeID             20.0
Age                    59.0
YearsExperience        19.0
Salary             118953.0
dtype: float64

In [10]:
df.min()

EmployeeID                   1
Name                Employee_1
Age                         21
Department         Engineering
YearsExperience            2.0
Salary                 41016.0
dtype: object

### pandas.DataFrame.std

`DataFrame.std(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)`  
Return sample standard deviation over requested axis.  
Normalized by N-1 by default. This can be changed using the `ddof` argument.

#### Parameters:
- **axis** `{index (0), columns (1)}`
  - For Series, this parameter is unused and defaults to 0.

**Warning**  
The behavior of `DataFrame.std` with `axis=None` is deprecated. In a future version, this will reduce over both axes and return a scalar. To retain the old behavior, pass `axis=0` (or do not pass `axis`).

- **skipna** `bool`, default True  
  Exclude NA/null values. If an entire row/column is NA, the result will be NA.

- **ddof** `int`, default 1  
  Delta Degrees of Freedom. The divisor used in calculations is `N - ddof`, where N represents the number of elements.

- **numeric_only** `bool`, default False  
  Include only float, int, and boolean columns. Not implemented for Series.

#### Returns:
- **Series** or **DataFrame** (if level specified)


In [11]:
df.std(numeric_only=True)

EmployeeID             5.916080
Age                   11.968819
YearsExperience        5.723404
Salary             24090.226989
dtype: float64

### pandas.DataFrame.var

`DataFrame.var(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)`  
Return unbiased variance over the requested axis.  
Normalized by \( N - 1 \) by default. This can be changed using the `ddof` argument.

#### Parameters:
- **axis** `{index (0), columns (1)}`
  - For Series, this parameter is unused and defaults to 0.

**Warning**  
The behavior of `DataFrame.var` with `axis=None` is deprecated. In a future version, this will reduce over both axes and return a scalar. To retain the old behavior, pass `axis=0` (or do not pass `axis`).

- **skipna** `bool`, default True  
  Exclude NA/null values. If an entire row/column is NA, the result will be NA.

- **ddof** `int`, default 1  
  Delta Degrees of Freedom. The divisor used in calculations is \( N - \text{ddof} \), where \( N \) represents the number of elements.

- **numeric_only** `bool`, default False  
  Include only float, int, and boolean columns. Not implemented for Series.

#### Returns:
- **Series** or **DataFrame** (if level specified)


In [12]:
df.var(numeric_only=True)

EmployeeID         3.500000e+01
Age                1.432526e+02
YearsExperience    3.275735e+01
Salary             5.803390e+08
dtype: float64

# pandas.DataFrame.corr

`DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)[source]`
Compute pairwise correlation of columns, excluding NA/null values.

## Parameters

- **`method`**: {‘pearson’, ‘kendall’, ‘spearman’} or callable
  Method of correlation:
  - `pearson`: standard correlation coefficient.
  - `kendall`: Kendall Tau correlation coefficient.
  - `spearman`: Spearman rank correlation.
  - `callable`: callable with input two 1D ndarrays and returning a float. Note that the returned matrix from `corr` will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

- **`min_periods`**: int, optional
  Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

- **`numeric_only`**: bool, default False
  Include only float, int, or boolean data.
  - **Added in version 1.5.0.**
  - **Changed in version 2.0.0**: The default value of `numeric_only` is now False.

## Returns

- **DataFrame**
  Correlation matrix.


In [24]:
a,b=pd.factorize(df['Department'])

In [27]:
df['Department_ID']=a

In [29]:
df[['Age','YearsExperience','Salary','Department_ID']].corr()

Unnamed: 0,Age,YearsExperience,Salary,Department_ID
Age,1.0,-0.080478,0.466668,-0.130397
YearsExperience,-0.080478,1.0,-0.026241,0.052518
Salary,0.466668,-0.026241,1.0,0.15881
Department_ID,-0.130397,0.052518,0.15881,1.0


Index(['Engineering', 'Sales', 'HR', 'Marketing'], dtype='object')