# <p style='text-align: center;'> Descriptive Statistics using Pandas </p>

<b> Performoing Descriptive Statistics operations using Pandas :
    
    
- Here we are using "tips.csv" file for Statistics operations.

In [2]:
# importe pandas with alias name 'pd'.
import pandas as pd

# Load the "tips.csv" file.
df = pd.read_csv("tips.csv")

# print the top 5 records of csv file,
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## df.info() : -
- The info() method prints information about the DataFrame.


- The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).


- Note: the info() method actually prints the info. You do not use the print() method to print the info.

In [8]:
# print the information of df dataframe.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


- <b> In the above example, the information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

## sum() : -
- The sum() method adds all values in each column and returns the sum for each column.


- By specifying the column axis (axis='columns'), the sum() method searches column-wise and returns the sum of each row.

In [14]:
# Calculating the sum() on single column
df["total_bill"].sum()

4827.77

- <b> In the above example the sum of the single column as "total_bill" is 4827.77

In [16]:
# Calculating the sum() on multiple column
df[["total_bill","tip"]].sum()

total_bill    4827.77
tip            731.58
dtype: float64

- <b> In the above example the sum of the multiple columns as "total_bill" is 4827.77 and "tip" is 731.58.

## mean() : -
- It is measure of average of all value in a sample set.

In [17]:
# Calculating the mean() of the "total_bill" column.
df["total_bill"].mean()

19.785942622950824

- <b> In the above example the mean() of the "total_bill" column is 19.785

## median() : -
- Median refers to the data value that is positioned in the middle of as ordered data set.


- In these, data set is ordered from lowest to highest value and then finds exact middle.

In [18]:
# Calculating the median() of the "total_bill" column.
df["total_bill"].median()

17.795

- <b> In the above example the median() of the "total_bill" column is 17.795

## mode() : -
- Mode refers to the data value that is most frequently / recurrentelly observed.

In [20]:
# Calculating the mode() of the "sex" column.
df["sex"].mode()

0    Male
Name: sex, dtype: object

- <b> In the above example the mode() of the "sex" column is "Male".

## min() : -
- The min() method returns a Series with the minimum value of each column.


- By specifying the column axis (axis='columns'), the max() method searches column-wise and returns the minimum value for each 
row.

In [21]:
# Calculating the min() of the "total_bill" column.
df["total_bill"].min()

3.07

- <b> In the above example the min() of the "total_bill" column is 3.07.

## max() : -
- The max() method returns a Series with the maximum value of each column.


- By specifying the column axis (axis='columns'), the max() method searches column-wise and returns the maximum value for each row.

In [22]:
# Calculating the max() of the "total_bill" column.
df["total_bill"].max()

50.81

- <b> In the above example the max() of the "total_bill" column is 50.81.

## Range : -
- The range returns a difference between the maximum value and the minimum value of each column.

In [23]:
# Calculating the range of the "total_bill" column.
df["total_bill"].max() - df["total_bill"].min()

47.74

- <b> In the above example the range of the "total_bill" column is 47.74.

## variance or var() : -
- The var() method calculates the standard deviation for each column.


- By specifying the column axis (axis='columns'), the var() method searches column-wise and returns the standard deviation for each row.

In [24]:
# Calculating the var() of the "total_bill" column.
df["total_bill"].var()

79.25293861397826

- <b> In the above example the var() of the "total_bill" column is 79.2529.
    
    
- <b> By default pandas calculate the sample variance only. If we want the population variance, we need to calculate as follows :

In [25]:
# Calculating the population var() of the "total_bill" column.
df["total_bill"].var() * (len("total_bill")-1)/len("total_bill")

71.32764475258043

- <b> In the above example the population var() of the "total_bill" column is 71.327.

## Standard deviation or std() : -
- Pandas dataframe.std() function return sample standard deviation over requested axis.


- By default the standard deviations are normalized by N-1. It is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

In [32]:
# Calculating the std() of the "total_bill" column.
df["total_bill"].std()

8.902411954856856

- <b> In the above example the std() of the "total_bill" column is 8.9024.


- <b> By default pandas calculate the sample Standard deviation or std() only. If we want the population Standard deviation or std(), we need to calculate as follows :

In [45]:
# import the math module for calculating sqrt.
import math 

# Calculating the population var() of the "total_bill" column.
var = df["total_bill"].var() * (len("total_bill")-1)/len("total_bill")

# Calculate the population standard deviation without using std()
# Applying the square root to the "var" object
population_std = math.sqrt(var)

# Print the population standard deviation result
population_std

8.445569533937922

- <b> In the above example the population std() of the "total_bill" column is 8.4455.

## percentile or quantile() : -
- The quantile() method calculates the quantile of the values in a given axis. Default axis is row.


- By specifying the column axis (axis='columns'), the quantile() method calculates the quantile column-wise and returns the mean value for each row.

In [46]:
# Calculating the 25th percentile or (Q1) of the "total_bill" column.
df["total_bill"].quantile(0.25)

13.3475

- <b> In the above example the 25th percentile or (Q1) of the "total_bill" column is 13.3475.

In [47]:
# Calculating the 50th percentile or (Q2) or median of the "total_bill" column.
df["total_bill"].quantile(0.50)

17.795

- <b> In the above example the 50th percentile or (Q1) or median of the "total_bill" column is 17.795.

In [48]:
# Calculating the 75th percentile or (Q3) of the "total_bill" column.
df["total_bill"].quantile(0.75)

24.127499999999998

- <b> In the above example the 75th percentile or (Q3) of the "total_bill" column is 24.1274.

## interquartile range (IQR) : -
- The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).


- The IQR can help to determine potential outliers.

In [49]:
# Calculating the interquartile range (IQR) of the "total_bill" column.
# IQR = Q3 - Q1
df["total_bill"].quantile(0.75) - df["total_bill"].quantile(0.25)

10.779999999999998

- <b> In the above example the interquartile range (IQR) of the "total_bill" column is 10.779.

## skewness or skew() : -
- The skew() method calculates the skew for each column.


- By specifying the column axis (axis='columns'), the skew() method searches column-wise and returns the skew of each row.

In [50]:
# Calculating the skewness or skew() of the "total_bill" column.
df["total_bill"].skew()

1.1332130376158205

- <b> If skewness=0, then it is called perfect normal distribution.


- <b> If skewness values lies between -1 to +1, then it is called normal distribution.


- <b> If skewness < -1, then it is left skewed.


- <b> If skewness > +1, then it is right skewed.

- <b> In the above example the "total_bill" column skewness is greater than 1, So it is right skewed.

## kurtosis or kurt() : -
- Kurtosis is a measure of the tailedness (peakedness) of a distribution. Tailedness is how often outliers occur.


- There are 3 types in Kurtosis, they are :
     1) leptokurtic (high & thin) : Distributions with high kurtosis (fat tails) are leptokurtic.


     2) mesokurtic (normal in shape) : Distributions with medium kurtosis (medium tails) are mesokurtic.


     3) platykurtic (flat & spread out) : Distributions with low kurtosis (thin tails) are platykurtic.
     
     
- Kurtosis can take several values :

    1) Positive excess kurtosis – when excess kurtosis, given by (kurtosis – 3), is positive, then the distribution has a sharp peak and is called a leptokurtic distribution.


    2) Negative excess kurtosis – when excess kurtosis, given by (kurtosis – 3), is negative, then the distribution has a flat  peak and is called a platykurtic distribution.


    3) Zero excess kurtosis – when excess kurtosis, given by (kurtosis – 3), is zero, then the distribution follows a normal   distribution and is also called a mesokurtic distribution.

In [53]:
# Calculating the kurtosis or kurt() of the "total_bill" column.
df["total_bill"].kurt()

1.2184840156638854

- <b> In the above example the "total_bill" column kurtosis or kurt() is positive, then the distribution has a sharp peak and is called a leptokurtic distribution. So it is leptokurtic distribution.

## covariance or cov() : -
- The cov() function is used to compute pairwise covariance of columns, excluding NA/null values.


- Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

In [3]:
# Calculating the covariance or cov() of the "total_bill" and "tip" columns.
df[["total_bill","tip"]].cov()

Unnamed: 0,total_bill,tip
total_bill,79.252939,8.323502
tip,8.323502,1.914455


- <b> The above example matrix shows the covariance or cov() of the "total_bill" and "tip" columns.

## correlation or corr() : -
- The corr() method finds the correlation of each column in a DataFrame.


- Compute the pairwise correlation among the series of a DataFrame. The returned data frame is the correlation matrix of the columns of the DataFrame.

In [4]:
# Calculating the correlation or corr() of the "total_bill" and "tip" columns.
df[["total_bill","tip"]].corr()

Unnamed: 0,total_bill,tip
total_bill,1.0,0.675734
tip,0.675734,1.0


- <b> The above example matrix shows the correlation or corr() of the "total_bill" and "tip" columns.

## describe or df.describe() : -
- The describe() method returns description of the data in the DataFrame.


- If the DataFrame contains numerical data, the description contains these information for each column :


            1) count - The number of not-empty values.
            2) mean  - The average (mean) value.
            3) std   - The standard deviation.
            4) min   - The minimum value.
            5) 25%   - The 25% percentile*.
            6) 50%   - The 50% percentile*.
            7) 75%   - The 75% percentile*.
            8) max   - The maximum value.

In [5]:
# Apply the describe() on df dataframe.
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


- <b> In the above example "total_bill" & "tip" columns are continous variables and "size" column is discrete count variable, remaining variables are discrete Categorical data. So descibe() function by default returns continous & discrete count variables only. 

- <b> We can apply describe() function on discrete Categorical variables, it returns count, unique, top (mode), freq and remaining information as NaN.
    
    
- <b> To apply describe() function on discrete Categorical variables, we must pass "include='all'".

In [6]:
# Apply the describe() on df dataframe with "include='all'".
df.describe(include = "all")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
count,244.0,244.0,244,244,244,244,244.0
unique,,,2,2,4,2,
top,,,Male,No,Sat,Dinner,
freq,,,157,151,87,176,
mean,19.785943,2.998279,,,,,2.569672
std,8.902412,1.383638,,,,,0.9511
min,3.07,1.0,,,,,1.0
25%,13.3475,2.0,,,,,2.0
50%,17.795,2.9,,,,,2.0
75%,24.1275,3.5625,,,,,3.0


- <b> In the above example, we can see that the describe() function on discrete Categorical variables, it returns count, unique, top (mode), freq and remaining information as NaN.