# Basic Stats Computation in Python

The very first step towards Data Science involve understanding the basics of Statistics in order to analyse the data efficiently. Here, we are going to see how to use simple commands and libraries of Python for analysing the data and further compute certain Statistical parameters for any data like mean, median, mode, vriance and standard deviation.

## Computation on an arbitrary list using Numpy

In [1]:
list1 = [1,2,3,4,7,6,9,8,6,3,2,9,0,1,6,7,8,9,4,4,5,2,10,4,1,5,2,6,5]

In [2]:
import numpy as np
from scipy import stats

### 1. Mean

In [3]:
#mean
mean_list1 = np.mean(list1)
mean_list1

4.793103448275862

### 2. Median

In [4]:
#median
median_list1 = np.median(list1)
median_list1

5.0

### 3. Mode

In [5]:
#mode
mode_list1 = stats.mode(list1)
print(mode_list1)

ModeResult(mode=array([2]), count=array([4]))


### 4. Variance

In [6]:
#variance
var_list1 = np.var(list1)
var_list1

7.681331747919144

### 5. Standard Deviation

In [7]:
#standard deviation
sd_list1 = np.std(list1)
sd_list1

2.7715215582634647

## Computation on an arbitrary dataframe

### 1. Create DataFrame using Pandas library

In [8]:
buffTail = [10, 1, 37, 5, 12]
GarderBee = [8, 3, 19, 6, 4]
RedTail = [18, 9, 1, 2, 4]
HoneyBee = [12, 13, 16, 9, 10]
CarderBee = [8, 27, 6, 32, 23]

In [9]:
data = {
    'buffTail' : buffTail,
    'GarderBee' : GarderBee,
    'RedTail' : RedTail,
    'HoneyBee' : HoneyBee, 
    'CarderBee' : CarderBee}
data

{'buffTail': [10, 1, 37, 5, 12],
 'GarderBee': [8, 3, 19, 6, 4],
 'RedTail': [18, 9, 1, 2, 4],
 'HoneyBee': [12, 13, 16, 9, 10],
 'CarderBee': [8, 27, 6, 32, 23]}

In [10]:
import pandas as pd 
df = pd.DataFrame(data, index = ['Thistle', 
                                 'Viper Bugloss', 
                                 'Golden Rain', 
                                 'Yellow Alfalfa', 
                                 'Blackberry'])
df

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Thistle,10,8,18,12,8
Viper Bugloss,1,3,9,13,27
Golden Rain,37,19,1,16,6
Yellow Alfalfa,5,6,2,9,32
Blackberry,12,4,4,10,23


In [11]:
import numpy as np
display(df.to_numpy())

array([[10,  8, 18, 12,  8],
       [ 1,  3,  9, 13, 27],
       [37, 19,  1, 16,  6],
       [ 5,  6,  2,  9, 32],
       [12,  4,  4, 10, 23]], dtype=int64)

In [12]:
display(df)

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Thistle,10,8,18,12,8
Viper Bugloss,1,3,9,13,27
Golden Rain,37,19,1,16,6
Yellow Alfalfa,5,6,2,9,32
Blackberry,12,4,4,10,23


### 2. Accessing particular row of DataFrame

In [13]:
display(df.loc[['Blackberry'],:])

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Blackberry,12,4,4,10,23


In [14]:
display(df.loc[['Golden Rain', 'Yellow Alfalfa'],:])

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Golden Rain,37,19,1,16,6
Yellow Alfalfa,5,6,2,9,32


### 3. Accessing particular column of DataFrame

In [15]:
display(df.loc[:,['RedTail']])

Unnamed: 0,RedTail
Thistle,18
Viper Bugloss,9
Golden Rain,1
Yellow Alfalfa,2
Blackberry,4


### 4. Sort w.r.t. a particular Column

In [16]:
df.sort_values(by=['buffTail'], ascending=False)

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Golden Rain,37,19,1,16,6
Blackberry,12,4,4,10,23
Thistle,10,8,18,12,8
Yellow Alfalfa,5,6,2,9,32
Viper Bugloss,1,3,9,13,27


### 5. Create a sample out of the DataFrame

In [17]:
df1 = df.sample(n = 3, random_state = 1)
df1

Unnamed: 0,buffTail,GarderBee,RedTail,HoneyBee,CarderBee
Golden Rain,37,19,1,16,6
Viper Bugloss,1,3,9,13,27
Blackberry,12,4,4,10,23


### 6. Mean of each Attribute(Column) using Numpy

In [18]:
#Avg of the sample
meanData = np.mean(df1)
display(meanData)

buffTail     16.666667
GarderBee     8.666667
RedTail       4.666667
HoneyBee     13.000000
CarderBee    18.666667
dtype: float64

### 7. Mean of each Attribute using Pandas

In [19]:
display(df1.mean())

buffTail     16.666667
GarderBee     8.666667
RedTail       4.666667
HoneyBee     13.000000
CarderBee    18.666667
dtype: float64

### 8. Max value using Pandas

In [20]:
#Largest value in sample
maxData = df1.max()
display(maxData)

buffTail     37
GarderBee    19
RedTail       9
HoneyBee     16
CarderBee    27
dtype: int64

### 9. Min value using Pandas

In [21]:
#Smallest value in sample
minData = df1.min()
display(minData)

buffTail      1
GarderBee     3
RedTail       1
HoneyBee     10
CarderBee     6
dtype: int64

### 10. Counting the number of items(rows)

In [22]:
#list number of items in the sample
item_count = len(df1)
print(item_count)

3


### 11. Standard Deviation using Numpy

In [23]:
#Standard Deviation of the sample
#using numpy
stdDataN = np.std(df1)
display(stdDataN)

buffTail     15.062831
GarderBee     7.318166
RedTail       3.299832
HoneyBee      2.449490
CarderBee     9.104334
dtype: float64

### 12. Standard Deviation using Pandas

In [24]:
#Standard Deviation of the sample
#using pandas
stdDataP = df1.std()
display(stdDataP)

buffTail     18.448125
GarderBee     8.962886
RedTail       4.041452
HoneyBee      3.000000
CarderBee    11.150486
dtype: float64

As we can see above, the output for the Standard Deviation calculated by voth methods yield different output. Reason behind this mismatch is the default Degree of Freedom for pandas and numpy library functions. We can manually set the matching value of dof using the parameter 'ddof' in the 'std()' function.

In [25]:
#Standard Deviation of the sample
#using numpy like pandas
sdn= np.std(df1, ddof = 1)
display(sdn)

buffTail     18.448125
GarderBee     8.962886
RedTail       4.041452
HoneyBee      3.000000
CarderBee    11.150486
dtype: float64

In [26]:
#Standard Deviation of the sample
#using pandas like numpy
sdp = df1.std(ddof = 0)
display(sdp)

buffTail     15.062831
GarderBee     7.318166
RedTail       3.299832
HoneyBee      2.449490
CarderBee     9.104334
dtype: float64

### 13. Std calculation by removing na values

In [27]:
#SD calculation by removing NA if any
#using pandas
df1.std(skipna = True)

buffTail     18.448125
GarderBee     8.962886
RedTail       4.041452
HoneyBee      3.000000
CarderBee    11.150486
dtype: float64

In [28]:
#SD calculation by removing NA if any
#using numpy
sdn= np.nanstd(df1)
display(sdn)

10.117092248050106

In [29]:
np.nanstd(np.where(np.isclose(df1,0), np.nan, df1))

10.117092248050106