<a href="https://colab.research.google.com/github/MK316/statistics/blob/main/Descriptive_stat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive statistics [wikipedia](https://en.wikipedia.org/wiki/Descriptive_statistics)
- Descriptive statistics provide simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation.

### Sample data to read (csv file) & Data shape (rows, columns)

In [1]:
import pandas as pd

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/MK316/statistics/main/iris.csv')
data.head()

- Description of the data using file.describe()
- Including: count, mean, std(standard deviation), min, max, quantiles(25%, 50%, 75%)

In [91]:
data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [None]:
# Number of rows
data.shape[0]

In [None]:
# Number of columns
data.shape[1]

In [92]:
# Number of rows and columns
data.shape

(150, 5)

- Accessing individual columns

In [93]:
# Mean of a single column
data["petal_length"].mean()

3.7586666666666693

In [94]:
# Mean of multiple columns
data[["petal_length","petal_width"]].mean()

petal_length    3.758667
petal_width     1.198667
dtype: float64

In [95]:
# Mean of all columns
data.mean(axis=0)

  


sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

In [None]:
data.mean(axis = 0, skipna = False)

- Standard Deviation [data.std( )] of individual columns

In [97]:
data.std(axis=0)

  """Entry point for launching an IPython kernel.


sepal_length    0.828066
sepal_width     0.433594
petal_length    1.764420
petal_width     0.763161
dtype: float64

# Python statistidcs module: import {statistics}
Note: Colab uses Python version 3.6
https://docs.python.org/3.6/library/statistics.html

- Keywords: mean, median, standard deviation

In [23]:
import statistics as stat

- mean: stat.mean( )

In [98]:
a = stat.mean([1, 2, 3, 4, 4]); print(a)
b = stat.mean(data.petal_length); print(b) # data column

2.8
3.7586666666666666


- What is median? "The median is a robust measure of central location, and is less affected by the presence of outliers in your data. When the number of data points is odd, the middle data point is returned." 
- stat.median( )

In [99]:
a = stat.median([1, 2, 3, 4, 4]); print(a)
b = stat.median([1,2,3,4]); print(b)
d = stat.median(data.petal_length); print(d)

3
2.5
4.35


- What is mode? A single mode (most common value) of discrete or nominal data.
- stat.mode( )

In [100]:
sample = [1,1,2,2,2,3,4,5,4,6,2,3]
a = stat.mode(sample); print(a)
b = stat.mode(data.petal_length); print(b)

2
1.5


- What is variance? "Variance, or second moment about the mean, is a measure of the variability (spread or dispersion) of data. A large variance indicates that the data is spread out; a small variance indicates it is clustered closely around the mean."
- stat.variance( ): 

In [101]:
sample = [1,1,2,2,2,3,4,5,4,6,2,3]
a = stat.variance(sample); print(a)
b = stat.variance(data.petal_length); print(b)

2.446969696969697
3.113179418344519


- What is standard deviation? "It is the square root of the sample variance"
- stat.stdev()

In [102]:
sample = [1,1,2,2,2,3,4,5,4,6,2,3]
a = stat.stdev(sample); print(a)
b = stat.stdev(data.petal_length); print(b)

1.5642792899510294
1.7644204199522626


- Count

In [45]:
import pandas as pd

In [103]:
# count() for list data
sample = ['A','B','A','C','A']
sample.count('A')

3

In [104]:
# unique() for dataframe column
data.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [105]:
# Change dataframe column into list data; and use count()
a = list(data.species); print(a)
a.count('setosa')

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicol

50

In [106]:
# Getting column names
print(data.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


- What is rounding? "Rounding numbers means adjusting the digits (up or down) to make rough calculations easier. The result will be an estimated answer rather than a precise one."
- np.round(data.mean,1)

In [83]:
pel = data.petal_length

In [107]:
# print("Mean petal length: ", np.round(data['petal_length'].mean(), 1))
np.round(pel.mean(),1)

3.8

In [108]:
# np.round(pel.max(),1)
pel.max()

6.9

In [109]:
pel.min()

1.0