# <span style="color:#1a7796"> Basics of Statistics for Data Science </span>
---

## Outline

1. Descriptive Statistics

2. Data Visualisation

3. Probability Distribution

4. Hypothesis Testing

5. Regression Analysis

## Important information before jumping into Statistics

Python packages and libraries for Data Science

- For <span style="color:#fca503">**Scientific Computing** </span>, we use<span style="color:#aa6da3"> **Numpy** </span>(Arrays and matrices),<span style="color:#aa6da3"> **Pandas**</span> (Data Structures and 2D dataframes), and <span style="color:#aa6da3"> **SciPy**</span> (optimisation and solving of differential equations)

- For <span style="color:#fca503"> **Data Visualisation** </span> , we use <span style="color:#aa6da3"> **Seaborn**</span> (heat maps, time series and other plots) and <span style="color:#aa6da3">**Matplotlib**</span> (Plots, graphs, figures)

- For <span style="color:#fca503">**Machine Learning (ML)**</span> algorithmic development, we use <span style="color:#aa6da3"> **Scikit-learn**</span> (Machine learning: Linear regression, classification, clustering analysis, and so on) and <span style="color:#aa6da3"> **Statsmodels**</span> (explore data, estimation of statistical models, and perform statistical analysis)

---
---

## <span style="color:#1a7796"> Statistics </span>

Statistics is a collection of methods for collecting, displaying, analyzing and drawing conclusions from data.
Statistics is everywhere.

- Will it rain? 55% - 70% chance of rain? (Weather forecast)
- Rate of USD prediction

- Housing material prices increase?

- Un-employment rate increased or decreased?

- Who gets paid how much?

- Averege salary of a Data Analyst or Data Scientist?

- Any comparison in research


### <span style="color:#1a7796"> Language of Statitics</span>

- Average or mean

- Highest, Maximum, Lowest, Minimum

- Percentages, ratios

- Probability, likelihood

- Varience, Standard Deviation

- T-Test

- ANOVA 

### <span style="color:#1a7796"> Types of Data </span>

1. Cross Sectional Data - Data collected at one point. For example, how many views has this video gotten on 13 April, 2022?

2. Time Series Data - Data collected over different points of time. For example, how many views has this video gotten since 13 April, 2022?

3. Univariate Data - Data contains a single variable to measure entity. For example, meal taken that results in weight gain.

4. Multi-variate Data - Data contains more than 2 variables to measure something. For example, meal and coke taken that result in weight gain.

### <span style="color:#1a7796"> Variable Types </span>

Categorical (Nominal)

1. Bionomial (YES?NO) or (True/False) - No quantitative relationship is given

2. Multinomial

3. Ordinal Variable - Data ranked or ordered. Categories can be compared. No fixed unit of measurement for statistics.

4. Ratio Data

5. Interval Variable/Data

### <span style="color:#1a7796"> Measure of central tendency </span>

- Mean, Median. Mode
- N = Size of population
- n = Size of sample
- sum sign = sum


In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
phool = sns.load_dataset("iris")
phool.head()
phool.to_csv("Iris.csv")

In [6]:
df = pd.read_csv("Iris.csv")
print(df.describe())

       Unnamed: 0  sepal_length  sepal_width  petal_length  petal_width
count  150.000000    150.000000   150.000000    150.000000   150.000000
mean    74.500000      5.843333     3.057333      3.758000     1.199333
std     43.445368      0.828066     0.435866      1.765298     0.762238
min      0.000000      4.300000     2.000000      1.000000     0.100000
25%     37.250000      5.100000     2.800000      1.600000     0.300000
50%     74.500000      5.800000     3.000000      4.350000     1.300000
75%    111.750000      6.400000     3.300000      5.100000     1.800000
max    149.000000      7.900000     4.400000      6.900000     2.500000


In [7]:
df.mean()

  df.mean()


Unnamed: 0      74.500000
sepal_length     5.843333
sepal_width      3.057333
petal_length     3.758000
petal_width      1.199333
dtype: float64

In [8]:
df.mode()

Unnamed: 0.1,Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.0,3.0,1.4,0.2,setosa
1,1,,,1.5,,versicolor
2,2,,,,,virginica
3,3,,,,,
4,4,,,,,
...,...,...,...,...,...,...
145,145,,,,,
146,146,,,,,
147,147,,,,,
148,148,,,,,


### Measure of Dispersion

- Variablity
- Scatter
- Spread
- Variance
- Co-variance

How much is the data spread around the mean?

- Standard deviation (std)
- Standard error (se)
- Variance
- Bell Curve

Dispersion is caused by the difference between minimum and maximum.




In [16]:
samosachai = pd.Series([5,10,15,25,30,35,45,55,65,75,85,95,105,115])
samosachai.mean()

54.285714285714285

In [18]:
samosachai.median()

50.0

In [19]:
samosachai.mode()

0       5
1      10
2      15
3      25
4      30
5      35
6      45
7      55
8      65
9      75
10     85
11     95
12    105
13    115
dtype: int64

In [20]:
samosachai.std()

36.41941276810523

In [21]:
samosachai.var()

1326.3736263736262

In [23]:
samosachai.max()

115

In [24]:
samosachai.min()

5

Mean gives us only a small picture and have very little meaning without the standard dispersion. Mean with SD is more useful than mean by itself. 

### Fundamentals of Data Visualisation
Types of Visualisation depends on the type of data available.

- Categorical Variable
    
    - Qualitative in nature
    - No numerical meaning 
    - Counts (plot type)
    - Male vs Female
    - True vs False
    - 0 vs 1
    - Yes vs NO

- Continous Variable
    
    - Quantitative in nature
    - Have numerical meanings
    - Scatter plot
    - Statistical proportions (mean and their comparisons)


### Choice of Charts

![Choice of Chart](https://apandre.files.wordpress.com/2011/02/chartchooserincolor.jpg)

![Data Visualisation](https://apandre.files.wordpress.com/2011/02/visualthinkingcodex.jpg)

![image](https://cdn2.slidemodel.com/wp-content/uploads/FF00101-01-free-abelas-charts-16x9-1.jpg)

![image](https://biuwer.com/static/e5700b9017bb1ec94651589e455a7774/e40ed/biuwer-how-to-choose-the-right-chart-for-your-data.png)