# Statistics Basic Foundation

### What is statisics?

#### Statistics in data science is a mathematical framework used for collecting, planning, analyzing, interpreting and presenting the uncovered patterns, create or build models and make decisions. It acts as a backbone for understanding data and validating ML models. It contains several terms to understand data in a various ways to help analyze the data at the faster rate.

## Mean, Median and Mode

#### Mean - It is the average value in a dataset also referred to as the sum of all the values divided by total number of values in a dataset.

#### Median - It is the middle value in a dataset. If the total number of values are even then the mean of the two middle values is the median of the whole dataset.

#### Mode - It is maximum repeated value in a dataset. A mode can be one or many and if there is no repetition in data set then there exists no mode.

In [None]:
#Example with a dataset...

data = [2, 3, 5, 5, 8, 9, 12]

import numpy as np
import statistics as stats

print(np.mean(data))        #value of the 2Ï€ indeed

print(np.median(data))      #Out of 7 values only one middle value stands out

print(stats.mode(data))     #Most repeated value in a dataset

#### *Mean is affected by outliers, median is not*...

## Variance and Standard Deviation

Variance measures how far values are from the mean.
Standard deviation is the square root of variance and is easier to interpret
because it is in the same units as the data.

In [None]:
data = [2, 3, 5, 5, 8, 9, 12]

import numpy as np
import statistics as stats

print(np.var(data))

print(np.std(data))

In [None]:
data1 = [1, 2, 2, 3, 3, 4]

print(np.mean(data1))

print(np.var(data1))

print(np.std(data1))

*From the above two example sets of data, variance for first set is bigger while for the second set is lesser meaning the values for the first set are spreaded more than mean, than the values for the second set.*

In [None]:
#Comparing two datasets...

data_large_spread = [1, 10, 5, 12, 0]
data_small_spread = [5, 5, 6, 6, 5]

print(np.std(data_large_spread))

print(np.std(data_small_spread))

#### *Larger standard deviation means more variability*...

## Probability

Probability measures how likely an event is to occur.
Values ranges from 0(impossible) to 1(certain).

In [None]:
# Events : It is an outcome or a set of outcomes from a randomly selected experiment.
# Example : 
#Event A : Randomly selecting a terrestrial planet.
#Event B : Randomly selecting a planet with rings.

planets = ["Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

terrestrial = ["Mercury","Venus","Earth","Mars"]

rings = ["Jupiter","Saturn","Uranus","Neptune"]

P_A = len(terrestrial) / len(planets)

P_B = len(rings) / len(planets)

P_A,P_B

Above example calculates the probabilities of two randomly selected events from an experiment.
First event refers to randomly picking a terrestrial planet out of all 8 planets and P_A returns the possiblity of getting it. Similar to first event, the second event refers to randomly picking a planet with ring out of all 8 planets and P_B returns the possibility of getting it. The values must be between 0 to 1 showing certain possiblity. 

In [None]:
# Intersection(AND) : It is the event of having common or same features from two or more set of events.
# For example, checking whether a planet has a possibility of being a terrestrial planet and a planet with rings.

intersection = set(terrestrial).intersection(rings)

I_A_and_B = len(intersection) / len(planets)

I_A_and_B

In [None]:
# Union(OR) : It is the event of having all outcomes from either or both the events.
#For example, checking an event where it contains either of all outcomes of being a terrestrial planet or having an 
# a ring around a planet.

union = set(terrestrial).union(rings)

U_A_and_B = len(union) / len(planets)

U_A_and_B

Thus from above two events we can conclude that I_A_and_B returns probability of choosing a planet that is both terrestrial as well as with the rings, which is not possible resulting in 0. On the other hand, U_A_and_B returns probability of choosing a planet with either being a terrestrial or a having a ring one, that results in 1.

**In Astronomy, probability is used to quantify uncertainty in observations, whether the detected signal is either a planet or just a noise.**

### Conditional Probability

Conditional Probability is the probability of an event occuring given that
another event has already occured.

It is written as 
P(A | B)

In [None]:
planets = ["Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

P_with_atm = ["Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

P_within_2AU = ["Mercury","Venus","Earth","Mars"]

intersection1 = set(P_with_atm).intersection(P_within_2AU)

P_A_within_B = len(intersection1) / len(planets)

P_A_within_B

From above example, of all the 8 planets, only few planets has atmosphere (event A: P_with_atm) and are in range of 2 Astronomical units (event B: P_within_2AU). Making it a total of 3 planets that intersect in both the events, thus making conditional probability of 3/8 or 0.375...

### Independance 

Two events are independant if knowing one does not change the probability of the other.

### Astroinformatic Reflection

Note: In exoplanet detection, conditional probability is used to estimate the chance that a signal is a real planet given the observed data.

### Normal Distrubution 

A Normal Distribution is a symmetric, bell-shaped distribuiton where most values cluster around the mean 
and fewer values occur far from the mean.

Measurrement errors in Astronomy often follows the normal distribution.

In [None]:
import numpy as np

np.random.seed(0)

errors = np.random.normal(loc = 0, scale = 1, size = 1000)
# Here loc = mean, scale = std dev, size = number of outcomes...

np.mean(errors), np.std(errors)

Here in the above example, we have generated 1000 random numbers and have used np.random.seed() so that each time you print the random numbers, they appear same as before every time. Here location has been kept to 0 so that the numbers generated will be near to 0. Scale is kept 1 meaning standard deviation is 1 so that range of numbers spreading should be less. Here we can understand how normal distribution works.

In [None]:
import matplotlib.pyplot as plt

plt.hist(errors, bins = range(30), edgecolor = 'black')
plt.xlabel("Error value")
plt.ylabel("Frequency")
plt.title("Normal Distribution of measurement errors")
plt.show()

### Binomial Distribution

A Binomial Distribution models the number of successes in a fixed number of independant trials.

Measurement or detection of planet signals.

In [None]:
import numpy as np

np.random.seed(0)
detections = np.random.binomial(n = 10, p = 0.3, size = 1000)
# Here n = number of trials in each experiments, p = probability of success, size = number of experiments...

np.mean(detections), np.std(detections)

Here in the above example we are generating 1000 experiments using np.random.seed() by which random experiments generated every time are same. Here number of trails for each experiment is kept 10 and the probability of success  is kept 30% or 0.3. This shows how binomial distribution works.

In [None]:
import matplotlib.pyplot as plt

plt.hist(detections, bins = range(12), edgecolor = 'black')
plt.xlabel("Number of detections out of 10 trials")
plt.ylabel("Frequency")
plt.title("Binomial Distribution (n=10, p=0.3)")
plt.show()

### Astrinformatic Reflection

Normal distributions model measurement noise, while binomial distributions model detection outcomes
in astronomical surveys.