# Statistics Basic Foundation

### What is statisics?

#### Statistics in data science is a mathematical framework used for collecting, planning, analyzing, interpreting and presenting the uncovered patterns, create or build models and make decisions. It acts as a backbone for understanding data and validating ML models. It contains several terms to understand data in a various ways to help analyze the data at the faster rate.

## Mean, Median and Mode

#### Mean - It is the average value in a dataset also referred to as the sum of all the values divided by total number of values in a dataset.

#### Median - It is the middle value in a dataset. If the total number of values are even then the mean of the two middle values is the median of the whole dataset.

#### Mode - It is maximum repeated value in a dataset. A mode can be one or many and if there is no repetition in data set then there exists no mode.

In [None]:
#Example with a dataset...

data = [2, 3, 5, 5, 8, 9, 12]

import numpy as np
import statistics as stats

print(np.mean(data))        #value of the 2π indeed

print(np.median(data))      #Out of 7 values only one middle value stands out

print(stats.mode(data))     #Most repeated value in a dataset

#### *Mean is affected by outliers, median is not*...

## Variance and Standard Deviation

Variance measures how far values are from the mean.
Standard deviation is the square root of variance and is easier to interpret
because it is in the same units as the data.

In [None]:
data = [2, 3, 5, 5, 8, 9, 12]

import numpy as np
import statistics as stats

print(np.var(data))

print(np.std(data))

In [None]:
data1 = [1, 2, 2, 3, 3, 4]

print(np.mean(data1))

print(np.var(data1))

print(np.std(data1))

*From the above two example sets of data, variance for first set is bigger while for the second set is lesser meaning the values for the first set are spreaded more than mean, than the values for the second set.*

In [None]:
#Comparing two datasets...

data_large_spread = [1, 10, 5, 12, 0]
data_small_spread = [5, 5, 6, 6, 5]

print(np.std(data_large_spread))

print(np.std(data_small_spread))

#### *Larger standard deviation means more variability*...

## Probability

Probability measures how likely an event is to occur.
Values ranges from 0(impossible) to 1(certain).

In [None]:
# Events : It is an outcome or a set of outcomes from a randomly selected experiment.
# Example : 
#Event A : Randomly selecting a terrestrial planet.
#Event B : Randomly selecting a planet with rings.

planets = ["Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

terrestrial = ["Mercury","Venus","Earth","Mars"]

rings = ["Jupiter","Saturn","Uranus","Neptune"]

P_A = len(terrestrial) / len(planets)

P_B = len(rings) / len(planets)

P_A,P_B

Above example calculates the probabilities of two randomly selected events from an experiment.
First event refers to randomly picking a terrestrial planet out of all 8 planets and P_A returns the possiblity of getting it. Similar to first event, the second event refers to randomly picking a planet with ring out of all 8 planets and P_B returns the possibility of getting it. The values must be between 0 to 1 showing certain possiblity. 

In [None]:
# Intersection(AND) : It is the event of having common or same features from two or more set of events.
# For example, checking whether a planet has a possibility of being a terrestrial planet and a planet with rings.

intersection = set(terrestrial).intersection(rings)

I_A_and_B = len(intersection) / len(planets)

I_A_and_B

In [None]:
# Union(OR) : It is the event of having all outcomes from either or both the events.
#For example, checking an event where it contains either of all outcomes of being a terrestrial planet or having an 
# a ring around a planet.

union = set(terrestrial).union(rings)

U_A_and_B = len(union) / len(planets)

U_A_and_B

Thus from above two events we can conclude that I_A_and_B returns probability of choosing a planet that is both terrestrial as well as with the rings, which is not possible resulting in 0. On the other hand, U_A_and_B returns probability of choosing a planet with either being a terrestrial or a having a ring one, that results in 1.

**In Astronomy, probability is used to quantify uncertainty in observations, whether the detected signal is either a planet or just a noise.**

### Conditional Probability

Conditional Probability is the probability of an event occuring given that
another event has already occured.

It is written as 
P(A | B)

In [None]:
planets = ["Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

P_with_atm = ["Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune"]

P_within_2AU = ["Mercury","Venus","Earth","Mars"]

intersection1 = set(P_with_atm).intersection(P_within_2AU)

P_A_within_B = len(intersection1) / len(planets)

P_A_within_B

From above example, of all the 8 planets, only few planets has atmosphere (event A: P_with_atm) and are in range of 2 Astronomical units (event B: P_within_2AU). Making it a total of 3 planets that intersect in both the events, thus making conditional probability of 3/8 or 0.375...

### Independance 

Two events are independant if knowing one does not change the probability of the other.

### Astroinformatic Reflection

Note: In exoplanet detection, conditional probability is used to estimate the chance that a signal is a real planet given the observed data.

### Normal Distrubution 

A Normal Distribution is a symmetric, bell-shaped distribuiton where most values cluster around the mean 
and fewer values occur far from the mean.

Measurrement errors in Astronomy often follows the normal distribution.

In [None]:
import numpy as np

np.random.seed(0)

errors = np.random.normal(loc = 0, scale = 1, size = 1000)
# Here loc = mean, scale = std dev, size = number of outcomes...

np.mean(errors), np.std(errors)

Here in the above example, we have generated 1000 random numbers and have used np.random.seed() so that each time you print the random numbers, they appear same as before every time. Here location has been kept to 0 so that the numbers generated will be near to 0. Scale is kept 1 meaning standard deviation is 1 so that range of numbers spreading should be less. Here we can understand how normal distribution works.

In [None]:
import matplotlib.pyplot as plt

plt.hist(errors, bins = range(30), edgecolor = 'black')
plt.xlabel("Error value")
plt.ylabel("Frequency")
plt.title("Normal Distribution of measurement errors")
plt.show()

### Binomial Distribution

A Binomial Distribution models the number of successes in a fixed number of independant trials.

Measurement or detection of planet signals.

In [None]:
import numpy as np

np.random.seed(0)
detections = np.random.binomial(n = 10, p = 0.3, size = 1000)
# Here n = number of trials in each experiments, p = probability of success, size = number of experiments...

np.mean(detections), np.std(detections)

Here in the above example we are generating 1000 experiments using np.random.seed() by which random experiments generated every time are same. Here number of trails for each experiment is kept 10 and the probability of success  is kept 30% or 0.3. This shows how binomial distribution works.

In [None]:
import matplotlib.pyplot as plt

plt.hist(detections, bins = range(12), edgecolor = 'black')
plt.xlabel("Number of detections out of 10 trials")
plt.ylabel("Frequency")
plt.title("Binomial Distribution (n=10, p=0.3)")
plt.show()

### Astrinformatic Reflection

Normal distributions model measurement noise, while binomial distributions model detection outcomes
in astronomical surveys.

### Hypothesis Testing

Hypothesis Testing is a method to decide whether the observed result is likely due to a chance or 
represents a real effect.

It is useful in astronomy to answer questions like.,
Is this signal real or noise?
Is this difference meaningful or random?
Should we trust this result?

### Hypothesis

1] Null Hypothesis (H0): There is no real effect.
2] Alternative Hypothesis (H1): There is a real effect.

Example., 
H0: The detected signal is just a noise.
H1: The signal is caused by a real planet.

In [None]:
import numpy as np

data = np.array([10, 12, 9, 11, 10, 13, 12])
mean_observed = np.mean(data)
print(f"Observed result: {mean_observed}")

print(f"Difference from actual result: {mean_observed - 10}")  
# Considering 10 as mean (noise-only variation)

11.0
1.0


If the difference between observed result and the actual one is big, then the effect might be real.
And if the difference between observed result and the actual one is small, then the effect is probably a noise.

### P-value

A p-value measures how likely it is to observe data this extreme
if the null hypothesis were true.
A small p-value suggests the result is unlikely to be due to chance.

#### Astroinformatic Reflection

In astronomy, hypothesis testing helps determine whether a detected signal is statistically significant 
or just random noise.

### Confidence Intervals

A confidence interval gives a range of values within which the true value of a parameter is likely to lie.

***Wider the interval, more the Uncertainty.***

In [5]:
import numpy as np

data2 = np.array([10, 12, 9, 11, 10, 13, 12])

mean = np.mean(data2)
std = np.std(data2)

n = len(data2)

print(f"Mean: {mean}")
print(f"Standard Deviation: {std}")
print(f"Number of values: {n}")

Mean: 11.0
Standard Deviation: 1.3093073414159542
Number of values: 7


In [7]:
# Now we use CI here...

lower = mean - std
upper = mean + std

print(lower, upper)

9.690692658584046 12.309307341415954


This interval is not exact but shows that most measurements lie within one
standard deviation of the mean.

#### Astroinformatic Reflection

In astronomy, Confidence Intervals are used to express uncertainty in measurements, such as 
distance, mass, density, volume, brightness, etc...

### Z score 

A z-score measures how far a value is from the mean in terms of standard deviations.

A large absolute z-score indicates that a value is unusual. Also, a z-score of zero means the value 
equals to the mean.

In [3]:
import numpy as np

data123 = np.array([10, 12, 9, 11, 10, 13, 12])

mean = np.mean(data123)
std = np.std(data123)

print(f"Mean: {mean}")
print(f"Standard Deviation: {std}")

Mean: 11.0
Standard Deviation: 1.3093073414159542


In [None]:
#Calculating the Z-Score...

x1 = 13    # Consider the observed value is 13.
z1 = (x1 - mean) / std

print(f"Z-Score of 13: {z1}")

Z-Score: 1.5275252316519468


According to the following, 
|z| < 1 → common variation
|z| ≈ 2 → somewhat unusual
|z| ≥ 3 → rare event

The above program results in z-score of 13 as approximately 1.5, meaning it is 1.5 standard deviation above the mean.

13 is not considered much extreme. It is thus considered mildly above average but still within expected variation. It is not considered a rare or an extreme value.

In [8]:
x2 = 9
z2 = (x2 - mean) / std

print(f"Z-Score of 9: {z2}")

Z-Score of 9: -1.5275252316519468


The above program shows the z-score of 9 as approximately -1.5. 

By comparing both the z-scores of 13 and 9, we get to see the exact similar values, meaning the standard deviation for both values is exactly about ~1.52 away from the mean. Negative sign for 9 denotes that standard deviation is 1.52 below the mean value. 

Thus both are considered mildly above average but still within expected variation.

#### Astroinformatic Reflection

Z-scores help identify unusually strong signals in astronomical data,
such as potential exoplanet transits or rare cosmic events.

In [2]:
# Simulating noisy measurements

import numpy as np

np.random.seed(42) 

# simulating 1000 brightness measurements with noise
brightness = np.random.normal(loc = 100, scale = 2, size = 1000)

mean = np.mean(brightness)
std = np.std(brightness)

print(f"Mean: {mean}")
print(f"Standard deviation: {std}")

Mean: 100.03866411164464
Standard deviation: 1.9574524154947086


In [None]:
# Computing the Z-Score for potential event

brightness_with_event = np.append(brightness, 110)

obv_val1 = 110

z_score = (obv_val1 - mean) / std

print(z_score)

5.088928757350062


According to the results, the value of z_score for 110 as an observed value is approx 5 meaning that it is not considered a noise but a signal of something happening which is unusual. It does not fall under normal noise but a potential signal of something big like a star exploded losing it's brightness(higher brightness magnitude means less brightness/ less glow). It is totally worth the investigation as it might set up a new record of history in the field of astronomy.

Points to be noted: 

The computed z-score for the value 110 is approximately 5,
meaning it is about five standard deviations away from the mean.

Such a deviation is extremely unlikely under the assumed normal noise model.
Therefore, this value warrants further investigation.

However, a large z-score does not automatically confirm a real astrophysical event.
Instrumental error, data artifacts, or incorrect noise assumptions must be ruled out first.


In [5]:
# Computing the Z-Score for mild event

obv_val2 = 103

z_score2 = (obv_val2 - mean) / std

print(z_score2)

1.5128520442766131


Comparing the z_score of 103 with 110...

z_score1 of 110 is ~5, while z_score2 of 103 is ~1.5. It means that the z_score of 103 is more likely noisy than of 110 and is not considered a potential event. The obvious answer for which deserves more attention is z_score of 110 which is approximately 5 which is inconsistent with the regular magnitude observations.

#### Astroinformatic Reflection

In astronomical surveys, z-scores help flag unusual brightness changes that may indicate transient events or exoplanet transits. However, extreme values must be verified carefully to avoid false positives.