# Statistics for Data Science
This notebook provides a detailed overview of essential statistical concepts for data science. We will cover descriptive statistics, probability distributions, hypothesis testing, regression analysis, and data visualization, with practical examples and exercises to reinforce learning.

Here is a link to a video resource that will help reinforce the learning and understanding the use of statistics in Data Science. [The Vital Role of statistics](https://www.youtube.com/watch?v=RjayA3jyXyo)

## Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. Common metrics include:
- **Mean:** The average value.
- **Median:** The middle value when data is sorted.
- **Mode:** The most frequently occurring value.
- **Standard Deviation:** A measure of data dispersion around the mean.

### Example:
Given a dataset: [10, 20, 20, 30, 40]
- Mean = (10 + 20 + 20 + 30 + 40) / 5
- Median = 20
- Mode = 20
- Standard Deviation = sqrt(((10-mean)^2 + (20-mean)^2 + ... + (40-mean)^2)/N)

In [None]:
import numpy as np
from scipy import stats

data = [10, 20, 20, 30, 40]
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)
std_dev = np.std(data)

print(f'Mean: {mean}')
print(f'Median: {median}')
print(f'Mode: {mode}')
print(f'Standard Deviation: {std_dev}')

### Practice Exercise
Calculate the mean, median, mode, and standard deviation for the dataset: [12, 15, 14, 10, 18, 20, 25].

## Probability Distributions
Probability distributions describe how data points are distributed.

### Common Distributions:
1. **Normal Distribution:** A symmetric, bell-shaped curve.
2. **Binomial Distribution:** Represents the number of successes in a fixed number of trials.

### Example:
- A coin toss follows a binomial distribution.
- Heights of people often follow a normal distribution.

In [None]:
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 100)
normal_dist = stats.norm.pdf(x, loc=0, scale=1)

plt.plot(x, normal_dist, label='Normal Distribution')
plt.title('Probability Distribution')
plt.legend()
plt.show()

### Practice Exercise
Simulate 1000 coin flips using a binomial distribution and plot the results.

## Hypothesis Testing
Hypothesis testing is a method to test assumptions about a population.

### Steps:
1. Define null (H0) and alternative (H1) hypotheses.
2. Select a significance level (e.g., α=0.05).
3. Calculate the test statistic and p-value.
4. Decide to reject or fail to reject H0.

In [None]:
# Example: One-sample t-test
sample = [2.3, 2.5, 2.7, 2.6, 2.8]
population_mean = 2.5
t_stat, p_value = stats.ttest_1samp(sample, population_mean)

print(f'T-Statistic: {t_stat}, P-Value: {p_value}')
if p_value < 0.05:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

### Practice Exercise
Perform a two-sample t-test for the datasets: [1.2, 1.4, 1.6, 1.5] and [1.8, 1.9, 2.0, 1.7].

## Regression Analysis
Regression analysis examines the relationship between variables.

### Types:
1. **Linear Regression:** Relationship between two variables.
2. **Multiple Regression:** Relationship involving multiple variables.

In [None]:
from sklearn.linear_model import LinearRegression
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])
model = LinearRegression().fit(X, y)
print(f'Coefficient: {model.coef_[0]}')
print(f'Intercept: {model.intercept_}')

### Practice Exercise
Fit a linear regression model to the dataset: X = [10, 20, 30], y = [15, 25, 35].

## Data Visualization
Data visualization involves representing data graphically to uncover patterns and insights.

### Common Visualizations:
- Line Plot
- Bar Chart
- Scatter Plot
- Histogram

In [None]:
# Example: Scatter Plot
x = np.random.rand(50)
y = 2 * x + np.random.normal(0, 0.1, 50)
plt.scatter(x, y, label='Data')
plt.title('Scatter Plot Example')
plt.legend()
plt.show()

### Practice Exercise
Create a bar chart for the data: Categories = ["A", "B", "C"], Values = [10, 15, 7].

Read more on statistics on [w3schools](https://www.w3schools.com/statistics/)