###Today, we will:

1. Understanding Statistics in Data Science
2. Calculating and Explaining Mean, Mode, Median, variance, and standard deviation and other statistical measures
3. Examples of Statistical Measures

Statistics in data science involves the collection, analysis, interpretation, and presentation of data. It provides the foundational tools and techniques to extract meaningful insights and make informed decisions from data. In data science, statistics are used to:

- Describe Data: Summarize the characteristics and properties of datasets.
- Make Predictions: Utilize data to forecast trends and outcomes.
- Validate Models: Assess the performance and accuracy of machine learning models.
- Inform Decision Making: Support business strategies and actions based on data-driven insights.

## Overview of the statistics Module in Python

In Python, the `statistics` module is part of the standard library and provides functions to perform basic statistical operations on numerical data. Here's an overview of what the `statistics` package includes and its purpose:

# Purpose of the statistics Module

The `statistics` module is designed to facilitate statistical computations and analysis in Python programs. It offers a set of functions that are essential for analyzing data distributions, making data-driven decisions, and performing statistical calculations in various domains, including data science, research, finance, and more.

# Functions Provided by the statistics Module

The `statistics` module includes functions for calculating common statistical measures:

- **mean**: Calculates the arithmetic mean (average) of data.
-**harmonic mean**: Calculates the harmonic (averaging rates/ratios). E.g. for speed, 60m/h
- **median**: Computes the median (middle value) of data.
- **mode**: Finds the mode (most common value) in data.
- **variance**: Computes the variance of a sample (sample is A subset of the population used to gather data and make inferences about the population. Where population is the whole group you are interested in studying or learning about).
- **stdev**: Computes the standard deviation of a sample.

##Examples

In [None]:
# Importing necessary functions from the statistics module
from statistics import mean,harmonic_mean, median, mode, variance, stdev

# Example age data
age_data = [10, 15, 12, 15, 20, 18, 15, 12, 10]

# Calculating mean (average age)
mean_age = mean(age_data)

#Calculating harmonic mean
h_mean_age = harmonic_mean(age_data)

# Calculating median (middle age)
median_age = median(age_data)

# Calculating mode (most common age)
mode_age = mode(age_data)

# Calculating variance
variance_value = variance(age_data)

# Calculating standard deviation
stdev_value = stdev(age_data)

# Calculating sum of all values
sum_value = sum(age_data)

# Finding minimum value in data
min_value = min(age_data)

# Finding maximum value in data
max_value = max(age_data)

# Printing results with explanations
print("Data:", age_data)
print("--------------------------------------")
print("Mean (Average):", mean_age)           # Print the mean (average) of data
print("Harmonic Mean:",h_mean_age)           # Print the mean (average) of data
print("Median:", median_age)                 # Print the median (middle value) of data
print("Mode:", mode_age)                     # Print the mode (most common value) of data
print("Variance:", variance_value)           # Print the variance of data
print("Standard Deviation:", stdev_value)    # Print the standard deviation of data
print("Sum of all ages:", sum_value)         # Print the sum of all values in data
print("Minimum age:", min_value)             # Print the minimum value in data
print("Maximum age:", max_value)             # Print the maximum value in data

Data: [10, 15, 12, 15, 20, 18, 15, 12, 10]
--------------------------------------
Mean (Average): 14.11111111111111
Harmonic Mean: 13.388429752066116
Median: 15
Mode: 15
Variance: 11.86111111111111
Standard Deviation: 3.4439963866286374
Sum of all ages: 127
Minimum age: 10
Maximum age: 20


##Exercise

In [None]:
# Task 1: Import necessary functions from the statistics module
from statistics import mean, median, mode, variance, stdev

# Task 2: Define a new dataset (e.g., heights of students in a class)
height_data = [160, 165, 170, 155, 175, 168, 162, 172, 160, 165]

# Task 3: Calculate the mean (average) height using the mean function
mean_height = mean(height_data)
print("Mean (Average) Height:", mean_height)

# Task 4: Calculate the median (middle value) height using the median function
median_height = median(height_data)
print("Median Height:", median_height)

# Task 5: Calculate the mode (most common value) height using the mode function

# Task 6: Calculate the variance of heights using the variance function

# Task 7: Calculate the standard deviation of heights using the stdev function

# Task 8: Print the calculated statistical measures with explanations

#Task 9: If you drive from point A to point B at 60 miles per hour and return from point B to point A at 30 miles per hour, what is your average speed for the entire trip?

print("Data:", height_data)
print("--------------------------------------")
# Print the mean (average) height
print("Mean (Average) Height:" + str(mean_height))
# Print the median (middle value) height
print("Median Height:")
# Print the mode (most common value) height
print("Mode Height:")
# Print the variance of heights
print("Variance of Heights:")
# Print the standard deviation of heights
print("Standard Deviation of Heights:")
# Print the average speed for the entire trip
print("Average Speed for the Trip:")



Mean (Average) Height: 165.2
Median Height: 165.0
Data: [160, 165, 170, 155, 175, 168, 162, 172, 160, 165]
--------------------------------------
Mean (Average) Height:165.2
Median Height:
Mode Height:
Variance of Heights:
Standard Deviation of Heights:
Average Speed for the Trip:


##Discuss your findings below:


Discuss here...

#Statistical significance analysis
###In this part, we will explore statistical significance analysis using the context of student performance. This includes performing a t-test, calculating p-values, and understanding z-scores

###We will evaluate whether a new teaching method significantly improves students' test scores compared to the old method.

students' scores before = [70, 75, 60, 55, 45, 85, 88, 77, 58, 88]

students' scores after = [75, 80, 85, 90, 65, 95, 89, 85, 60, 90]

#T-Test
The t-test is a statistical test used to compare the means of two groups to see if they are significantly different. It uses the t-distribution to account for variability in sample sizes. It is measured by p-value


###p-value

p-Value
The p-value helps us figure out how confident we can be about the difference we see in our test results.

If the p-value is very small (usually less than 0.05), it means we can be pretty sure that the difference we see is real and not just by chance.

If the p-value is larger than 0.05, it means the difference might just be random, and we can't be very sure it's due to our new teaching method.

In our case with the teaching method, the p-value tells us how certain we are that the improvement in scores is really because of the new method and not just luck.

In [None]:
import statistics as stats
import numpy as np
from scipy import stats as scipy_stats

# Students' scores before and after using the new method
scores_before = [70, 75, 60, 55, 45, 85, 88, 77, 58, 88]
scores_after = [75, 80, 85, 90, 65, 95, 89, 85, 60, 90]

# Perform a paired t-test
t_statistic, p_value = scipy_stats.ttest_rel(scores_after, scores_before)

# Calculate means and standard deviations
mean_before = stats.mean(scores_before)
std_dev_before = stats.stdev(scores_before)
mean_after = stats.mean(scores_after)
std_dev_after = stats.stdev(scores_after)

# Calculate z-scores
z_scores_before = [(score - mean_before) / std_dev_before for score in scores_before]
z_scores_after = [(score - mean_after) / std_dev_after for score in scores_after]

# Results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)
print("Mean Scores Before:", mean_before)
print("Standard Deviation Before:", std_dev_before)
print("Z-Scores Before:", z_scores_before)
print("Mean Scores After:", mean_after)
print("Standard Deviation After:", std_dev_after)
print("Z-Scores After:", z_scores_after)


T-Statistic: 3.0996735185734767
P-Value: 0.01272916034943612
Mean Scores Before: 70.1
Standard Deviation Before: 15.058773743790251
Z-Scores Before: [-0.006640646954485989, 0.32539170076983237, -0.6707053424031227, -1.002737690127441, -1.6668023855760778, 0.989456396218469, 1.1886758048530601, 0.4582046398595597, -0.8035182814928501, 1.1886758048530601]
Mean Scores After: 81.4
Standard Deviation After: 11.481385901633226
Z-Scores After: [-0.5574239952242704, -0.12193649895530954, 0.3135509973136514, 0.7490384935826123, -1.4283989877621923, 1.1845259898515732, 0.66194099432882, 0.3135509973136514, -1.863886484031153, 0.7490384935826123]


#Why is the T-Test Important in Data analysis