### **Statistics**

- Statistics is the science of **collecting**, **organizing**, **analyzing**, **interpreting**, and **presenting data** to make **informed decisions**. 

- Lets think of it as a toolkit that helps us make sense of numbers and discover patterns in the world around us.

**There are two main branches**

1. **Descriptive Statistics**

- Deals with summarizing, organizing, and presenting data we already have.

- Uses measures like mean (average), median, mode, range, variance, and standard deviation to describe a dataset.

- Also involves graphs, charts, and tables for easy visualization.

- For example calculating the average coding hours of AI engineering students.

2. **Inferential Statistics**

- Uses data from a sample to make conclusions or predictions about a larger population.

- Involves methods like correlation, regression, hypothesis testing, chi-square tests, t-tests, and ANOVA.

- Helps in data-driven decision making by testing ideas and estimating outcomes with a level of certainty.

- Foundation of predictive analysis, guiding policymakers and organizations in making strategic decisions.


**Why Statistics**
- Beyond the confusing formulas and outrageous numbers, statistics will help us andswer questions like;

   - What's typical? (measures of center)
   - How much variation is there?(measures of spread)
   - Is this pattern real or just coincidence? (hypothesis testing)
   - Can we predict future outcomes?(regression)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

In [22]:
# let's set random seed fro reproducibility
np.random.seed(40)

In [23]:
# Lets simulate a dataset for AI Engineering students

#1. Traditional learning -classroom-based
# - lets create it with a 25 hours/week and standard deviation of 5 hours
traditional_study_hours = np.random.normal(25, 5, 100)

#2. Accelerated learning (project-based and hands-on style)
#This one will be a 35 hours/week and a standard deviation of 8 hours.

accelerated_study_hours = np.random.normal(35, 8, 100)

# Lets generate corresponding performance scores between 0-100

#We would let the performance correlate with study hours but has some randomness

traditional_scores = np.random.normal(75, 12, 100)  # Mean of 75, SD of 12
accelerated_scores = np.random.normal(82, 15, 100) #Mean of 82, SD of 15

# Lets generate project completion counts
traditional_projects = np.random.poisson(8, 100) #Average 8 ptrojects
accelerated_projects = np.random.poisson(12, 100) #Average 12 projects

traditional_scores

array([ 87.15923732,  70.43250214,  78.66763734,  89.27973866,
        80.77793708,  85.02291695,  89.7131645 ,  95.66031395,
        72.46346869,  60.40820454,  82.67671681,  60.02987887,
        75.37652351,  81.66922793,  93.67561015,  90.40899235,
        68.69690819, 104.42824559,  54.52407514,  80.40072209,
        62.4943163 ,  57.2790668 ,  68.77558302,  74.78937089,
        98.41225467,  79.41717259,  63.28440871,  79.25170238,
        67.33258968,  72.03937915,  68.86553843,  71.0382253 ,
        57.59787308,  74.7122058 ,  40.05315747,  70.25776064,
        82.16099481,  53.05976933,  40.63266623,  83.30965498,
        89.51163619,  82.98475059,  67.61919259,  73.5372882 ,
        79.96729413,  61.25685047,  74.49746265,  79.22350236,
        44.87987472,  73.56476563,  80.20139073,  69.8442207 ,
        71.9513029 ,  65.26711906,  55.41536415,  64.49358776,
        79.22291681,  87.90829148,  80.48711319,  84.87389833,
        69.17282845,  63.2433886 ,  73.19657846,  98.46

In [24]:
#Now, lets create the Dataframe

data = pd.DataFrame({
    'study_Hours_Per_Week': np.concatenate([traditional_study_hours,accelerated_study_hours]),
    'Performance_Score': np.concatenate([traditional_scores,accelerated_scores]),
    'Performance_Completed': np.concatenate([traditional_projects, accelerated_projects]),
    'Learning_Track': ['Traditional']* 100 + ['Accelerated'] * 100
})

In [None]:
data.head() # the first 5

Unnamed: 0,study_Hours_Per_Week,Performance_Score,Performance_Completed,Learning_Track
0,21.962262,87.159237,9,Traditional
1,24.369318,70.432502,8,Traditional
2,21.576968,78.667637,2,Traditional
3,29.643574,89.279739,13,Traditional
4,15.777995,80.777937,8,Traditional


In [None]:
data.tail() #the last 5 

Unnamed: 0,study_Hours_Per_Week,Performance_Score,Performance_Completed,Learning_Track
195,35.650585,87.690171,10,Accelerated
196,42.576661,83.825481,11,Accelerated
197,30.730266,101.763224,5,Accelerated
198,35.108772,89.349736,15,Accelerated
199,23.492957,72.949309,14,Accelerated


In [None]:
#Clean the data - to keep it in range or in realistic bounds by cliping it 

