# **Maths in DS [Stat Inference and Basic Prob]**

**Instructors:** Jhun Brian M. Andam | Timothy Jonah E. Borromeo

**Course:** Introduction to Data Science

**Objectives:**

- Understand the necessary requirements for a data science task.
- Utilize and demonstrate the various data science tools.

**Statistical inference** is the process of using data analysis to infer properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

<u>Statistical inference makes propositions about a population</u>, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.

### **Key Concepts in Statistical Inference:**

**1. Population vs. Sample**

A population includes all individuals or data points relevant to a study, while a sample is a subset of that population selected for analysis. Since studying an entire population is often impractical, researchers use samples to draw conclusions about the population.

<center><img src="https://www.scribbr.com/wp-content/uploads/2019/09/population-vs-sample-1.png" width="300px"></center>

*Understanding the Decline of Route-1 (R1) Jeepneys in Cagayan de Oro: An Exploratory Study on Contributing Factors*

**Potential Variables to Explore**

1. Route Length & Travel Time – Is R1 too long or time-consuming compared to other routes?
2. Profitability – Are R1 drivers earning less due to fewer passengers, longer routes, or higher expenses?
3. Operational Costs – Are fuel prices, maintenance costs, or boundary fees discouraging R1 drivers?
4. Passenger Demand – Have commuters shifted to alternative transport modes (e.g., modern jeepneys, tricycles, or ride-hailing services)?
5. Traffic & Road Conditions – Are road congestion, rerouting policies, or construction projects making R1 less viable?
6. Government Policies & Modernization – Has the PUV modernization program affected R1 jeepney numbers?

**2. Parameter vs. Statistic**

A parameter is a numerical summary that describes a characteristic of an entire population, such as the mean or standard deviation. A statistic, on the other hand, is a numerical summary computed from a sample and used to estimate the corresponding population parameter.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np

In [3]:
df = sns.load_dataset('iris')
setosa_pop = df[df['species'] == 'setosa']
setosa_sam = setosa_pop.sample(n=30, random_state=42)

# population mean [parameter]
print(setosa_pop['sepal_width'].mean())

# sample mean [statistic]
print(setosa_sam['sepal_width'].mean())

3.428
3.436666666666667


**3. Point Estimation vs. Interval Estimation**

Point estimation provides a single best guess of a population parameter (e.g., using a sample mean to estimate the population mean). Interval estimation, such as confidence intervals, provides a range of values within which the parameter is likely to fall, incorporating a margin of error.

In [4]:
# Compute the sample mean [point estimate]
sepal_length = df['sepal_length']
point_estimate = np.mean(sepal_length)
print("Point Estimate (Mean):", point_estimate)

Point Estimate (Mean): 5.843333333333334


In [5]:
# Compute the 95% confidence interval for the mean
confidence_level = 0.95
n = len(sepal_length)
sample_std = np.std(sepal_length, ddof=1)  # Sample standard deviation
standard_error = sample_std / np.sqrt(n)

# Compute the margin of error using the t-distribution
t_critical = stats.t.ppf((1 + confidence_level) / 2, df=n-1)
margin_of_error = t_critical * standard_error

# Compute confidence interval
lower_bound = point_estimate - margin_of_error
upper_bound = point_estimate + margin_of_error

print(f"95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")

95% Confidence Interval: (5.71, 5.98)


**4. Hypothesis Testing**

Hypothesis testing is a statistical method used to determine whether there is enough evidence to support a specific claim about a population parameter. It involves setting up a null hypothesis (H₀), an alternative hypothesis (H₁), and using sample data to decide whether to reject H₀ in favor of H₁.

- Null Hypothesis ($H_0$): The mean sepal length is the same across all species.
- Alternative Hypothesis ($H_1$): At least one species has a significantly different mean sepal length.

In [6]:
setosa = df[df['species'] == 'setosa']['sepal_length']
versicolor = df[df['species'] == 'versicolor']['sepal_length']
virginica = df[df['species'] == 'virginica']['sepal_length']

# One-way ANOVA
f_stat, p_value = stats.f_oneway(setosa, versicolor, virginica)

# Output results
print(f'F-statistic: {f_stat:.4f}')
print(f'P-value: {p_value:.4f}')

F-statistic: 119.2645
P-value: 0.0000


In [7]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_result = pairwise_tukeyhsd(df['sepal_length'], df['species'])

# Display results
print(tukey_result)

ModuleNotFoundError: No module named 'statsmodels'

In [None]:
1+1