# 1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

Qualitative Data: Descriptive data that cannot be measured but can be categorized based on traits and characteristics.
Example: Eye color (brown, blue, green).

Quantitative Data: Numerical data that can be measured or counted.
Example: Number of students in a class (45).

Nominal Scale: Data classified into distinct categories with no inherent order.
Example: Types of fruits (apple, banana, orange).

Ordinal Scale: Data categorized with a meaningful order, but the intervals between categories are not defined.
Example: Satisfaction rating (satisfied, neutral, unsatisfied).

Interval Scale: Data with meaningful intervals between values but no true zero point.
Example: Temperature in Celsius (0°C does not indicate the absence of temperature).

Ratio Scale: Data with meaningful intervals and a true zero point, allowing for comparison of absolute magnitudes.
Example: Weight (0 kg means no weight).

# 2. What are the measures of central tendency, and when should you use each? Discuss the mean, median, and mode with examples and situations where each is appropriate.

Mean: The average of a dataset, useful when data is symmetrically distributed without extreme outliers.
Example: Average age of students in a class.

Median: The middle value in an ordered dataset, appropriate when data is skewed or has outliers.
Example: Median income in a population where some individuals are extremely wealthy.

Mode: The most frequent value in the dataset, useful when identifying the most common category.
Example: Mode of shoe sizes in a group of people.

# 3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

Dispersion refers to the extent to which values in a dataset vary around the central tendency.

Variance: Measures the average squared deviations from the mean, giving an idea of how data points spread out.

Standard Deviation: The square root of variance, providing a more intuitive measure of spread in the same units as the data.

Example: For a dataset of exam scores, a high standard deviation means the scores vary greatly, while a low standard deviation means most students scored similarly.



# 4. What is a box plot, and what can it tell you about the distribution of data?

A box plot (or whisker plot) visually represents the distribution of data based on five summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It shows the spread, central value, and potential outliers in the dataset.

Interpretation:

The box represents the interquartile range (IQR), showing the middle 50% of the data.
The line inside the box is the median.
Whiskers extend to the minimum and maximum values within 1.5 times the IQR, with any points outside these considered outliers.

# 5. Discuss the role of random sampling in making inferences about populations.

Random sampling ensures that each member of a population has an equal chance of being selected for the sample. This randomness helps avoid bias and allows for the generalization of findings from the sample to the entire population, enabling more accurate and reliable inferences about the population's characteristics.

Example: Surveying 500 randomly selected voters to infer the voting preferences of an entire country's population.

# 6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

Skewness measures the asymmetry of a data distribution. It tells us whether the data tail extends more to the right (positive skew) or left (negative skew).

Positive Skew (Right Skew): The tail is on the right side, and the mean is greater than the median.
Example: Income distribution, where a few individuals have very high incomes.

Negative Skew (Left Skew): The tail is on the left side, and the mean is less than the median.
Example: Age at retirement, where most people retire around 65, but a few retire much earlier.

Skewness affects interpretation by indicating whether the mean is pulled away from the median, which can mislead conclusions about the central tendency.

# 7. What is the interquartile range (IQR), and how is it used to detect outliers?

The Interquartile Range (IQR) is the difference between the first quartile (Q1) and the third quartile (Q3), capturing the middle 50% of the data.

IQR = Q3 - Q1
To detect outliers, any data point below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR) is considered an outlier.

# 8. Discuss the conditions under which the binomial distribution is used.

The binomial distribution applies when:

There are a fixed number of n independent trials.
Each trial has only two possible outcomes (success or failure).
The probability of success (p) remains constant across trials.
Example: Flipping a coin 10 times and counting the number of heads.

# 9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

The normal distribution is a continuous probability distribution that is symmetric and bell-shaped.

Mean = Median = Mode.
The curve is symmetric around the mean.
Most of the data lies close to the mean.
The Empirical Rule states that:

68% of data falls within 1 standard deviation of the mean.
95% falls within 2 standard deviations.
99.7% falls within 3 standard deviations.


# 10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

Poisson Process Example: The number of emails received per hour.

Suppose on average you receive 5 emails per hour, and you want to find the probability of receiving exactly 3 emails in one hour. Using the Poisson formula:

In [3]:
import math

# Given values
lambd = 5  # average number of events (λ)
k = 3  # number of occurrences

# Poisson formula: P(X = k) = (e^(-λ) * λ^k) / k!
probability = (math.exp(-lambd) * lambd**k) / math.factorial(k)

probability

0.14037389581428056

There is a 14.04% chance of receiving exactly 3 emails in an hour.

# 11. Explain what a random variable is and differentiate between discrete and continuous random variables.

A random variable is a variable that takes on different numerical outcomes, determined by a random process.

Discrete Random Variable: Takes on a countable number of values.
Example: Number of students in a classroom.

Continuous Random Variable: Takes on an infinite number of values within a range.
Example: Height of students.

# 12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.

Example Dataset:

In [5]:
import numpy as np

# Example dataset
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]

# Mean of X and Y
mean_X = np.mean(X)
mean_Y = np.mean(Y)

# Covariance formula: Cov(X, Y) = sum((X_i - mean_X) * (Y_i - mean_Y)) / n
covariance = sum((x - mean_X) * (y - mean_Y) for x, y in zip(X, Y)) / len(X)

# Standard deviations of X and Y
std_X = np.std(X, ddof=0)
std_Y = np.std(Y, ddof=0)

# Correlation formula: Corr(X, Y) = Cov(X, Y) / (std_X * std_Y)
correlation = covariance / (std_X * std_Y)

covariance, correlation

(4.0, 0.9999999999999998)

In [6]:
# Output by particular names
covariance_answer = f"Covariance = {covariance}"
correlation_answer = f"Correlation = {correlation}"

covariance_answer, correlation_answer


('Covariance = 4.0', 'Correlation = 0.9999999999999998')

Interpretation: A covariance of 5 suggests that as 𝑋 increases, Y tends to increase as well. A correlation of 1 indicates a strong linear relationship between the two variables.