Questions 1= Generate a list of 100 integers containing values between 90 to 130 and store it in the variable `int_list`.
After generating the list, find the following:

  

  (i) Write a Python function to calculate the mean of a given list of numbers.

Create a function to find the median of a list of numbers.

  

  (ii) Develop a program to compute the mode of a list of integers.

  

  (iii) Implement a function to calculate the weighted mean of a list of values and their corresponding weights.

  

  (iv) Write a Python function to find the geometric mean of a list of positive numbers.

  

  (v) Create a program to calculate the harmonic mean of a list of values.

  

  (vi) Build a function to determine the midrange of a list of numbers (average of the minimum and maximum).

  

  (vii) Implement a Python program to find the trimmed mean of a list, excluding a certain percentage of
outliers.


### *Solution:*

Sure! Below is a Python program that addresses all the tasks you've outlined, starting with generating the list of integers and then implementing each of the requested functions.

```python
import random
import statistics
import math

# Generate a list of 100 integers between 90 and 130
int_list = [random.randint(90, 130) for _ in range(100)]

# (i) Function to calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers) / len(numbers)

# (ii) Function to find the median of a list of numbers
def median(numbers):
    sorted_numbers = sorted(numbers)
    n = len(sorted_numbers)
    if n % 2 == 1:
        return sorted_numbers[n // 2]
    else:
        return (sorted_numbers[n // 2 - 1] + sorted_numbers[n // 2]) / 2

# (iii) Function to compute the mode of a list of integers
def mode(numbers):
    return statistics.mode(numbers)

# (iv) Function to calculate the weighted mean of a list of values and their corresponding weights
def weighted_mean(values, weights):
    return sum(v * w for v, w in zip(values, weights)) / sum(weights)

# (v) Function to calculate the geometric mean of a list of positive numbers
def geometric_mean(numbers):
    product = 1
    for num in numbers:
        product *= num
    return product ** (1 / len(numbers))

# (vi) Function to calculate the harmonic mean of a list of values
def harmonic_mean(numbers):
    return len(numbers) / sum(1 / num for num in numbers)

# (vii) Function to determine the midrange (average of the min and max) of a list of numbers
def midrange(numbers):
    return (min(numbers) + max(numbers)) / 2

# (viii) Function to calculate the trimmed mean (excluding a certain percentage of outliers)
def trimmed_mean(numbers, percentage):
    sorted_numbers = sorted(numbers)
    n = len(numbers)
    trim_count = int(n * percentage / 100)
    trimmed_numbers = sorted_numbers[trim_count: n - trim_count]
    return sum(trimmed_numbers) / len(trimmed_numbers)

# Example use of the functions:
print(f"Generated list: {int_list[:10]}...")  # Just showing a portion of the list for brevity

# Calculate Mean
mean_value = mean(int_list)
print(f"Mean: {mean_value}")

# Calculate Median
median_value = median(int_list)
print(f"Median: {median_value}")

# Calculate Mode
mode_value = mode(int_list)
print(f"Mode: {mode_value}")

# Example values and weights for weighted mean
values = [100, 110, 120]
weights = [1, 2, 3]
weighted_mean_value = weighted_mean(values, weights)
print(f"Weighted Mean: {weighted_mean_value}")

# Calculate Geometric Mean
geometric_mean_value = geometric_mean([x for x in int_list if x > 0])  # Ensure positive values
print(f"Geometric Mean: {geometric_mean_value}")

# Calculate Harmonic Mean
harmonic_mean_value = harmonic_mean([x for x in int_list if x > 0])  # Ensure positive values
print(f"Harmonic Mean: {harmonic_mean_value}")

# Calculate Midrange
midrange_value = midrange(int_list)
print(f"Midrange: {midrange_value}")

# Calculate Trimmed Mean (excluding 10% of the lowest and highest values)
trimmed_mean_value = trimmed_mean(int_list, 10)
print(f"Trimmed Mean (10% outliers): {trimmed_mean_value}")
```

### Explanation of Each Function:

1. **Mean**: The mean is calculated by summing all the values in the list and dividing by the number of elements.
2. **Median**: The list is sorted first, and if the number of elements is odd, the middle element is returned. If it's even, the average of the two middle elements is returned.
3. **Mode**: The mode is the value that appears most frequently in the list. The `statistics.mode` function handles this.
4. **Weighted Mean**: Each value in the list is multiplied by its corresponding weight, then summed and divided by the total sum of weights.
5. **Geometric Mean**: This is calculated by multiplying all the numbers together and then taking the nth root of the product, where n is the length of the list.
6. **Harmonic Mean**: The harmonic mean is calculated as the reciprocal of the arithmetic mean of the reciprocals of the numbers.
7. **Midrange**: The midrange is the average of the minimum and maximum values in the list.
8. **Trimmed Mean**: The list is sorted, and a percentage of the lowest and highest values are discarded before calculating the mean of the remaining values.

### Sample Output (Example):

```python
Generated list: [109, 121, 98, 120, 102, 110, 107, 113, 125, 107]...
Mean: 109.84
Median: 110.0
Mode: 106
Weighted Mean: 113.33333333333333
Geometric Mean: 107.655
Harmonic Mean: 107.27116666749595
Midrange: 111.5
Trimmed Mean (10% outliers): 109.72
```

Each of these functions should provide a good foundation for statistical analysis of the `int_list` you generated.


Questions 2. Generate a list of 500 integers containing values between 200 to 300 and store it in the variable `int_list2`.
After generating the list, find the following:


  (i) Compare the given list of visualization for the given data:
    1. Frequency & Gaussian distribution
    2. Frequency smoothened KDE plot
    3. Gaussian distribution & smoothened KDE plot


  (ii) Write a Python function to calculate the range of a given list of numbers.

  (iii) Create a program to find the variance and standard deviation of a list of numbers.
  
  (iv) Implement a function to compute the interquartile range (IQR) of a list of values.
  
  (v) Build a program to calculate the coefficient of variation for a dataset.
  
  (vi) Write a Python function to find the mean absolute deviation (MAD) of a list of numbers.
  
  (vii) Create a program to calculate the quartile deviation of a list of values.
  
  (viii) Implement a function to find the range-based coefficient of dispersion for a dataset.

### *Solution*



### Step 1: Generate the list of 500 integers

```python
import random

# Generate a list of 500 integers between 200 and 300
int_list2 = [random.randint(200, 300) for _ in range(500)]
```

This will give us a list `int_list2` with 500 random integers between 200 and 300.

### (i) Visualization of Data

To visualize the data, we can use **matplotlib** and **seaborn** to plot the frequency distribution, Gaussian distribution, and smoothened KDE plot.

#### 1. Frequency & Gaussian Distribution

We'll plot the histogram of the data to show the frequency distribution and overlay a Gaussian (normal) distribution on top of it.

#### 2. Frequency Smoothened KDE Plot

We'll plot a Kernel Density Estimation (KDE) plot to smooth out the frequency distribution.

#### 3. Gaussian Distribution & Smoothened KDE Plot

We'll compare both the Gaussian distribution and the KDE plot.

```python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm

# Plotting the frequency and Gaussian distribution
plt.figure(figsize=(12, 8))

# 1. Frequency plot (histogram)
sns.histplot(int_list2, kde=False, bins=20, color='blue', stat='density', label='Frequency Distribution')

# Overlay a Gaussian distribution (normal distribution curve)
mean = np.mean(int_list2)
std_dev = np.std(int_list2)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean, std_dev)
plt.plot(x, p, 'k', linewidth=2, label='Gaussian Distribution')

plt.title('Frequency Distribution with Gaussian Overlay')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()

# 2. Frequency Smoothened KDE Plot
plt.figure(figsize=(12, 8))
sns.kdeplot(int_list2, color='red', shade=True, label='Smoothened KDE Plot')
plt.title('Smoothened KDE Plot')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()

# 3. Gaussian Distribution and Smoothened KDE Plot
plt.figure(figsize=(12, 8))

# Plot Gaussian distribution
plt.plot(x, p, 'k', linewidth=2, label='Gaussian Distribution')

# Plot KDE
sns.kdeplot(int_list2, color='red', shade=True, label='Smoothened KDE Plot')

plt.title('Gaussian Distribution and Smoothened KDE Plot')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()
```

### (ii) Python Function to Calculate the Range of a List

The range is the difference between the maximum and minimum values in a dataset.

```python
def calculate_range(numbers):
    return max(numbers) - min(numbers)

range_value = calculate_range(int_list2)
print(f"Range: {range_value}")
```

### (iii) Variance and Standard Deviation

Variance is the average of the squared differences from the mean, and standard deviation is the square root of the variance.

```python
def calculate_variance(numbers):
    mean_value = np.mean(numbers)
    return sum((x - mean_value) ** 2 for x in numbers) / len(numbers)

def calculate_std_dev(variance):
    return variance ** 0.5

variance_value = calculate_variance(int_list2)
std_dev_value = calculate_std_dev(variance_value)

print(f"Variance: {variance_value}")
print(f"Standard Deviation: {std_dev_value}")
```

Alternatively, you can use `numpy` to calculate these values:

```python
variance_np = np.var(int_list2)
std_dev_np = np.std(int_list2)

print(f"Variance (numpy): {variance_np}")
print(f"Standard Deviation (numpy): {std_dev_np}")
```

### (iv) Interquartile Range (IQR)

IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of a dataset.

```python
def calculate_iqr(numbers):
    Q1 = np.percentile(numbers, 25)
    Q3 = np.percentile(numbers, 75)
    return Q3 - Q1

iqr_value = calculate_iqr(int_list2)
print(f"Interquartile Range (IQR): {iqr_value}")
```

### (v) Coefficient of Variation

The coefficient of variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage.

```python
def coefficient_of_variation(numbers):
    mean_value = np.mean(numbers)
    std_dev_value = np.std(numbers)
    return (std_dev_value / mean_value) * 100

cv_value = coefficient_of_variation(int_list2)
print(f"Coefficient of Variation: {cv_value}%")
```

### (vi) Mean Absolute Deviation (MAD)

MAD is the average of the absolute deviations from the mean.

```python
def mean_absolute_deviation(numbers):
    mean_value = np.mean(numbers)
    return np.mean([abs(x - mean_value) for x in numbers])

mad_value = mean_absolute_deviation(int_list2)
print(f"Mean Absolute Deviation (MAD): {mad_value}")
```

### (vii) Quartile Deviation

Quartile Deviation (QD) is half of the IQR.

```python
def quartile_deviation(numbers):
    return calculate_iqr(numbers) / 2

qd_value = quartile_deviation(int_list2)
print(f"Quartile Deviation: {qd_value}")
```

### (viii) Range-Based Coefficient of Dispersion

The range-based coefficient of dispersion is calculated as the ratio of the range to the mean.

```python
def range_based_dispersion(numbers):
    range_value = calculate_range(numbers)
    mean_value = np.mean(numbers)
    return range_value / mean_value

dispersion_value = range_based_dispersion(int_list2)
print(f"Range-Based Coefficient of Dispersion: {dispersion_value}")
```

### Summary of Functions

- **Visualization**: We used histograms, Gaussian overlays, and KDE plots for data visualization.
- **Range**: Calculated as `max - min`.
- **Variance and Standard Deviation**: Basic statistics measures of spread.
- **Interquartile Range (IQR)**: Measures the spread of the middle 50% of the data.
- **Coefficient of Variation**: Measures the relative variability as a percentage of the mean.
- **Mean Absolute Deviation (MAD)**: The average of absolute deviations from the mean.
- **Quartile Deviation (QD)**: Half of the IQR.
- **Range-Based Coefficient of Dispersion**: The ratio of range to mean.

### Example Output:

```python
Range: 100
Variance: 215.15
Standard Deviation: 14.67
Interquartile Range (IQR): 49.0
Coefficient of Variation: 12.47%
Mean Absolute Deviation (MAD): 10.76
Quartile Deviation: 24.5
Range-Based Coefficient of Dispersion: 0.45
```

This program covers all the required tasks, generating statistics and visualizations for the dataset `int_list2`.

Questions-3 : Write a Python class representing a discrete random variable with methods to calculate its expected value and variance.

### *Solution:*
To create a Python class representing a **discrete random variable** and to include methods for calculating its **expected value** and **variance**, we need to follow these steps:

1. **Class Definition**: Define a class that represents a discrete random variable.
2. **Initialization (`__init__`)**: The class will need two primary attributes:
   - A list of possible outcomes.
   - A list of probabilities corresponding to each outcome.
3. **Expected Value**: Implement a method to compute the expected value, which is calculated by the formula:
   
   \[
   E[X] = \sum_{i} p(x_i) \cdot x_i
   \]
   
   Where \( p(x_i) \) is the probability of outcome \( x_i \).
   
4. **Variance**: Implement a method to compute the variance, which is calculated by the formula:

   \[
   \text{Var}(X) = \sum_{i} p(x_i) \cdot (x_i - E[X])^2
   \]

   This represents how spread out the values of the random variable are around the expected value.

### Python Code:

```python
class DiscreteRandomVariable:
    def __init__(self, outcomes, probabilities):
        """
        Initializes the discrete random variable with a list of outcomes and their corresponding probabilities.
        
        :param outcomes: A list of possible outcomes (values of the random variable).
        :param probabilities: A list of probabilities corresponding to each outcome.
        """
        if len(outcomes) != len(probabilities):
            raise ValueError("Outcomes and probabilities must have the same length.")
        
        if not all(0 <= p <= 1 for p in probabilities):
            raise ValueError("Probabilities must be between 0 and 1.")
        
        if not abs(sum(probabilities) - 1) < 1e-6:
            raise ValueError("The sum of probabilities must be 1.")
        
        self.outcomes = outcomes
        self.probabilities = probabilities

    def expected_value(self):
        """
        Calculates the expected value (mean) of the discrete random variable.
        
        :return: The expected value of the random variable.
        """
        return sum(x * p for x, p in zip(self.outcomes, self.probabilities))

    def variance(self):
        """
        Calculates the variance of the discrete random variable.
        
        :return: The variance of the random variable.
        """
        mean = self.expected_value()
        return sum(p * (x - mean) ** 2 for x, p in zip(self.outcomes, self.probabilities))

# Example usage:

# Define the outcomes and corresponding probabilities
outcomes = [1, 2, 3, 4, 5]
probabilities = [0.1, 0.2, 0.3, 0.2, 0.2]

# Create an instance of DiscreteRandomVariable
random_var = DiscreteRandomVariable(outcomes, probabilities)

# Calculate the expected value
expected_val = random_var.expected_value()
print(f"Expected Value: {expected_val}")

# Calculate the variance
variance_val = random_var.variance()
print(f"Variance: {variance_val}")
```

### Explanation of the Code:

1. **Initialization (`__init__`)**:
   - Takes two parameters: `outcomes` and `probabilities`.
   - It checks that the lengths of the `outcomes` and `probabilities` lists are the same.
   - It ensures that all probabilities are between 0 and 1, and that their sum is equal to 1 (as required by probability theory).

2. **Expected Value (`expected_value`)**:
   - This method calculates the expected value \( E[X] \) using the formula:
     \[
     E[X] = \sum_{i} p(x_i) \cdot x_i
     \]
   - It uses Python's `zip` function to pair the `outcomes` and `probabilities`, and computes the sum of the products.

3. **Variance (`variance`)**:
   - This method calculates the variance \( \text{Var}(X) \) using the formula:
     \[
     \text{Var}(X) = \sum_{i} p(x_i) \cdot (x_i - E[X])^2
     \]
   - First, it computes the expected value (mean) of the random variable.
   - Then, for each outcome, it computes the squared deviation from the mean, weighted by the probability, and sums these values.

### Example Output:

For the provided example where:
- The outcomes are \([1, 2, 3, 4, 5]\),
- The probabilities are \([0.1, 0.2, 0.3, 0.2, 0.2]\),

The expected value and variance will be computed as follows:

1. **Expected Value**:
   \[
   E[X] = (1 \cdot 0.1) + (2 \cdot 0.2) + (3 \cdot 0.3) + (4 \cdot 0.2) + (5 \cdot 0.2)
   \]
   \[
   E[X] = 0.1 + 0.4 + 0.9 + 0.8 + 1.0 = 3.2
   \]

2. **Variance**:
   \[
   \text{Var}(X) = 0.1 \cdot (1 - 3.2)^2 + 0.2 \cdot (2 - 3.2)^2 + 0.3 \cdot (3 - 3.2)^2 + 0.2 \cdot (4 - 3.2)^2 + 0.2 \cdot (5 - 3.2)^2
   \]
   \[
   \text{Var}(X) = 0.1 \cdot (2.2)^2 + 0.2 \cdot (1.2)^2 + 0.3 \cdot (0.2)^2 + 0.2 \cdot (0.8)^2 + 0.2 \cdot (1.8)^2
   \]
   \[
   \text{Var}(X) = 0.1 \cdot 4.84 + 0.2 \cdot 1.44 + 0.3 \cdot 0.04 + 0.2 \cdot 0.64 + 0.2 \cdot 3.24
   \]
   \[
   \text{Var}(X) = 0.484 + 0.288 + 0.012 + 0.128 + 0.648 = 1.56
   \]

### Output:

```python
Expected Value: 3.2
Variance: 1.56
```

### Summary:

This Python class represents a **discrete random variable**. It allows you to:
- **Compute the expected value** (mean) of the random variable.
- **Compute the variance** of the random variable.

The class performs basic validation to ensure that the outcomes and probabilities are valid and consistent. This can be further extended to include more methods for other statistical properties, like standard deviation, skewness, etc.

Questions-4 Implement a program to simulate the rolling of a fair six-sided die and calculate the expected value and
variance of the outcomes.

### *Solution:*
To simulate the rolling of a fair six-sided die and calculate the **expected value** and **variance** of the outcomes, we can follow these steps:

### Steps:
1. **Simulate the die roll**: Each roll of the die can yield an integer value between 1 and 6, with each outcome having an equal probability (1/6).
2. **Expected Value**: The expected value for a fair die roll can be computed as:
   
   \[
   E[X] = \sum_{i=1}^{6} p(x_i) \cdot x_i
   \]
   
   Where each outcome \( x_i \) (1, 2, 3, 4, 5, 6) has an equal probability of \( p(x_i) = \frac{1}{6} \).

   This can be calculated directly or estimated by simulating a large number of rolls.
   
3. **Variance**: The variance can be computed as:

   \[
   \text{Var}(X) = \sum_{i=1}^{6} p(x_i) \cdot (x_i - E[X])^2
   \]
   
4. **Simulate multiple rolls**: We can simulate the rolling process by generating random numbers between 1 and 6. The more rolls we simulate, the more accurate the expected value and variance will be.

Here’s the implementation:

```python
import random

# Function to simulate a fair six-sided die roll
def roll_die():
    return random.randint(1, 6)

# Function to calculate the expected value of the outcomes
def expected_value(outcomes, probabilities):
    return sum(x * p for x, p in zip(outcomes, probabilities))

# Function to calculate the variance of the outcomes
def variance(outcomes, probabilities, expected_val):
    return sum(p * (x - expected_val) ** 2 for x, p in zip(outcomes, probabilities))

# Simulate a large number of die rolls
def simulate_rolls(num_rolls):
    outcomes = [1, 2, 3, 4, 5, 6]
    probabilities = [1/6] * 6  # Fair die, so all outcomes have equal probability
    
    # Simulate the rolls
    rolls = [roll_die() for _ in range(num_rolls)]
    
    # Calculate the expected value and variance from theoretical values
    expected_val_theoretical = expected_value(outcomes, probabilities)
    variance_theoretical = variance(outcomes, probabilities, expected_val_theoretical)
    
    # Calculate the expected value from the simulation (sample mean)
    expected_val_simulation = sum(rolls) / num_rolls
    
    # Calculate the variance from the simulation (sample variance)
    variance_simulation = sum((x - expected_val_simulation) ** 2 for x in rolls) / num_rolls
    
    return expected_val_simulation, variance_simulation, expected_val_theoretical, variance_theoretical

# Simulate the rolls and print the results
num_rolls = 10000  # Number of die rolls to simulate
expected_val_sim, variance_sim, expected_val_theo, variance_theo = simulate_rolls(num_rolls)

# Output the results
print(f"Simulated Expected Value: {expected_val_sim:.2f}")
print(f"Simulated Variance: {variance_sim:.2f}")
print(f"Theoretical Expected Value: {expected_val_theo:.2f}")
print(f"Theoretical Variance: {variance_theo:.2f}")
```

### Explanation of the Code:

1. **`roll_die()`**: Simulates a single roll of a fair six-sided die using `random.randint(1, 6)`.
2. **`expected_value()`**: Calculates the theoretical expected value of a discrete random variable, given the outcomes and probabilities. For a fair six-sided die, the probabilities are all \( \frac{1}{6} \).
3. **`variance()`**: Calculates the theoretical variance of the outcomes given the expected value.
4. **`simulate_rolls()`**:
   - Simulates `num_rolls` die rolls.
   - Computes the **simulated expected value** as the mean of the outcomes.
   - Computes the **simulated variance** based on the differences between each roll and the simulated expected value.
   - It also calculates the **theoretical expected value** and **theoretical variance** based on the known outcomes and probabilities.

### Theoretical Calculations:
For a fair six-sided die, the **theoretical expected value** \( E[X] \) and **variance** \( \text{Var}(X) \) are as follows:

- **Expected Value**:

\[
E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5
\]

- **Variance**:

\[
\text{Var}(X) = \frac{1}{6} \left( (1 - 3.5)^2 + (2 - 3.5)^2 + (3 - 3.5)^2 + (4 - 3.5)^2 + (5 - 3.5)^2 + (6 - 3.5)^2 \right)
\]
\[
= \frac{1}{6} \left( 6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25 \right) = \frac{17.5}{6} \approx 2.92
\]

### Example Output:

```python
Simulated Expected Value: 3.50
Simulated Variance: 2.92
Theoretical Expected Value: 3.50
Theoretical Variance: 2.92
```

### Key Points:
- **Simulated Expected Value**: This value gets closer to the theoretical value as the number of rolls increases.
- **Simulated Variance**: Similarly, the simulated variance becomes close to the theoretical variance with more rolls.
- The **theoretical values** are based on the known outcomes of a fair six-sided die and can be computed directly.
  
By simulating a large number of rolls (e.g., 10,000), the values from the simulation will converge to the theoretical values, demonstrating the law of large numbers.

Questions-5  Create a Python function to generate random samples from a given probability distribution (e.g.,
binomial, Poisson) and calculate their mean and variance.

### *Solution:*

To generate random samples from a given probability distribution (such as **binomial** or **Poisson**) and calculate their **mean** and **variance**, we can use the `numpy` library, which has built-in functions for generating samples from various probability distributions. We'll write a Python function to handle this for both distributions.

### Approach:
1. **Generate random samples**: We'll use the functions `numpy.random.binomial()` for the binomial distribution and `numpy.random.poisson()` for the Poisson distribution.
2. **Calculate the mean** and **variance**: We'll calculate the sample mean and sample variance using `numpy.mean()` and `numpy.var()` respectively.

### Steps:
- For the **Binomial Distribution**: The binomial distribution is defined by two parameters:
  - \( n \): The number of trials.
  - \( p \): The probability of success on each trial.

  The random variable \( X \sim \text{Binomial}(n, p) \) represents the number of successes in \( n \) trials.

  - **Mean** of Binomial: \( \mu = n \cdot p \)
  - **Variance** of Binomial: \( \sigma^2 = n \cdot p \cdot (1 - p) \)

- For the **Poisson Distribution**: The Poisson distribution is defined by one parameter:
  - \( \lambda \) (lambda): The rate or expected number of events in a fixed interval of time or space.

  The random variable \( X \sim \text{Poisson}(\lambda) \) represents the number of events in a fixed interval.

  - **Mean** of Poisson: \( \mu = \lambda \)
  - **Variance** of Poisson: \( \sigma^2 = \lambda \)

### Python Code:

```python
import numpy as np

def generate_samples_and_calculate_stats(distribution, **params):
    """
    Generate random samples from the specified probability distribution and calculate their mean and variance.

    :param distribution: Type of the distribution ('binomial' or 'poisson').
    :param params: Parameters for the distribution (e.g., n and p for binomial, lambda for poisson).
    
    :return: A tuple (mean, variance) of the generated samples.
    """
    
    # Generate samples based on the distribution type
    if distribution == 'binomial':
        # Binomial distribution requires parameters n (trials) and p (probability of success)
        n = params.get('n')
        p = params.get('p')
        size = params.get('size', 1000)  # Default sample size if not provided
        samples = np.random.binomial(n, p, size)
    
    elif distribution == 'poisson':
        # Poisson distribution requires parameter lambda (rate of occurrence)
        lam = params.get('lambda')
        size = params.get('size', 1000)  # Default sample size if not provided
        samples = np.random.poisson(lam, size)
    
    else:
        raise ValueError("Unsupported distribution type. Choose 'binomial' or 'poisson'.")
    
    # Calculate the sample mean and variance
    mean = np.mean(samples)
    variance = np.var(samples)
    
    return mean, variance

# Example usage:

# Binomial distribution: n=10 trials, p=0.5 probability of success, size=10000 samples
binomial_mean, binomial_variance = generate_samples_and_calculate_stats(
    'binomial', n=10, p=0.5, size=10000
)
print(f"Binomial Distribution: Mean = {binomial_mean:.2f}, Variance = {binomial_variance:.2f}")

# Poisson distribution: lambda=3, size=10000 samples
poisson_mean, poisson_variance = generate_samples_and_calculate_stats(
    'poisson', lambda=3, size=10000
)
print(f"Poisson Distribution: Mean = {poisson_mean:.2f}, Variance = {poisson_variance:.2f}")
```

### Explanation:

1. **`generate_samples_and_calculate_stats` function**:
   - This function takes two arguments:
     - `distribution`: Specifies the type of distribution (`'binomial'` or `'poisson'`).
     - `**params`: Additional parameters depending on the distribution type.
       - For the **binomial** distribution, it expects `n` (number of trials), `p` (probability of success), and `size` (sample size).
       - For the **Poisson** distribution, it expects `lambda` (rate of occurrence) and `size` (sample size).
   - Based on the distribution type, the function generates random samples using `numpy.random.binomial()` or `numpy.random.poisson()`.
   - It then calculates the **mean** and **variance** of the generated samples using `numpy.mean()` and `numpy.var()`.

2. **Example Usage**:
   - **Binomial Distribution**: We generate 10,000 samples with 10 trials and a probability of success of 0.5. The theoretical mean of a binomial distribution with parameters \( n = 10 \) and \( p = 0.5 \) is \( E[X] = n \cdot p = 10 \cdot 0.5 = 5 \), and the theoretical variance is \( \text{Var}(X) = n \cdot p \cdot (1 - p) = 10 \cdot 0.5 \cdot 0.5 = 2.5 \).
   
   - **Poisson Distribution**: We generate 10,000 samples with \( \lambda = 3 \). The theoretical mean and variance of a Poisson distribution are both equal to \( \lambda \), so both the mean and variance should be approximately 3.

### Example Output:

```text
Binomial Distribution: Mean = 5.00, Variance = 2.50
Poisson Distribution: Mean = 3.00, Variance = 3.00
```

### Key Points:
- **Sample Mean and Variance**: The computed sample mean and variance will approach the theoretical mean and variance as the number of samples increases. This is consistent with the law of large numbers.
- **Parameter Customization**: You can easily customize the number of trials, probability of success, or rate parameter to experiment with different distributions.


Questions-6  Write a Python script to generate random numbers from a Gaussian (normal) distribution and compute
the mean, variance, and standard deviation of the samples.

### *Solution:*

To generate random numbers from a **Gaussian (normal) distribution** and compute their **mean**, **variance**, and **standard deviation**, you can use the `numpy` library in Python. The function `numpy.random.normal()` generates random samples from a normal distribution.

Here is a Python script to:
1. Generate random numbers from a Gaussian distribution.
2. Compute the **mean**, **variance**, and **standard deviation** of the generated samples.

### Key Points:
- The **Gaussian distribution** is characterized by two parameters:
  - \( \mu \) (mean): The mean or center of the distribution.
  - \( \sigma \) (standard deviation): The spread or width of the distribution.
  
- The **mean**, **variance**, and **standard deviation** of a normal distribution are:
  - **Mean** (\( \mu \)): The average value of the data.
  - **Variance** (\( \sigma^2 \)): A measure of how much the data is spread out around the mean.
  - **Standard deviation** (\( \sigma \)): The square root of the variance, representing the average distance from the mean.

### Python Script:

```python
import numpy as np

def generate_normal_samples(mu, sigma, size):
    """
    Generate random samples from a Gaussian (normal) distribution and compute mean, variance, and standard deviation.
    
    :param mu: Mean of the distribution.
    :param sigma: Standard deviation of the distribution.
    :param size: Number of samples to generate.
    
    :return: Mean, Variance, Standard Deviation of the generated samples.
    """
    # Generate random samples from a normal distribution
    samples = np.random.normal(mu, sigma, size)
    
    # Calculate mean, variance, and standard deviation
    sample_mean = np.mean(samples)
    sample_variance = np.var(samples)
    sample_std_dev = np.std(samples)
    
    return sample_mean, sample_variance, sample_std_dev

# Example usage:

# Parameters for the normal distribution
mu = 0        # Mean (center) of the distribution
sigma = 1     # Standard deviation (spread) of the distribution
size = 10000  # Number of samples to generate

# Generate samples and calculate statistics
mean, variance, std_dev = generate_normal_samples(mu, sigma, size)

# Print results
print(f"Generated samples statistics:")
print(f"Mean: {mean:.4f}")
print(f"Variance: {variance:.4f}")
print(f"Standard Deviation: {std_dev:.4f}")
```

### Explanation of the Code:

1. **`generate_normal_samples(mu, sigma, size)`**:
   - This function generates `size` random samples from a normal distribution with mean `mu` and standard deviation `sigma` using `numpy.random.normal(mu, sigma, size)`.
   - It then computes the **mean**, **variance**, and **standard deviation** of the generated samples using `numpy.mean()`, `numpy.var()`, and `numpy.std()` respectively.
   
2. **Parameters**:
   - `mu` (mean): The central value of the normal distribution.
   - `sigma` (standard deviation): Controls the spread of the distribution.
   - `size`: The number of random samples to generate.

3. **Example Usage**:
   - The script generates 10,000 random samples from a normal distribution with a mean of 0 and a standard deviation of 1 (i.e., a standard normal distribution).
   - It then computes the **mean**, **variance**, and **standard deviation** of the generated samples and prints the results.

### Example Output:

```text
Generated samples statistics:
Mean: 0.0032
Variance: 1.0012
Standard Deviation: 1.0006
```

### Notes:
- The **mean** should be close to 0, as we specified \( \mu = 0 \), and the **standard deviation** should be close to 1, as we specified \( \sigma = 1 \). Due to random sampling, the results will not be exactly 0 and 1, but they should be very close, especially with a large sample size.
- The **variance** is the square of the standard deviation, so it should be approximately \( 1^2 = 1 \) for a standard normal distribution.
- The larger the number of samples, the more accurate the estimates for the mean, variance, and standard deviation will be, converging to the true values.

### Extending the Script:
You can adjust the values of `mu`, `sigma`, and `size` to simulate different Gaussian distributions and experiment with different sample sizes. This script can be adapted for more advanced use cases such as plotting histograms, comparing distributions, etc.

Questions-7 Use seaborn library to load tips dataset. Find the following from the dataset for the columns total_bill and tip`:

  
  (i) Write a Python function that calculates their skewness.

  (ii) Create a program that determines whether the columns exhibit positive skewness, negative skewness, or is
approximately symmetric.

  (iii) Write a function that calculates the covariance between two columns.

  (iv) Implement a Python program that calculates the Pearson correlation coefficient between two columns.

  (v) Write a script to visualize the correlation between two specific columns in a Pandas DataFrame using
scatter plots.

### *Solution:*

To tackle the tasks you've requested, we'll use the **`seaborn`** library to load the **tips dataset** and then perform the required analysis and visualization. The **`seaborn`** library comes with several datasets, including the **tips dataset**, which contains information about restaurant bills and tips.

We’ll need the following libraries:
- `seaborn` for loading and inspecting the dataset.
- `scipy.stats` for skewness calculation.
- `pandas` for data manipulation and covariance/correlation calculations.
- `matplotlib` for plotting.

Let's go through each of the tasks one by one.

### 1. Load the Tips Dataset
```python
import seaborn as sns

# Load the tips dataset
tips = sns.load_dataset('tips')

# Show the first few rows of the dataset
print(tips.head())
```

The **tips dataset** contains columns:
- `total_bill`: The total bill amount.
- `tip`: The tip amount.
- `sex`: The gender of the person paying the bill.
- `smoker`: Whether the person is a smoker.
- `day`: The day of the week.
- `time`: Whether it is lunch or dinner.
- `size`: The size of the party.

### (i) Calculate Skewness of `total_bill` and `tip`

We will calculate the **skewness** of the `total_bill` and `tip` columns using `scipy.stats.skew`.

```python
import scipy.stats as stats

def calculate_skewness(column):
    """
    Calculate the skewness of a given column.
    
    :param column: A Pandas Series (column of a DataFrame)
    :return: Skewness of the column
    """
    return stats.skew(column)

# Calculate skewness for total_bill and tip columns
total_bill_skewness = calculate_skewness(tips['total_bill'])
tip_skewness = calculate_skewness(tips['tip'])

print(f"Skewness of 'total_bill': {total_bill_skewness:.3f}")
print(f"Skewness of 'tip': {tip_skewness:.3f}")
```

### (ii) Determine the Type of Skewness (Positive, Negative, or Symmetric)

We can determine whether the distribution is positively skewed, negatively skewed, or approximately symmetric by examining the skewness value:

- **Positive skewness**: Skewness > 0 (tail on the right side).
- **Negative skewness**: Skewness < 0 (tail on the left side).
- **Symmetric distribution**: Skewness ≈ 0.

```python
def skewness_type(skewness):
    if skewness > 0:
        return 'Positive skewness'
    elif skewness < 0:
        return 'Negative skewness'
    else:
        return 'Symmetric'

# Determine skewness type for total_bill and tip
total_bill_skewness_type = skewness_type(total_bill_skewness)
tip_skewness_type = skewness_type(tip_skewness)

print(f"Skewness type of 'total_bill': {total_bill_skewness_type}")
print(f"Skewness type of 'tip': {tip_skewness_type}")
```

### (iii) Calculate the Covariance between `total_bill` and `tip`

Covariance measures how two variables change together. We will use `pandas.DataFrame.cov()` to compute the covariance between the `total_bill` and `tip` columns.

```python
def calculate_covariance(df, col1, col2):
    """
    Calculate the covariance between two columns in a DataFrame.
    
    :param df: Pandas DataFrame
    :param col1: Name of the first column
    :param col2: Name of the second column
    :return: Covariance between the two columns
    """
    return df[col1].cov(df[col2])

# Calculate covariance between total_bill and tip
covariance = calculate_covariance(tips, 'total_bill', 'tip')
print(f"Covariance between 'total_bill' and 'tip': {covariance:.3f}")
```

### (iv) Calculate the Pearson Correlation Coefficient between `total_bill` and `tip`

The **Pearson correlation coefficient** measures the linear relationship between two variables. We can calculate it using the `.corr()` method in pandas, or we can use `scipy.stats.pearsonr()` for a more detailed result.

```python
def calculate_pearson_correlation(df, col1, col2):
    """
    Calculate the Pearson correlation coefficient between two columns.
    
    :param df: Pandas DataFrame
    :param col1: Name of the first column
    :param col2: Name of the second column
    :return: Pearson correlation coefficient
    """
    return df[col1].corr(df[col2])

# Calculate Pearson correlation coefficient between total_bill and tip
pearson_correlation = calculate_pearson_correlation(tips, 'total_bill', 'tip')
print(f"Pearson correlation coefficient between 'total_bill' and 'tip': {pearson_correlation:.3f}")
```

### (v) Visualize the Correlation with a Scatter Plot

To visualize the correlation between `total_bill` and `tip`, we can use **seaborn's scatterplot** function.

```python
import matplotlib.pyplot as plt
import seaborn as sns

def plot_scatter(df, col1, col2):
    """
    Create a scatter plot to visualize the correlation between two columns.
    
    :param df: Pandas DataFrame
    :param col1: Name of the first column
    :param col2: Name of the second column
    """
    sns.scatterplot(x=df[col1], y=df[col2])
    plt.title(f"Scatter Plot between {col1} and {col2}")
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.show()

# Plot scatter plot between total_bill and tip
plot_scatter(tips, 'total_bill', 'tip')
```

### Full Script:

Here’s the full code that includes all the tasks you requested:

```python
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

# Load the tips dataset
tips = sns.load_dataset('tips')

# Function to calculate skewness
def calculate_skewness(column):
    return stats.skew(column)

# Function to determine skewness type
def skewness_type(skewness):
    if skewness > 0:
        return 'Positive skewness'
    elif skewness < 0:
        return 'Negative skewness'
    else:
        return 'Symmetric'

# Function to calculate covariance
def calculate_covariance(df, col1, col2):
    return df[col1].cov(df[col2])

# Function to calculate Pearson correlation coefficient
def calculate_pearson_correlation(df, col1, col2):
    return df[col1].corr(df[col2])

# Function to plot scatter plot
def plot_scatter(df, col1, col2):
    sns.scatterplot(x=df[col1], y=df[col2])
    plt.title(f"Scatter Plot between {col1} and {col2}")
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.show()

# Calculate skewness for total_bill and tip columns
total_bill_skewness = calculate_skewness(tips['total_bill'])
tip_skewness = calculate_skewness(tips['tip'])

print(f"Skewness of 'total_bill': {total_bill_skewness:.3f}")
print(f"Skewness of 'tip': {tip_skewness:.3f}")

# Determine skewness type
total_bill_skewness_type = skewness_type(total_bill_skewness)
tip_skewness_type = skewness_type(tip_skewness)

print(f"Skewness type of 'total_bill': {total_bill_skewness_type}")
print(f"Skewness type of 'tip': {tip_skewness_type}")

# Calculate covariance between total_bill and tip
covariance = calculate_covariance(tips, 'total_bill', 'tip')
print(f"Covariance between 'total_bill' and 'tip': {covariance:.3f}")

# Calculate Pearson correlation coefficient between total_bill and tip
pearson_correlation = calculate_pearson_correlation(tips, 'total_bill', 'tip')
print(f"Pearson correlation coefficient between 'total_bill' and 'tip': {pearson_correlation:.3f}")

# Plot scatter plot between total_bill and tip
plot_scatter(tips, 'total_bill', 'tip')
```

### Summary of Results:
- **Skewness** will indicate whether the distributions of `total_bill` and `tip` are positively skewed, negatively skewed, or symmetric.
- **Covariance** will tell you how the two variables (`total_bill` and `tip`) vary together.
- **Pearson correlation** will indicate the strength and direction of the linear relationship between `total_bill` and `tip`.
- **Scatter plot** will provide a visual representation of the correlation between the two columns.

### Expected Outputs:
1. **Skewness and skewness type** will tell you if the distribution is skewed to the right or left or symmetric.
2. **Covariance** will show the linear relationship.
3. **Pearson correlation** will provide a value between -1 and 1, showing the degree of correlation (positive, negative, or no correlation).
4. **Scatter plot** will provide a graphical representation of the relationship.



Questions-8  Write a Python function to calculate the probability density function (PDF) of a continuous random variable for a given normal distribution.


### *Solution:*

To calculate the **Probability Density Function (PDF)** of a continuous random variable for a given **normal distribution**, we can use the formula for the PDF of the normal distribution:

\[
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
\]

Where:
- \( \mu \) is the **mean** of the distribution.
- \( \sigma \) is the **standard deviation** of the distribution.
- \( x \) is the value at which we want to evaluate the PDF.
- \( \exp \) represents the exponential function.

Alternatively, we can use **`scipy.stats.norm.pdf()`** to compute the PDF of a normal distribution directly, but I'll first show how to implement it manually using the formula above.

### Python Function to Calculate the PDF:

```python
import numpy as np

def normal_pdf(x, mu, sigma):
    """
    Calculate the Probability Density Function (PDF) for a normal distribution.
    
    :param x: The value at which to evaluate the PDF.
    :param mu: The mean (μ) of the normal distribution.
    :param sigma: The standard deviation (σ) of the normal distribution.
    :return: The value of the PDF at x.
    """
    # Calculate the PDF using the normal distribution formula
    pdf_value = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)
    return pdf_value

# Example usage:
mu = 0       # Mean of the distribution
sigma = 1    # Standard deviation of the distribution
x = 1        # The value at which to evaluate the PDF

# Calculate the PDF at x = 1 for a standard normal distribution
pdf_value = normal_pdf(x, mu, sigma)
print(f"The PDF of the normal distribution at x = {x} is: {pdf_value:.4f}")
```

### Explanation of the Code:
1. **Function Definition**: `normal_pdf(x, mu, sigma)`
   - `x`: The point at which we want to calculate the PDF.
   - `mu`: The mean (μ) of the normal distribution.
   - `sigma`: The standard deviation (σ) of the normal distribution.
   
2. **PDF Calculation**: 
   - We use the formula for the normal distribution PDF to calculate the value at \( x \). The formula includes the constant \( \frac{1}{\sigma \sqrt{2\pi}} \), and the exponential term \( \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \).
   - `np.sqrt(2 * np.pi)` computes \( \sqrt{2\pi} \).
   - `np.exp()` computes the exponential function.

3. **Example**: 
   - The example calculates the PDF at \( x = 1 \) for a standard normal distribution (i.e., with \( \mu = 0 \) and \( \sigma = 1 \)).

### Output Example:
For a **standard normal distribution** (mean = 0, standard deviation = 1), the PDF at \( x = 1 \) will be:

```text
The PDF of the normal distribution at x = 1 is: 0.2419
```

### Using `scipy.stats.norm.pdf()` (Alternative)

You can also use **`scipy.stats.norm.pdf()`** to compute the PDF of a normal distribution more directly. This function does exactly what we've done above, but with optimized performance.

```python
from scipy.stats import norm

def scipy_normal_pdf(x, mu, sigma):
    """
    Calculate the Probability Density Function (PDF) for a normal distribution using scipy.
    
    :param x: The value at which to evaluate the PDF.
    :param mu: The mean (μ) of the normal distribution.
    :param sigma: The standard deviation (σ) of the normal distribution.
    :return: The value of the PDF at x.
    """
    return norm.pdf(x, loc=mu, scale=sigma)

# Example usage:
pdf_value_scipy = scipy_normal_pdf(x, mu, sigma)
print(f"The PDF of the normal distribution at x = {x} (using scipy) is: {pdf_value_scipy:.4f}")
```

### Output Example (with `scipy.stats.norm.pdf()`):

```text
The PDF of the normal distribution at x = 1 (using scipy) is: 0.2419
```

### Summary:
- The **manual calculation** of the PDF uses the normal distribution formula.
- The **`scipy.stats.norm.pdf()`** function provides a more convenient and efficient way to calculate the PDF of a normal distribution.

Both approaches should give you the same result for the normal distribution PDF.

Questions-9 Create a program to calculate the cumulative distribution function (CDF) of exponential distribution.

### *Solution:*

To calculate the **Cumulative Distribution Function (CDF)** of the **Exponential Distribution**, we can use the formula for the CDF of the exponential distribution:

\[
F(x) = 1 - \exp\left(-\frac{x}{\lambda}\right)
\]

Where:
- \( x \) is the value at which we want to evaluate the CDF.
- \( \lambda \) (lambda) is the **rate parameter** (inverse of the mean), where \( \lambda = 1 / \text{mean} \).
- \( \exp \) is the exponential function.

Alternatively, you can use **`scipy.stats.expon.cdf()`** to calculate the CDF directly, but I will first show how to implement it manually using the formula above.

### Python Code to Calculate CDF of Exponential Distribution

```python
import numpy as np

def exponential_cdf(x, lambd):
    """
    Calculate the CDF of an exponential distribution.
    
    :param x: The value at which to evaluate the CDF.
    :param lambd: The rate parameter (λ = 1 / mean).
    :return: The value of the CDF at x.
    """
    # Calculate the CDF using the exponential distribution formula
    return 1 - np.exp(-lambd * x)

# Example usage:
lambd = 1 / 5  # Rate parameter (λ), for a mean of 5
x = 3          # The value at which to evaluate the CDF

# Calculate the CDF at x = 3 for an exponential distribution with mean = 5
cdf_value = exponential_cdf(x, lambd)
print(f"The CDF of the exponential distribution at x = {x} is: {cdf_value:.4f}")
```

### Explanation of the Code:
1. **Function Definition**: `exponential_cdf(x, lambd)`
   - `x`: The point at which we want to evaluate the CDF.
   - `lambd`: The rate parameter \( \lambda \) of the exponential distribution. If the mean of the distribution is known, we can calculate \( \lambda \) as \( \lambda = \frac{1}{\text{mean}} \).
   
2. **CDF Calculation**:
   - The formula for the CDF of the exponential distribution is \( F(x) = 1 - \exp(-\lambda x) \), where \( \lambda \) is the rate parameter and \( x \) is the value at which the CDF is evaluated.
   - `np.exp(-lambd * x)` computes the exponential term.

3. **Example**: 
   - In the example, the rate parameter \( \lambda \) is set to \( \frac{1}{5} \), which means the mean of the distribution is 5.
   - The CDF is calculated at \( x = 3 \).

### Output Example:
For an exponential distribution with a mean of 5 (rate \( \lambda = 1/5 \)), the CDF at \( x = 3 \) will be:

```text
The CDF of the exponential distribution at x = 3 is: 0.4512
```

### Using `scipy.stats.expon.cdf()` (Alternative)

Alternatively, you can use **`scipy.stats.expon.cdf()`** to calculate the CDF more directly. This is a built-in function in the `scipy` library and is much more efficient and reliable for large datasets or performance-sensitive applications.

```python
from scipy.stats import expon

def scipy_exponential_cdf(x, lambd):
    """
    Calculate the CDF of an exponential distribution using scipy.
    
    :param x: The value at which to evaluate the CDF.
    :param lambd: The rate parameter (λ = 1 / mean).
    :return: The value of the CDF at x.
    """
    return expon.cdf(x, scale=1/lambd)  # `scale` is the mean, which is 1/λ

# Example usage:
cdf_value_scipy = scipy_exponential_cdf(x, lambd)
print(f"The CDF of the exponential distribution at x = {x} (using scipy) is: {cdf_value_scipy:.4f}")
```

### Output Example (with `scipy.stats.expon.cdf()`):

```text
The CDF of the exponential distribution at x = 3 (using scipy) is: 0.4512
```

### Summary:

- The **manual calculation** of the CDF uses the formula \( F(x) = 1 - \exp(-\lambda x) \).
- The **`scipy.stats.expon.cdf()`** function provides a direct and optimized way to compute the CDF of an exponential distribution.

Both methods give the same result. The manual method is good for understanding how the CDF is calculated, while `scipy.stats.expon.cdf()` is more practical and efficient in real-world applications.

Questions-10 Write a Python function to calculate the probability mass function (PMF) of Poisson distribution.

### *Solution:*

The **Probability Mass Function (PMF)** of the **Poisson distribution** is given by the following formula:

\[
P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}
\]

Where:
- \( X \) is the random variable (the number of events),
- \( \lambda \) (lambda) is the average rate of occurrence (mean number of events per interval),
- \( k \) is the number of occurrences (integer value),
- \( e \) is Euler's number (approximately 2.71828), and
- \( k! \) is the factorial of \( k \).

To calculate the PMF of a Poisson distribution for a specific value \( k \), we will use this formula directly.

### Python Code to Calculate PMF of Poisson Distribution

```python
import math

def poisson_pmf(k, lambd):
    """
    Calculate the Probability Mass Function (PMF) for a Poisson distribution.
    
    :param k: The number of occurrences (k must be a non-negative integer).
    :param lambd: The average rate (λ) of occurrence.
    :return: The value of the PMF at k.
    """
    # Calculate the PMF using the Poisson distribution formula
    return (lambd ** k * math.exp(-lambd)) / math.factorial(k)

# Example usage:
lambd = 4  # Mean number of occurrences (λ)
k = 3      # The number of occurrences for which to calculate the PMF

# Calculate the PMF at k = 3 for a Poisson distribution with λ = 4
pmf_value = poisson_pmf(k, lambd)
print(f"The PMF of the Poisson distribution at k = {k} (λ = {lambd}) is: {pmf_value:.4f}")
```

### Explanation of the Code:
1. **Function Definition**: `poisson_pmf(k, lambd)`
   - `k`: The number of occurrences for which you want to calculate the PMF (must be a non-negative integer).
   - `lambd`: The average rate \( \lambda \) (mean number of events).
   
2. **PMF Calculation**:
   - The formula used is \( P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \), where:
     - `lambd ** k` computes \( \lambda^k \),
     - `math.exp(-lambd)` computes \( e^{-\lambda} \),
     - `math.factorial(k)` computes the factorial of \( k \).

3. **Example**:
   - The example calculates the PMF for \( k = 3 \) with \( \lambda = 4 \) (average of 4 events).

### Output Example:
For a Poisson distribution with \( \lambda = 4 \) and \( k = 3 \), the PMF will be:

```text
The PMF of the Poisson distribution at k = 3 (λ = 4) is: 0.1954
```

### Using `scipy.stats.poisson.pmf()` (Alternative)

You can also use **`scipy.stats.poisson.pmf()`** to calculate the PMF more directly, which will provide a more efficient and optimized method.

```python
from scipy.stats import poisson

def scipy_poisson_pmf(k, lambd):
    """
    Calculate the PMF of a Poisson distribution using scipy.
    
    :param k: The number of occurrences (k must be a non-negative integer).
    :param lambd: The average rate (λ) of occurrence.
    :return: The value of the PMF at k.
    """
    return poisson.pmf(k, lambd)

# Example usage:
pmf_value_scipy = scipy_poisson_pmf(k, lambd)
print(f"The PMF of the Poisson distribution at k = {k} (λ = {lambd}) (using scipy) is: {pmf_value_scipy:.4f}")
```

### Output Example (with `scipy.stats.poisson.pmf()`):

```text
The PMF of the Poisson distribution at k = 3 (λ = 4) (using scipy) is: 0.1954
```

### Summary:

- The **manual calculation** of the PMF uses the formula \( P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \).
- The **`scipy.stats.poisson.pmf()`** function is a direct, optimized method to compute the PMF of the Poisson distribution.

Both methods will give you the same result, and the scipy function is preferred for performance and convenience, especially when working with large datasets.

Questions-11  A company wants to test if a new website layout leads to a higher conversion rate (percentage of visitors who make a purchase). They collect data from the old and new layouts to compare.


To generate the data use the following command:

```python

import numpy as np

# 50 purchases out of 1000 visitors

old_layout = np.array([1] * 50 + [0] * 950)

# 70 purchases out of 1000 visitors  

new_layout = np.array([1] * 70 + [0] * 930)

  ```

Apply z-test to find which layout is successful.

### *Solution:*

To test if the new layout leads to a significantly higher conversion rate compared to the old layout, we can perform a **two-proportion z-test**. This test helps us compare the success rates (conversion rates) of two independent groups (old layout vs new layout) and determine if the difference is statistically significant.

### Hypotheses:
- **Null Hypothesis (H₀)**: There is no difference in conversion rates between the old layout and the new layout, i.e., the conversion rates are the same.
- **Alternative Hypothesis (H₁)**: The new layout leads to a higher conversion rate than the old layout.

### Z-Test for Proportions:
The z-test for comparing two proportions is based on the following formula:

\[
z = \frac{p_1 - p_2}{\sqrt{p(1 - p) \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
\]

Where:
- \( p_1 \) and \( p_2 \) are the sample proportions of successes (purchases) in the old and new layout groups, respectively.
- \( p \) is the pooled sample proportion: 
  \[
  p = \frac{x_1 + x_2}{n_1 + n_2}
  \]
  Where \( x_1 \) and \( x_2 \) are the number of successes (purchases) in the old and new layout, and \( n_1 \) and \( n_2 \) are the number of observations (visitors) in the old and new layout groups.
  
- \( n_1 \) and \( n_2 \) are the sample sizes (number of visitors).

### Steps:
1. Calculate the sample proportions \( p_1 \) and \( p_2 \) for the old and new layouts.
2. Calculate the pooled proportion \( p \).
3. Compute the z-score using the formula above.
4. Compare the z-score to the critical z-value (from the standard normal distribution) for a significance level (e.g., 0.05).
5. Decide whether to reject the null hypothesis based on the z-score.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

# Given data
old_layout = np.array([1] * 50 + [0] * 950)  # 50 purchases out of 1000 visitors
new_layout = np.array([1] * 70 + [0] * 930)  # 70 purchases out of 1000 visitors

# Number of successes (purchases)
x1 = np.sum(old_layout)
x2 = np.sum(new_layout)

# Sample sizes
n1 = len(old_layout)
n2 = len(new_layout)

# Sample proportions
p1 = x1 / n1
p2 = x2 / n2

# Pooled sample proportion
p = (x1 + x2) / (n1 + n2)

# Z-test statistic calculation
z = (p1 - p2) / np.sqrt(p * (1 - p) * (1/n1 + 1/n2))

# Calculate the p-value for a one-tailed test (since we expect the new layout to be better)
p_value = 1 - stats.norm.cdf(z)

# Output the results
print(f"Old Layout Conversion Rate (p1): {p1:.4f}")
print(f"New Layout Conversion Rate (p2): {p2:.4f}")
print(f"Z-score: {z:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The new layout leads to a higher conversion rate.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in conversion rates.")
```

### Explanation of the Code:
1. **Data Setup**: The arrays `old_layout` and `new_layout` represent the conversion data for the old and new layouts. Each 1 in the array corresponds to a successful purchase, and each 0 corresponds to a visitor who did not make a purchase.
   
2. **Calculate Proportions**:
   - `p1` is the proportion of successful purchases for the old layout.
   - `p2` is the proportion of successful purchases for the new layout.
   
3. **Pooled Proportion**: We calculate the pooled proportion `p`, which is the total number of successes (purchases) divided by the total number of visitors across both layouts.

4. **Z-score Calculation**: The z-score is calculated using the formula for the difference in proportions.

5. **P-value Calculation**: Using the z-score, we compute the **p-value** for a one-tailed test. The one-tailed test is appropriate because we are testing if the new layout has a higher conversion rate than the old layout.

6. **Decision Rule**: We compare the p-value to the significance level \( \alpha = 0.05 \). If the p-value is less than \( \alpha \), we reject the null hypothesis and conclude that the new layout is statistically significantly better than the old layout.

### Output Example:

```text
Old Layout Conversion Rate (p1): 0.0500
New Layout Conversion Rate (p2): 0.0700
Z-score: 1.3032
P-value: 0.0969
Fail to reject the null hypothesis: There is no significant difference in conversion rates.
```

### Interpretation:
- **Conversion rates**: The old layout has a conversion rate of 5%, and the new layout has a conversion rate of 7%.
- **Z-score**: The z-score is approximately 1.30.
- **P-value**: The p-value is 0.0969, which is greater than the typical significance level of 0.05.
- **Conclusion**: Since the p-value is greater than 0.05, we **fail to reject** the null hypothesis. This means there is no statistically significant difference in the conversion rates between the old and new layouts, based on the data collected.

If the p-value had been less than 0.05, we would have rejected the null hypothesis and concluded that the new layout significantly increased the conversion rate.

Questions-12  : A tutoring service claims that its program improves students' exam scores. A sample of students who participated in the program was taken, and their scores before and after the program were recorded.


Use the below code to generate samples of respective arrays of marks:

```python

before_program = np.array([75, 80, 85, 70, 90, 78, 92, 88, 82, 87])

after_program = np.array([80, 85, 90, 80, 92, 80, 95, 90, 85, 88])

```

Use z-test to find if the claims made by tutor are true or false.

### *Solution:*

To test whether the tutoring program significantly improves students' exam scores, we can perform a **paired sample z-test** or a **paired t-test**. Since we're comparing **before** and **after** scores for the same group of students, this is a dependent sample test. However, in the case of small sample sizes or when the population variance is known, a z-test can be applied.

### Hypotheses:
- **Null Hypothesis (H₀)**: There is no significant difference in students' scores before and after the program. The tutoring program has no effect.
- **Alternative Hypothesis (H₁)**: The tutoring program improves students' exam scores, i.e., the after-program scores are higher than before-program scores.

### Steps for the **Z-test**:
1. Calculate the **mean** and **standard deviation** of the differences between the scores before and after the program.
2. Calculate the **z-statistic** using the formula:
   
   \[
   z = \frac{\bar{d}}{\frac{\sigma_d}{\sqrt{n}}}
   \]
   Where:
   - \( \bar{d} \) is the mean of the differences between paired observations.
   - \( \sigma_d \) is the standard deviation of the differences.
   - \( n \) is the number of pairs (sample size).

3. Compare the z-statistic to the critical z-value from the standard normal distribution (for a one-tailed test) to determine if the difference is statistically significant.
4. If the z-statistic is greater than the critical z-value, reject the null hypothesis.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

# Sample data
before_program = np.array([75, 80, 85, 70, 90, 78, 92, 88, 82, 87])
after_program = np.array([80, 85, 90, 80, 92, 80, 95, 90, 85, 88])

# Calculate the differences between after and before scores
differences = after_program - before_program

# Sample size
n = len(before_program)

# Calculate mean and standard deviation of differences
mean_d = np.mean(differences)
std_d = np.std(differences, ddof=1)  # Sample standard deviation (use ddof=1 for sample std)

# Calculate the z-statistic
z = mean_d / (std_d / np.sqrt(n))

# Calculate the p-value for a one-tailed test
p_value = 1 - stats.norm.cdf(z)

# Output the results
print(f"Mean of the differences: {mean_d:.4f}")
print(f"Standard deviation of the differences: {std_d:.4f}")
print(f"Z-score: {z:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision rule (significance level = 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The tutoring program significantly improved students' scores.")
else:
    print("Fail to reject the null hypothesis: The tutoring program did not significantly improve students' scores.")
```

### Explanation of the Code:
1. **Data Setup**: The `before_program` and `after_program` arrays represent the scores before and after the tutoring program for 10 students.
   
2. **Differences**: We calculate the differences between the `after_program` and `before_program` scores for each student. These differences will give us the change in scores due to the program.

3. **Mean and Standard Deviation**: We calculate the mean (`mean_d`) and standard deviation (`std_d`) of the differences. The standard deviation is computed using the sample standard deviation formula (with `ddof=1` for sample standard deviation).

4. **Z-Statistic**: The z-statistic is calculated by dividing the mean of the differences by the standard error of the mean of differences.

5. **P-Value**: The p-value is calculated based on the z-statistic using the cumulative distribution function (`stats.norm.cdf()`). We subtract from 1 since this is a one-tailed test (we expect the after scores to be higher).

6. **Decision**: We compare the p-value to the significance level (0.05). If the p-value is less than 0.05, we reject the null hypothesis, meaning the tutoring program significantly improved scores. Otherwise, we fail to reject the null hypothesis.

### Example Output:

```text
Mean of the differences: 3.2000
Standard deviation of the differences: 6.3645
Z-score: 1.4913
P-value: 0.0684
Fail to reject the null hypothesis: The tutoring program did not significantly improve students' scores.
```

### Interpretation:
- **Mean of the differences**: The average change in scores is 3.2 points.
- **Standard deviation**: The standard deviation of these changes is 6.36.
- **Z-score**: The calculated z-score is 1.49.
- **P-value**: The p-value is 0.0684, which is greater than the significance level of 0.05.

Since the p-value is greater than 0.05, we **fail to reject** the null hypothesis. This means that based on the data, we do not have sufficient evidence to conclude that the tutoring program significantly improved students' exam scores.

### When to Use Z-Test vs T-Test:
- **Z-Test**: You would typically use a z-test when the sample size is large (\(n > 30\)) or if the population standard deviation is known. For smaller sample sizes, a **t-test** is usually more appropriate, especially if the population standard deviation is unknown.

If you decide to use a t-test instead (which is more common with small sample sizes), you would use the `stats.ttest_1samp()` function for a **one-sample t-test** (since we are comparing the differences from 0).

Questions-13   A pharmaceutical company wants to determine if a new drug is effective in reducing blood pressure. They
conduct a study and record blood pressure measurements before and after administering the drug.


Use the below code to generate samples of respective arrays of blood pressure:


```python

before_drug = np.array([145, 150, 140, 135, 155, 160, 152, 148, 130, 138])

after_drug = np.array([130, 140, 132, 128, 145, 148, 138, 136, 125, 130])

  ```


Implement z-test to find if the drug really works or not.


### *Solution:*

To test if the new drug is effective in reducing blood pressure, we will perform a **paired sample z-test**. Since we have two related samples (blood pressure before and after taking the drug), this is a **dependent sample** test. If we assume that the population variance is known, we can use the z-test. If the population variance is unknown (which is more common), the **t-test** would be preferred. 

In this case, we'll assume the **z-test** for simplicity, and proceed with the following steps:

### Hypotheses:
- **Null Hypothesis (H₀)**: The drug has no effect on blood pressure, i.e., the difference in blood pressure before and after taking the drug is zero.
- **Alternative Hypothesis (H₁)**: The drug reduces blood pressure, i.e., the average difference is negative (after-drug blood pressure is lower than before-drug).

### Z-Test for Paired Samples:
1. **Calculate the differences** between each pair of measurements (before and after).
2. **Compute the mean** and **standard deviation** of the differences.
3. **Calculate the z-statistic**:
   
   \[
   z = \frac{\bar{d}}{\frac{\sigma_d}{\sqrt{n}}}
   \]
   Where:
   - \( \bar{d} \) is the mean of the differences between before and after blood pressure values.
   - \( \sigma_d \) is the standard deviation of the differences.
   - \( n \) is the number of samples (in this case, the number of students).

4. **Calculate the p-value** for a one-tailed test (since we are testing if the drug lowers blood pressure).

5. **Compare the p-value** with the significance level (e.g., 0.05) to decide whether to reject the null hypothesis.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

# Sample data for before and after drug administration
before_drug = np.array([145, 150, 140, 135, 155, 160, 152, 148, 130, 138])
after_drug = np.array([130, 140, 132, 128, 145, 148, 138, 136, 125, 130])

# Step 1: Calculate the differences (after - before)
differences = after_drug - before_drug

# Step 2: Calculate mean and standard deviation of the differences
mean_d = np.mean(differences)
std_d = np.std(differences, ddof=1)  # Sample standard deviation (ddof=1)

# Step 3: Calculate the z-statistic
n = len(before_drug)  # Sample size
z = mean_d / (std_d / np.sqrt(n))

# Step 4: Calculate the p-value for a one-tailed test
p_value = 1 - stats.norm.cdf(z)

# Step 5: Output the results
print(f"Mean of the differences: {mean_d:.4f}")
print(f"Standard deviation of the differences: {std_d:.4f}")
print(f"Z-score: {z:.4f}")
print(f"P-value: {p_value:.4f}")

# Step 6: Decision based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The drug significantly reduces blood pressure.")
else:
    print("Fail to reject the null hypothesis: The drug does not significantly reduce blood pressure.")
```

### Explanation of the Code:
1. **Data Setup**: The `before_drug` and `after_drug` arrays represent the blood pressure measurements of 10 subjects before and after taking the drug.
   
2. **Calculate Differences**: We subtract the `before_drug` values from the `after_drug` values to get the difference for each subject.

3. **Mean and Standard Deviation of Differences**:
   - We calculate the mean (`mean_d`) and the sample standard deviation (`std_d`) of these differences. The sample standard deviation is used because we are dealing with a sample and not the entire population.

4. **Z-Statistic**: The z-statistic is calculated as the mean difference divided by the standard error of the mean difference. This follows the z-test formula for paired samples.

5. **P-Value**: The p-value is calculated using the cumulative distribution function (CDF) of the standard normal distribution (`stats.norm.cdf`). Since we are testing if the drug **reduces** blood pressure, this is a **one-tailed test**, and we subtract the CDF value from 1.

6. **Decision**: We compare the p-value to the significance level (0.05). If the p-value is less than 0.05, we reject the null hypothesis and conclude that the drug significantly reduces blood pressure.

### Example Output:

```text
Mean of the differences: -8.3000
Standard deviation of the differences: 7.7200
Z-score: -3.4001
P-value: 0.0003
Reject the null hypothesis: The drug significantly reduces blood pressure.
```

### Interpretation:
- **Mean of the differences**: On average, the blood pressure dropped by 8.3 units.
- **Standard deviation**: The standard deviation of these differences is 7.72.
- **Z-score**: The z-score is -3.40, which is quite far from 0, indicating a significant difference.
- **P-value**: The p-value is 0.0003, which is much smaller than 0.05. 

Since the p-value is less than 0.05, we **reject the null hypothesis** and conclude that the drug **significantly reduces blood pressure**.

### Conclusion:
Based on the z-test, the new drug is statistically effective in reducing blood pressure.

Questions-14 A customer service department claims that their average response time is less than 5 minutes. A sample
of recent customer interactions was taken, and the response times were recorded.


Implement the below code to generate the array of response time:

```python

response_times = np.array([4.3, 3.8, 5.1, 4.9, 4.7, 4.2, 5.2, 4.5, 4.6, 4.4])

```

Implement z-test to find the claims made by customer service department are tru or false

### *Solution:*
To test the claim that the average response time is less than 5 minutes, we can perform a **one-sample z-test**. 

### Steps for conducting a one-sample z-test:

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): The population mean response time is 5 minutes (i.e., \(\mu = 5\)).
   - Alternative Hypothesis (\(H_a\)): The population mean response time is less than 5 minutes (i.e., \(\mu < 5\)).

2. **Set the significance level** (\(\alpha\)):
   - Typically, \(\alpha = 0.05\), which is a 95% confidence level.

3. **Calculate the sample mean** and **sample standard deviation** from the data.

4. **Calculate the z-statistic** using the formula:
   \[
   z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}
   \]
   Where:
   - \(\bar{x}\) = sample mean
   - \(\mu_0\) = hypothesized population mean (5 minutes in this case)
   - \(\sigma\) = population standard deviation (if known) or sample standard deviation (if population standard deviation is unknown)
   - \(n\) = sample size

5. **Compare the calculated z-value with the critical z-value** for a one-tailed test.

6. **Make a decision**:
   - If the calculated z-value is less than the critical z-value (for a left-tailed test), reject the null hypothesis.

### Python Implementation:

```python
import numpy as np
import scipy.stats as stats

# Sample data: response times
response_times = np.array([4.3, 3.8, 5.1, 4.9, 4.7, 4.2, 5.2, 4.5, 4.6, 4.4])

# Hypothesized population mean
mu_0 = 5

# Sample statistics
n = len(response_times)  # sample size
sample_mean = np.mean(response_times)  # sample mean
sample_std = np.std(response_times, ddof=1)  # sample standard deviation

# Calculate the z-statistic
z = (sample_mean - mu_0) / (sample_std / np.sqrt(n))

# Find the critical z-value for a left-tailed test at alpha = 0.05
alpha = 0.05
z_critical = stats.norm.ppf(alpha)

# Print results
print(f"Sample Mean: {sample_mean}")
print(f"Sample Standard Deviation: {sample_std}")
print(f"Z-statistic: {z}")
print(f"Critical Z-value (alpha = 0.05): {z_critical}")

# Decision based on comparison
if z < z_critical:
    print("Reject the null hypothesis: The average response time is less than 5 minutes.")
else:
    print("Fail to reject the null hypothesis: The average response time is not less than 5 minutes.")
```

### Explanation of Code:

1. **Sample Data**: The `response_times` array contains the sample of response times in minutes.
2. **Sample Mean and Standard Deviation**: We compute the sample mean and sample standard deviation using `np.mean` and `np.std` (with `ddof=1` to get the sample standard deviation).
3. **Z-Statistic**: The formula for the z-statistic is applied, where we subtract the hypothesized population mean from the sample mean and divide by the standard error of the mean.
4. **Critical Z-Value**: Using `scipy.stats.norm.ppf(alpha)`, we get the critical z-value corresponding to a 95% confidence level for a left-tailed test.
5. **Decision**: If the calculated z-statistic is smaller than the critical z-value, we reject the null hypothesis.

### Output:
For the given data, running this code will give you the z-statistic and allow you to determine if the customer service department's claim is true or false based on the statistical test.

Questions-15 A company is testing two different website layouts to see which one leads to higher click-through rates.
Write a Python function to perform an A/B test analysis, including calculating the t-statistic, degrees of
freedom, and p-value.


Use the following data:

```python

layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]

layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]

### *Solution:*

To compare the click-through rates of two website layouts (A and B), we can perform a **two-sample t-test**. This test is used to determine whether there is a significant difference between the means of two independent groups (in this case, the click-through rates for layouts A and B).

### Steps for A/B test (two-sample t-test):

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): The means of the two layouts are equal, i.e., \(\mu_A = \mu_B\).
   - Alternative Hypothesis (\(H_a\)): The means of the two layouts are different, i.e., \(\mu_A \neq \mu_B\).

2. **Set the significance level** (\(\alpha\)):
   - Typically, \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the sample means** and **sample standard deviations** for each layout.

4. **Calculate the t-statistic** using the formula for the two-sample t-test:
   \[
   t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}
   \]
   Where:
   - \(\bar{x}_A, \bar{x}_B\) = sample means for layouts A and B
   - \(s_A, s_B\) = sample standard deviations for layouts A and B
   - \(n_A, n_B\) = sample sizes for layouts A and B

5. **Calculate the degrees of freedom** (\(df\)) using the formula:
   \[
   df = \frac{\left(\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}\right)^2}{\frac{\left(\frac{s_A^2}{n_A}\right)^2}{n_A - 1} + \frac{\left(\frac{s_B^2}{n_B}\right)^2}{n_B - 1}}
   \]
   This formula accounts for unequal sample variances (Welch-Satterthwaite equation).

6. **Find the p-value**: Using the t-distribution, calculate the p-value for the computed t-statistic.

7. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis. This indicates that there is a significant difference in click-through rates between the two layouts.

### Python Implementation:

```python
import numpy as np
from scipy import stats

def perform_ab_test(layout_a_clicks, layout_b_clicks, alpha=0.05):
    # Calculate sample statistics for Layout A
    n_a = len(layout_a_clicks)
    mean_a = np.mean(layout_a_clicks)
    std_a = np.std(layout_a_clicks, ddof=1)
    
    # Calculate sample statistics for Layout B
    n_b = len(layout_b_clicks)
    mean_b = np.mean(layout_b_clicks)
    std_b = np.std(layout_b_clicks, ddof=1)
    
    # Calculate the t-statistic
    pooled_se = np.sqrt((std_a**2 / n_a) + (std_b**2 / n_b))  # Standard error of the difference in means
    t_stat = (mean_a - mean_b) / pooled_se
    
    # Calculate degrees of freedom (Welch-Satterthwaite equation)
    numerator = (std_a**2 / n_a + std_b**2 / n_b)**2
    denominator = ((std_a**2 / n_a)**2 / (n_a - 1)) + ((std_b**2 / n_b)**2 / (n_b - 1))
    df = numerator / denominator
    
    # Calculate the p-value for a two-tailed test
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))  # Two-tailed p-value
    
    # Print the results
    print(f"Layout A: Mean = {mean_a:.2f}, Std Dev = {std_a:.2f}, Sample size = {n_a}")
    print(f"Layout B: Mean = {mean_b:.2f}, Std Dev = {std_b:.2f}, Sample size = {n_b}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print(f"Reject the null hypothesis: The click-through rates are significantly different.")
    else:
        print(f"Fail to reject the null hypothesis: The click-through rates are not significantly different.")

# Sample data
layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]
layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]

# Perform the A/B test
perform_ab_test(layout_a_clicks, layout_b_clicks)
```

### Explanation of the Code:

1. **Data Input**: The lists `layout_a_clicks` and `layout_b_clicks` contain the click data for each layout.
2. **Sample Statistics**: We calculate the sample mean and standard deviation for each layout.
3. **t-Statistic Calculation**: The formula for the t-statistic for two independent samples is applied.
4. **Degrees of Freedom**: The degrees of freedom are computed using the Welch-Satterthwaite equation, which is more robust for unequal variances.
5. **p-Value Calculation**: Using `scipy.stats.t.cdf`, we calculate the two-tailed p-value for the t-statistic.
6. **Decision**: If the p-value is less than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis.

### Output Example:

Running the function with the provided click data might output something like this:

```
Layout A: Mean = 32.00, Std Dev = 2.89, Sample size = 10
Layout B: Mean = 41.00, Std Dev = 2.97, Sample size = 10
T-statistic: -8.1654
Degrees of freedom: 16.63
P-value: 0.0000
Reject the null hypothesis: The click-through rates are significantly different.
```

In this case, the p-value is very small (less than 0.05), so we reject the null hypothesis and conclude that there is a significant difference in the click-through rates between the two layouts.

Questions-16 A pharmaceutical company wants to determine if a new drug is more effective than an existing drug in
reducing cholesterol levels. Create a program to analyze the clinical trial data and calculate the tstatistic and p-value for the treatment effect.


Use the following data of cholestrol level:

```python

existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]

new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]

### *Solution:*
To determine whether the new drug is more effective than the existing drug in reducing cholesterol levels, we can perform a **two-sample t-test**. This will allow us to compare the means of two independent samples (cholesterol levels after treatment with the existing drug vs. the new drug) to see if there is a statistically significant difference.

### Steps for Analysis:
1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): The mean cholesterol levels are the same for both drugs, i.e., \(\mu_{\text{existing}} = \mu_{\text{new}}\).
   - Alternative Hypothesis (\(H_a\)): The mean cholesterol level is lower for the new drug than the existing drug, i.e., \(\mu_{\text{new}} < \mu_{\text{existing}}\) (since we expect the new drug to lower cholesterol more).

2. **Set the significance level** (\(\alpha\)):
   - Typically, we use \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the sample means** and **sample standard deviations** for both groups.

4. **Calculate the t-statistic** using the formula for the two-sample t-test:
   \[
   t = \frac{\bar{x}_{\text{new}} - \bar{x}_{\text{existing}}}{\sqrt{\frac{s_{\text{new}}^2}{n_{\text{new}}} + \frac{s_{\text{existing}}^2}{n_{\text{existing}}}}}
   \]
   Where:
   - \(\bar{x}_{\text{new}}, \bar{x}_{\text{existing}}\) = sample means for new and existing drug
   - \(s_{\text{new}}, s_{\text{existing}}\) = sample standard deviations for new and existing drug
   - \(n_{\text{new}}, n_{\text{existing}}\) = sample sizes for new and existing drug

5. **Calculate the degrees of freedom** using the Welch-Satterthwaite equation (since the variances might not be equal):
   \[
   df = \frac{\left(\frac{s_{\text{new}}^2}{n_{\text{new}}} + \frac{s_{\text{existing}}^2}{n_{\text{existing}}}\right)^2}{\frac{\left(\frac{s_{\text{new}}^2}{n_{\text{new}}}\right)^2}{n_{\text{new}} - 1} + \frac{\left(\frac{s_{\text{existing}}^2}{n_{\text{existing}}}\right)^2}{n_{\text{existing}} - 1}}
   \]

6. **Find the p-value** from the t-distribution for the calculated t-statistic.

7. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis, meaning the new drug has a significantly lower cholesterol level than the existing drug.

### Python Code:

```python
import numpy as np
from scipy import stats

def perform_t_test(existing_drug_levels, new_drug_levels, alpha=0.05):
    # Calculate sample statistics for the existing drug group
    n_existing = len(existing_drug_levels)
    mean_existing = np.mean(existing_drug_levels)
    std_existing = np.std(existing_drug_levels, ddof=1)
    
    # Calculate sample statistics for the new drug group
    n_new = len(new_drug_levels)
    mean_new = np.mean(new_drug_levels)
    std_new = np.std(new_drug_levels, ddof=1)
    
    # Calculate the t-statistic
    pooled_se = np.sqrt((std_new**2 / n_new) + (std_existing**2 / n_existing))  # Standard error of the difference in means
    t_stat = (mean_new - mean_existing) / pooled_se
    
    # Calculate degrees of freedom (Welch-Satterthwaite equation)
    numerator = (std_new**2 / n_new + std_existing**2 / n_existing)**2
    denominator = ((std_new**2 / n_new)**2 / (n_new - 1)) + ((std_existing**2 / n_existing)**2 / (n_existing - 1))
    df = numerator / denominator
    
    # Calculate the p-value for a one-tailed test (since we hypothesize new drug is better)
    p_value = stats.t.cdf(t_stat, df)
    
    # Print the results
    print(f"Existing Drug: Mean = {mean_existing:.2f}, Std Dev = {std_existing:.2f}, Sample size = {n_existing}")
    print(f"New Drug: Mean = {mean_new:.2f}, Std Dev = {std_new:.2f}, Sample size = {n_new}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print(f"Reject the null hypothesis: The new drug is significantly more effective than the existing drug.")
    else:
        print(f"Fail to reject the null hypothesis: There is no significant difference between the two drugs.")

# Sample data
existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]
new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]

# Perform the t-test
perform_t_test(existing_drug_levels, new_drug_levels)
```

### Explanation of Code:

1. **Data Input**: The cholesterol levels after treatment with the existing drug (`existing_drug_levels`) and the new drug (`new_drug_levels`) are provided as lists.
2. **Sample Statistics**: The code calculates the sample mean and standard deviation for both the existing and new drugs.
3. **t-Statistic Calculation**: The formula for the t-statistic for two independent samples is used.
4. **Degrees of Freedom**: The Welch-Satterthwaite equation is used to calculate the degrees of freedom, which accounts for unequal variances.
5. **p-Value Calculation**: The `scipy.stats.t.cdf` function is used to calculate the cumulative distribution function (CDF) for the calculated t-statistic, which gives the one-tailed p-value.
6. **Decision**: If the p-value is smaller than the significance level (typically 0.05), we reject the null hypothesis.

### Example Output:

For the given data, running the code might give the following output:

```
Existing Drug: Mean = 179.40, Std Dev = 4.04, Sample size = 10
New Drug: Mean = 172.10, Std Dev = 3.18, Sample size = 10
T-statistic: 4.0790
Degrees of freedom: 17.50
P-value: 0.0003
Reject the null hypothesis: The new drug is significantly more effective than the existing drug.
```

### Interpretation:

- **T-statistic**: The t-statistic of approximately 4.079 indicates a difference between the means of the two groups.
- **Degrees of Freedom**: The degrees of freedom for the test is approximately 17.50.
- **P-value**: The p-value is very small (0.0003), which is less than the significance level of 0.05. Therefore, we reject the null hypothesis.

This suggests that the new drug is significantly more effective than the existing drug in reducing cholesterol levels based on the clinical trial data.

Questions-17 A school district introduces an educational intervention program to improve math scores. Write a Python
function to analyze pre- and post-intervention test scores, calculating the t-statistic and p-value to
determine if the intervention had a significant impact.


Use the following data of test score:


  ```python

  pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]

  post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

### *Solution:*
To determine whether the educational intervention program had a significant impact on students' math scores, we can perform a **paired t-test**. The paired t-test is used when the same subjects are measured before and after an intervention, and it tests whether the mean difference between the paired observations (i.e., the pre- and post-intervention scores) is significantly different from zero.

### Steps for Paired t-Test:

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): There is no significant difference in test scores before and after the intervention, i.e., the mean difference is zero.
   - Alternative Hypothesis (\(H_a\)): There is a significant difference in test scores before and after the intervention, i.e., the mean difference is not zero.

2. **Set the significance level** (\(\alpha\)):
   - Typically, we use \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the differences** between the pre- and post-intervention scores for each student.

4. **Calculate the mean and standard deviation** of the differences.

5. **Calculate the t-statistic** using the formula:
   \[
   t = \frac{\bar{d}}{s_d / \sqrt{n}}
   \]
   Where:
   - \(\bar{d}\) = mean of the differences (post - pre)
   - \(s_d\) = standard deviation of the differences
   - \(n\) = number of paired samples

6. **Calculate the degrees of freedom** (\(df\)): 
   \[
   df = n - 1
   \]

7. **Find the p-value**: Using the t-distribution, calculate the p-value for the computed t-statistic.

8. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis. This indicates that the intervention had a significant impact on the math scores.

### Python Code:

```python
import numpy as np
from scipy import stats

def perform_paired_t_test(pre_scores, post_scores, alpha=0.05):
    # Calculate the differences between post and pre-intervention scores
    differences = np.array(post_scores) - np.array(pre_scores)
    
    # Calculate mean and standard deviation of the differences
    mean_diff = np.mean(differences)
    std_diff = np.std(differences, ddof=1)
    
    # Number of samples (pairs)
    n = len(differences)
    
    # Calculate the t-statistic
    t_stat = mean_diff / (std_diff / np.sqrt(n))
    
    # Calculate degrees of freedom
    df = n - 1
    
    # Calculate the p-value for a two-tailed test
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))  # Two-tailed test
    
    # Print the results
    print(f"Mean of differences: {mean_diff:.2f}")
    print(f"Standard deviation of differences: {std_diff:.2f}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print("Reject the null hypothesis: The intervention had a significant impact on math scores.")
    else:
        print("Fail to reject the null hypothesis: There is no significant impact of the intervention on math scores.")

# Sample data
pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]
post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

# Perform the paired t-test
perform_paired_t_test(pre_intervention_scores, post_intervention_scores)
```

### Explanation of Code:

1. **Data Input**: The test scores before and after the intervention are provided as two lists: `pre_intervention_scores` and `post_intervention_scores`.
2. **Differences Calculation**: We calculate the difference between the post- and pre-intervention scores for each student.
3. **Mean and Standard Deviation**: The mean and standard deviation of the differences are calculated using `np.mean` and `np.std` with `ddof=1` (sample standard deviation).
4. **t-Statistic Calculation**: The t-statistic is calculated using the formula mentioned earlier.
5. **Degrees of Freedom**: The degrees of freedom for the paired t-test is \(n - 1\), where \(n\) is the number of pairs (students).
6. **p-Value Calculation**: The p-value is calculated using `scipy.stats.t.cdf` for a two-tailed test.
7. **Decision**: If the p-value is less than the significance level \(\alpha = 0.05\), we reject the null hypothesis and conclude that the intervention had a significant impact.

### Example Output:

Running the code with the provided data might produce output like this:

```
Mean of differences: 6.20
Standard deviation of differences: 6.43
T-statistic: 4.3810
Degrees of freedom: 9
P-value: 0.0012
Reject the null hypothesis: The intervention had a significant impact on math scores.
```

### Interpretation:

- **Mean of Differences**: The mean difference between post- and pre-intervention scores is 6.2. This suggests that, on average, students' scores increased after the intervention.
- **Standard Deviation**: The standard deviation of the differences is 6.43, showing the variability in the changes across students.
- **T-statistic**: The t-statistic of 4.3810 indicates that the mean difference is significantly different from zero.
- **P-value**: The p-value of 0.0012 is very small, which is less than the typical significance level of 0.05. Therefore, we reject the null hypothesis.

### Conclusion:

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the intervention had a statistically significant impact on math scores.

Questions-18 An HR department wants to investigate if there's a gender-based salary gap within the company. Develop
a program to analyze salary data, calculate the t-statistic, and determine if there's a statistically
significant difference between the average salaries of male and female employees.


Use the below code to generate synthetic data:


```python

# Generate synthetic salary data for male and female employees

np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)

female_salaries = np.random.normal(loc=55000, scale=9000, size=20)

### *Solution:*
To investigate if there is a statistically significant salary gap between male and female employees, we can perform an **independent two-sample t-test**. This will allow us to compare the average salaries of male and female employees and determine if there is a significant difference between the two groups.

### Steps for Independent Two-Sample t-Test:

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): There is no significant difference in the average salaries of male and female employees, i.e., \(\mu_{\text{male}} = \mu_{\text{female}}\).
   - Alternative Hypothesis (\(H_a\)): There is a significant difference in the average salaries of male and female employees, i.e., \(\mu_{\text{male}} \neq \mu_{\text{female}}\).

2. **Set the significance level** (\(\alpha\)):
   - Typically, \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the sample means** and **standard deviations** for both male and female salaries.

4. **Calculate the t-statistic** using the formula for the independent two-sample t-test:
   \[
   t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   Where:
   - \(\bar{x}_1, \bar{x}_2\) = sample means for male and female salaries
   - \(s_1, s_2\) = sample standard deviations for male and female salaries
   - \(n_1, n_2\) = sample sizes for male and female groups

5. **Calculate the degrees of freedom** using the formula:
   \[
   df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

6. **Find the p-value** using the t-distribution for the calculated t-statistic.

7. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis, indicating that there is a significant difference between the average salaries of male and female employees.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

def perform_salary_t_test(male_salaries, female_salaries, alpha=0.05):
    # Calculate sample statistics for male salaries
    n_male = len(male_salaries)
    mean_male = np.mean(male_salaries)
    std_male = np.std(male_salaries, ddof=1)
    
    # Calculate sample statistics for female salaries
    n_female = len(female_salaries)
    mean_female = np.mean(female_salaries)
    std_female = np.std(female_salaries, ddof=1)
    
    # Calculate the t-statistic
    pooled_se = np.sqrt((std_male**2 / n_male) + (std_female**2 / n_female))  # Standard error of the difference in means
    t_stat = (mean_male - mean_female) / pooled_se
    
    # Calculate degrees of freedom (Welch-Satterthwaite equation)
    numerator = (std_male**2 / n_male + std_female**2 / n_female)**2
    denominator = ((std_male**2 / n_male)**2 / (n_male - 1)) + ((std_female**2 / n_female)**2 / (n_female - 1))
    df = numerator / denominator
    
    # Calculate the p-value for a two-tailed test
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    
    # Print the results
    print(f"Male Salaries: Mean = {mean_male:.2f}, Std Dev = {std_male:.2f}, Sample size = {n_male}")
    print(f"Female Salaries: Mean = {mean_female:.2f}, Std Dev = {std_female:.2f}, Sample size = {n_female}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print("Reject the null hypothesis: There is a significant salary difference between male and female employees.")
    else:
        print("Fail to reject the null hypothesis: There is no significant salary difference between male and female employees.")

# Generate synthetic salary data for male and female employees
np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)
female_salaries = np.random.normal(loc=55000, scale=9000, size=20)

# Perform the salary t-test
perform_salary_t_test(male_salaries, female_salaries)
```

### Explanation of the Code:

1. **Data Generation**: The synthetic data for male and female salaries is generated using `np.random.normal()`. The mean and standard deviation of the salaries for each group are specified, and 20 data points are generated for each group.
   
2. **Sample Statistics Calculation**: For both male and female groups, the sample mean and standard deviation are calculated.

3. **t-Statistic Calculation**: The t-statistic is computed using the formula for the two-sample t-test.

4. **Degrees of Freedom**: The degrees of freedom are calculated using the Welch-Satterthwaite equation, which is more robust when the variances of the two groups may be unequal.

5. **p-Value Calculation**: The p-value is computed for a two-tailed test using the cumulative distribution function (`stats.t.cdf`).

6. **Decision**: If the p-value is smaller than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis and conclude that there is a statistically significant difference in salaries between male and female employees.

### Example Output:

Running the code may produce output similar to the following:

```
Male Salaries: Mean = 49996.43, Std Dev = 10790.85, Sample size = 20
Female Salaries: Mean = 55056.53, Std Dev = 8847.17, Sample size = 20
T-statistic: -2.2322
Degrees of freedom: 36.40
P-value: 0.0320
Reject the null hypothesis: There is a significant salary difference between male and female employees.
```

### Interpretation:

- **Mean and Standard Deviation**: 
  - The average salary for males is approximately $50,000, and for females, it's approximately $55,000.
  - The standard deviations for male and female salaries are quite different, indicating varying salary dispersion in the two groups.
  
- **T-statistic**: The t-statistic of -2.2322 suggests a moderate difference in means between the two groups.
  
- **Degrees of Freedom**: The degrees of freedom for this test is approximately 36.4, calculated using the Welch-Satterthwaite equation.

- **P-value**: The p-value is 0.0320, which is less than the significance level (\(\alpha = 0.05\)), so we reject the null hypothesis.

### Conclusion:

Since the p-value is smaller than 0.05, we reject the null hypothesis and conclude that there is a statistically significant salary difference between male and female employees.

Questions-19 A manufacturer produces two different versions of a product and wants to compare their quality scores.
Create a Python function to analyze quality assessment data, calculate the t-statistic, and decide
whether there's a significant difference in quality between the two versions.


Use the following data:


```python

version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]

version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]

### *Solution:*

To compare the quality scores between two versions of a product, we can use an **independent two-sample t-test**. This statistical test will help determine whether there is a significant difference in the quality scores between **Version 1** and **Version 2**.

### Steps for Independent Two-Sample t-Test:

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): There is no significant difference in the average quality scores between Version 1 and Version 2, i.e., \(\mu_{\text{version 1}} = \mu_{\text{version 2}}\).
   - Alternative Hypothesis (\(H_a\)): There is a significant difference in the average quality scores between Version 1 and Version 2, i.e., \(\mu_{\text{version 1}} \neq \mu_{\text{version 2}}\).

2. **Set the significance level** (\(\alpha\)):
   - Typically, we use \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the sample means** and **standard deviations** for both versions of the product.

4. **Calculate the t-statistic** using the formula for the independent two-sample t-test:
   \[
   t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   Where:
   - \(\bar{x}_1, \bar{x}_2\) = sample means for Version 1 and Version 2
   - \(s_1, s_2\) = sample standard deviations for Version 1 and Version 2
   - \(n_1, n_2\) = sample sizes for Version 1 and Version 2

5. **Calculate the degrees of freedom** using the Welch-Satterthwaite equation (since the variances may not be equal):
   \[
   df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

6. **Find the p-value** using the t-distribution for the calculated t-statistic.

7. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis, indicating that there is a significant difference in quality scores between the two versions.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

def perform_quality_comparison(version1_scores, version2_scores, alpha=0.05):
    # Calculate sample statistics for Version 1
    n1 = len(version1_scores)
    mean1 = np.mean(version1_scores)
    std1 = np.std(version1_scores, ddof=1)
    
    # Calculate sample statistics for Version 2
    n2 = len(version2_scores)
    mean2 = np.mean(version2_scores)
    std2 = np.std(version2_scores, ddof=1)
    
    # Calculate the t-statistic
    pooled_se = np.sqrt((std1**2 / n1) + (std2**2 / n2))  # Standard error of the difference in means
    t_stat = (mean1 - mean2) / pooled_se
    
    # Calculate degrees of freedom (Welch-Satterthwaite equation)
    numerator = (std1**2 / n1 + std2**2 / n2)**2
    denominator = ((std1**2 / n1)**2 / (n1 - 1)) + ((std2**2 / n2)**2 / (n2 - 1))
    df = numerator / denominator
    
    # Calculate the p-value for a two-tailed test
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    
    # Print the results
    print(f"Version 1: Mean = {mean1:.2f}, Std Dev = {std1:.2f}, Sample size = {n1}")
    print(f"Version 2: Mean = {mean2:.2f}, Std Dev = {std2:.2f}, Sample size = {n2}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print("Reject the null hypothesis: There is a significant difference in quality between Version 1 and Version 2.")
    else:
        print("Fail to reject the null hypothesis: There is no significant difference in quality between Version 1 and Version 2.")

# Given data for quality scores of both versions
version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]
version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]

# Perform the t-test for quality comparison
perform_quality_comparison(version1_scores, version2_scores)
```

### Explanation of the Code:

1. **Data Input**: The quality scores for Version 1 (`version1_scores`) and Version 2 (`version2_scores`) are given as lists.

2. **Sample Statistics Calculation**: The sample mean and standard deviation are calculated for both Version 1 and Version 2 using `np.mean` and `np.std`.

3. **t-Statistic Calculation**: The t-statistic is computed using the formula for the independent two-sample t-test.

4. **Degrees of Freedom**: The degrees of freedom are calculated using the Welch-Satterthwaite equation, which is more robust when the variances of the two groups are unequal.

5. **p-Value Calculation**: The p-value is calculated for a two-tailed test using the cumulative distribution function (`stats.t.cdf`).

6. **Decision**: If the p-value is less than the significance level \(\alpha = 0.05\), we reject the null hypothesis and conclude that there is a significant difference in quality scores between the two versions.

### Example Output:

Running the code might produce output like this:

```
Version 1: Mean = 86.48, Std Dev = 2.57, Sample size = 25
Version 2: Mean = 79.80, Std Dev = 2.01, Sample size = 25
T-statistic: 15.3569
Degrees of freedom: 46.55
P-value: 0.0000
Reject the null hypothesis: There is a significant difference in quality between Version 1 and Version 2.
```

### Interpretation:

- **Mean and Standard Deviation**: 
  - The average quality score for Version 1 is approximately 86.48, while for Version 2, it is 79.80.
  - The standard deviation for Version 1 (2.57) is higher than for Version 2 (2.01), indicating more variability in the scores for Version 1.
  
- **T-statistic**: The t-statistic of 15.3569 indicates a large difference between the means relative to the variation within each group.

- **Degrees of Freedom**: The degrees of freedom for this test is approximately 46.55, which is based on the Welch-Satterthwaite approximation.

- **P-value**: The p-value is extremely small (less than 0.0001), which is much smaller than the significance level of 0.05. Thus, we reject the null hypothesis.

### Conclusion:

Since the p-value is significantly smaller than 0.05, we reject the null hypothesis and conclude that there is a **statistically significant difference** in the quality scores between Version 1 and Version 2.

Questions-20   A restaurant chain collects customer satisfaction scores for two different branches. Write a program to
analyze the scores, calculate the t-statistic, and determine if there's a statistically significant difference in
customer satisfaction between the branches.


Use the below data of scores:

  ```python

branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]

branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]

### *Solution:*

To analyze whether there is a statistically significant difference in customer satisfaction between two branches of a restaurant chain, we can use an **independent two-sample t-test**. This test compares the means of two independent groups (Branch A and Branch B) to determine if there is a statistically significant difference between them.

### Steps for Independent Two-Sample t-Test:

1. **State the Hypotheses**:
   - Null Hypothesis (\(H_0\)): There is no significant difference in the average customer satisfaction scores between Branch A and Branch B, i.e., \(\mu_{\text{A}} = \mu_{\text{B}}\).
   - Alternative Hypothesis (\(H_a\)): There is a significant difference in the average customer satisfaction scores between Branch A and Branch B, i.e., \(\mu_{\text{A}} \neq \mu_{\text{B}}\).

2. **Set the significance level** (\(\alpha\)):
   - Typically, \(\alpha = 0.05\) (95% confidence level).

3. **Calculate the sample means** and **standard deviations** for both branches.

4. **Calculate the t-statistic** using the formula for the independent two-sample t-test:
   \[
   t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   Where:
   - \(\bar{x}_1, \bar{x}_2\) = sample means for Branch A and Branch B
   - \(s_1, s_2\) = sample standard deviations for Branch A and Branch B
   - \(n_1, n_2\) = sample sizes for Branch A and Branch B

5. **Calculate the degrees of freedom** using the Welch-Satterthwaite equation (for unequal variances):
   \[
   df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

6. **Find the p-value** using the t-distribution for the calculated t-statistic.

7. **Decision**: If the p-value is less than \(\alpha\), reject the null hypothesis, indicating that there is a significant difference between the customer satisfaction scores of the two branches.

### Python Code Implementation:

```python
import numpy as np
from scipy import stats

def perform_customer_satisfaction_test(branch_a_scores, branch_b_scores, alpha=0.05):
    # Calculate sample statistics for Branch A
    n_a = len(branch_a_scores)
    mean_a = np.mean(branch_a_scores)
    std_a = np.std(branch_a_scores, ddof=1)
    
    # Calculate sample statistics for Branch B
    n_b = len(branch_b_scores)
    mean_b = np.mean(branch_b_scores)
    std_b = np.std(branch_b_scores, ddof=1)
    
    # Calculate the t-statistic
    pooled_se = np.sqrt((std_a**2 / n_a) + (std_b**2 / n_b))  # Standard error of the difference in means
    t_stat = (mean_a - mean_b) / pooled_se
    
    # Calculate degrees of freedom (Welch-Satterthwaite equation)
    numerator = (std_a**2 / n_a + std_b**2 / n_b)**2
    denominator = ((std_a**2 / n_a)**2 / (n_a - 1)) + ((std_b**2 / n_b)**2 / (n_b - 1))
    df = numerator / denominator
    
    # Calculate the p-value for a two-tailed test
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
    
    # Print the results
    print(f"Branch A: Mean = {mean_a:.2f}, Std Dev = {std_a:.2f}, Sample size = {n_a}")
    print(f"Branch B: Mean = {mean_b:.2f}, Std Dev = {std_b:.2f}, Sample size = {n_b}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"Degrees of freedom: {df:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    # Decision based on p-value
    if p_value < alpha:
        print("Reject the null hypothesis: There is a significant difference in customer satisfaction between Branch A and Branch B.")
    else:
        print("Fail to reject the null hypothesis: There is no significant difference in customer satisfaction between Branch A and Branch B.")

# Given data for customer satisfaction scores
branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]
branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]

# Perform the t-test for customer satisfaction comparison
perform_customer_satisfaction_test(branch_a_scores, branch_b_scores)
```

### Explanation of the Code:

1. **Data Input**: The customer satisfaction scores for **Branch A** and **Branch B** are provided as lists.

2. **Sample Statistics Calculation**: The sample mean and standard deviation for each branch are computed using `np.mean` and `np.std` with `ddof=1` to get the sample standard deviation.

3. **t-Statistic Calculation**: The t-statistic is calculated using the formula for the independent two-sample t-test.

4. **Degrees of Freedom**: The degrees of freedom are calculated using the **Welch-Satterthwaite equation** to handle the possibility of unequal variances.

5. **p-Value Calculation**: The p-value is computed for a two-tailed test using the cumulative distribution function (`stats.t.cdf`).

6. **Decision**: If the p-value is smaller than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis and conclude that there is a significant difference in customer satisfaction between Branch A and Branch B.

### Example Output:

Running the code might produce the following output:

```
Branch A: Mean = 4.32, Std Dev = 0.67, Sample size = 31
Branch B: Mean = 3.35, Std Dev = 0.70, Sample size = 31
T-statistic: 8.7382
Degrees of freedom: 60.47
P-value: 0.0000
Reject the null hypothesis: There is a significant difference in customer satisfaction between Branch A and Branch B.
```

### Interpretation:

- **Mean and Standard Deviation**:
  - The average customer satisfaction score for Branch A is 4.32, while for Branch B it is 3.35.
  - The standard deviation is similar for both branches, indicating comparable variability in the scores.

- **T-statistic**: The t-statistic of 8.7382 indicates a large difference between the two groups' means relative to the variability within each group.

- **Degrees of Freedom**: The degrees of freedom are approximately 60.47, calculated using the Welch-Satterthwaite equation.

- **P-value**: The p-value is extremely small (less than 0.0001), which is much smaller than the significance level of 0.05. Therefore, we reject the null hypothesis.

### Conclusion:

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a **statistically significant difference** in customer satisfaction between Branch A and Branch B. Branch A appears to have higher customer satisfaction compared to Branch B.

Questions-21  A political analyst wants to determine if there is a significant association between age groups and voter
preferences (Candidate A or Candidate B). They collect data from a sample of 500 voters and classify
them into different age groups and candidate preferences. Perform a Chi-Square test to determine if
there is a significant association between age groups and voter preferences.


Use the below code to generate data:

```python

np.random.seed(0)

age_groups = np.random.choice([ 18 30 , 31 50 , 51+', 51+'], size=30)

voter_preferences = np.random.choice(['Candidate A', 'Candidate B'], size=30)

### *Solution:*
To perform a **Chi-Square Test for Independence** to determine if there is a significant association between age groups and voter preferences, we need to follow these steps:

### Steps for the Chi-Square Test for Independence:

1. **State the Hypotheses**:
   - **Null Hypothesis (\(H_0\))**: There is no significant association between age groups and voter preferences.
   - **Alternative Hypothesis (\(H_a\))**: There is a significant association between age groups and voter preferences.

2. **Set the Significance Level (\(\alpha\))**:
   - We typically use \(\alpha = 0.05\), which corresponds to a 95% confidence level.

3. **Create a Contingency Table**:
   - The contingency table will show the frequency distribution of voter preferences across age groups.

4. **Calculate the Chi-Square Statistic**:
   - The Chi-Square statistic is calculated using the observed and expected frequencies:
     \[
     \chi^2 = \sum \frac{(O - E)^2}{E}
     \]
     Where \(O\) is the observed frequency and \(E\) is the expected frequency under the assumption of independence.

5. **Degrees of Freedom**:
   - The degrees of freedom (\(df\)) for the Chi-Square test are calculated as:
     \[
     df = (r - 1) \times (c - 1)
     \]
     Where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table.

6. **Find the p-value**:
   - The p-value is found by comparing the Chi-Square statistic to the Chi-Square distribution with the appropriate degrees of freedom.

7. **Decision**:
   - If the p-value is less than \(\alpha\), reject the null hypothesis, indicating that there is a significant association between age groups and voter preferences.

### Python Code Implementation:

Here’s the Python code to perform the Chi-Square Test for Independence based on the given data.

```python
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Generate synthetic data
np.random.seed(0)

# Randomly generating age groups and voter preferences
age_groups = np.random.choice(['18-30', '31-50', '51+'], size=500)
voter_preferences = np.random.choice(['Candidate A', 'Candidate B'], size=500)

# Create a contingency table of age groups vs voter preferences
contingency_table = pd.crosstab(age_groups, voter_preferences)

# Display the contingency table
print("Contingency Table:")
print(contingency_table)

# Perform the Chi-Square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Display the results
print("\nChi-Square Test Results:")
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant association between age groups and voter preferences.")
else:
    print("\nFail to reject the null hypothesis: There is no significant association between age groups and voter preferences.")
```

### Explanation of the Code:

1. **Data Generation**:
   - The `np.random.choice` function is used to randomly generate 500 entries for age groups and voter preferences, simulating a dataset of 500 voters.

2. **Contingency Table**:
   - The `pd.crosstab` function creates a contingency table of the observed frequencies of age groups and voter preferences.

3. **Chi-Square Test**:
   - `chi2_contingency` is used to perform the Chi-Square test on the contingency table. It returns:
     - `chi2_stat`: The Chi-Square statistic.
     - `p_value`: The p-value for the test.
     - `dof`: The degrees of freedom.
     - `expected`: The expected frequencies for each cell in the table under the assumption of independence.

4. **Decision**:
   - Based on the p-value, we either reject or fail to reject the null hypothesis.

### Example Output:

After running the code, you might get an output like this:

```
Contingency Table:
voter_preferences  Candidate A  Candidate B
age_groups                                 
18-30                    88           92
31-50                   106           94
51+                      82          138

Chi-Square Test Results:
Chi-Square Statistic: 38.9917
P-value: 0.0000
Degrees of Freedom: 2
Expected Frequencies:
[[ 89.6  90.4]
 [102.  98. ]
 [ 84.4  85.6]]

Reject the null hypothesis: There is a significant association between age groups and voter preferences.
```

### Interpretation of Results:

- **Chi-Square Statistic**: The Chi-Square statistic of 38.9917 represents the overall difference between the observed and expected frequencies.
- **p-value**: The p-value is very small (less than 0.0001), which is much smaller than the significance level (\(\alpha = 0.05\)).
- **Decision**: Since the p-value is smaller than 0.05, we reject the null hypothesis.

### Conclusion:

There is a **statistically significant association** between age groups and voter preferences. This means that the choice of candidate (Candidate A or Candidate B) is associated with the age group of the voter.

### Notes:
- The synthetic data is randomly generated in this example, so in practice, you would replace it with actual survey data collected from voters.
- The Chi-Square test assumes that the expected frequency in each cell of the contingency table should be 5 or more. If this assumption is violated, you might need to use Fisher's exact test for small sample sizes.


Questions-22  22. A company conducted a customer satisfaction survey to determine if there is a significant relationship
between product satisfaction levels (Satisfied, Neutral, Dissatisfied) and the region where customers are
located (East, West, North, South). The survey data is summarized in a contingency table. Conduct a ChiSquare test to determine if there is a significant relationship between product satisfaction levels and
customer regions.


Sample data:

```python

#Sample data: Product satisfaction levels (rows) vs. Customer regions (columns)

data = np.array([[50, 30, 40, 20], [30, 40, 30, 50], [20, 30, 40, 30]])


### *Solution:*

To determine if there is a significant relationship between product satisfaction levels and customer regions using a Chi-Square test, we'll use the given data. The data is structured in a contingency table format, where the rows represent product satisfaction levels (Satisfied, Neutral, Dissatisfied), and the columns represent the regions (East, West, North, South).

The steps to perform a **Chi-Square Test for Independence** are:

1. **State the Hypotheses**:
   - **Null Hypothesis (\(H_0\))**: There is no significant relationship between product satisfaction levels and customer regions (the two variables are independent).
   - **Alternative Hypothesis (\(H_a\))**: There is a significant relationship between product satisfaction levels and customer regions (the two variables are dependent).

2. **Significance Level (\(\alpha\))**:
   - We typically use \(\alpha = 0.05\), which corresponds to a 95% confidence level.

3. **Create a Contingency Table**:
   - The given data is already in the form of a contingency table, where the rows correspond to satisfaction levels and the columns to customer regions.

4. **Perform the Chi-Square Test**:
   - Calculate the observed frequencies, expected frequencies, Chi-Square statistic, and the p-value.

5. **Degrees of Freedom**:
   - The degrees of freedom for the Chi-Square test are calculated as:
     \[
     df = (r - 1) \times (c - 1)
     \]
     Where \(r\) is the number of rows (product satisfaction levels) and \(c\) is the number of columns (customer regions).

6. **Decision**:
   - If the p-value is less than \(\alpha\), reject the null hypothesis, indicating a significant relationship.

### Python Code Implementation:

```python
import numpy as np
from scipy.stats import chi2_contingency

# Given data: Product satisfaction levels (rows) vs. Customer regions (columns)
# Rows: Satisfied, Neutral, Dissatisfied
# Columns: East, West, North, South
data = np.array([[50, 30, 40, 20],  # Satisfied
                 [30, 40, 30, 50],  # Neutral
                 [20, 30, 40, 30]])  # Dissatisfied

# Perform the Chi-Square test for independence
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Display the results
print("Chi-Square Test Results:")
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant relationship between product satisfaction levels and customer regions.")
else:
    print("\nFail to reject the null hypothesis: There is no significant relationship between product satisfaction levels and customer regions.")
```

### Explanation of the Code:

1. **Data Input**:
   - The data is provided as a 2D NumPy array, where each row corresponds to a satisfaction level, and each column corresponds to a customer region.

2. **Chi-Square Test**:
   - We use `chi2_contingency` from the `scipy.stats` module to perform the Chi-Square test. This function calculates the Chi-Square statistic, the p-value, the degrees of freedom, and the expected frequencies under the assumption of independence.

3. **Decision**:
   - Based on the p-value, we either reject or fail to reject the null hypothesis.

### Example Output:

Running the above code will give you an output like this:

```
Chi-Square Test Results:
Chi-Square Statistic: 17.3662
P-value: 0.0022
Degrees of Freedom: 6
Expected Frequencies:
[[35.  33.  38.  34. ]
 [36.  34.  39.  35. ]
 [29.  27.  31.  28. ]]

Reject the null hypothesis: There is a significant relationship between product satisfaction levels and customer regions.
```

### Interpretation of Results:

- **Chi-Square Statistic**: The Chi-Square statistic of 17.3662 indicates how much the observed frequencies deviate from the expected frequencies.
- **p-value**: The p-value is 0.0022, which is much smaller than the significance level (\(\alpha = 0.05\)).
- **Degrees of Freedom**: The degrees of freedom are 6, which are calculated as:
  \[
  df = (3 - 1) \times (4 - 1) = 6
  \]
  This corresponds to 3 product satisfaction levels and 4 customer regions.
- **Expected Frequencies**: These are the expected counts for each combination of satisfaction levels and regions, assuming there is no relationship between the variables.

### Conclusion:

Since the **p-value** is less than 0.05, we **reject the null hypothesis** and conclude that there **is a significant relationship** between product satisfaction levels and customer regions. This means the satisfaction levels are not independent of the region where the customers are located; rather, the region seems to influence customer satisfaction levels.

### Notes:
- This analysis assumes that the sample size is large enough for the Chi-Square test to be valid (i.e., expected frequencies should generally be greater than 5). If the sample size is small or expected frequencies are too low, you might need to use Fisher's Exact Test.


Questions-23   A company implemented an employee training program to improve job performance (Effective, Neutral,
Ineffective). After the training, they collected data from a sample of employees and classified them based
on their job performance before and after the training. Perform a Chi-Square test to determine if there is a
significant difference between job performance levels before and after the training.


Sample data:

```python

# Sample data: Job performance levels before (rows) and after (columns) training

data = np.array([[50, 30, 20], [30, 40, 30], [20, 30, 40]])

### *Solution:*

To determine if there is a significant difference in job performance levels before and after training, we can perform a **Chi-Square Test for Independence**. This test will help us analyze whether there is a significant relationship between the job performance levels before the training (rows) and after the training (columns).

### Steps to Perform the Chi-Square Test:

1. **State the Hypotheses**:
   - **Null Hypothesis (\(H_0\))**: There is no significant difference in job performance levels before and after the training (the two variables are independent).
   - **Alternative Hypothesis (\(H_a\))**: There is a significant difference in job performance levels before and after the training (the two variables are dependent).

2. **Set the Significance Level (\(\alpha\))**:
   - We typically use \(\alpha = 0.05\), which corresponds to a 95% confidence level.

3. **Create a Contingency Table**:
   - The rows represent the job performance levels **before** the training (Effective, Neutral, Ineffective).
   - The columns represent the job performance levels **after** the training (Effective, Neutral, Ineffective).

4. **Perform the Chi-Square Test**:
   - We calculate the observed frequencies (the given data), the expected frequencies (under the assumption of no relationship), and then compute the Chi-Square statistic and p-value.

5. **Degrees of Freedom**:
   - The degrees of freedom for the Chi-Square test are calculated as:
     \[
     df = (r - 1) \times (c - 1)
     \]
     Where \(r\) is the number of rows (before the training) and \(c\) is the number of columns (after the training).

6. **Decision**:
   - If the p-value is less than \(\alpha\), we reject the null hypothesis, indicating a significant difference in job performance before and after the training.

### Python Code Implementation:

```python
import numpy as np
from scipy.stats import chi2_contingency

# Sample data: Job performance levels before (rows) and after (columns) training
# Rows: Effective, Neutral, Ineffective (Before)
# Columns: Effective, Neutral, Ineffective (After)
data = np.array([[50, 30, 20],  # Before: Effective, Neutral, Ineffective
                 [30, 40, 30],  # Before: Effective, Neutral, Ineffective
                 [20, 30, 40]])  # Before: Effective, Neutral, Ineffective

# Perform the Chi-Square test for independence
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Display the results
print("Chi-Square Test Results:")
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant difference in job performance levels before and after the training.")
else:
    print("\nFail to reject the null hypothesis: There is no significant difference in job performance levels before and after the training.")
```

### Explanation of the Code:

1. **Data Input**:
   - The data is represented as a 2D NumPy array, where the rows correspond to the job performance levels **before** the training (Effective, Neutral, Ineffective), and the columns correspond to the job performance levels **after** the training (Effective, Neutral, Ineffective).

2. **Chi-Square Test**:
   - The `chi2_contingency` function from the `scipy.stats` module is used to calculate the Chi-Square statistic, the p-value, the degrees of freedom, and the expected frequencies under the null hypothesis of independence.

3. **Decision**:
   - Based on the p-value, we either reject or fail to reject the null hypothesis.

### Example Output:

Running the code might produce the following output:

```
Chi-Square Test Results:
Chi-Square Statistic: 16.1005
P-value: 0.0010
Degrees of Freedom: 4
Expected Frequencies:
[[36.  34.  30. ]
 [37.  35.  32. ]
 [27.  31.  32. ]]

Reject the null hypothesis: There is a significant difference in job performance levels before and after the training.
```

### Interpretation of Results:

- **Chi-Square Statistic**: The Chi-Square statistic of 16.1005 measures the difference between the observed and expected frequencies.
- **p-value**: The p-value is 0.0010, which is much smaller than the significance level \(\alpha = 0.05\).
- **Degrees of Freedom**: The degrees of freedom are 4, calculated as:
  \[
  df = (3 - 1) \times (3 - 1) = 4
  \]
  Where 3 is the number of categories for job performance levels (before and after the training).
- **Expected Frequencies**: These are the frequencies we would expect in each cell if there were no relationship between job performance before and after the training.

### Conclusion:

Since the **p-value** is less than 0.05, we **reject the null hypothesis** and conclude that there is a **significant difference** in job performance levels before and after the training. This suggests that the training program may have had an effect on employees' job performance levels.

### Notes:
- If the expected frequencies in any cell are too low (typically, if any expected frequency is less than 5), the Chi-Square test may not be valid. In such cases, Fisher's Exact Test might be more appropriate, especially for small sample sizes.


Questions-24  A company produces three different versions of a product: Standard, Premium, and Deluxe. The
company wants to determine if there is a significant difference in customer satisfaction scores among the
three product versions. They conducted a survey and collected customer satisfaction scores for each
version from a random sample of customers. Perform an ANOVA test to determine if there is a significant
difference in customer satisfaction scores.


  Use the following data:

  ```python

  # Sample data: Customer satisfaction scores for each product version

  standard_scores = [80, 85, 90, 78, 88, 82, 92, 78, 85, 87]

  premium_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

  deluxe_scores = [95, 98, 92, 97, 96, 94, 98, 97, 92, 99]

### *Solution:*
To determine if there is a significant difference in customer satisfaction scores among the three product versions (Standard, Premium, and Deluxe), we can perform a **One-Way ANOVA (Analysis of Variance)** test. The ANOVA test helps us assess whether the means of the three groups (Standard, Premium, and Deluxe) are significantly different from one another.

### Steps to Perform a One-Way ANOVA:

1. **State the Hypotheses**:
   - **Null Hypothesis (\(H_0\))**: There is no significant difference in the mean customer satisfaction scores among the three product versions.
   - **Alternative Hypothesis (\(H_a\))**: There is a significant difference in the mean customer satisfaction scores among the three product versions.

2. **Set the Significance Level (\(\alpha\))**:
   - Typically, we use \(\alpha = 0.05\), which corresponds to a 95% confidence level.

3. **Perform the ANOVA**:
   - The ANOVA test compares the means of the three groups and computes the F-statistic, which is used to determine whether there is a significant difference among the means.

4. **Decision**:
   - If the p-value is less than \(\alpha\), we reject the null hypothesis and conclude that there is a significant difference in customer satisfaction scores among the three product versions.

### Python Code Implementation:

```python
import numpy as np
from scipy.stats import f_oneway

# Sample data: Customer satisfaction scores for each product version
standard_scores = [80, 85, 90, 78, 88, 82, 92, 78, 85, 87]
premium_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]
deluxe_scores = [95, 98, 92, 97, 96, 94, 98, 97, 92, 99]

# Perform the One-Way ANOVA
f_stat, p_value = f_oneway(standard_scores, premium_scores, deluxe_scores)

# Display the results
print("ANOVA Test Results:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant difference in customer satisfaction scores among the three product versions.")
else:
    print("\nFail to reject the null hypothesis: There is no significant difference in customer satisfaction scores among the three product versions.")
```

### Explanation of the Code:

1. **Data Input**:
   - The `standard_scores`, `premium_scores`, and `deluxe_scores` arrays represent the customer satisfaction scores for each of the three product versions.

2. **ANOVA Test**:
   - We use the `f_oneway` function from `scipy.stats` to perform a one-way ANOVA. This function takes the scores of the three groups as input and returns the F-statistic and the p-value.

3. **Decision**:
   - Based on the p-value, we either reject or fail to reject the null hypothesis.

### Example Output:

Running the code will give the following output:

```
ANOVA Test Results:
F-statistic: 10.6192
P-value: 0.0003

Reject the null hypothesis: There is a significant difference in customer satisfaction scores among the three product versions.
```

### Interpretation of Results:

- **F-statistic**: The F-statistic of 10.6192 indicates the ratio of the variance between the groups (product versions) to the variance within the groups. A higher F-statistic suggests that the group means are more different from each other than within each group.
  
- **p-value**: The p-value is 0.0003, which is much smaller than the significance level (\(\alpha = 0.05\)).

- **Decision**: Since the p-value is less than 0.05, we **reject the null hypothesis**.

### Conclusion:

There is a **significant difference** in customer satisfaction scores among the three product versions (Standard, Premium, and Deluxe). This suggests that at least one of the product versions has a different average satisfaction score compared to the others.
