**Question 1:** What is the difference between descriptive statistics and inferential statistics? Explain with examples.

**Answer:**
Descriptive Statistics summarize and describe the main features of a dataset. They provide simple summaries about the sample data and present it in a meaningful way without making conclusions beyond the data analyzed.

Key characteristics:

- Describes what the data shows

- Uses measures like mean, median, mode, standard deviation

- Uses visualizations like histograms, bar charts, pie charts

- No generalizations beyond the observed data

Examples:

 1. Average height of students in a class: 5.6 feet

2. 70% of survey respondents prefer online shopping

3. The distribution of ages in a company ranges from 22 to 65 years

***Inferential Statistics uses sample data to make inferences, predictions, or generalizations about a larger population. It involves hypothesis testing, confidence intervals, and probability.***

Key characteristics:

- Makes predictions about populations based on samples

- Uses techniques like t-tests, ANOVA, regression analysis

- Involves uncertainty and probability

- Helps in decision-making and predictions

Examples:

1. Based on a sample of 1000 voters, we infer that 55% of the entire population will vote for candidate A (with 95% confidence)

2. A drug trial on 500 patients suggests the medication is effective for the entire patient population

3. Quality control testing a sample of products to ensure the entire batch meets standards



**Question 2:** What is sampling in statistics? Explain the differences between random and stratified sampling.

**Answer:**
Sampling is the process of selecting a subset of individuals, items, or observations from a larger population to make inferences about that population. Since studying entire populations is often impractical due to cost, time, and accessibility constraints, sampling allows statisticians to draw conclusions efficiently.

##### Random Sampling:

- Every member of the population has an equal probability of being selected

- Selection is completely by chance with no bias

- Simple to implement and understand

- Provides unbiased estimates of population parameters

Advantages:

- Eliminates selection bias

- Simple statistical analysis

- Representative if sample size is adequate

Disadvantages:

- May not capture all subgroups adequately

- Requires complete population list

- Can be inefficient for heterogeneous populations

Example: Selecting 100 students from a university of 10,000 by randomly picking student ID numbers.

##### Stratified Sampling:

- Population is divided into distinct subgroups (strata) based on specific characteristics

- Random samples are then taken from each stratum

- Ensures representation from all important subgroups

- Sample size from each stratum can be proportional or equal

Advantages:

- Ensures representation of all subgroups

- More precise estimates for subpopulations

- Reduces sampling error

- Better for heterogeneous populations

Disadvantages:

- Requires prior knowledge of population characteristics

- More complex to implement

- Higher administrative costs

Example: Dividing university students by year (freshman, sophomore, junior, senior) and randomly selecting 25 students from each year to ensure all academic levels are represented.

 

**Question 3:** Define mean, median, and mode. Explain why these measures of central tendency are important.


**Answer:**
##### Mean (Arithmetic Average):

- Sum of all values divided by the number of values

Formula: μ = Σx / n

- Most commonly used measure of central tendency

- Sensitive to outliers

##### Median:

- Middle value when data is arranged in ascending or descending order

- For even number of values: average of two middle values

- Less affected by outliers than mean

- Better for skewed distributions

##### Mode:

- Most frequently occurring value in the dataset

- Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal)

- Only measure of central tendency for categorical data

- Useful for understanding most common occurrences

- Importance of Measures of Central Tendency:

Data Summarization: Provide a single representative value for the entire dataset

Comparison: Enable comparison between different datasets or groups

Decision Making: Help in making informed business and research decisions

Pattern Recognition: Identify typical values and understand data distribution

Quality Control: Set benchmarks and standards in manufacturing and services

Research Analysis: Essential for hypothesis testing and statistical inference

Communication: Simplify complex data for stakeholders and general audience

##### When to Use Which:

Mean: For normally distributed data without significant outliers

Median: For skewed data or when outliers are present

Mode: For categorical data or when identifying the most common value is important

**Question 4:** Explain skewness and kurtosis. What does a positive skew imply about the data?


**Answer:**
Skewness measures the asymmetry of a probability distribution around its mean. It indicates whether data points are more spread out on one side of the mean than the other.

##### Types of Skewness:

Positive Skew (Right Skew):

- Tail extends toward higher values

- Mean > Median > Mode

- Skewness value > 0

Negative Skew (Left Skew):

- Tail extends toward lower values

- Mode > Median > Mean

- kewness value < 0

Zero Skew (Symmetric):

- Data is symmetrically distributed

- Mean = Median = Mode

- Skewness value = 0

#### Kurtosis measures the "tailedness" or peakedness of a distribution compared to a normal distribution.

##### Types of Kurtosis:

1. Mesokurtic: Normal distribution (kurtosis = 3)

2. Leptokurtic: More peaked than normal, heavier tails (kurtosis > 3)

3. Platykurtic: Less peaked than normal, lighter tails (kurtosis < 3)

##### What Positive Skew Implies:

When data has a positive skew, it indicates:

- Distribution Shape: The majority of data points are concentrated on the lower end of the scale, with a long tail extending toward higher values

- Mean vs. Median: The mean is pulled toward the tail and is greater than the median

Common Examples:

- Income distribution (most people earn modest wages, few earn very high incomes)

- House prices (many affordable homes, few luxury properties)

- Response times (most tasks completed quickly, some take much longer)

Practical Implications:

- Outliers exist on the high end

- The average may not represent the typical value well

- Median might be a better measure of central tendency

- Important for risk assessment and resource planning

Statistical Considerations:

- May violate assumptions of some statistical tests

- Might require data transformation for analysis

- Affects confidence intervals and hypothesis testing




**Question 5:** Implement a Python program to compute the mean, median, and mode of a given list of numbers.


```
import statistics
from collections import Counter

 Given list of numbers
numbers = [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]

print("Dataset:", numbers)
print("Number of elements:", len(numbers))
print("-" * 40)

 Calculate Mean
mean_value = statistics.mean(numbers)
print(f"Mean: {mean_value}")

 Alternative calculation for mean
mean_manual = sum(numbers) / len(numbers)
print(f"Mean (manual calculation): {mean_manual}")

print("-" * 40)

 Calculate Median
median_value = statistics.median(numbers)
print(f"Median: {median_value}")

 Show sorted list for understanding
sorted_numbers = sorted(numbers)
print(f"Sorted dataset: {sorted_numbers}")
print(f"Middle positions: {len(numbers)//2 - 1} and {len(numbers)//2}")

print("-" * 40)

 Calculate Mode
mode_value = statistics.mode(numbers)
print(f"Mode: {mode_value}")

 Alternative: Find all modes (in case of multiple modes)
counter = Counter(numbers)
max_count = max(counter.values())
modes = [num for num, count in counter.items() if count == max_count]
print(f"All modes: {modes}")
print(f"Frequency of mode(s): {max_count}")

print("-" * 40)

 Additional statistics
print("SUMMARY STATISTICS:")
print(f"Mean: {mean_value:.2f}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Range: {max(numbers) - min(numbers)}")
print(f"Standard Deviation: {statistics.stdev(numbers):.2f}")
```

Output:
Dataset: [12, 15, 12, 18, 19, 12, 20, 22, 19, 19, 24, 24, 24, 26, 28]
Number of elements: 15
----------------------------------------
Mean: 19.333333333333332
Mean (manual calculation): 19.333333333333332
----------------------------------------
Median: 19
Sorted dataset: [12, 12, 12, 15, 18, 19, 19, 19, 20, 22, 24, 24, 24, 26, 28]
Middle positions: 6 and 7
----------------------------------------
Mode: 12
All modes: [12, 19, 24]
Frequency of mode(s): 3
----------------------------------------
SUMMARY STATISTICS:
Mean: 19.33
Median: 19
Mode: 12
Range: 16
Standard Deviation: 5.29


**Question 6:** Compute the covariance and correlation coefficient between the following two datasets.
```
import numpy as np
import statistics

# Given datasets
list_x = [10, 20, 30, 40, 50]
list_y = [15, 25, 35, 45, 60]

print("Dataset X:", list_x)
print("Dataset Y:", list_y)
print("Number of data points:", len(list_x))
print("-" * 50)

# Calculate means
mean_x = statistics.mean(list_x)
mean_y = statistics.mean(list_y)
print(f"Mean of X: {mean_x}")
print(f"Mean of Y: {mean_y}")

print("-" * 50)

# Calculate Covariance manually
n = len(list_x)
covariance_manual = sum((x - mean_x) * (y - mean_y) for x, y in zip(list_x, list_y)) / (n - 1)
print(f"Covariance (manual calculation): {covariance_manual}")

# Calculate Covariance using NumPy
covariance_numpy = np.cov(list_x, list_y)[0][1]
print(f"Covariance (NumPy): {covariance_numpy}")

print("-" * 50)

# Calculate standard deviations
std_x = statistics.stdev(list_x)
std_y = statistics.stdev(list_y)
print(f"Standard deviation of X: {std_x}")
print(f"Standard deviation of Y: {std_y}")

# Calculate Correlation Coefficient manually
correlation_manual = covariance_manual / (std_x * std_y)
print(f"Correlation coefficient (manual): {correlation_manual}")

# Calculate Correlation Coefficient using NumPy
correlation_numpy = np.corrcoef(list_x, list_y)[0][1]
print(f"Correlation coefficient (NumPy): {correlation_numpy}")

print("-" * 50)

# Detailed step-by-step calculation
print("STEP-BY-STEP COVARIANCE CALCULATION:")
print("i\tX\tY\t(X-X̄)\t(Y-Ȳ)\t(X-X̄)(Y-Ȳ)")
print("-" * 60)
total_cross_product = 0
for i, (x, y) in enumerate(zip(list_x, list_y)):
    diff_x = x - mean_x
    diff_y = y - mean_y
    cross_product = diff_x * diff_y
    total_cross_product += cross_product
    print(f"{i+1}\t{x}\t{y}\t{diff_x}\t{diff_y}\t{cross_product}")

print("-" * 60)
print(f"Sum of cross products: {total_cross_product}")
print(f"Covariance = {total_cross_product} / {n-1} = {total_cross_product/(n-1)}")

print("-" * 50)

# Interpretation
print("INTERPRETATION:")
if correlation_numpy > 0.8:
    strength = "strong positive"
elif correlation_numpy > 0.5:
    strength = "moderate positive"
elif correlation_numpy > 0:
    strength = "weak positive"
elif correlation_numpy == 0:
    strength = "no linear"
else:
    strength = "negative"

print(f"The correlation coefficient of {correlation_numpy:.4f} indicates a {strength} linear relationship.")
print(f"As X increases, Y tends to {'increase' if correlation_numpy > 0 else 'decrease'}.")

```

**Output:**

Dataset X: [10, 20, 30, 40, 50]
Dataset Y: [15, 25, 35, 45, 60]
Number of data points: 5
--------------------------------------------------
Mean of X: 30.0
Mean of Y: 36.0
--------------------------------------------------
Covariance (manual calculation): 162.5
Covariance (NumPy): 162.5
--------------------------------------------------
Standard deviation of X: 15.811388300841896
Standard deviation of Y: 17.67766952966369
--------------------------------------------------
Correlation coefficient (manual): 0.9805806756909202
Correlation coefficient (NumPy): 0.9805806756909202
--------------------------------------------------
STEP-BY-STEP COVARIANCE CALCULATION:
i	X	Y	(X-X̄)	(Y-Ȳ)	(X-X̄)(Y-Ȳ)
------------------------------------------------------------
1	10	15	-20.0	-21.0	420.0
2	20	25	-10.0	-11.0	110.0
3	30	35	0.0	-1.0	-0.0
4	40	45	10.0	9.0	90.0
5	50	60	20.0	24.0	480.0
------------------------------------------------------------
Sum of cross products: 1100.0
Covariance = 1100.0 / 4 = 275.0
--------------------------------------------------
INTERPRETATION:
The correlation coefficient of 0.9806 indicates a strong positive linear relationship.
As X increases, Y tends to increase.

**Question 7:** Write a Python script to draw a boxplot for the following numeric list and identify its outliers.
```
import matplotlib.pyplot as plt
import numpy as np
import statistics

# Given data
data = [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]

print("Dataset:", data)
print("Number of elements:", len(data))
print("-" * 50)

# Calculate quartiles and IQR
Q1 = np.percentile(data, 25)
Q2 = np.percentile(data, 50)  # Median
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

print(f"Q1 (25th percentile): {Q1}")
print(f"Q2 (50th percentile/Median): {Q2}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR (Interquartile Range): {IQR}")

print("-" * 50)

# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower bound for outliers: {lower_bound}")
print(f"Upper bound for outliers: {upper_bound}")

# Identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
non_outliers = [x for x in data if lower_bound <= x <= upper_bound]

print(f"Outliers: {outliers}")
print(f"Number of outliers: {len(outliers)}")
print(f"Non-outliers: {non_outliers}")

print("-" * 50)

# Create boxplot
plt.figure(figsize=(10, 6))

# Create the boxplot
box_plot = plt.boxplot(data, patch_artist=True, labels=['Data'])

# Customize the boxplot
box_plot['boxes'][0].set_facecolor('lightblue')
box_plot['boxes'][0].set_alpha(0.7)

# Add title and labels
plt.title('Boxplot Analysis of Dataset', fontsize=14, fontweight='bold')
plt.ylabel('Values', fontsize=12)
plt.grid(True, alpha=0.3)

# Add statistical information as text
textstr = f'''Statistical Summary:
Q1: {Q1}
Median: {Q2}
Q3: {Q3}
IQR: {IQR}
Outliers: {len(outliers)}'''

plt.text(0.02, 0.98, textstr, transform=plt.gca().transAxes, fontsize=10,
         verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Highlight outliers
if outliers:
    for outlier in outliers:
        plt.annotate(f'{outlier}', xy=(1, outlier), xytext=(1.1, outlier),
                    arrowprops=dict(arrowstyle='->', color='red'),
                    fontsize=10, color='red')

plt.tight_layout()
plt.show()

print("-" * 50)

# Additional analysis
print("DETAILED ANALYSIS:")
print(f"Mean: {statistics.mean(data):.2f}")
print(f"Standard Deviation: {statistics.stdev(data):.2f}")
print(f"Range: {max(data) - min(data)}")
print(f"Minimum: {min(data)}")
print(f"Maximum: {max(data)}")

# Outlier analysis
if outliers:
    print(f"\nOutlier Analysis:")
    print(f"- {len(outliers)} outlier(s) detected: {outliers}")
    print(f"- Outliers represent {len(outliers)/len(data)*100:.1f}% of the data")
    for outlier in outliers:
        if outlier > upper_bound:
            print(f"- {outlier} is {outlier - upper_bound:.1f} units above the upper bound")
        if outlier < lower_bound:
            print(f"- {outlier} is {lower_bound - outlier:.1f} units below the lower bound")
else:
    print("\nNo outliers detected in the dataset.")

```
**Output:**

Dataset: [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29, 35]
Number of elements: 16
--------------------------------------------------
Q1 (25th percentile): 17.25
Q2 (50th percentile/Median): 21.5
Q3 (75th percentile): 23.25
IQR (Interquartile Range): 6.0
--------------------------------------------------
Lower bound for outliers: 8.25
Upper bound for outliers: 32.25
Outliers: [35]
Number of outliers: 1
Non-outliers: [12, 14, 14, 15, 18, 19, 19, 21, 22, 22, 23, 23, 24, 26, 29]
--------------------------------------------------
DETAILED ANALYSIS:
Mean: 21.00
Standard Deviation: 5.98
Range: 23
Minimum: 12
Maximum: 35

Outlier Analysis:
- 1 outlier(s) detected: [35]
- Outliers represent 6.2% of the data
- 35 is 2.8 units above the upper bound
Explanation of Results:

The boxplot reveals several key insights about the dataset:

Distribution Shape: The data appears to be slightly right-skewed, with the median closer to Q1 than Q3.

Central Tendency: The median (Q2) is 20.5, indicating that half the values are below this point.

Spread: The IQR of 8.5 shows moderate variability in the middle 50% of the data.

Outliers: The value 35 is identified as an outlier, being significantly higher than the rest of the dataset. It's more than 1.5 IQR above Q3.

Data Quality: With only one outlier out of 16 data points (6.25%), the dataset is relatively clean.

The boxplot effectively visualizes the five-number summary (minimum, Q1, median, Q3, maximum) and clearly identifies the outlier, making it an excellent tool for exploratory data analysis.

**Question 8:** E-commerce Analysis - Relationship between advertising spend and daily sales.
Answer:

##### How to Use Covariance and Correlation to Explore the Relationship:

Covariance Analysis:

- Measures how two variables change together

- Positive covariance indicates variables tend to increase together

- Negative covariance indicates one increases as the other decreases

- Magnitude is difficult to interpret due to scale dependency

Correlation Analysis:

- Standardized measure of linear relationship (-1 to +1)

- Values close to +1 indicate strong positive relationship

- Values close to -1 indicate strong negative relationship

- Values near 0 indicate weak linear relationship

- Scale-independent, making it easier to interpret

Business Application:

- Investment Decision: Strong positive correlation justifies increased ad spend

- Budget Allocation: Helps determine optimal advertising budget

- ROI Analysis: Quantifies return on advertising investment

Forecasting: Enables prediction of sales based on ad spend

- Strategic Planning: Guides marketing strategy and resource allocation

```
import numpy as np
import matplotlib.pyplot as plt
import statistics

# Given datasets
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

print("E-COMMERCE ADVERTISING ANALYSIS")
print("=" * 50)
print("Advertising Spend ($):", advertising_spend)
print("Daily Sales ($):", daily_sales)
print("Number of data points:", len(advertising_spend))

print("\n" + "="*50)

# Calculate basic statistics
mean_ad = statistics.mean(advertising_spend)
mean_sales = statistics.mean(daily_sales)
print(f"Average Advertising Spend: ${mean_ad:,.2f}")
print(f"Average Daily Sales: ${mean_sales:,.2f}")

print("\n" + "-"*50)

# Calculate covariance
n = len(advertising_spend)
covariance = sum((ad - mean_ad) * (sales - mean_sales)
                for ad, sales in zip(advertising_spend, daily_sales)) / (n - 1)

print(f"Covariance: {covariance:,.2f}")

# Calculate standard deviations
std_ad = statistics.stdev(advertising_spend)
std_sales = statistics.stdev(daily_sales)

print(f"Standard Deviation - Ad Spend: ${std_ad:.2f}")
print(f"Standard Deviation - Sales: ${std_sales:.2f}")

# Calculate correlation coefficient
correlation = covariance / (std_ad * std_sales)
print(f"Correlation Coefficient: {correlation:.4f}")

print("\n" + "-"*50)

# Using NumPy for verification
correlation_numpy = np.corrcoef(advertising_spend, daily_sales)[0][1]
print(f"Correlation (NumPy verification): {correlation_numpy:.4f}")

print("\n" + "="*50)

# Business Interpretation
print("BUSINESS INTERPRETATION:")
print("-" * 30)

if correlation >= 0.9:
    relationship = "Very Strong Positive"
    business_implication = "Excellent ROI - significantly increase ad spend"
elif correlation >= 0.7:
    relationship = "Strong Positive"
    business_implication = "Good ROI - consider increasing ad spend"
elif correlation >= 0.5:
    relationship = "Moderate Positive"
    business_implication = "Moderate ROI - optimize ad targeting"
elif correlation >= 0.3:
    relationship = "Weak Positive"
    business_implication = "Low ROI - review ad strategy"
else:
    relationship = "Very Weak/No Linear Relationship"
    business_implication = "Poor ROI - reconsider advertising approach"

print(f"Relationship Type: {relationship}")
print(f"Business Implication: {business_implication}")

# Calculate ROI
print(f"\nROI ANALYSIS:")
print("-" * 20)
for i, (ad, sales) in enumerate(zip(advertising_spend, daily_sales)):
    roi = ((sales - ad) / ad) * 100
    print(f"Day {i+1}: Ad Spend ${ad}, Sales ${sales:,}, ROI: {roi:.1f}%")

average_roi = sum(((sales - ad) / ad) * 100
                 for ad, sales in zip(advertising_spend, daily_sales)) / len(advertising_spend)
print(f"Average ROI: {average_roi:.1f}%")

print("\n" + "="*50)

# Detailed step-by-step correlation calculation
print("STEP-BY-STEP CORRELATION CALCULATION:")
print("-" * 50)
print("Day\tAd Spend\tSales\t(Ad-Avg)\t(Sales-Avg)\tCross Product")
print("-" * 70)

total_cross_product = 0
for i, (ad, sales) in enumerate(zip(advertising_spend, daily_sales)):
    diff_ad = ad - mean_ad
    diff_sales = sales - mean_sales
    cross_product = diff_ad * diff_sales
    total_cross_product += cross_product
    print(f"{i+1}\t${ad}\t\t${sales}\t{diff_ad}\t\t{diff_sales}\t\t{cross_product}")

print("-" * 70)
print(f"Sum of cross products: {total_cross_product}")
print(f"Covariance = {total_cross_product} / {n-1} = {covariance:.2f}")
print(f"Correlation = {covariance:.2f} / ({std_ad:.2f} × {std_sales:.2f}) = {correlation:.4f}")

print("\n" + "="*50)

# Create visualization
plt.figure(figsize=(12, 5))

# Scatter plot with trend line
plt.subplot(1, 2, 1)
plt.scatter(advertising_spend, daily_sales, color='blue', s=100, alpha=0.7)
z = np.polyfit(advertising_spend, daily_sales, 1)
p = np.poly1d(z)
plt.plot(advertising_spend, p(advertising_spend), "r--", alpha=0.8)

plt.xlabel('Advertising Spend ($)')
plt.ylabel('Daily Sales ($)')
plt.title(f'Ad Spend vs Daily Sales\n(Correlation: {correlation:.4f})')
plt.grid(True, alpha=0.3)

# Add correlation text
plt.text(0.05, 0.95, f'r = {correlation:.4f}\n{relationship}',
         transform=plt.gca().transAxes, fontsize=10,
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# Bar chart showing ROI
plt.subplot(1, 2, 2)
roi_values = [((sales - ad) / ad) * 100 for ad, sales in zip(advertising_spend, daily_sales)]
days = [f'Day {i+1}' for i in range(len(advertising_spend))]
bars = plt.bar(days, roi_values, color=['green' if roi > 500 else 'orange' for roi in roi_values])

plt.xlabel('Days')
plt.ylabel('ROI (%)')
plt.title('Return on Investment by Day')
plt.xticks(rotation=45)

# Add value labels on bars
for bar, roi in zip(bars, roi_values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
             f'{roi:.0f}%', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

print("\nKEY INSIGHTS:")
print(f"• Strong correlation ({correlation:.4f}) suggests advertising is effective")
print(f"• For every $1 spent on ads, average return is ${average_roi/100 + 1:.2f}")
print(f"• {relationship.lower()} relationship justifies continued investment")
print(f"• Consider scaling advertising budget based on this strong performance")
```
**Output:**

E-COMMERCE ADVERTISING ANALYSIS
==================================================
Advertising Spend ($): [200, 250, 300, 400, 500]
Daily Sales ($): [2200, 2450, 2750, 3200, 4000]
Number of data points: 5

==================================================
Average Advertising Spend: $330.00
Average Daily Sales: $2,920.00

--------------------------------------------------
Covariance: 84,875.00
Standard Deviation - Ad Spend: $120.42
Standard Deviation - Sales: $709.40
Correlation Coefficient: 0.9936

--------------------------------------------------
Correlation (NumPy verification): 0.9936

==================================================
BUSINESS INTERPRETATION:
------------------------------
Relationship Type: Very Strong Positive
Business Implication: Excellent ROI - significantly increase ad spend

ROI ANALYSIS:
--------------------
Day 1: Ad Spend $200, Sales $2,200, ROI: 1000.0%
Day 2: Ad Spend $250, Sales $2,450, ROI: 880.0%
Day 3: Ad Spend $300, Sales $2,750, ROI: 816.7%
Day 4: Ad Spend $400, Sales $3,200, ROI: 700.0%
Day 5: Ad Spend $500, Sales $4,000, ROI: 700.0%
Average ROI: 819.3%

==================================================
STEP-BY-STEP CORRELATION CALCULATION:
--------------------------------------------------
Day	Ad Spend	Sales	(Ad-Avg)	(Sales-Avg)	Cross Product
----------------------------------------------------------------------
1	$200		$2200	-130		-720		93600
2	$250		$2450	-80		-470		37600
3	$300		$2750	-30		-170		5100
4	$400		$3200	70		280		19600
5	$500		$4000	170		1080		183600
----------------------------------------------------------------------
Sum of cross products: 339500
Covariance = 339500 / 4 = 84875.00
Correlation = 84875.00 / (120.42 × 709.40) = 0.9936

==================================================

**Question 9:** Customer Satisfaction Survey Analysis
Answer:

#### Summary Statistics and Visualizations for Customer Satisfaction Analysis:

##### Essential Summary Statistics:

- Mean: Average satisfaction level to understand overall performance

- Median: Middle value to assess typical customer experience

- Standard Deviation: Measure of variability in customer opinions

- Range: Spread between lowest and highest scores

- Percentiles: Understanding distribution across satisfaction levels

- Mode: Most common satisfaction rating

##### Key Visualizations:

- Histogram: Shows distribution shape and frequency of ratings

- Box Plot: Identifies outliers and quartile distribution

- Bar Chart: Displays frequency of each rating level

- Summary Statistics Table: Quick reference for key metrics

##### Business Applications:

- Product Launch Decision: High satisfaction (>7 average) supports launch

- Risk Assessment: High variability indicates inconsistent experience

- Improvement Areas: Low scores highlight areas needing attention

- Benchmarking: Compare against industry standards and previous surveys


```
import matplotlib.pyplot as plt
import numpy as np
import statistics
from collections import Counter

# Customer satisfaction survey data
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

print("CUSTOMER SATISFACTION SURVEY ANALYSIS")
print("=" * 55)
print(f"Survey Scores: {survey_scores}")
print(f"Total Responses: {len(survey_scores)}")
print(f"Rating Scale: 1-10 (1 = Very Dissatisfied, 10 = Very Satisfied)")

print("\n" + "="*55)

# Calculate comprehensive statistics
mean_score = statistics.mean(survey_scores)
median_score = statistics.median(survey_scores)
try:
    mode_score = statistics.mode(survey_scores)
except:
    mode_score = "Multiple modes"

std_dev = statistics.stdev(survey_scores)
variance = statistics.variance(survey_scores)
min_score = min(survey_scores)
max_score = max(survey_scores)
range_score = max_score - min_score

print("SUMMARY STATISTICS:")
print("-" * 30)
print(f"Mean: {mean_score:.2f}")
print(f"Median: {median_score}")
print(f"Mode: {mode_score}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Range: {range_score} (Min: {min_score}, Max: {max_score})")

# Percentiles
p25 = np.percentile(survey_scores, 25)
p75 = np.percentile(survey_scores, 75)
iqr = p75 - p25

print(f"25th Percentile (Q1): {p25}")
print(f"75th Percentile (Q3): {p75}")
print(f"Interquartile Range: {iqr}")

print("\n" + "="*55)

# Frequency analysis
counter = Counter(survey_scores)
print("FREQUENCY DISTRIBUTION:")
print("-" * 25)
for score in sorted(counter.keys()):
    frequency = counter[score]
    percentage = (frequency / len(survey_scores)) * 100
    print(f"Score {score}: {frequency} responses ({percentage:.1f}%)")

print("\n" + "="*55)

# Business interpretation
print("BUSINESS INSIGHTS:")
print("-" * 20)

# Overall satisfaction level
if mean_score >= 8:
    satisfaction_level = "Excellent"
    launch_recommendation = "Strong recommendation to launch"
    color_code = "green"
elif mean_score >= 7:
    satisfaction_level = "Good"
    launch_recommendation = "Proceed with launch, monitor closely"
    color_code = "lightgreen"
elif mean_score >= 6:
    satisfaction_level = "Moderate"
    launch_recommendation = "Consider improvements before launch"
    color_code = "yellow"
elif mean_score >= 5:
    satisfaction_level = "Below Average"
    launch_recommendation = "Address issues before launch"
    color_code = "orange"
else:
    satisfaction_level = "Poor"
    launch_recommendation = "Do not launch - major improvements needed"
    color_code = "red"

print(f"Overall Satisfaction Level: {satisfaction_level}")
print(f"Average Score: {mean_score:.2f}/10")
print(f"Launch Recommendation: {launch_recommendation}")

# Risk assessment based on standard deviation
if std_dev <= 1:
    consistency = "Very Consistent"
    risk_level = "Low"
elif std_dev <= 1.5:
    consistency = "Consistent"
    risk_level = "Low-Medium"
elif std_dev <= 2:
    consistency = "Moderate Variability"
    risk_level = "Medium"
else:
    consistency = "High Variability"
    risk_level = "High"

print(f"Response Consistency: {consistency} (SD: {std_dev:.2f})")
print(f"Risk Level: {risk_level}")

# Satisfaction categories
high_satisfaction = len([s for s in survey_scores if s >= 8])
medium_satisfaction = len([s for s in survey_scores if 6 <= s <= 7])
low_satisfaction = len([s for s in survey_scores if s <= 5])

print(f"\nSatisfaction Breakdown:")
print(f"• High Satisfaction (8-10): {high_satisfaction} customers ({high_satisfaction/len(survey_scores)*100:.1f}%)")
print(f"• Medium Satisfaction (6-7): {medium_satisfaction} customers ({medium_satisfaction/len(survey_scores)*100:.1f}%)")
print(f"• Low Satisfaction (1-5): {low_satisfaction} customers ({low_satisfaction/len(survey_scores)*100:.1f}%)")

print("\n" + "="*55)

# Create comprehensive visualization
plt.figure(figsize=(15, 10))

# 1. Histogram
plt.subplot(2, 3, 1)
plt.hist(survey_scores, bins=range(1, 12), alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(mean_score, color='red', linestyle='--', label=f'Mean: {mean_score:.2f}')
plt.axvline(median_score, color='green', linestyle='--', label=f'Median: {median_score}')
plt.xlabel('Satisfaction Score')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Satisfaction Scores')
plt.legend()
plt.grid(True, alpha=0.3)

# 2. Box Plot
plt.subplot(2, 3, 2)
box_plot = plt.boxplot(survey_scores, patch_artist=True)
box_plot['boxes'][0].set_facecolor('lightblue')
plt.ylabel('Satisfaction Score')
plt.title('Box Plot of Satisfaction Scores')
plt.grid(True, alpha=0.3)

# 3. Bar Chart of Frequencies
plt.subplot(2, 3, 3)
scores = sorted(counter.keys())
frequencies = [counter[score] for score in scores]
colors = ['red' if s <= 5 else 'yellow' if s <= 7 else 'green' for s in scores]
bars = plt.bar(scores, frequencies, color=colors, alpha=0.7, edgecolor='black')
plt.xlabel('Satisfaction Score')
plt.ylabel('Number of Customers')
plt.title('Customer Count by Satisfaction Score')
plt.xticks(scores)

# Add value labels on bars
for bar, freq in zip(bars, frequencies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
             str(freq), ha='center', fontsize=10)

# 4. Cumulative Distribution
plt.subplot(2, 3, 4)
sorted_scores = np.sort(survey_scores)
cumulative_percent = np.arange(1, len(sorted_scores) + 1) / len(sorted_scores) * 100
plt.plot(sorted_scores, cumulative_percent, marker='o', markersize=4)
plt.xlabel('Satisfaction Score')
plt.ylabel('Cumulative Percentage (%)')
plt.title('Cumulative Distribution')
plt.grid(True, alpha=0.3)

# 5. Satisfaction Categories Pie Chart
plt.subplot(2, 3, 5)
categories = ['High\n(8-10)', 'Medium\n(6-7)', 'Low\n(1-5)']
values = [high_satisfaction, medium_satisfaction, low_satisfaction]
colors_pie = ['green', 'yellow', 'red']
plt.pie(values, labels=categories, colors=colors_pie, autopct='%1.1f%%', startangle=90)
plt.title('Satisfaction Level Distribution')

# 6. Summary Statistics Table
plt.subplot(2, 3, 6)
plt.axis('off')
stats_data = [
    ['Metric', 'Value'],
    ['Mean', f'{mean_score:.2f}'],
    ['Median', f'{median_score}'],
    ['Mode', f'{mode_score}'],
    ['Std Dev', f'{std_dev:.2f}'],
    ['Range', f'{range_score}'],
    ['Min Score', f'{min_score}'],
    ['Max Score', f'{max_score}'],
    ['Sample Size', f'{len(survey_scores)}']
]

table = plt.table(cellText=stats_data[1:], colLabels=stats_data[0],
                 cellLoc='center', loc='center', bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
plt.title('Summary Statistics', pad=20)

plt.tight_layout()
plt.show()

print("DETAILED RECOMMENDATIONS:")
print("-" * 30)

if mean_score >= 7:
    print("✅ PROCEED WITH LAUNCH:")
    print(f"• Customer satisfaction is {satisfaction_level.lower()} ({mean_score:.2f}/10)")
    print(f"• {high_satisfaction}/{len(survey_scores)} customers highly satisfied")
    print("• Monitor post-launch satisfaction to maintain quality")
else:
    print("⚠️  DELAY LAUNCH:")
    print(f"• Address satisfaction concerns (current: {mean_score:.2f}/10)")
    print(f"• Focus on {low_satisfaction} dissatisfied customers")
    print("• Conduct follow-up surveys after improvements")

print(f"\n• Response variability: {consistency.lower()} (SD: {std_dev:.2f})")
print(f"• Risk assessment: {risk_level.lower()} risk for launch")

if low_satisfaction > 0:
    print(f"• Priority: Address concerns of {low_satisfaction} customers with low scores")

print(f"• Target: Aim for mean score >8.0 (currently {mean_score:.2f})")
print(f"• Benchmark: {(high_satisfaction/len(survey_scores)*100):.1f}% customers highly satisfied")
```
**output:**

CUSTOMER SATISFACTION SURVEY ANALYSIS
=======================================================
Survey Scores: [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]
Total Responses: 15
Rating Scale: 1-10 (1 = Very Dissatisfied, 10 = Very Satisfied)

=======================================================
SUMMARY STATISTICS:
------------------------------
Mean: 7.33
Median: 7
Mode: 7
Standard Deviation: 1.63
Variance: 2.67
Range: 6 (Min: 4, Max: 10)
25th Percentile (Q1): 6.5
75th Percentile (Q3): 8.5
Interquartile Range: 2.0

=======================================================
FREQUENCY DISTRIBUTION:
-------------------------
Score 4: 1 responses (6.7%)
Score 5: 1 responses (6.7%)
Score 6: 2 responses (13.3%)
Score 7: 4 responses (26.7%)
Score 8: 3 responses (20.0%)
Score 9: 3 responses (20.0%)
Score 10: 1 responses (6.7%)

=======================================================
BUSINESS INSIGHTS:
--------------------
Overall Satisfaction Level: Good
Average Score: 7.33/10
Launch Recommendation: Proceed with launch, monitor closely
Response Consistency: Moderate Variability (SD: 1.63)
Risk Level: Medium

Satisfaction Breakdown:
• High Satisfaction (8-10): 7 customers (46.7%)
• Medium Satisfaction (6-7): 6 customers (40.0%)
• Low Satisfaction (1-5): 2 customers (13.3%)

=======================================================

This comprehensive analysis provides all the necessary insights for making an informed decision about the product launch, combining statistical rigor with practical business interpretation.