# DS_C8_S6

# Refining Computer Sales Strategy through Statistical Analysis Part-2

In Project - Part 1, few business questions were resolved which include enhancing product strategy and sales performance through specification-based analysis, assessing price disparities between premium and non-premium computers, understanding computer price trends. Now, let's address the next set of business questions such as analyzing advertising budget for premium computers, evaluating price differences between computers for certain specifications, analyzing premium computer pricing strategy. For this sprint, continue using the same cleaned data obtained in Project - Part 1.

In [1]:
import pandas as pd
# Step 1: Load the data (if not already in DataFrame)
data = pd.read_csv('DS1_C8_S5_Computers_Data_Project.csv')

# Step 2: Impute missing values (if any)
# For categorical columns, we impute with the mode (most frequent value)
for column in data.select_dtypes(include=['object']).columns:
    data[column].fillna(data[column].mode()[0])

# Step 3: Drop duplicates
data.drop_duplicates(inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6259 entries, 0 to 6258
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   index     6259 non-null   int64 
 1   price     6259 non-null   int64 
 2   speed     6259 non-null   int64 
 3   hd        6259 non-null   int64 
 4   ram       6259 non-null   int64 
 5   screen    6259 non-null   int64 
 6   cd        6259 non-null   object
 7   multi     6259 non-null   object
 8   premium   6259 non-null   object
 9   ads_2022  6259 non-null   int64 
 10  ads_2023  6259 non-null   int64 
 11  trend     6259 non-null   int64 
dtypes: int64(9), object(3)
memory usage: 586.9+ KB


# Task 1
The advertising budget spent on promoting premium computers in 2023 is increased as compared to 2022. The mean advertising budget was 221.3 billion dollars in 2022 and it was 222.2 billion dollars in 2023. A promoter in this company believes that the average advertising budget is higher than that of 2022. Priya, a data analyst, randomly selected 40 premium computers to check this notion.
Use a 5% level of significance to test Maya's hypothesis. Consider normally distribution in the population and standard deviation is 74.83.

In [1]:
from scipy.stats import norm
import math
sample_mean = 222.2
population_mean = 221.3
pop_std_dev = 74.83
sample_size = 40
alpha = 0.05

In [29]:
# Z-test calculation
z = (sample_mean - population_mean) / (pop_std_dev / math.sqrt(sample_size))
z_critical = norm.ppf(1 - alpha)     # Critical value for 5% significance level

print(f"Z-statistic: {z:.2f}")
print(f"Critical Z-value: {z_critical:.2f}")

Z-statistic: 0.08
Critical Z-value: 1.64


In [30]:
if z >= z_critical:
    print("Reject the null hypothesis: The average advertising budget in 2023 is significantly higher than 2022.")
else:
    print("Fail to reject the null hypothesis: There is no significant evidence that the 2023 average advertising budget is higher.")

Fail to reject the null hypothesis: There is no significant evidence that the 2023 average advertising budget is higher.


- Test Statistic (z): The calculated z-statistic will indicate how far the sample mean deviates from the null hypothesis mean in terms of standard errors.
- P-Value: If the p-value is less than 0.05, the null hypothesis is rejected, indicating that the advertising budget in 2023 is significantly higher than in 2022.
- We Fail to reject the null hypothesis: There is no significant evidence that the 2023 average advertising budget is higher.

# Task 2
Is there a statistically significant difference in the average price of computers with CD players and computers without CD players? Use 5% of the significance level for the test.

In [35]:
from scipy.stats import ttest_ind

# Separate the data into two groups
with_cd = data[data['cd'] == 'yes']['price']
without_cd = data[data['cd'] == 'no']['price']

# Perform an independent t-test
p_value = norm.ppf(z)         # P-value (one-tailed)
t_stat, p_value = ttest_ind(with_cd, without_cd)  # Welch's t-test (assumes unequal variances)

alpha = 0.05
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.6f}")

if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in the average prices of computers with and without CD players.")
else:
    print("Fail to reject the null hypothesis: No significant difference in the average prices of computers with and without CD players.")

T-statistic: 15.92
P-value: 0.000000
Reject the null hypothesis: There is a significant difference in the average prices of computers with and without CD players.


# Task 3
#### a) Identify Price Discrepancy for Premium Computers
Determine if the mean price of premium computers differs significantly from $2200?
To examine this, select 25 samples of premium computers randomly from the data. Assume the data is normally distributed in the population. Use a 5% significance level to test this hypothesis.


In [37]:
import numpy as np
from scipy.stats import t

# Step 1: Filter premium computers and select 25 random samples
premium_computers = data[data['premium'] == 'yes']['price']
sample_premium = premium_computers.sample(n=25, random_state=42)

# Step 2: Calculate sample mean and standard deviation
sample_mean = np.mean(sample_premium)
sample_std = np.std(sample_premium, ddof=1)  # Sample standard deviation
sample_size = len(sample_premium)

In [40]:
# Step 3: Perform one-sample t-test
population_mean = 2200
t_stat = (sample_mean - population_mean) / (sample_std / np.sqrt(sample_size))

# Step 4: Calculate the two-tailed p-value
df = sample_size - 1  # Degrees of freedom
p_value = 2 * t.sf(np.abs(t_stat), df)

In [42]:
# Step 5: Print results
alpha = 0.05
print(f"Sample Mean: {sample_mean:.2f}")
print(f"Sample Standard Deviation: {sample_std:.2f}")
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print("Reject the null hypothesis: The mean price of premium computers differs significantly from $2200.")
else:
    print("Fail to reject the null hypothesis: The mean price of premium computers does not differ significantly from $2200.")

Sample Mean: 2295.12
Sample Standard Deviation: 882.69
T-statistic: 0.54
P-value: 0.5950
Fail to reject the null hypothesis: The mean price of premium computers does not differ significantly from $2200.


#### b) Analyze price disparity between premium and non-premium computers.
Is there a significant difference in the mean prices of premium and non-premium computers?
Assume that the prices are normally distributed and that the population variances are approximately equal. Use a 5% significance level to test this hypothesis.


In [43]:
import numpy as np
from scipy.stats import t

# Step 1: Separate the data into two groups: Premium and Non-Premium
premium_computers = data[data['premium'] == 'yes']['price']
non_premium_computers = data[data['premium'] == 'no']['price']

# Step 2: Calculate sample means and standard deviations for both groups
mean_premium = np.mean(premium_computers)
mean_non_premium = np.mean(non_premium_computers)
std_premium = np.std(premium_computers, ddof=1)
std_non_premium = np.std(non_premium_computers, ddof=1)
n_premium = len(premium_computers)
n_non_premium = len(non_premium_computers)

# Step 3: Calculate the pooled standard deviation
pooled_std = np.sqrt(((n_premium - 1) * std_premium**2 + (n_non_premium - 1) * std_non_premium**2) /
                     (n_premium + n_non_premium - 2))

# Step 4: Calculate the t-statistic
t_stat = (mean_premium - mean_non_premium) / (pooled_std * np.sqrt(1 / n_premium + 1 / n_non_premium))

# Step 5: Calculate degrees of freedom
df = n_premium + n_non_premium - 2

# Step 6: Calculate the p-value (two-tailed test)
p_value = 2 * t.sf(np.abs(t_stat), df)

# Step 7: Print results
alpha = 0.05
print(f"Premium Mean: {mean_premium:.2f}, Non-Premium Mean: {mean_non_premium:.2f}")
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Decision
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in the mean prices of premium and non-premium computers.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the mean prices of premium and non-premium computers.")

Premium Mean: 2204.15, Non-Premium Mean: 2361.93
T-statistic: -6.40
P-value: 0.0000
Reject the null hypothesis: There is a significant difference in the mean prices of premium and non-premium computers.
