# Problem Statement:

**Objective**: Sun Pharma needs to test 80,000 new painkiller drugs for two key parameters:
1. **Time of Effect**: The duration it takes for the drug to completely cure the pain.
2. **Quality Assurance**: Whether the drug performs satisfactorily in curing the pain.








### Steps to Address the Problem

#### 1. Data Collection
- **Parameters to Measure**:
  - Time of Effect: This is a continuous variable measured in minutes/hours.
  - Quality Assurance: This is a binary variable (Satisfactory/Not Satisfactory).
- **Sample Size**: 80,000 drugs need to be tested.

#### 2. Exploratory Data Analysis (EDA)
- **Descriptive Statistics**:
  - Summary statistics (mean, median, mode, standard deviation, etc.) for the Time of Effect.
  - Distribution of Quality Assurance results.
- **Visualizations**:
  - Histograms/Box plots for Time of Effect.
  - Bar charts/Pie charts for Quality Assurance results.
  - Scatter plots to identify any potential relationships between Time of Effect and Quality Assurance.

#### 3. Data Preprocessing
- **Handling Missing Values**: Identify and handle missing or inconsistent data.
- **Data Transformation**: Normalize/standardize the Time of Effect if needed.
- **Encoding Categorical Data**: Convert the Quality Assurance results into numerical values (e.g., 0 for Not Satisfactory, 1 for Satisfactory).

#### 4. Statistical Analysis
- **Hypothesis Testing**:
  - Test whether the mean Time of Effect meets a specified standard.
  - Compare the proportions of satisfactory results across different batches if applicable.
- **Confidence Intervals**:
  - Calculate confidence intervals for the mean Time of Effect.
  - Calculate confidence intervals for the proportion of satisfactory Quality Assurance results.

#### 5. Predictive Modeling
- **Regression Analysis**:
  - Use linear regression to predict the Time of Effect based on various features.
- **Classification Models**:
  - Use logistic regression, decision trees, or other classification algorithms to predict Quality Assurance results.
  - Evaluate model performance using metrics like accuracy, precision, recall, and F1-score.

#### 6. Reporting and Insights
- **Summary Report**:
  - Summarize key findings from EDA, statistical analysis, and predictive modeling.
  - Provide actionable insights and recommendations.
- **Visualizations**:
  - Create visualizations to communicate the results effectively.


### Example Code Snippets

#### 1. Data Preprocessing

In [None]:

import pandas as pd

# Load the dataset
data = pd.read_csv('painkiller_test_data.csv')

# Handle missing values
data = data.dropna()

# Encode categorical variables
data['quality_assurance'] = data['quality_assurance'].map({'Satisfactory': 1, 'Not Satisfactory': 0})

#### 2. Exploratory Data Analysis

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Descriptive statistics
print(data.describe())

# Histograms
sns.histplot(data['time_of_effect'], kde=True)
plt.title('Distribution of Time of Effect')
plt.show()

# Bar chart for Quality Assurance
sns.countplot(x='quality_assurance', data=data)
plt.title('Quality Assurance Results')
plt.show()



#### 3. Hypothesis Testing



In [None]:


from scipy import stats

# Hypothesis test for mean Time of Effect
t_stat, p_value = stats.ttest_1samp(data['time_of_effect'], popmean=specified_standard_time)
print(f'T-statistic: {t_stat}, P-value: {p_value}')


#### 4. Predictive Modeling

In [None]:



from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the data
X = data[['time_of_effect']]
y = data['quality_assurance']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions and Evaluation
y_pred = model.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')



### Conclusion

By following these steps, Sun Pharma can comprehensively test the new batch of painkiller drugs, ensuring they meet the required standards for both Time of Effect and Quality Assurance. The analysis will help in identifying any issues early and ensure that only the best products reach the market.

## Problem #1

Question 1:
The quality assurance checks on the previous batches of drugs found that — it is 4 times more likely
that a drug is able to produce a satisfactory result than not.
Given a small sample of 10 drugs, you are required to find the theoretical probability that at most, 3
drugs are not able to do a satisfactory job.
a.) Propose the type of probability distribution that would accurately portray the above scenario,
and list out the three conditions that this distribution follows.

### Solution

#### 1. Type of Probability Distribution

The scenario described can be accurately portrayed by a **Binomial Distribution**. 

The Binomial Distribution is used to model the number of successes in a fixed number of independent Bernoulli trials (i.e., trials with two possible outcomes, such as success and failure).

#### 2. Conditions of the Binomial Distribution

The Binomial Distribution follows these three conditions:

1. **Fixed Number of Trials (n)**:
   - The experiment consists of a fixed number of trials. In this case, there are 10 drugs being tested.
   
2. **Independent Trials**:
   - Each trial is independent of the others. The outcome of one drug's effectiveness does not affect the outcome of another.
   
3. **Constant Probability of Success (p)**:
   - The probability of success (producing a satisfactory result) is constant for each trial. According to the problem, it is 4 times more likely for a drug to produce a satisfactory result than not.

#### Probability Calculation

To find the theoretical probability that at most 3 drugs are not able to do a satisfactory job, we can use the Binomial Distribution formula:

\[ P(X \leq k) = \sum_{i=0}^{k} \binom{n}{i} p^i (1-p)^{n-i} \]

Where:
- \( n \) is the number of trials (10 in this case).
- \( p \) is the probability of failure (not satisfactory).
- \( 1-p \) is the probability of success (satisfactory).
- \( X \) is the random variable representing the number of failures.

Given that it is 4 times more likely for a drug to be satisfactory than not, let \( q \) be the probability of failure (not satisfactory). Then, the probability of success (satisfactory) \( p \) can be expressed as:

\[ p = 4q \]

Since \( p + q = 1 \):

\[ 4q + q = 1 \]
\[ 5q = 1 \]
\[ q = \frac{1}{5} = 0.2 \]
\[ p = 1 - q = 0.8 \]

So, the probability of failure \( q \) (not satisfactory) is 0.2, and the probability of success \( p \) (satisfactory) is 0.8.

Now we can calculate the probability that at most 3 drugs are not satisfactory:

\[ P(X \leq 3) = \sum_{i=0}^{3} \binom{10}{i} (0.2)^i (0.8)^{10-i} \]



In [3]:
from scipy.stats import binom

# Number of trials
n = 10

# Probability of failure
q = 0.2

# Probability of success
p = 0.8

# Calculate cumulative probability of at most 3 failures
prob_at_most_3_failures = binom.cdf(3, n, q)

prob_at_most_3_failures



np.float64(0.8791261183999999)


Executing the above code gives us the required probability.

In summary, the appropriate probability distribution is the **Binomial Distribution**, and the conditions it follows are:
1. Fixed number of trials.
2. Independent trials.
3. Constant probability of success.

By calculating the cumulative probability \( P(X \leq 3) \), we can determine the theoretical probability that at most 3 drugs are not able to produce a satisfactory result.

## Problem #2

Question 2:
For the effectiveness test, a sample of 100 drugs was taken. The mean time of effect was 207
seconds, with the standard deviation coming to 65 seconds. Using this information, you are required
to estimate the range in which the population mean might lie — with a 95% confidence level.
a.) Discuss the main methodology using which you will approach this problem. State all the
properties of the required method. Limit your answer to 150 words.

### Solution

#### Methodology: Confidence Interval for Population Mean

To estimate the range in which the population mean might lie with a 95% confidence level, we will use the concept of **Confidence Interval** (CI) for the mean.

#### Properties of the Confidence Interval Method:

1. **Sample Mean (\(\bar{x}\))**:
   - The average time of effect in the sample, which is 207 seconds.
   
2. **Sample Standard Deviation (s)**:
   - The standard deviation of the sample, which is 65 seconds.

3. **Sample Size (n)**:
   - The number of observations in the sample, which is 100 drugs.

4. **Confidence Level**:
   - The desired confidence level is 95%.

5. **Standard Error of the Mean (SEM)**:
   - SEM is calculated as \( \frac{s}{\sqrt{n}} \).
   - In this case, SEM = \( \frac{65}{\sqrt{100}} = \frac{65}{10} = 6.5 \).

6. **Critical Value (Z\(_{\alpha/2}\))**:
   - For a 95% confidence level, the critical value from the standard normal distribution (Z-distribution) is approximately 1.96.

#### Confidence Interval Calculation:
The 95% confidence interval for the population mean is given by:

\[ \bar{x} \pm Z_{\alpha/2} \times \text{SEM} \]

Substituting the values:

\[ 207 \pm 1.96 \times 6.5 \]

#### Computation:

\[ 207 \pm 12.74 \]

So, the 95% confidence interval for the population mean is:

\[ (194.26, 219.74) \]

This means that we are 95% confident that the true population mean time of effect lies between 194.26 seconds and 219.74 seconds.

In summary, the method involves calculating the standard error of the mean, determining the critical value for the desired confidence level, and then using these to compute the confidence interval for the population mean.

Below will compute and print the 95% confidence interval for the population mean based on the provided sample statistics.


1. **Imports `scipy.stats`**:
   - This library is used for statistical calculations, such as finding the critical value.

2. **Define the Given Data**:
   - Sample mean, sample standard deviation, sample size, and confidence level are defined based on the problem statement.

3. **Calculate the Standard Error of the Mean (SEM)**:
   - SEM is computed using the formula \( \text{SEM} = \frac{\text{sample\_std\_dev}}{\sqrt{\text{sample\_size}}} \).

4. **Calculate the Critical Value (Z\(_{\alpha/2}\))**:
   - The critical value for a 95% confidence level is found using `stats.norm.ppf`, which gives the z-score corresponding to the upper tail of the normal distribution.

5. **Calculate the Margin of Error**:
   - The margin of error is calculated as the product of the critical value and the SEM.

6. **Calculate the Confidence Interval**:
   - The lower and upper bounds of the confidence interval are computed by subtracting and adding the margin of error from/to the sample mean.

7. **Print the Confidence Interval**:
   - The code prints the confidence interval for the population mean.

This code will give you the range in which the population mean might lie with 95% confidence.

In [5]:
import scipy.stats as stats

# Given data
sample_mean = 207
sample_std_dev = 65
sample_size = 100
confidence_level = 0.95

# Calculate the standard error of the mean (SEM)
sem = sample_std_dev / (sample_size ** 0.5)

#The critical value for a 95% confidence level is found using stats.norm.ppf, which gives the z-score corresponding to the upper tail of the normal distribution.
# Calculate the critical value (Z_alpha/2) for the 95% confidence level
z_alpha_half = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the margin of error
margin_of_error = z_alpha_half * sem

# Calculate the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Print the confidence interval
print(f"The 95% confidence interval for the population mean is ({lower_bound:.2f}, {upper_bound:.2f})")


The 95% confidence interval for the population mean is (194.26, 219.74)


To test the claim that the newer batch of painkiller drugs has a time of effect of at most 200 seconds using two hypothesis testing methods, we can use:

1. **Z-test** for the population mean (since the sample size is large, \( n = 100 \))
2. **T-test** for the population mean (although the sample size is large, for completeness, we will also perform the T-test)

We'll start by defining the hypotheses and then proceed with the calculations for both tests.

### Hypotheses
- **Null Hypothesis (\(H_0\))**: The mean time of effect is at most 200 seconds (\(\mu \leq 200\)).
- **Alternative Hypothesis (\(H_1\))**: The mean time of effect is greater than 200 seconds (\(\mu > 200\)).

Since we are dealing with a "greater than" test, this is a one-tailed test.

### Given Data
- Sample mean (\(\bar{x}\)) = 207 seconds
- Sample standard deviation (\(s\)) = 65 seconds
- Sample size (\(n\)) = 100
- Significance level (\(\alpha\)) = 0.05

### Z-test

1. **Calculate the Standard Error of the Mean (SEM)**:
\[ \text{SEM} = \frac{s}{\sqrt{n}} = \frac{65}{\sqrt{100}} = \frac{65}{10} = 6.5 \]

2. **Calculate the Z-score**:
\[ Z = \frac{\bar{x} - \mu_0}{\text{SEM}} = \frac{207 - 200}{6.5} = \frac{7}{6.5} \approx 1.077 \]

3. **Find the critical Z-value for a 5% significance level (one-tailed)**:
\[ Z_{critical} = 1.645 \]

4. **Decision rule**:
   - If \( Z > Z_{critical} \), reject the null hypothesis.
   - If \( Z \leq Z_{critical} \), fail to reject the null hypothesis.

5. **Compare the test statistic with the critical value**:
\[ 1.077 \leq 1.645 \]

Therefore, we fail to reject the null hypothesis using the Z-test.

### T-test

1. **Calculate the Standard Error of the Mean (SEM)** (same as Z-test):
\[ \text{SEM} = 6.5 \]

2. **Calculate the T-score**:
\[ T = \frac{\bar{x} - \mu_0}{\text{SEM}} = \frac{207 - 200}{6.5} = 1.077 \]

3. **Degrees of freedom (df)**:
\[ \text{df} = n - 1 = 100 - 1 = 99 \]

4. **Find the critical T-value for a 5% significance level (one-tailed)** using a T-distribution table or a statistical software:
\[ T_{critical} \approx 1.660 \]

5. **Decision rule**:
   - If \( T > T_{critical} \), reject the null hypothesis.
   - If \( T \leq T_{critical} \), fail to reject the null hypothesis.

6. **Compare the test statistic with the critical value**:
\[ 1.077 \leq 1.660 \]

Therefore, we fail to reject the null hypothesis using the T-test.

### Final Decision
Based on both the Z-test and the T-test, we fail to reject the null hypothesis at the 5% significance level. This means that there is not enough evidence to support the claim that the mean time of effect for the newer batch of painkiller drugs is greater than 200 seconds. Therefore, we conclude that the newer batch of drugs passes the quality assurance test for having a mean time of effect of at most 200 seconds.

### Python Code for Both Tests

```python
import scipy.stats as stats
import numpy as np

# Given data
sample_mean = 207
sample_std_dev = 65
sample_size = 100
population_mean = 200
alpha = 0.05

# Calculate the Standard Error of the Mean (SEM)
sem = sample_std_dev / np.sqrt(sample_size)

# Z-test
z_score = (sample_mean - population_mean) / sem
z_critical = stats.norm.ppf(1 - alpha)

# T-test
t_score = (sample_mean - population_mean) / sem
df = sample_size - 1
t_critical = stats.t.ppf(1 - alpha, df)

# Output the results
print(f"Z-test: Z-score = {z_score:.3f}, Z-critical = {z_critical:.3f}")
if z_score > z_critical:
    print("Reject the null hypothesis using Z-test.")
else:
    print("Fail to reject the null hypothesis using Z-test.")

print(f"T-test: T-score = {t_score:.3f}, T-critical = {t_critical:.3f}")
if t_score > t_critical:
    print("Reject the null hypothesis using T-test.")
else:
    print("Fail to reject the null hypothesis using T-test.")
```

Running this code will provide the Z-score, T-score, and their respective critical values, allowing you to determine whether to reject or fail to reject the null hypothesis.

In [6]:
import scipy.stats as stats
import numpy as np

# Given data
sample_mean = 207
sample_std_dev = 65
sample_size = 100
population_mean = 200
alpha = 0.05

# Calculate the Standard Error of the Mean (SEM)
sem = sample_std_dev / np.sqrt(sample_size)

# Z-test
z_score = (sample_mean - population_mean) / sem
z_critical = stats.norm.ppf(1 - alpha)

# T-test
t_score = (sample_mean - population_mean) / sem
df = sample_size - 1
t_critical = stats.t.ppf(1 - alpha, df)

# Output the results
print(f"Z-test: Z-score = {z_score:.3f}, Z-critical = {z_critical:.3f}")
if z_score > z_critical:
    print("Reject the null hypothesis using Z-test.")
else:
    print("Fail to reject the null hypothesis using Z-test.")

print(f"T-test: T-score = {t_score:.3f}, T-critical = {t_critical:.3f}")
if t_score > t_critical:
    print("Reject the null hypothesis using T-test.")
else:
    print("Fail to reject the null hypothesis using T-test.")


Z-test: Z-score = 1.077, Z-critical = 1.645
Fail to reject the null hypothesis using Z-test.
T-test: T-score = 1.077, T-critical = 1.660
Fail to reject the null hypothesis using T-test.


## Problem #4

Question 4:
Now, once the batch has passed all the quality tests and is ready to be launched in the market, the marketing team needs to plan an effective online ad campaign to attract new customers. Two taglines were proposed for the campaign, and the team is currently divided on which option to use.
Explain why and how A/B testing can be used to decide which option is more effective. Give a stepwise procedure for the test that needs to be conducted.

### A/B Testing for Tagline Effectiveness

**A/B testing** (or split testing) is a method used to compare two versions of a webpage, ad, or any other marketing material to determine which one performs better. In this case, we will use A/B testing to compare the effectiveness of two taglines for an online ad campaign.

### Why Use A/B Testing?

- **Data-Driven Decision Making:** It allows decisions to be made based on data rather than assumptions or opinions.
- **Objective Comparison:** Provides a scientific way to compare the performance of two different taglines.
- **User-Centric:** Measures actual user behavior and reactions to the taglines.

### Step-by-Step Procedure for Conducting A/B Testing

#### Step 1: Define the Goal
The primary goal is to determine which tagline leads to more conversions (e.g., click-throughs, sign-ups, or purchases).

#### Step 2: Formulate Hypotheses
- **Null Hypothesis (\(H_0\))**: There is no difference in effectiveness between Tagline A and Tagline B.
- **Alternative Hypothesis (\(H_1\))**: There is a difference in effectiveness between Tagline A and Tagline B.

#### Step 3: Identify Key Metrics
Decide on the metrics to measure the effectiveness. Common metrics include:
- Click-through rate (CTR)
- Conversion rate
- Engagement rate

#### Step 4: Select the Sample Size
Determine the sample size needed to achieve statistical significance. Tools like online sample size calculators can be used to calculate the required number of users.

#### Step 5: Randomly Assign Users
Randomly divide the users into two groups:
- **Group A**: Sees Tagline A
- **Group B**: Sees Tagline B

#### Step 6: Run the Test
- **Duration**: Ensure the test runs for a sufficient period to collect enough data (e.g., a few days to a couple of weeks).
- **Consistent Conditions**: Ensure both groups experience the same conditions except for the tagline.

#### Step 7: Collect Data
Monitor and record the performance of both taglines based on the defined metrics.

#### Step 8: Analyze Results
Use statistical analysis to compare the performance of the two taglines. A common approach is to use a chi-square test for proportions if comparing conversion rates.

#### Step 9: Make a Decision
Based on the statistical analysis:
- **If the null hypothesis is rejected**: Choose the tagline with the higher conversion rate.
- **If the null hypothesis is not rejected**: Either tagline can be used as there is no significant difference in performance.

#### Step 10: Implement the Winning Tagline
Deploy the more effective tagline across the entire campaign to maximize conversions.

### Conclusion

A/B testing is a robust method for making data-driven decisions about which tagline to use in the online ad campaign. By following the outlined steps, the marketing team can ensure that they choose the tagline that is most likely to attract new customers and achieve the desired business outcomes.

In [7]:
import scipy.stats as stats

# Example data
conversions_A = 150
visitors_A = 2000
conversions_B = 170
visitors_B = 2000

# Conversion rates
conversion_rate_A = conversions_A / visitors_A
conversion_rate_B = conversions_B / visitors_B

# Pooled conversion rate
pooled_conversion_rate = (conversions_A + conversions_B) / (visitors_A + visitors_B)

# Standard error
std_error = ((pooled_conversion_rate * (1 - pooled_conversion_rate)) * (1 / visitors_A + 1 / visitors_B)) ** 0.5

# Z-score
z_score = (conversion_rate_B - conversion_rate_A) / std_error

# P-value
p_value = 1 - stats.norm.cdf(z_score)

print(f"Z-score: {z_score:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference between Tagline A and Tagline B.")
else:
    print("Fail to reject the null hypothesis: No significant difference between Tagline A and Tagline B.")


Z-score: 1.166
P-value: 0.122
Fail to reject the null hypothesis: No significant difference between Tagline A and Tagline B.
