<a href="https://colab.research.google.com/github/Hassan-zeidan/AAI614_Zeidan/blob/main/Week%203/Assignment%202/Formulating_and_Testing_a_Hypothesis_With_a_Real_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Formulating and Testing a Hypothesis With a Real Dataset

In this exercise, you will independently choose a dataset, explore its variables, and formulate a research question that can be answered using hypothesis testing. You'll then carry out the entire statistical process—from question to conclusion—including verifying assumptions and interpreting results:

a) Specify your research question and translate it into a hypothesis.  
b) Specify and justify the choice of test.  
c) Compute the suitable test score and draw conclusions.  
d) Interpret your results and discuss the limitations of your conclusions.


# About this File :  
Food servers’ tips in restaurants may be influenced by many
factors, including the nature of the restaurant, size of the party, and table
locations in the restaurant. Restaurant managers need to know which factors
matter when they assign tables to food servers. For the sake of staff morale,
they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the United
States) are a major component of pay.
In one restaurant, a food server recorded the following data on all cus-
tomers they served during an interval of two and a half months in early 1990.
The restaurant, located in a suburban shopping mall, was part of a national
chain and served a varied menu. In observance of local law, the restaurant
offered to seat in a non-smoking section to patrons who requested it. Each
record includes a day and time, and taken together, they show the server’s
work schedule.

# **PART A:**

---



---



### **1) Research Question**  
**Do smokers tip a different percentage of their bill compared to non-smokers?**

### **2) Hypothesis:**

**Null Hypothesis** : The mean tip percentage is the same for smokers and non-smokers.  
$H_0$ : $\mu_{\text{smokers}}$ = $\mu_{\text{non-smokers}}$  
**Alternative hypothesis**: The mean tip percentage is different between smokers and non-smokers.  
$H_A$ : $\mu_{\text{smokers}} \neq \mu_{\text{non-smokers}}$  

# **PART B:**

---



---


### **Choice of Test:**  
**Data type:** Numeric outcome (tip percentage), two independent groups (smoker vs non-smoker).

**Appropriate test:** Two-sample t-test.

# **PART C:**

---



---


### **Compute the suitable test score and draw conclusions.**

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
df = pd.read_csv('tips.csv')   # Load Kaggle Dataset
df.head()                      # check whether the Data set is loaded

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [12]:
df['tip_percentage'] = df['tip']/df['total_bill']        # Tip percentage "kept in decimal"
smokers = df[df['smoker']=='Yes']['tip_percentage']       # Smokers data with the tip percentage column
non_smokers = df[df['smoker']=='No']['tip_percentage']    # non-smokers data with the tip percentage column


In [17]:
# now we have our data ready to be used for the calculations
n1 = len(smokers)                                            # size of smokers sample
n2 = len(non_smokers)                                        # size of non smokers sample

x1 = smokers.mean()                                          # sample mean of smokers set
x2 = non_smokers.mean()                                      # sample mean of non smokers set

s1_sq = smokers.var(ddof=1)                                  # sample variance of smokers set
s2_sq = non_smokers.var(ddof=1)                              # sample variance of non smokers set

sp_sq = ((n1-1)*s1_sq + (n2-1)*s2_sq) / (n1 + n2 - 2)        # pooled variance
sp = np.sqrt(sp_sq)                                          # pooled standard deviation

t_obs = (x1 - x2) / (sp * np.sqrt(1/n1 + 1/n2))              # t-statistic

df = n1 + n2 - 2                                             # degree of freedom

se = sp*np.sqrt(1/n1 + 1/n2)                                 # standard error

abs_t = np.abs(t_obs)                                        # absolute value of t-statistic

p_value = 1 - stats.t.cdf(abs_t, df)                         # p-value
p_two_sided = 2.0 * p_value                                  # two-sided p-value



In [18]:
alpha = 0.05                                # 95% confidence
tcrit = stats.t.ppf(1 - alpha/2, df)        # critical t-value
margin = tcrit * se                         # margin of error
diff_means = x1 - x2                        # observed mean difference
ci_lower = diff_means - margin
ci_upper = diff_means + margin

In [19]:
print("Sample sizes:", n1, n2)
print("Means:", x1, x2)
print("Difference in means (smokers - non-smokers):", diff_means)
print("t-statistic:", t_obs)
print("Degrees of freedom:", df)
print("Two-sided p-value:", p_two_sided)
print("95% CI for (mu1 - mu2):", (ci_lower, ci_upper))

Sample sizes: 93 151
Means: 0.16319604463687792 0.15932846217921523
Difference in means (smokers - non-smokers): 0.003867582457662694
t-statistic: 0.4796693002669869
Degrees of freedom: 242
Two-sided p-value: 0.631895777687852
95% CI for (mu1 - mu2): (np.float64(-0.012015073475032632), np.float64(0.01975023839035802))


# **Part D:**

---



---



---



**Interpretation**  
When we look at the results, the average tip percent for smokers was about 16.3 while non smokers was 15.9. The difference between them is almost nothing, around 0.4 percent. The t test gave a value of 0.48 with p equal to 0.63. This is very high, it means that such a small difference is very possible to happen by chance if in reality there is no difference. Also the confidence interval range goes from negative 1.2 percent to positive 2 percent, which means the real difference could be less, could be more, or even zero. So we don’t see any strong evidence that smoking status changes tipping behavior.

**conclusion**  
Based on this test we cannot reject the null hypothesis. The data do not support any real difference in the percentage of tip between smokers and non smokers. Even though smokers had a little bit higher average, the result is not significant and could just be random variation. In simple words, in this restaurant the tipping percent is basically the same for both groups.

**limitation**  
The data has some limits that we should keep in mind. It was collected from one restaurant only, so we cannot generalize for all restaurants or countries. Also the dataset is observational, not an experiment, so many other things can affect the tips, like the day of week, the size of the group, or the quality of service, and we cannot control for these factors.

Ai usage:  
1) some latex equations - example " $H_A$ : $\mu_{\text{smokers}} \neq \mu_{\text{non-smokers}}$ "

2) tcrit = stats.t.ppf(1 - alpha/2, df) # critical t-value  
margin = tcrit * se                         # margin of error

3) some of the result analysis