-------------------------------------------------------
# **Hypothesis Testing: IceCream daily revenue**
-------------------------------------------------------
---------------------
## **Context**
---------------------

An ice cream vendor says he earns about  $ 522 a day with a standard deviation of  170. 

To test the validity of this statement, the seller provide some data that he collected. the data was obtained from: "https://www.kaggle.com/datasets/vinicius150987/ice-cream-revenue"

--------------------------
## **Key Question**
--------------------------

Is there enough statistical evidence to conclude that the mean daily revenue  is different from 522 dollars? 

**Note:** We assume that the samples are randomly selected, independent, and come from a normally distributed population.

## **Importing the necessary libraries**

In [1]:
# Import the important packages
import pandas as pd  # Library used for data manipulation and analysis

import numpy as np  # Library used for working with arrays

import matplotlib.pyplot as plt  # Library for visualization

import seaborn as sns  # Library for visualization

%matplotlib inline

import scipy.stats as stats  # This library contains a large number of probability distributions as well as a growing library of statistical functions

## **Loading the  Data**

In [10]:
mydata = pd.read_csv('IceCreamData.csv')
mydata.drop('Temperature', axis = 1, inplace = True)
mydata.head()

Unnamed: 0,Revenue
0,534.799028
1,625.190122
2,660.632289
3,487.70696
4,316.240194


In [11]:
mydata.shape

(500, 1)

In [12]:
mydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Revenue  500 non-null    float64
dtypes: float64(1)
memory usage: 4.0 KB


In [16]:
mydata.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Revenue,500.0,521.570777,175.404751,10.0,405.558681,529.368565,642.257922,1000.0


## **Steps of Hypothesis Testing**

### **Step 1: Define the null and the alternate hypotheses**

**Null hypothesis states that the mean daily revenue, $\mu$ is equal to 522.**
**Alternative hypothesis states that the mean daily revenue, $\mu$ is not equal to 522.**

* $H_0$: $\mu$ = 522
* $H_a$: $\mu$ $\neq$ 522

### **Step 2: Decide the significance level**

Here, $\alpha$ = 0.05.

In [17]:
print("The sample size for this problem is", len(mydata))

The sample size for this problem is 500


### **Step 3: Identify the test statistic**

The population is normally distributed and the population standard deviation is assume to be equal to 165. So, we can use the Z-test statistic.

### **Step 4: Calculate the p-value using z-statistic**

In [20]:
sample_mean = mydata["Revenue"].mean()

In [32]:
# Calculating the z-stat

n = 500
mu = 522
sigma = 175

test_stat =  (sample_mean - mu) / (sigma / np.sqrt(n)) 

In [33]:
test_stat

-0.05484414258538087

In [34]:
from scipy.stats import norm

# The p-value for one-tailed test
p_value1 = 1 - norm.cdf(test_stat)

# We can find the p_value for the the two-tailed test from the one-tailed test
p_value_ztest = p_value1 * 2

In [35]:
print('The p-value is: {0} '.format(p_value_ztest))

The p-value is: 1.0437373673957615 


### **Step 5: Decide to reject or fail to reject the null hypothesis based on the z-statistic**

In [40]:
alpha_value = 0.05 # Level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ztest < alpha_value: 
    print('We have the evidence to reject the null hypothesis as the p-value is less than the level of significance'.format(p_value_ztest))
else:
    print('We do not have sufficient evidence to reject the null hypothesis as the p-value ({:.2f}) is greater than the level of significance'.format(p_value_ztest)) 



Level of significance: 0.05
We do not have sufficient evidence to reject the null hypothesis as the p-value (1.04) is greater than the level of significance


The z-statistic has been computed based on the premise that the population standard deviation is known. However, this assumption is unlikely to hold in reality. To address this issue, an alternative test called the **t-statistic** exists, which is akin to the z-statistic but assumes that the population standard deviation is unknown and utilizes the sample standard deviation to compute the test statistic.

We will use **scipy.stats.ttest_1samp** which calculates the t-test for the mean of one sample given the sample observations. This function returns the t statistic and the p-value for a two-tailed t-test.

### **Step 6: Calculate the p-value using t-statistic**

In [43]:
t_statistic, p_value_ttest = stats.ttest_1samp(mydata, popmean = 522)
print('One sample t-test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value_ttest))

One sample t-test 
t statistic: [-0.05471759] p value: [0.95638536] 


### **Step 7: Decide to reject or not to reject the null hypothesis based on t-statistic**

In [44]:
alpha_value = 0.05 # Level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ttest < alpha_value: 
    print('We have the evidence to reject the null hypothesis as the p-value is less than the level of significance'.format(p_value_ttest))
else:
    print('We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance'.format(p_value_ttest)) 



Level of significance: 0.05
We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance


**Observation:** 

- At a 5% significance level, we do not have enough statistical evidence to prove that the mean daily revenue is not equal to 522 dollars. 