# Hypothesis Testing for Superstore Dataset

In [2]:
import pandas as pd
from scipy.stats import ttest_ind


In [3]:
# Load the data with "ISO-8859-1" encoding
data = pd.read_csv('Sample - Superstore.csv', encoding="ISO-8859-1")
data

Unnamed: 0,ï»¿Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,1/21/2014,1/23/2014,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


In [6]:
discount = data['Discount']
discount

0       0.00
1       0.00
2       0.00
3       0.45
4       0.20
        ... 
9989    0.20
9990    0.00
9991    0.20
9992    0.00
9993    0.00
Name: Discount, Length: 9994, dtype: float64

In [16]:
unique_discount_levels = data['Discount'].unique()
unique_discount_levels.sort()
print(unique_discount_levels)

[0.   0.1  0.15 0.2  0.3  0.32 0.4  0.45 0.5  0.6  0.7  0.8 ]


# Goals
#### Testing whether the discount level has a significant impact on sales or profit,

# For Sales:
#### Null Hypothesis (�0H0​): The average sales for orders with a discount is equal to the average sales for orders without a discount.
#### Alternate Hypothesis (�1H1​): The average sales for orders with a discount is not equal to the average sales for orders without a discount.

# For Profit:
#### Null Hypothesis (�0H0​): The average profit for orders with a discount is equal to the average profit for orders without a discount.
#### Alternate Hypothesis (�1H1​): The average profit for orders with a discount is not equal to the average profit for orders without a discount.




##### We will use the Independent Two-Sample T-test for this, as we're comparing means between two unrelated groups (orders with discounts vs. orders without discounts).

In [22]:
# Splitting the data into groups: with discount and without discount
with_discount = data[data['Discount'] > 0]
without_discount = data[data['Discount'] == 0]

# T-test for Sales
t_stat_sales, p_value_sales = ttest_ind(with_discount['Sales'], without_discount['Sales'], equal_var=False)

# T-test for Profit
t_stat_profit, p_value_profit = ttest_ind(with_discount['Profit'], without_discount['Profit'], equal_var=False)

print(t_stat_sales, p_value_sales, t_stat_profit, p_value_profit)

0.4786374783542724 0.6322073069401478 -15.737992941015493 4.356930371141414e-55


# Results

# For Sales:
#### T-Statistic: 0.4786
#### P-Value: 0.6322
##### Decision: Since the p-value (0.6322) is greater than �=0.05α=0.05, we fail to reject the null hypothesis. This suggests that the average sales for orders with a discount are not significantly different from the average sales for orders without a discount.

# For Profit:
### T-Statistic: -15.7380
### P-Value: 4.36×10−554.36×10−55

#### Decision: The p-value is extremely close to 0, which is much less than �=0.05α=0.05. Therefore, we reject the null hypothesis. This indicates that there is a significant difference in average profit between orders with a discount and those without.

# Summary

#### Discounts don't seem to significantly impact the sales value. However, discounts have a significant impact on profit, with orders having discounts tending to have lower profits than those without discounts.