# **SHETH L.U.J. & SIR M.V. COLLEGE**
**Swati Mahajan | T093**
## **Practical No. 4**

**Aim** :- Hypothesis Testing
*  Formulate null and alternative hypotheses for a given problem.
*  Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chisquare test).
*  Interpret the results and draw conclusions based on the test outcomes.


# **Statistical Tests on the Global Superstore Dataset**

## 1. Data Loading and Setup

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statsmodels.stats import weightstats as stests
import statsmodels.api as sm
from statsmodels.formula.api import ols

try:
    df = pd.read_csv('Datasets/Global_Superstore2.csv', encoding='latin1') # Added encoding for compatibility
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'Datasets/Global_Superstore2.csv' not found.")
    df = pd.DataFrame() # Create an empty DataFrame to avoid errors

df.head()

Dataset loaded successfully.


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,32298,CA-2012-124891,31-07-2012,31-07-2012,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,05-02-2013,07-02-2013,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,17-10-2013,18-10-2013,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2013-1579342,28-01-2013,30-01-2013,First Class,KM-16375,Katherine Murray,Home Office,Berlin,Berlin,...,TEC-PH-10004583,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2013-4320,05-11-2013,06-11-2013,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


## 2. One-Sample t-test

**Hypothesis:** Is the average Sales per order significantly different from a hypothesized value of 250?

In [3]:
if not df.empty:
    # Define the null and alternative hypotheses
    H0 = "The average Sales is 250."
    H1 = "The average Sales is not 250."

    # Extract the data for the test
    sales_data = df['Sales']

    # Calculate the test statistic
    t_stat, p_value = stats.ttest_1samp(sales_data, 250)

    # Print the results
    print("Hypothesis: Is the average Sales different from 250?")
    print("Test statistic:", t_stat)
    print("p-value:", p_value)

    # Conclusion
    if p_value < 0.05:
        print("Conclusion: Reject the null hypothesis. The average Sales is significantly different from 250.")
    else:
        print("Conclusion: Fail to reject the null hypothesis.")

Hypothesis: Is the average Sales different from 250?
Test statistic: -1.630116728496896
p-value: 0.10308296939821836
Conclusion: Fail to reject the null hypothesis.


## 3. Two-Sample (Independent) t-test

**Hypothesis:** Is there a significant difference in the average Sales between the 'Consumer' and 'Corporate' segments?

In [4]:
if not df.empty and 'Consumer' in df['Segment'].unique() and 'Corporate' in df['Segment'].unique():
    # Separate the data into two groups
    consumer_sales = df[df['Segment'] == 'Consumer']['Sales']
    corporate_sales = df[df['Segment'] == 'Corporate']['Sales']

    print("Consumer Sales mean:", consumer_sales.mean())
    print("Corporate Sales mean:", corporate_sales.mean())

    # Perform the independent t-test
    ttest, pval = stats.ttest_ind(consumer_sales, corporate_sales, nan_policy='omit')

    print("\nHypothesis: Is there a difference in Sales between Consumer and Corporate segments?")
    print("p-value:", pval)

    if pval < 0.05:
        print("Conclusion: We reject the null hypothesis. There is a significant difference.")
    else:
        print("Conclusion: We accept the null hypothesis.")

Consumer Sales mean: 245.41629903688062
Corporate Sales mean: 247.89017573789613

Hypothesis: Is there a difference in Sales between Consumer and Corporate segments?
p-value: 0.6110671988608942
Conclusion: We accept the null hypothesis.


## 4. Paired Sample t-test

**Hypothesis:**  Is there a significant difference between the Sales and Profit for each order? (Note: This is an illustrative example to test for a difference in their means).

*  H0: The mean difference between Sales and Profit is 0.
*  H1: The mean difference between Sales and Profit is not 0.

In [5]:
if not df.empty:
    print(df[['Sales', 'Profit']].describe())

    # Perform the paired t-test
    ttest, pval = stats.ttest_rel(df['Sales'], df['Profit'])

    print("\nHypothesis: Is there a difference between Sales and Profit?")
    print("p-value:", pval)

    if pval < 0.05:
        print("Conclusion: Reject the null hypothesis; there is a significant difference.")
    else:
        print("Conclusion: Accept the null hypothesis.")

              Sales        Profit
count  51290.000000  51290.000000
mean     246.490581     28.610982
std      487.565361    174.340972
min        0.444000  -6599.978000
25%       30.758625      0.000000
50%       85.053000      9.240000
75%      251.053200     36.810000
max    22638.480000   8399.976000

Hypothesis: Is there a difference between Sales and Profit?
p-value: 0.0
Conclusion: Reject the null hypothesis; there is a significant difference.


## 5. One-Sample Z-test

**Hypothesis:** Is the average Quantity of products ordered significantly different from a hypothesized value of 3?

In [7]:
if not df.empty and len(df) > 30:
    # Perform one-sample Z-test
    ztest, pval = stests.ztest(df['Quantity'], x2=None, value=3)

    print("Hypothesis: Is the average Quantity different from 3?")
    print("p-value:", float(pval))

    if pval < 0.05:
        print("Conclusion: Reject the null hypothesis.")
    else:
        print("Conclusion: Accept the null hypothesis.")
else:
    print("Not enough data (n < 30) for a Z-test. A t-test would be more appropriate.")

Hypothesis: Is the average Quantity different from 3?
p-value: 0.0
Conclusion: Reject the null hypothesis.


## 6. Two-Sample Z-test

**Hypothesis:** Is there a significant difference in the mean of Sales and Profit?

* H0: The mean difference between the two groups is 0.
* H1: The mean difference between the two groups is not 0.

In [8]:
if not df.empty and len(df) > 30:
    ztest, pval1 = stests.ztest(df['Sales'], x2=df['Profit'],
                                value=0, alternative='two-sided')

    print("Hypothesis: Is the mean of Sales different from the mean of Profit?")
    print("p-value:", float(pval1))

    if pval1 < 0.05:
        print("Conclusion: Reject the null hypothesis.")
    else:
        print("Conclusion: Accept the null hypothesis.")

Hypothesis: Is the mean of Sales different from the mean of Profit?
p-value: 0.0
Conclusion: Reject the null hypothesis.


## 7. Chi-Square Test

**Hypothesis:**  Is there a significant association between a customer's Segment and their Region?

In [9]:
if not df.empty:
    # Create a contingency table
    contingency_table = pd.crosstab(df['Segment'], df['Region'])

    print("Contingency Table:")
    print(contingency_table)
    print("\n")

    # Perform the chi-square test
    chi2_stat, p_value, dof, expected_freq = stats.chi2_contingency(contingency_table)

    # Print the results
    print('Chi-square statistic:', chi2_stat)
    print('P-value:', p_value)
    print('Degrees of freedom:', dof)
    print('Expected frequencies:\n', expected_freq)

    # Conclusion
    if p_value < 0.05:
        print("\nConclusion: Reject the null hypothesis. There is a significant association between Segment and Region.")
    else:
        print("\nConclusion: Fail to reject the null hypothesis. There is no significant association between Segment and Region.")

Contingency Table:
Region       Africa  Canada  Caribbean  Central  Central Asia  EMEA  East  \
Segment                                                                     
Consumer       2381     202        828     5782          1042  2538  1469   
Corporate      1312     110        507     3321           613  1574   877   
Home Office     894      72        355     2014           393   917   502   

Region       North  North Asia  Oceania  South  Southeast Asia  West  
Segment                                                               
Consumer      2468        1170     1837   3479            1650  1672  
Corporate     1487         708     1053   1998             909   960  
Home Office    830         460      597   1168             570   571  


Chi-square statistic: 38.99310388434618
P-value: 0.027354738562463815
Degrees of freedom: 24
Expected frequencies:
 [[2371.57469292  198.53601092  873.76525639 5747.72092026 1058.8587249
  2600.09791382 1472.47541431 2473.94482355 1208.79