# 8. Hypothesis Testing (T-tests, ANOVA): 
#### **Introduction**
This project applies hypothesis testing techniques, including T-tests and ANOVA, to analyze datasets for statistical relationships. The goal is to explore whether differences in group means are statistically significant, using two datasets: customer shopping data and synthetic transaction data.

#### **Objectives**
1. **T-Test**: Compare the means of two groups (e.g., gender, transaction volume) to identify significant differences.
2. **ANOVA**: Analyze variance among three or more groups (e.g., shopping malls, customer categories) to detect significant mean differences.
3. Summarize findings and evaluate their implications.

---

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Users\Zana\Desktop\portfolio_projects\project_1\cleaned_customer_shopping_data.csv")
print(df.head())

  invoice_no customer_id  gender  age  category  quantity    price  \
0    I138884     C241288  Female   28  Clothing         5  1500.40   
1    I317333     C111565    Male   21     Shoes         3  1800.51   
2    I127801     C266599    Male   20  Clothing         1   300.08   
3    I337046     C189076  Female   53     Books         4    60.60   
4    I227836     C657758  Female   28  Clothing         5  1500.40   

  payment_method invoice_date   shopping_mall  
0    Credit Card   2022-08-05          Kanyon  
1     Debit Card   2021-12-12  Forum Istanbul  
2           Cash   2021-11-09       Metrocity  
3           Cash   2021-10-24          Kanyon  
4    Credit Card   2022-05-24  Forum Istanbul  


- T-tests are used when comparing the means of two groups to see if they are significantly different from each other.

- ANOVA (Analysis of Variance) is used when you want to compare the means of three or more groups.

In [2]:
from scipy import stats
# Step 1: T-test for Quantity between Male and Female
# Group data by gender
male_quantity = df[df['gender'] == 'Male']['quantity']
female_quantity = df[df['gender'] == 'Female']['quantity']

# Perform independent T-test
t_stat, p_value_ttest = stats.ttest_ind(male_quantity, female_quantity)

# Step 2: ANOVA for Price across Shopping Malls
# Group data by shopping mall
mall_groups = [group['price'].values for name, group in df.groupby('shopping_mall')]

# Perform one-way ANOVA
f_stat, p_value_anova = stats.f_oneway(*mall_groups)

# Output the results
print(f"T-test results: t-statistic = {t_stat}, p-value = {p_value_ttest}")
print(f"ANOVA results: F-statistic = {f_stat}, p-value = {p_value_anova}")

T-test results: t-statistic = -0.07644059594288087, p-value = 0.939068735114279
ANOVA results: F-statistic = 0.4822743311412417, p-value = 0.8876054043690432


# Hypothesis Testing Summary

## 1. T-Test Results: Comparing Quantity of Items Purchased by Gender
- **t-statistic**: -0.076
- **p-value**: 0.939

### Conclusion:
The p-value is much greater than the common significance level of 0.05, so we **fail to reject the null hypothesis**. This indicates that there is no statistically significant difference in the average quantity of items purchased between Male and Female customers. The means for both genders are quite similar.

---

## 2. ANOVA Results: Comparing Prices Across Shopping Malls
- **F-statistic**: 0.482
- **p-value**: 0.888

### Conclusion:
The p-value is greater than 0.05, so we **fail to reject the null hypothesis**. This means there is no statistically significant difference in the average price of items purchased across the different shopping malls. The average prices are similar among the malls.

---

## Overall Findings:
- There is no significant difference in the quantity of items purchased by Male vs. Female customers.
- There is no significant difference in the average price of items purchased across the shopping malls.

These results suggest that neither gender nor shopping mall is a significant factor in determining the quantity of items purchased or the price paid by customers.

In [7]:
# Create a simple dataset with date, transactions, sales, profit, and customer count
data = {
    'date': pd.date_range(start='2023-01-01', periods=100, freq='D'),  # Dates over 100 days
    'transactions': np.random.randint(500, 5000, size=100),  # Random number of transactions
    'sales': np.random.uniform(10000, 50000, size=100),  # Random sales amounts
    'profit': np.random.uniform(5000, 20000, size=100),  # Random profit amounts
    'customer_count': np.random.randint(100, 1000, size=100)  # Random customer count
}

# Create the DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,date,transactions,sales,profit,customer_count
0,2023-01-01,3387,28307.069479,16564.098541,784
1,2023-01-02,1696,47301.190844,17075.066212,195
2,2023-01-03,3367,29440.087996,12432.529834,496
3,2023-01-04,3489,32946.090802,10153.539943,126
4,2023-01-05,2288,31531.459852,10778.531431,422
...,...,...,...,...,...
95,2023-04-06,1783,11623.229662,13040.175005,885
96,2023-04-07,3834,35789.475706,12244.994752,953
97,2023-04-08,508,12508.266962,9861.017531,908
98,2023-04-09,4014,11694.260086,14834.759668,586


### T-test

In [8]:
median_transactions = df['transactions'].median()
high_transactions = df[df['transactions'] > median_transactions]['profit']
low_transactions = df[df['transactions'] <= median_transactions]['profit']

# Perform T-test
t_stat, p_value = stats.ttest_ind(high_transactions, low_transactions)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: -0.23265664324157728, P-value: 0.8165132445251699


### Anova

In [9]:
# Categorize customer counts
bins = [0, 300, 600, 1000]
labels = ['Low', 'Medium', 'High']
df['customer_category'] = pd.cut(df['customer_count'], bins=bins, labels=labels)

# Perform ANOVA
anova_result = stats.f_oneway(
    df[df['customer_category'] == 'Low']['sales'],
    df[df['customer_category'] == 'Medium']['sales'],
    df[df['customer_category'] == 'High']['sales']
)

print(f"ANOVA F-statistic: {anova_result.statistic}, P-value: {anova_result.pvalue}")

ANOVA F-statistic: 2.488607260893683, P-value: 0.08831339314994303


### **Findings**

#### **1. T-Test on Customer Shopping Data**
- **Tested Hypothesis**: Are there significant differences in the quantity of items purchased between Male and Female customers?
- **Results**:
  - T-statistic: `-0.076`
  - P-value: `0.939`
- **Conclusion**:
  - With a p-value much greater than `0.05`, we fail to reject the null hypothesis.
  - No significant difference in item quantities purchased by Male and Female customers.

#### **2. ANOVA on Customer Shopping Data**
- **Tested Hypothesis**: Are there significant differences in average prices across different shopping malls?
- **Results**:
  - F-statistic: `0.482`
  - P-value: `0.888`
- **Conclusion**:
  - With a p-value greater than `0.05`, we fail to reject the null hypothesis.
  - No significant difference in the average prices of items across shopping malls.

#### **3. T-Test on Transaction Data**
- **Tested Hypothesis**: Is there a significant difference in profit between high-transaction and low-transaction days?
- **Results**:
  - T-statistic: `-0.233`
  - P-value: `0.817`
- **Conclusion**:
  - The p-value exceeds `0.05`, so we fail to reject the null hypothesis.
  - No significant difference in profit between high-transaction and low-transaction days.

#### **4. ANOVA on Transaction Data**
- **Tested Hypothesis**: Are there significant differences in average sales across different customer categories (Low, Medium, High)?
- **Results**:
  - F-statistic: `2.489`
  - P-value: `0.088`
- **Conclusion**:
  - The p-value is greater than `0.05` but less than `0.1`, suggesting a trend toward significance.
  - However, we do not reject the null hypothesis at the `0.05` level. No significant difference in sales across customer categories.

---

### **Summary**
1. **Customer Shopping Data**:
   - No significant difference in quantities purchased by gender or in average prices across shopping malls.
2. **Transaction Data**:
   - No significant difference in profit between high and low transaction days.
   - No strong evidence of differences in sales across customer categories, although a trend towards significance was observed.

These results suggest that at the 0.05 significance level, the analyzed factors do not strongly impact the measured outcomes. Further data or alternative methods may be required for deeper insights.
