# Inferential Statistics & Hypothesis Testing #

## Formulate Hypotheses: ##

**Hypothesis 1:**

**H₀:** There is no significant difference in average transaction revenue between new and returning users.

**H₁:** There is a significant difference in average transaction revenue between new and returning users.

**Hypothesis 2:**

**H₀:** No significant difference in transaction revenue across device categories.

**H₁:** A significant difference exists in transaction revenue across device categories.

**Hypothesis 3:**

**H₀:** Device category and user type are independent.

**H₁:** Device category and use type are associated.

## Data Preparation ##

In [4]:
import pandas as pd
import numpy as np

# Load data from the CSV file exported from BigQuery
path = "C:/Users/Jonat/DataProject/CLV and Customer Segmentation/hypothesis_testing.csv"
df = pd.read_csv(path)

# Display the first few rows to verify the data structure
print(df.head())

# Remove rows with missing critical values for analysis (transaction_revenue and time_on_site)
df.dropna(subset=['transaction_revenue', 'time_on_site'])

# Ensure transaction_revenue is numeric and change it with log transformation
df['transaction_revenue'] = pd.to_numeric(df['transaction_revenue'])
df['log_revenue'] = np.log1p(df['transaction_revenue'])

# Create a categorial cariable 'user_type' based on total_visits
df['user_type'] = df['visitNumber'].apply(lambda x: 'New' if x == 1 else 'Returning')

# Inspects the resulting DataFrame
print(df[['fullVisitorId', 'visitNumber', 'transaction_revenue', 'device_category', 'user_type']].head())

         fullVisitorId     visitId  visitNumber  total_transactions  \
0  3418334011779872055  1501591568            1                   0   
1  2474397855041322408  1501589647            2                   0   
2  5870462820713110108  1501616621            1                   0   
3  9397809171349480379  1501601200            1                   0   
4  6089902943184578335  1501615525            1                   0   

   transaction_revenue  time_on_site        traffic_source    medium  \
0                  0.0             0              (direct)    (none)   
1                  0.0             0  analytics.google.com  referral   
2                  0.0             0  analytics.google.com  referral   
3                  0.0             0  analytics.google.com  referral   
4                  0.0             0    adwords.google.com  referral   

  device_category  
0         desktop  
1         desktop  
2         desktop  
3         desktop  
4         desktop  
         fullVisitor

## Conduct statistical Tests ##

### A. T-Test: New vs. Returning Users ###

In [7]:
from scipy.stats import ttest_ind

# Seperate transaction revenue for new and returning users
new_users = df[df['user_type'] == 'New']['log_revenue']
returning_users = df[df['user_type'] == 'Returning']['log_revenue']

# Perform t-test assuming unequal variances
t_stat, p_value = ttest_ind(new_users, returning_users, equal_var=False)
print(f"T-test Statistic: {t_stat}, P-Value: {p_value}")

T-test Statistic: -4.939448852768216, P-Value: 9.735152981955357e-07


Interpretation: A p-value below 0.05 indicates a significant difference in average transaction revenue betweeen new and returning users.

### B. ANOVA: Transaction Revenue Across Device Categories ###

In [10]:
from scipy.stats import f_oneway

mobile_revenue = df[df['device_category'] == 'mobile']['log_revenue']
desktop_revenue = df[df['device_category'] == 'desktop']['log_revenue']
tablet_revenue = df[df['device_category'] == 'tablet']['log_revenue']

# Perform ANOVA
f_stat, p_value = f_oneway(mobile_revenue, desktop_revenue, tablet_revenue)
print(f"F-test Statistic: {f_stat}, P-Value: {p_value}")
a = df.groupby('device_category')
print(a)

F-test Statistic: 7.93782142946307, P-Value: 0.00036586619073722474
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024422ABD8E0>


Interpretation: A significant p-value (p < 0.05) implies that transaction revenue varies by device category.

### C. Chi-Square Test: Device Category vs. User Type

In [13]:
from scipy.stats import chi2_contingency 

# Create a Contingency Table
contingency_table = pd.crosstab(df['device_category'], df['user_type'])
print(f"Contigency table: {contingency_table}")

# Run Chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f" Chi-Square Statictic: {chi2}, P-Value: {p}")

Contigency table: user_type         New  Returning
device_category                 
desktop          1222        520
mobile            580        145
tablet             70         19
 Chi-Square Statictic: 26.724478073626763, P-Value: 1.5734513538415369e-06


In [14]:
df.to_csv('engineered_dataset.csv', index=False)