### Hypothesis Generation

1. Product attributes such as weight have a significant impact on sales.

2. Store location and size have a significant impact on sales.

3. Sales of a product in one store are correlated with its sales in other stores.

4. The sales of a product are affected by the price of the product.

5. The sales of a product are higher when it is displayed prominently in the store.

##### significance level (alpha) for all tests will be 0.05 

.

In [1]:
# importing library 
import pandas as pd

# loading data
train = pd.read_csv("BMPO_train.csv")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


#### 1st Hypothesis
H0: There is no significant impact of product attributes such as weight on sales.

H1:  There is a significant impact of product attributes such as weight on sales.

since we need to test a relation between two variables a correlation test will be performed 


In [2]:
from scipy.stats.stats import pearsonr

# remove nulls
train['Item_Weight'].fillna(train['Item_Weight'].mean(), inplace=True)

#calculation correlation coefficient and p-value between Item_Weight and sales
corr, p_value= pearsonr(train['Item_Weight'], train['Item_Outlet_Sales'])
print("Pearson correlation coefficient:", corr)
print("p-value:", p_value)


Pearson correlation coefficient: 0.011550000817703888
p-value: 0.28634393544046116


since the corresponding p-value is greater than 0.05, we conclude that there is no statistically significant association between the two variables. 
Hence we fail to reject H0

.

#### 2nd hypothesis 

H0: Store location and size have no significant impact on sales.

H1: Store location and size have a significant impact on sales.

we can use an Analysis of Variance (ANOVA) test

In [3]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#  creating a linear model with Item_Outlet_Sales as the dependent variable and Outlet_Location_Type and Outlet_Size as independent variables.
model = ols('Item_Outlet_Sales ~ C(Outlet_Location_Type) + C(Outlet_Size)', data=train).fit()

#ANOVA table
table = sm.stats.anova_lm(model, typ=2)
print(table)



                               sum_sq      df           F        PR(>F)
C(Outlet_Location_Type)  5.773847e+08     2.0  102.528957  1.597219e-44
C(Outlet_Size)           5.257626e+08     2.0   93.362181  1.150104e-40
Residual                 1.719839e+10  6108.0         NaN           NaN


since the corresponding p-value is less than 0.05 in both location and size , we conclude that there is a statistically significant association between store's location and size with the sales . Hence we reject H0

.

#### 3rd Hypothesis 

H0: Sales of a product in one store are not correlated with its sales in other stores

H1: Sales of a product in one store are correlated with its sales in other stores

correlation test will be performed

In [4]:
# select the columns we need for the hypothesis test
sales_cols = ['Item_Identifier', 'Outlet_Identifier', 'Item_Outlet_Sales']
sales_df = train[sales_cols]

# Since the Item_Identifier and Outlet_Identifier columns contain strings, we need to encode them as numerical values
sales_df['Item_Identifier'] = pd.factorize(sales_df['Item_Identifier'])[0]
sales_df['Outlet_Identifier'] = pd.factorize(sales_df['Outlet_Identifier'])[0]

# calculate the correlation
item_sales = sales_df.groupby('Item_Identifier')['Item_Outlet_Sales'].sum()
corr, p_value = pearsonr(item_sales.values, item_sales.shift().fillna(0).values)
print("Pearson correlation coefficient:", corr)
print("p-value:", p_value)


Pearson correlation coefficient: 0.0028044467900002483
p-value: 0.9118997125507567


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sales_df['Item_Identifier'] = pd.factorize(sales_df['Item_Identifier'])[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sales_df['Outlet_Identifier'] = pd.factorize(sales_df['Outlet_Identifier'])[0]


since the corresponding p-value is greater than 0.05, we conclude that there is no statistically significant association between  sales of a product in one store and its sales in other stores. Hence we fail to reject H0

.

#### 4th Hypothesis

H0: The sales of a product are not affected by the price of the product

H1: The sales of a product are affected by the price of the product

we will calculate the correlation between the price and sales columns. then Perform a hypothesis test using a t-test 

In [5]:
from scipy.stats import ttest_ind


# Calculate the Pearson correlation coefficient between the price and sales columns
corr, _ = pearsonr(train['Item_MRP'], train['Item_Outlet_Sales'])

# Perform a hypothesis test using a t-test to determine whether the correlation is statistically significant
price_high_sales = train[train['Item_MRP'] >= train['Item_MRP'].mean()]['Item_Outlet_Sales']
price_low_sales = train[train['Item_MRP'] < train['Item_MRP'].mean()]['Item_Outlet_Sales']
t, p = ttest_ind(price_high_sales, price_low_sales)


# checking results
alpha = 0.05
if p < alpha:
    print("Reject H0")
else:
    print("Fail to reject H0")

Reject H0


since the corresponding p-value is less than 0.05, we conclude that there is a statistically significant difference in sales between high-priced and low-priced products.Hence we reject H0

#### 5th Hypothesis

H0: The sales of a product are not affected when it is displayed prominently in the store.

H1: The sales of a product are higher when it is displayed prominently in the store.

In [6]:
# Perform linear regression to determine the relationship between product display area and sales
X = train['Item_Visibility']
y = train['Item_Outlet_Sales']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

# get p-value
model.pvalues[1]


9.041287179920645e-33

since the corresponding p-value is less than 0.05, we conclude that there is a statistically significant effect in sales when products are displayed prominently in the store.Hence we reject H0