En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

Hasta ahora habéis estado evaluando las características de vuestro set de datos y habéis hecho una gran exploración, es el momento de hacer vuestro primer ANOVA! 

En el ejercicio de hoy tendréis que hacer un ANOVA con vuestro datos y hacer una interpretación de los resultados.

📌 NOTA Puede que vuestros datos no se ajusten o no cumplan todas las asunciones, no pasa nada, haced el ANOVA e interpretad los resultados. 

En próximas lecciones aprenderemos que podemos hacer cuando nos encontramos en esta situación.

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_columns = None 

In [3]:
df = pd.read_csv('data/supermarket_estand.csv', index_col=0)
df.head(2)

Unnamed: 0,invoice_id,city,customer_type,gender,product_line,unit_price,total,date,time,payment,rating,quantity_boxcox
0,750-67-8428,Yangon,Member,Female,Health and beauty,0.431869,0.850677,1/5/2019,13:08,Ewallet,0.7,0.385456
1,226-31-3081,Naypyitaw,Normal,Female,Electronic accessories,-0.886596,-0.500473,3/8/2019,10:29,Cash,0.866667,0.0


In [4]:
#Antes de realizar el test de ANOVA agrupamos date y creamos columna nueva con los meses

df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['month'] = df['date'].dt.strftime('%B')

In [5]:
df.dtypes

invoice_id                 object
city                       object
customer_type              object
gender                     object
product_line               object
unit_price                float64
total                     float64
date               datetime64[ns]
time                       object
payment                    object
rating                    float64
quantity_boxcox           float64
month                      object
dtype: object

In [6]:
#Definimos una funcion para agrupar las horas por mañana o tarde. Antes de esto hemos comprobado que nuestros supermercados abren de 10 a 20.
def agrupar_turno(time):

    hora = int(time.split(':')[0])
    
    if 10 <= hora <= 14:
        return 'Morning'
    else:
        return 'Afternoon'

In [7]:
df['shift'] = df['time'].apply(agrupar_turno)

In [8]:
df.head(1)

Unnamed: 0,invoice_id,city,customer_type,gender,product_line,unit_price,total,date,time,payment,rating,quantity_boxcox,month,shift
0,750-67-8428,Yangon,Member,Female,Health and beauty,0.431869,0.850677,2019-01-05,13:08,Ewallet,0.7,0.385456,January,Morning


In [9]:
df['shift'].unique()

array(['Morning', 'Afternoon'], dtype=object)

In [10]:
#Realizamos el test de ANOVA
#Nuestra variable respuesta el 'quantity' (la cantidad de productos en cada ticket)

lm = ols('quantity_boxcox ~ city  +customer_type + gender + product_line + unit_price + total + payment + rating + month + shift', data=df).fit()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
city,2.0,0.05164,0.02582,0.373085,0.6887042
customer_type,1.0,0.075237,0.075237,1.087123,0.2973662
gender,1.0,2.058056,2.058056,29.737661,6.259348e-08
product_line,5.0,2.662171,0.532434,7.69335,4.158294e-07
payment,2.0,0.01056,0.00528,0.076289,0.9265536
month,2.0,0.153887,0.076943,1.111785,0.3293849
shift,1.0,0.226816,0.226816,3.277352,0.07054774
unit_price,1.0,0.072439,0.072439,1.046697,0.3065209
total,1.0,288.439557,288.439557,4167.776183,0.0
rating,1.0,0.195144,0.195144,2.819712,0.09343123


In [11]:
lm.summary()

0,1,2,3
Dep. Variable:,quantity_boxcox,R-squared:,0.812
Model:,OLS,Adj. R-squared:,0.809
Method:,Least Squares,F-statistic:,249.8
Date:,"Tue, 05 Sep 2023",Prob (F-statistic):,0.0
Time:,16:07:10,Log-Likelihood:,-74.53
No. Observations:,1000,AIC:,185.1
Df Residuals:,982,BIC:,273.4
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0894,0.032,-2.810,0.005,-0.152,-0.027
city[T.Naypyitaw],-0.0229,0.021,-1.107,0.268,-0.063,0.018
city[T.Yangon],0.0025,0.020,0.124,0.901,-0.038,0.043
customer_type[T.Normal],-0.0102,0.017,-0.608,0.543,-0.043,0.023
gender[T.Male],-0.0125,0.017,-0.743,0.458,-0.046,0.021
product_line[T.Fashion accessories],-0.0394,0.028,-1.386,0.166,-0.095,0.016
product_line[T.Food and beverages],-0.0136,0.029,-0.475,0.635,-0.070,0.042
product_line[T.Health and beauty],-0.0037,0.029,-0.124,0.901,-0.062,0.054
product_line[T.Home and lifestyle],-0.0188,0.029,-0.647,0.518,-0.076,0.038

0,1,2,3
Omnibus:,56.248,Durbin-Watson:,1.991
Prob(Omnibus):,0.0,Jarque-Bera (JB):,81.275
Skew:,-0.475,Prob(JB):,2.25e-18
Kurtosis:,4.024,Cond. No.,10.6


Evaluando el p-value, observamos que excepto unit_price y total, ninguna de nuestras variables predictoras afecta a nuestra variable respuesta.

In [13]:
#guardamos los cambios en un nuevo pickle

df.to_pickle('data/supermarket_3.pkl')