# Marketing Modeling Mix in Python

![image.png](https://github.com/Chesar832/Marketing_modeling_mix/blob/main/img/marketing_banner.png?raw=true)

*El presente tiene por objetivo analizar la data proporcionada por Kaggle para el √°rea de Marketing; se busca encontrar "**Qu√© inversi√≥n publicitaria impulsa en mayor porporci√≥n a las ventas**"*

## üì§ Librerias

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
from pandas_profiling import ProfileReport
import altair as alt

## üíæ Carga de datos

In [2]:
mrk_data = pd.read_csv('data/new_dataset.csv', delimiter=';')

In [3]:
mrk_data.head(5)

Unnamed: 0,WeekDate,Influencer,TV,Radio,Social Media,Sales
0,30/12/2017,Macro,444,157.102423,23.13004,1575.835702
1,30/12/2017,Mega,195,71.755979,18.409851,704.004553
2,30/12/2017,Micro,271,86.655446,17.237855,975.461079
3,30/12/2017,Nano,354,123.827561,21.4,1257.769598
4,6/01/2018,Macro,268,90.588322,12.70975,943.382384


## üìä EDA

El [dataset](https://docs.google.com/spreadsheets/d/1d_XEzDSvhkfWHeNj3Ux5Y5KerjYGEDDB/edit?usp=sharing&ouid=100459174823708459699&rtpof=true&sd=true) con el que se va a trabajar contiene las siguientes variables:

|  **VARIABLE**  |                                          **DESCRIPCI√ìN**                                      |
| :---           |                                                                                               | 
| WeekDate       | Fecha de la semana en la que se culmin√≥ la promoci√≥n                                          |
| TV             | Presupuesto de promoci√≥n televisiva *(en millones)*                                           |
| Social Media   | Presupuesto de promoci√≥n de redes sociales *(en millones)*                                    |
| Radio          | Presupuesto de promoci√≥n radiof√≥nica *(en millones)*                                          |
| Influencer     | Tipo de influencer con el que colabora en la promoci√≥n (Mega, Macro, Nano o Micro influencer) |
| Sales          | Ventas obtenidas *(en millones)*                                                              |

In [4]:
mrk_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 655 entries, 0 to 654
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   WeekDate      655 non-null    object 
 1   Influencer    655 non-null    object 
 2   TV            655 non-null    int64  
 3   Radio         655 non-null    float64
 4   Social Media  655 non-null    float64
 5   Sales         655 non-null    float64
dtypes: float64(3), int64(1), object(2)
memory usage: 30.8+ KB


Observaciones:
1. No existen valores nulos en ning√∫n registro.
2. Se tienen √∫nicamente 655 registros.
3. Las variable WeekDate tiene un formato incorrecto para la realizaci√≥n del an√°lisis.

### Corrigiendo formatos

In [5]:
mrk_data['WeekDate'] = pd.to_datetime(mrk_data['WeekDate'])

In [6]:
mrk_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 655 entries, 0 to 654
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   WeekDate      655 non-null    datetime64[ns]
 1   Influencer    655 non-null    object        
 2   TV            655 non-null    int64         
 3   Radio         655 non-null    float64       
 4   Social Media  655 non-null    float64       
 5   Sales         655 non-null    float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 30.8+ KB


Ahora el formato de la variable Weekdate es el adecuado.

### Generando el reporte

In [7]:
report = ProfileReport(mrk_data, title="Marketing modeling mix", explorative=True)
report.to_file("marketing-report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Del reporte generado:
- Ninguna variable contiene valores faltantes.
- No existen filas que contengan valores duplicados.
- Las variables TV e influencer est√°n fuertemente correlacionadas(esto puede ser ocasionado por la naturaleza publicitaria de los influencers en los medios audiovisuales).
- La distribuci√≥n de los tipos de influencers es casi sim√©trica.
- La distribuci√≥n de los presupuestos de publicidad televisiva, radiof√≥nica, por redes sociales y las ventas es casi sim√©trica.
- Los presupuestos para marketing televisivo est√°n m√°s fuertemente relacionados a las ventas generadas que las otras variables.

In [8]:
mrk_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TV,655.0,376.569466,70.095769,168.0,330.0,381.0,426.0,582.0
Radio,655.0,126.651154,25.916728,54.29091,109.738682,127.398413,145.510209,213.556056
Social Media,655.0,23.171277,5.787302,4.74029,19.029863,23.026443,27.066847,44.79938
Sales,655.0,1341.683214,248.858195,590.336016,1178.471164,1357.921625,1518.550852,2061.018042


## ‚úÇ Divisi√≥n del dataset

In [9]:
mrk_data_for_ML = mrk_data[['TV','Radio','Social Media','Influencer','Sales']]
mrk_data_for_ML['Year'] = mrk_data['WeekDate'].dt.year
mrk_data_for_ML['Month'] = mrk_data['WeekDate'].dt.month
mrk_data_for_ML['Day'] = mrk_data['WeekDate'].dt.day

In [10]:
mrk_data_for_ML.sample(5)

Unnamed: 0,TV,Radio,Social Media,Influencer,Sales,Year,Month,Day
498,353,95.808682,11.227599,Micro,1252.466076,2020,5,16
437,291,86.762552,17.465892,Mega,1032.859885,2020,1,2
365,390,138.738405,29.099736,Mega,1387.767088,2019,9,28
536,269,100.416021,23.174249,Macro,961.382921,2020,7,25
398,324,110.433432,16.396152,Micro,1152.071365,2019,11,23


In [11]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(mrk_data_for_ML, test_size=0.2, shuffle=True, random_state=41)

distributions= np.array([len(train), len(test)])
print(distributions)

[524 131]


## üõ† Feature Engineering

### Pipelines

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer

In [13]:
# Para el one hot encoding
one_hot_encode = ColumnTransformer([
        ('one_hot_encode',
        OneHotEncoder(sparse=False),
        ['Influencer'])
])

In [14]:
# Para el feature scaling y el impute
impute_and_scale = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),#Se elige la media por la distribuci√≥n sin sesgos de las variables continuas
    ('scale', StandardScaler())
])

standard_scaling = ColumnTransformer([
    ('standard_scaling', impute_and_scale, ['TV','Radio','Social Media'])
])

In [15]:
# Columnas sin tranformaciones
passthrough = ColumnTransformer([('passthrough', 'passthrough', ["Year",'Month','Day'])])

In [16]:
# Ensamble de todos los pipelines
pipe = Pipeline([
        ('features',
        FeatureUnion([
            ('one_hot_encode', one_hot_encode),
            ('standard_scaling', standard_scaling),
            ('just_pass', passthrough)
        ]))
])

In [17]:
from sklearn import set_config

set_config(display="diagram")
pipe

In [18]:
pipe.fit(train)

pd.DataFrame(pipe.transform(train)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,1.0,0.0,1.557084,1.369182,1.898006,2018.0,9.0,6.0
1,0.0,1.0,0.0,0.0,-0.218506,-1.086092,0.469709,2019.0,4.0,20.0
2,1.0,0.0,0.0,0.0,-0.937198,-0.501318,-0.261281,2018.0,1.0,27.0
3,0.0,0.0,1.0,0.0,-0.162139,-0.465966,0.290526,2020.0,9.0,26.0
4,1.0,0.0,0.0,0.0,1.514808,1.026597,0.028726,2018.0,1.0,12.0


In [19]:
pd.DataFrame(pipe.transform(test)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.0,0.0,0.0,1.261152,1.073237,1.692,2019.0,5.0,18.0
1,0.0,1.0,0.0,0.0,0.147885,1.953782,-0.117886,2020.0,9.0,5.0
2,0.0,0.0,0.0,1.0,0.260621,0.302235,-0.222939,2018.0,7.0,21.0
3,0.0,1.0,0.0,0.0,-0.401702,-0.729274,-0.858365,2020.0,6.0,27.0
4,0.0,0.0,0.0,1.0,-0.091679,0.171046,-0.449915,2019.0,4.0,5.0


## üìä Modelado

In [20]:
from sklearn.linear_model import LinearRegression

In [21]:
lr = LinearRegression()

In [22]:
predicting_pipeline = Pipeline([
    ('feature', pipe),
    ('estimator', lr)
])

In [23]:
predicting_pipeline.fit(train, train['Sales'])

In [24]:
train_pred = predicting_pipeline.predict(train)
test_pred  = predicting_pipeline.predict(test)

In [25]:
pd.DataFrame({'Original':test['Sales'], 'Predecido':test_pred}).sample(5)

Unnamed: 0,Original,Predecido
569,1038.048216,1048.150131
212,1364.791396,1358.777681
593,1070.694402,1076.947313
643,1201.253186,1206.25337
347,1438.831698,1442.717287


In [26]:
lr.coef_

array([ 2.37173209e+00, -1.01318793e+00,  1.39355370e-01, -1.49789953e+00,
        2.31809887e+02,  1.78424946e+01,  1.95921225e+00, -1.08029283e+00,
       -3.65743628e-01, -2.44490447e-01])

## üìãEvaluando el modelo

In [27]:
from sklearn import metrics

In [28]:
def print_metrics(true, predicted):
    print('-'*50)
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    print('MAE (Mean Absolute Error):', mae)
    print('MSE (Mean Squared Error):', mse)
    print('RMSE (Root Mean Squared Error):', rmse)

In [29]:
print('EVALUANDO EL SET DE TESTEO')
print_metrics(test['Sales'], test_pred)

EVALUANDO EL SET DE TESTEO
--------------------------------------------------
MAE (Mean Absolute Error): 9.337514590876587
MSE (Mean Squared Error): 134.63973111480027
RMSE (Root Mean Squared Error): 11.603436177046879


In [30]:
print('EVALUANDO EL SET DE ENTRENAMIENTO')
print_metrics(train['Sales'], train_pred)

EVALUANDO EL SET DE ENTRENAMIENTO
--------------------------------------------------
MAE (Mean Absolute Error): 16.131739847956098
MSE (Mean Squared Error): 1555.187243823233
RMSE (Root Mean Squared Error): 39.43586240749951


In [31]:
results_train = pd.DataFrame(train['Sales'])
results_train = results_train.assign(Sales_pred=train_pred)
results_train['Absolute_Dif'] = results_train.apply(lambda x: abs(x['Sales']-x['Sales_pred']),axis=1)
results_train

Unnamed: 0,Sales,Sales_pred,Absolute_Dif
94,1715.992132,1730.138858,14.146726
273,1291.359328,1268.105950,23.253378
16,1096.079007,1114.358884,18.279877
574,1289.100231,1288.662732,0.437499
192,1721.723788,1714.255368,7.468420
...,...,...,...
407,1188.777515,1180.202779,8.574736
601,906.979628,907.187478,0.207850
243,1566.771283,1559.649274,7.122009
321,1534.993964,1533.170547,1.823417


In [32]:
alt.Chart(results_train).mark_bar(color="#1A9873",stroke='#000000',strokeOpacity=0.5).encode(
    x=alt.X('Sales:Q', title='Ventas en Millones'),
    y=alt.Y('Absolute_Dif:Q', title='Error Absoluto en Millones', axis=alt.Axis(titlePadding=20)),
    tooltip=[alt.Tooltip('Sales:Q',title='Ventas originales'),
               alt.Tooltip('Sales_pred:Q',title='Ventas predecidas'),
            alt.Tooltip('Absolute_Dif:Q',title='Error absoluto')],
).properties(
    title='Histograma del error absoluto de las ventas en el set de entrenamiento',
    width = 800,
    height = 300
).configure_title(
    fontSize = 20,
    anchor = 'start',
    dy = -10,
    dx = 45
)

In [33]:
sales_original_chart = alt.Chart(results_train).mark_bar(color="#995DB3",stroke='#000000',strokeOpacity=0.2).encode(
    x=alt.X('Sales:Q', title='', bin=True),
    y=alt.Y('count():Q', title='', axis=alt.Axis(titlePadding=20)),
)

In [34]:
sales_pred_chart = alt.Chart(results_train).mark_bar(color="#73F65A",stroke='#000000',strokeOpacity=0.5,opacity=0.7).encode(
    x=alt.X('Sales_pred:Q', title='Ventas en Millones', bin=True),
    y=alt.Y('count():Q', title='Frecuencia Relativa', axis=alt.Axis(titlePadding=20)),
)

In [35]:
(sales_original_chart+sales_pred_chart).properties(
    title='Histograma de las ventas originales vs ventas predecidas',
    width=700
).configure_title(
    fontSize = 20,
    anchor = 'start',
    dy = -10,
    dx = 45
)

## üíæ Guardando el pipeline

In [36]:
from joblib import dump, load

dump(predicting_pipeline, 'mrk_modeling_mix.model') 

['mrk_modeling_mix.model']

## üîç Prediciendo las ventas con el modelo dise√±ado

In [37]:
saved_pipeline = load('mrk_modeling_mix.model')

In [38]:
tv = 400
radio = 150
social_media = 40
influencer = "Micro"
year = 2021
month = 3
day = 1

In [39]:
mi_presupuesto = pd.DataFrame({
    "TV": [tv], "Radio": [radio], "Social Media": [social_media], "Influencer": [influencer], 
    "Year": [year], "Month": [month], "Day": [day]
})

In [40]:
sale_pred = saved_pipeline.predict(mi_presupuesto).squeeze()

print(sale_pred)

1442.9756420000667
