<a href="https://colab.research.google.com/github/brook-miller/mbai-417-data/blob/main/enterprise-data-quality/homework/homework2-answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K Console 

### Situation
It's summer 2020 and we want to optimize production for the Kellogg Console for the holiday season.  

The K Console marketing team recently acquired “social data.” The social data looks across social media sources like reddit, twitter, blog, forums and even podcasts transcripts to find mentions of the K console.  The volume metric highlights the weekly volume of social posts from a set of specific highly influential sources.  The sentiment metrics score the number of positive, negative, neutral and unassigned posts for each week.

The marketing team has had success in using social data to understand & respond to customer concerns, as well as identifying content ideas that resonate with influencers, but it has not been used as an input to forecasting future sales.

### Analysis

Should we invest the resources (>$100k estimated costs) of our data science team to build a data pipeline, enhance our forecast with social data and use it to deliver better forecasts for production planning?

### Data
The data is in 3 files: 
* k-console-activations-final.csv the number of new activations by consumers (we don't know exact sales dates as the product is primarily sold through retail) but activations are a solid if lagging indicator of sales
* k-console-social-sentiment-final.csv contains 4 different measures of sentiment (positive, negative, neutral, unassigned)
* k-console-social-volume-final.csv the volume of social media posts mentioning the k console


In [50]:
#@title standard imports - we'll use in most EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

from datetime import datetime, timedelta
from dateutil.parser import parse
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Load the source data into individual data frames

In [51]:
activations_df = pd.read_csv('https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/enterprise-data-quality/homework/k-console-activations-final.csv')
sentiment_df = pd.read_csv('https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/enterprise-data-quality/homework/k-console-social-sentiment-final.csv')
volume_df = pd.read_csv('https://raw.githubusercontent.com/brook-miller/mbai-417-data/main/enterprise-data-quality/homework/k-console-social-volume-final.csv')

# View each data frame, verify the data types are appropriate

In addition to casting the date values, one of the columns includes "," as a thousands separator.  The pd.to_numeric function can help can convert that data to a numeric type.

```
pd.to_numeric(activations_dataframe["activations"].str.replace(",",""))
```



In [52]:
activations_df['week_end_date'] = pd.to_datetime(activations_df['week_end_date'], utc=True)
sentiment_df['week_start_date'] = pd.to_datetime(sentiment_df['week_start_date'], utc=True)
volume_df['week_start_date'] = pd.to_datetime(volume_df['week_start_date'], utc=True)

activations_df["activations"] = pd.to_numeric(activations_df["activations"].str.replace(",","")).astype(float)
activations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 540 entries, 0 to 539
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   country        540 non-null    object             
 1   week_end_date  540 non-null    datetime64[ns, UTC]
 2   activations    540 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(1), object(1)
memory usage: 12.8+ KB


# Filter each data frame to the United States data only

In [53]:
activations_df = activations_df[activations_df.country == 'United States']
sentiment_df = sentiment_df[sentiment_df.country == 'United States']
volume_df = volume_df[volume_df.country == 'United States']

# Filter, pivot and merge the data into a single dataframe by joining on dates

You should have columns for each metric: activations, volume, positive, negative, neutral & unassigned.

This section uses pivot on the sentiment data frame and you'll need to update the activations to update the date join key.  Use the pandas merge function to combine the 3 datasets into one.

You may want to drop any unnecessary columns and/or reorder them for convenience.

```
activations_dataframe["new_column_name"] = activations_dataframe["week_end_date"].apply(lambda x : x - timedelta(days=X))
```



In [54]:
piv_sentiment_df = sentiment_df.pivot(index=["week_start_date"], columns="sentiment", values="volume")
piv_sentiment_df 

sentiment,negative,neutral,positive,unassigned
week_start_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-10-30 00:00:00+00:00,6143.0,54020.0,11492.0,1200.0
2017-11-06 00:00:00+00:00,8222.0,75742.0,15450.0,1476.0
2017-11-13 00:00:00+00:00,8337.0,84234.0,14200.0,1344.0
2017-11-20 00:00:00+00:00,6373.0,,12979.0,1272.0
2017-11-27 00:00:00+00:00,3127.0,41008.0,7305.0,749.0
...,...,...,...,...
2019-12-02 00:00:00+00:00,,52135.0,11285.0,1634.0
2019-12-09 00:00:00+00:00,,57394.0,,1781.0
2019-12-16 00:00:00+00:00,3830.0,50988.0,7617.0,1365.0
2019-12-23 00:00:00+00:00,3189.0,38141.0,7019.0,1013.0


In [55]:
activations_df

Unnamed: 0,country,week_end_date,activations
3,United States,2017-10-01 00:00:00+00:00,41292.0
7,United States,2017-10-08 00:00:00+00:00,38545.0
11,United States,2017-10-15 00:00:00+00:00,36885.0
15,United States,2017-10-22 00:00:00+00:00,36749.0
19,United States,2017-10-29 00:00:00+00:00,34262.0
...,...,...,...
523,United States,2020-03-29 00:00:00+00:00,93696.0
527,United States,2020-04-05 00:00:00+00:00,92197.0
531,United States,2020-04-12 00:00:00+00:00,89822.0
535,United States,2020-04-19 00:00:00+00:00,124058.0


In [56]:
volume_df

Unnamed: 0,topic,week_start_date,country,volume
3,K Console,2017-11-20 00:00:00+00:00,United States,77280.0
5,K Console,2017-11-27 00:00:00+00:00,United States,123159.0
8,K Console,2017-12-04 00:00:00+00:00,United States,125452.0
14,K Console,2017-12-11 00:00:00+00:00,United States,
17,K Console,2017-12-18 00:00:00+00:00,United States,99206.0
...,...,...,...,...
440,K Console,2019-12-02 00:00:00+00:00,United States,70870.0
446,K Console,2019-12-09 00:00:00+00:00,United States,74647.0
451,K Console,2019-12-16 00:00:00+00:00,United States,63215.0
453,K Console,2019-12-23 00:00:00+00:00,United States,50667.0


In [57]:
activations_df["join_date"] = activations_df["week_end_date"].apply(lambda x : x - timedelta(days=6))
activations_df

Unnamed: 0,country,week_end_date,activations,join_date
3,United States,2017-10-01 00:00:00+00:00,41292.0,2017-09-25 00:00:00+00:00
7,United States,2017-10-08 00:00:00+00:00,38545.0,2017-10-02 00:00:00+00:00
11,United States,2017-10-15 00:00:00+00:00,36885.0,2017-10-09 00:00:00+00:00
15,United States,2017-10-22 00:00:00+00:00,36749.0,2017-10-16 00:00:00+00:00
19,United States,2017-10-29 00:00:00+00:00,34262.0,2017-10-23 00:00:00+00:00
...,...,...,...,...
523,United States,2020-03-29 00:00:00+00:00,93696.0,2020-03-23 00:00:00+00:00
527,United States,2020-04-05 00:00:00+00:00,92197.0,2020-03-30 00:00:00+00:00
531,United States,2020-04-12 00:00:00+00:00,89822.0,2020-04-06 00:00:00+00:00
535,United States,2020-04-19 00:00:00+00:00,124058.0,2020-04-13 00:00:00+00:00


In [58]:
final_df = piv_sentiment_df.merge(activations_df, left_on="week_start_date", right_on="join_date")
final_df = final_df.merge(volume_df, left_on="join_date", right_on="week_start_date")
final_df.drop(columns=["country_x", "week_end_date", "join_date", "topic", "country_y"], inplace=True)
cols = ["negative", "neutral", "positive", "unassigned", "activations", "volume"]
final_df = final_df[["week_start_date"] + cols]
final_df

Unnamed: 0,week_start_date,negative,neutral,positive,unassigned,activations,volume
0,2017-11-20 00:00:00+00:00,6373.0,,12979.0,1272.0,308652.0,77280.0
1,2017-11-27 00:00:00+00:00,3127.0,41008.0,7305.0,749.0,135496.0,123159.0
2,2017-12-04 00:00:00+00:00,9993.0,94338.0,20550.0,2422.0,79751.0,125452.0
3,2017-12-11 00:00:00+00:00,,85123.0,16888.0,1683.0,80229.0,
4,2017-12-18 00:00:00+00:00,,71803.0,17175.0,917.0,151110.0,99206.0
...,...,...,...,...,...,...,...
110,2019-12-02 00:00:00+00:00,,52135.0,11285.0,1634.0,95434.0,70870.0
111,2019-12-09 00:00:00+00:00,,57394.0,,1781.0,62924.0,74647.0
112,2019-12-16 00:00:00+00:00,3830.0,50988.0,7617.0,1365.0,77251.0,63215.0
113,2019-12-23 00:00:00+00:00,3189.0,38141.0,7019.0,1013.0,457024.0,50667.0


# Determine if there is missing data, fill in any missing data

In [59]:
print(final_df.isna().sum())


week_start_date     0
negative           12
neutral             9
positive           11
unassigned         11
activations         0
volume             10
dtype: int64


In [60]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 115 entries, 0 to 114
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   week_start_date  115 non-null    datetime64[ns, UTC]
 1   negative         103 non-null    float64            
 2   neutral          106 non-null    float64            
 3   positive         104 non-null    float64            
 4   unassigned       104 non-null    float64            
 5   activations      115 non-null    float64            
 6   volume           105 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(6)
memory usage: 7.2 KB


In [61]:
#there is an inconsistent bug that made interpolate fail when run on the entire dataframe, we can call interpolate on the list of columns that need interpolation and avoid the 'bug'
#https://stackoverflow.com/questions/58141908/pandas-interpolate-throws-an-invalid-fill-method-error-after-version-0-24

final_df[cols] = final_df[cols].interpolate(method='linear')
final_df.fillna(method='bfill', inplace=True)

print(final_df.isna().sum())
final_df

week_start_date    0
negative           0
neutral            0
positive           0
unassigned         0
activations        0
volume             0
dtype: int64


Unnamed: 0,week_start_date,negative,neutral,positive,unassigned,activations,volume
0,2017-11-20 00:00:00+00:00,6373.000000,41008.0,12979.0,1272.0,308652.0,77280.0
1,2017-11-27 00:00:00+00:00,3127.000000,41008.0,7305.0,749.0,135496.0,123159.0
2,2017-12-04 00:00:00+00:00,9993.000000,94338.0,20550.0,2422.0,79751.0,125452.0
3,2017-12-11 00:00:00+00:00,9269.333333,85123.0,16888.0,1683.0,80229.0,112329.0
4,2017-12-18 00:00:00+00:00,8545.666667,71803.0,17175.0,917.0,151110.0,99206.0
...,...,...,...,...,...,...,...
110,2019-12-02 00:00:00+00:00,1496.666667,52135.0,11285.0,1634.0,95434.0,70870.0
111,2019-12-09 00:00:00+00:00,2663.333333,57394.0,9451.0,1781.0,62924.0,74647.0
112,2019-12-16 00:00:00+00:00,3830.000000,50988.0,7617.0,1365.0,77251.0,63215.0
113,2019-12-23 00:00:00+00:00,3189.000000,38141.0,7019.0,1013.0,457024.0,50667.0


# Graph the data, analyze to determine if additional efforts are worthwhile

You can manually scale (normalize) each of the columns or use MinMaxScaler

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

#cols contains the list of columns you want to scale activations, volume, etc...
normalized_dataframe = pd.DataFrame(scaler.fit_transform(final_dataframe[cols]), columns=cols)

#add the date column to the normalized_dataframe
normalized_dataframe["start_date"] = final_dataframe["week_start_date"]

```



In [62]:
#for column in df_norm.columns:
#  df_norm[column] = (df_norm[column] - df_norm[column].min()) / (df_norm[column].max() - df_norm[column].min())

from sklearn.preprocessing import MinMaxScaler

# create a scaler object
scaler = MinMaxScaler()
# fit and transform the data
df_norm = pd.DataFrame(scaler.fit_transform(final_df[cols]), columns=cols)
df_norm["start_date"] = final_df["week_start_date"]
df_norm = df_norm[["start_date"] + cols]

plotdf = pd.melt(df_norm, id_vars=['start_date'], value_vars=cols,
                 var_name='var', value_name='val')

fig = px.line(plotdf, x="start_date", y="val", color="var", title='social metrics for United States and activations')
fig.show()


In [63]:
fig = px.scatter_matrix(df_norm, dimensions=cols)
fig.show()

In [69]:
holidayactivations = final_df.loc[((final_df['week_start_date'].dt.month == 11) & (final_df['week_start_date'].dt.day >= 15)) | (final_df['week_start_date'].dt.month == 12) ].activations.sum()
allactivations = final_df.activations.sum()

print(f"{allactivations}, {holidayactivations}, {holidayactivations / allactivations}")

8795725.0, 3980487.0, 0.4525479139013555


#  Final analysis
In the markdown cell below summarize your findings. What method did you use to fill in the missing data? Should the team invest additional resources to build better models linking activations to social metrics.  Please limit your response to 250 words (+/-)

The sentiment and social data is very messy and low quality.  45% of activations occur in the holiday period and the social data clearly does not match that pattern.  

I've added a scatter matrix to draw out one of the reasons I find sentiment challenging to operationalize.  Note the high corelation between positive and negative sentiment.

The activations data does have additional peaks outside of the holiday time frames that correspond to the release of new titles.  Monitoring social data to understand release dates of new titles and amplify those efforts would be valuable to increase peaks or promote new titles in the slower months.