#### The purpose of this EDA notebook is the following:
- Better understand the nature of the relationship between the independent variables
- Create an initial model with reasonable economic assumptions that may be dropped in later versions
- Explore methods of imputation for missing variables to provide more data samples
- Avoid linear combinations that might be more difficult to spot in the Bayesian Modeling process
- Establish a reasonable measure of variable importance, which along with correlation plots may inform initial hierarchies
- Create visualizations of poor quality data and also establish probability distributions for the likelihood function

In [4]:
# sklearn, pandas, numpy, matplotlib, and seaborn are natural choices for EDA
# though we will add quite a few more
import pandas as pd
import numpy as np
# import category_encoders as ce
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
# from sklearn.ensemble import RandomForestClassifier

# will add more here later on

#### In an effort to generalize for future EDA, 'df' makes it easier to debug

In [5]:
df = pd.read_csv('C:/Users/norri/Desktop/cortex_Push.csv')
print(df.info())
print(df.describe())
# extract info from panda core dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54039 entries, 0 to 54038
Data columns (total 54 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   ConcatId                                      54039 non-null  object 
 1   ClientId                                      54039 non-null  int64  
 2   Program Id                                    54039 non-null  int64  
 3   Program Name                                  54039 non-null  object 
 4   Retailers                                     54039 non-null  object 
 5   TacticId                                      54039 non-null  int64  
 6   Tactic                                        54039 non-null  object 
 7   CategoryId                                    54039 non-null  int64  
 8   Tactic Category                               54039 non-null  object 
 9   VendorId                                      53135 non-null 

In [6]:
##### While not always true, Dtypes as objects are strings, int64s are IDs, and floats are numerical data

In [7]:
# need more early data exploration following the data being loaded into a df

#### While dropping missings for initial exploration is informative and useful for testing code quickly, it can render over 2/3 of our data useless, so a method of imputation is vital.

In [8]:
# find the best method of imputation here instead of dropping

# df = df.dropna()
print(df.describe())
print(df.info())

          ClientId    Program Id      TacticId    CategoryId      VendorId  \
count  54039.00000  54039.000000  54039.000000  54039.000000  53135.000000   
mean       6.03875   2213.326579   7247.566646      3.962601     45.138270   
std        0.27568   1714.596822   5360.286096      3.854720     53.527778   
min        6.00000    876.000000   2028.000000      1.000000      1.000000   
25%        6.00000    994.000000   2867.000000      2.000000      1.000000   
50%        6.00000   1544.000000   5557.000000      2.000000     32.000000   
75%        6.00000   2373.000000   9281.000000      4.000000     63.000000   
max        8.00000   8210.000000  25783.000000     22.000000    256.000000   

       Total Impressions for Tactic  Total Tactic Spend  \
count                  5.403900e+04        54039.000000   
mean                   3.408857e+07        33927.000842   
std                    3.680617e+08        61813.987485   
min                    0.000000e+00            0.000000   
25

In [9]:
segment = [var for var in df.columns if df[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(segment)))
print('The categorical variables are :\n\n', segment)
print(df[segment].isnull().sum())

There are 12 categorical variables

The categorical variables are :

 ['ConcatId', 'Program Name', 'Retailers', 'Tactic', 'Tactic Category', 'Vendor', 'Tactic Start Date', 'Tactic End Date', 'Brand', 'Left_Right_ConcatId', 'RMN', 'Right_ConcatId']
ConcatId               0
Program Name           0
Retailers              0
Tactic                 0
Tactic Category        0
Vendor                 0
Tactic Start Date      0
Tactic End Date        0
Brand                  0
Left_Right_ConcatId    0
RMN                    0
Right_ConcatId         0
dtype: int64


In [10]:
integer = [var for var in df.columns if df[var].dtype=='int64']
print('There are {} integer variables\n'.format(len(integer)))
print('The integer variables are :\n\n', integer)
print(df[integer].isnull().sum())

There are 11 integer variables

The integer variables are :

 ['ClientId', 'Program Id', 'TacticId', 'CategoryId', 'Total Impressions for Tactic', 'BrandId', 'Nielsen_Week_Year', 'Right_TacticId', 'Right_BrandId', 'Right_Nielsen_Week_Year', 'Weeks']
ClientId                        0
Program Id                      0
TacticId                        0
CategoryId                      0
Total Impressions for Tactic    0
BrandId                         0
Nielsen_Week_Year               0
Right_TacticId                  0
Right_BrandId                   0
Right_Nielsen_Week_Year         0
Weeks                           0
dtype: int64


In [11]:
fp = [var for var in df.columns if df[var].dtype=='float64']
print('There are {} float variables\n'.format(len(fp)))
print('The float variables are :\n\n', fp)
print(df[fp].isnull().sum())

There are 31 float variables

The float variables are :

 ['VendorId', 'Total Tactic Spend', 'Total Tactic Insertion Cost', 'Total Tactic Redemption Cost', 'StoreCount', 'Impressions per Week', 'Brand Share of Program Budget', 'Brand Share of Total Tactic Spend', 'Brand Share of Tactic Insertion Cost', 'Brand Share of Tactic Redemption Cost', 'Weekly Brand Share of Total Tactic Spend', 'Weekly Brand Share of Tactic Insertion Cost', 'Weekly Brand Share of Tactic Redemption Cost', '$', 'Base $', 'Incr $', 'Units', 'Base Units', 'Incr Units', 'Avg Unit Price', 'Any Promo Units', '%ACV Distribution', 'Any Promo %ACV', 'Disp w/o Feat %ACV', 'Feat & Disp %ACV', 'Feat w/o Disp %ACV', 'Price Decr Only %ACV', 'Number of UPCs Selling', '$ Shr - Ty Subcategory', 'Units Shr - Ty Category', 'Units Shr - Ty Subcategory']
VendorId                                          904
Total Tactic Spend                                  0
Total Tactic Insertion Cost                         0
Total Tactic Redemp

#### As there are only 3 types of variables, we can check that all variables are accounted for in the dataframe (notice the .T transpose)

In [12]:
print('Missing variables check' +' ' + str((len(integer) + len(fp) + len(segment) - len(df.T))))

Missing variables check 0


In [13]:
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns], columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])
print(unique_counts)

                                     Column_Name  Num_Unique
25                                           RMN           2
1                                       ClientId           2
8                                Tactic Category          20
7                                     CategoryId          20
26                                         Weeks          27
4                                      Retailers          45
50                        Number of UPCs Selling          78
22                                 Right_BrandId          97
18                                         Brand          97
17                                       BrandId          97
9                                       VendorId         104
10                                        Vendor         105
24                                    StoreCount         120
23                       Right_Nielsen_Week_Year         129
19                             Nielsen_Week_Year         129
6                       

In [14]:
# unique concat id's need to be dropped for categorical variable assignment

# X = df
# y = df['$']

In [15]:
# taking samples here for testing machine learning method code

# X = cortex.head(n=100)
# y = cortex.head(n=100)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
#                                                     random_state=12)
# print(X_train.describe())
# print(y_train.describe())

In [16]:
# encoder = ce.OneHotEncoder(cols=['ConcatId', 'Program Name', 'Retailers',
#                                  'Tactic', 'Tactic Category', 'Vendor',
#                                  'Tactic Start Date', 'Tactic End Date',
#                                  'Brand', 'Left_Right_ConcatId',
#                                  'RMN', 'Right_ConcatId'])

# X_train = encoder.fit_transform(X_train)
# X_test = encoder.transform(X_test)
# y_train = encoder.fit_transform(y_train)
# y_test = encoder.transform(y_test)

# print(X_train.head())

# rfc = RandomForestClassifier(random_state=0)
# rfc.fit(X_train, y_train)

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