# Objectives
- Understand 3 main groups of data:
    - Customer (Broker) Demographic data
    - Advertising Campaign data
    - Success Marker data

# Week 2 : Data Cleaning
Assigned DataSets:"Goal stats - web traffic", "General stats - web traffic"

0. Load dataset
1. Change column headings to names that are easier to reference
2. Explore the data.
    1. Create a new DataFrame.
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
3. Remove redundant columns
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.

## 0. Load file & create dataframes (GoalStats)


In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
GoalStats_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="Goal stats - web traffic")

GoalStats_raw.head()

Unnamed: 0,Date,Campaign,Audience,Creative - Family,Creative - Version,Platform,Ad Format,Goal,Completions,Campaign Traffic?,Days away from max date,Latest report?
0,2022-04-01,(not set),,(not set),(not set),,,Learn More (Community Mortgage),2.0,General traffic,213.0,0
1,2022-04-01,(not set),,(not set),(not set),,,Form Submission,2.0,General traffic,213.0,0
2,2022-04-01,(not set),,(not set),(not set),,,Home Page Video Play,1.0,General traffic,213.0,0
3,2022-04-01,(not set),,(not set),(not set),,,Learn More (Closer Twins Banner),1.0,General traffic,213.0,0
4,2022-04-01,(not set),,(not set),(not set),,,Learn More (Community Mortgage),18.0,General traffic,213.0,0


In [3]:
#Create new dataframe for cleaned data
GoalStats = GoalStats_raw.copy()

## 1. Rename Columns

In [4]:
# Rename the column headers.
GoalStats = GoalStats.rename(
    columns={
        "City, Country": "Location",
        "Creative - Family": "Creative_Family",
        "Creative - Version": "Creative_Version",
        "Ad Format": "Ad_Format",
        "Campaign Traffic?": "Campaign_Traffic",
        "Days away from max date": "Days_Max_Date",
        "Latest report?": "Latest_Report"})

GoalStats.columns

Index(['Date', 'Campaign', 'Audience', 'Creative_Family', 'Creative_Version',
       'Platform', 'Ad_Format', 'Goal', 'Completions', 'Campaign_Traffic',
       'Days_Max_Date', 'Latest_Report'],
      dtype='object')


## 1. Check for missing values
- Evaluate what to do with entries with missing values

In [5]:
GoalStats_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16509 entries, 0 to 16508
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Date                     16509 non-null  datetime64[ns]
 1   Campaign                 16509 non-null  object        
 2   Audience                 4242 non-null   object        
 3   Creative - Family        16509 non-null  object        
 4   Creative - Version       16509 non-null  object        
 5   Platform                 4549 non-null   object        
 6   Ad Format                4427 non-null   object        
 7   Goal                     16509 non-null  object        
 8   Completions              16509 non-null  float64       
 9   Campaign Traffic?        16509 non-null  object        
 10  Days away from max date  16509 non-null  float64       
 11  Latest report?           16509 non-null  int64         
dtypes: datetime64[ns](1), float64(2)

### Basic Overview:

Out of __16509 entries__

There are missing data for:
- Audience : **12267** missing values [74.3% missing]
- Platform :  **11960** missing values [72.4% missing]
- Ad Format :  **12082** missing values [73.2% missing]

> Significant amount of missing data for these columns <br>
> ⇒ Significantly reduce amount of available data if all rows with missing values removed <br>
> ⇒ Missing data renamed to 'NA' instead of deleting

In [6]:
# Create function to check unique values in 
def col_list(df, para):
    print("List of values in <", para ,"> : \n",df[para].unique(),"\n")
    return

### 1.1. Demographic Information
Columns: 'Audience' <br>
 2   Audience                 4242 non-null   object        
 
 Expected Values: <br>
Audience 1	:	Registered Loan Officers from Registered Brokerage, active (last 120 days). <br>
Audience 2	:	Registered Loan Officers from Registered Brokerage, inactive (last 120 days). <br>
Audience 3	:	Registered Loan Officers from Registered Brokerage, never registered a loan. <br>
Audience 4	:	Non-Registered Loan Officers from Registered Brokerages. <br>
Audience 5	:	Retargeted audience. Non-Registered Loan Officers who visited website (last 7 days). <br>
Audience 6	:	General Targeting. Brokers not registered and not from registered brokerages.

In [7]:
col_list(GoalStats, 'Audience')

List of values in < Audience > : 
 [nan 'General Targetting' 'test4' '4' '2' '5' '1' '3' '1to4'] 



### Cleaning Required:  
1) Rename "General Targetting" to "6" <br>
2) Rename "1to4" to "4"
> Based on preliminary check, Audience 4 is the most common


3) Rename all none 1-6 values to "NA" <br>

In [8]:
# Rename 'General Targetting' to '6'
GoalStats['Audience'] = GoalStats.Audience.str.replace('General Targetting', '6')
# Rename '1to4' to '46'
GoalStats['Audience'] = GoalStats.Audience.str.replace('1to4', '4')

# Replace all non-'1-6' values
allowed_vals = ['1','2','3','4','5','6']
GoalStats.loc[~GoalStats['Audience'].isin(allowed_vals), 'Audience'] = 'NA'


# Check remaining values 
col_list(GoalStats, 'Audience')

List of values in < Audience > : 
 ['NA' '6' '4' '2' '5' '1' '3'] 



In [9]:
GoalStats.groupby('Audience')['Audience'].count()

Audience
1        68
2        23
3        21
4       112
5       153
6      3863
NA    12269
Name: Audience, dtype: int64

In [10]:
GoalStats.groupby('Audience')\
        .sum('Completions')\
        .sort_values('Completions', ascending=False)

Unnamed: 0_level_0,Completions,Days_Max_Date,Latest_Report
Audience,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,38170.0,1372374.0,767
6.0,5489.0,412591.0,268
5.0,155.0,16288.0,30
4.0,112.0,8817.0,17
1.0,69.0,9498.0,3
2.0,23.0,3737.0,0
3.0,21.0,2205.0,2


### 1.2. Advertising Campaign
Columns:  <br>
 1   Campaign                 16509 non-null  object        
 3   Creative - Family        16509 non-null  object        
 4   Creative - Version       16509 non-null  object        
 5   Platform                 4549 non-null   object        
 6   Ad Format                4427 non-null   object    
 
 Expected Values: <br>

In [11]:
col_list(GoalStats, 'Campaign')
col_list(GoalStats, 'Creative_Family')
col_list(GoalStats, 'Creative_Version')
col_list(GoalStats, 'Platform')
col_list(GoalStats, 'Ad_Format')
col_list(GoalStats, 'Campaign_Traffic')

List of values in < Campaign > : 
 ['(not set)' 'Announcement' 'NBNurture' 'FY22_broker_campaign_ph2'
 'FY22_broker_campaign' 'Brand_Exact' 'Brand_Phrase' 'NB_Wholesale_Exact'
 'NB_Wholesale_Phrase' 'FY23_broker_campaign' 'FY23_change_digital_phase3'
 'FY22_Broker_Campaign'
 '5d7f312058-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01'
 'c9dcf05b32-EMAIL_CAMPAIGN_2022_03_21_12_37_COPY_01'
 'f27ee0be9c-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01' 'AE_Intro'
 'Change_Wholesale_plusup' 'adhocwhol' 'Active Broker Emails' 'closr'
 'e3bc604b28-EMAIL_CAMPAIGN_2022_08_02_05_34' 2022.0 'LORecruiting'
 'FY22_anti_inflation' 'August_Mesaage' 'Anti_Inflation' 'CloserTwins'
 'Q4_2022'] 

List of values in < Creative_Family > : 
 ['(not set)' 'SEM Ads' 'Trade Media Ads' 'CompetitiveOpportunity'
 'CloseFaster' 'CloserTwins' 'newsletter' 'One-Off' 'SnapdocsLive'
 'UnfairAdvantage' 'ComingSoon' 219526440.0 'domain' 'All3' '08-24-2022'
 '08-25-2022' '08-29-2022' 'August' 'crm'
 datetime.datetime(2022, 6, 9, 0, 0) da

### Cleaning Required:  
1) No cleaning required for 'Campaign Traffic?'  <br>
2) Rename missing values in 'Platform', 'Ad Format' to "NA" <br>
<br>
3) Rename "(not set)" value to "NA" for 'Campaign', 'Creative - Family' and 'Creative - Version'<br>
>    - Upon verification on Excel, it's found that most of the "(not set)" values are for entries that record "General Traffic" vs "Campaign" traffic <br>
>    - Hence, it makes sense that the 'Campaign' details of the activities are undefined

4) Varied values in 'Campaign', 'Creative - Family' and 'Creative - Version' <br>
>    - will need to evaluate which are actual values and which are erroneous <br>
>    - Maybe cross-refer to other datasets

In [12]:
# Replace missing values
GoalStats = GoalStats.fillna('NA')

col_list(GoalStats, 'Platform')
col_list(GoalStats, 'Ad_Format')

List of values in < Platform > : 
 ['NA' 'Google SEM' 'Trade Media' 'LinkedIn' 'Domain Display'
 'User ID Display' 'Facebook' 'OTT'] 

List of values in < Ad_Format > : 
 ['NA' 'CPC' 'National_mortgage_news' 'Single image' 'Video'
 'Inside_mortgage_finance_newsletter' 'Nmn' 'Scotsman' 'Housingwire'
 'Carousel' 'Chrisman' 'Animated'] 



### 1.2.1 "Campaign" Values

1) Rename to "Email_Campaign": <br>
5d7f312058-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01	<br>
c9dcf05b32-EMAIL_CAMPAIGN_2022_03_21_12_37_COPY_01	<br>
e3bc604b28-EMAIL_CAMPAIGN_2022_08_02_05_34	<br>
f27ee0be9c-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01	

2) Rename "(not set)" to "NA"

In [13]:
# Change all values to string
GoalStats['Campaign'] = GoalStats['Campaign'].astype(str)

In [14]:
# Rename 'Email_Campaign'
GoalStats['Campaign'] = GoalStats.Campaign.str.replace(r'(^.*EMAIL_CAMPAIGN.*$)', 'Email_Campaign')
GoalStats['Campaign'] = GoalStats.Campaign.str.replace(r'(^.*2022.0.*$)', '2022')

# Rename '(not set)' to 'NA'
GoalStats['Campaign'] = GoalStats.Campaign.replace('(not set)', 'NA')

col_list(GoalStats, 'Campaign')

List of values in < Campaign > : 
 ['NA' 'Announcement' 'NBNurture' 'FY22_broker_campaign_ph2'
 'FY22_broker_campaign' 'Brand_Exact' 'Brand_Phrase' 'NB_Wholesale_Exact'
 'NB_Wholesale_Phrase' 'FY23_broker_campaign' 'FY23_change_digital_phase3'
 'FY22_Broker_Campaign' 'Email_Campaign' 'AE_Intro'
 'Change_Wholesale_plusup' 'adhocwhol' 'Active Broker Emails' 'closr'
 '2022.0' 'LORecruiting' 'FY22_anti_inflation' 'August_Mesaage'
 'Anti_Inflation' 'CloserTwins' 'Q4_2022'] 



  GoalStats['Campaign'] = GoalStats.Campaign.str.replace(r'(^.*EMAIL_CAMPAIGN.*$)', 'Email_Campaign')


In [15]:
test = GoalStats.groupby('Campaign').count()

test

Unnamed: 0_level_0,Date,Audience,Creative_Family,Creative_Version,Platform,Ad_Format,Goal,Completions,Campaign_Traffic,Days_Max_Date,Latest_Report
Campaign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022.0,19,19,19,19,19,19,19,19,19,19,19
AE_Intro,7,7,7,7,7,7,7,7,7,7,7
Active Broker Emails,1,1,1,1,1,1,1,1,1,1,1
Announcement,16,16,16,16,16,16,16,16,16,16,16
Anti_Inflation,26,26,26,26,26,26,26,26,26,26,26
August_Mesaage,2,2,2,2,2,2,2,2,2,2,2
Brand_Exact,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630
Brand_Phrase,480,480,480,480,480,480,480,480,480,480,480
Change_Wholesale_plusup,33,33,33,33,33,33,33,33,33,33,33
CloserTwins,5,5,5,5,5,5,5,5,5,5,5


### 1.2.2 Creative_Family & Creative_Version
__For 'Creative - Family':__ <br>
After evaluating the values in the 'Creative_Family' column, we have found that most of the illogical values are for 'General Traffic' records. Since 'General Traffic' would not be associated with any campaign, we have opted only to keep the "Creative_Family" values that are associated with the campaign traffic or have >5 entries.

> __Accepted values:__
SEM Ads, 
UnfairAdvantage, 
CloserTwins, 
Trade Media Ads, 
CloseFaster, 
domain, 
August, 
CompetitiveOpportunity, 
newsletter, 
One-Off, 
crm
 <br>

__For 'Creative - Version':__<br>
We have opted to leave the values as is, since we cannot be sure which version names are correct
> Column left untouched so that the data can be used if needed <br>
> However, we currently have no plans to use this data column

__For both columns:__<br>
> All blanks & '(not set)' values to be renamed to 'NA' for consistency

In [16]:
CreFam_allowed = ['SEM Ads', 'UnfairAdvantage', 'CloserTwins', \
                  'Trade Media Ads', 'CloseFaster', 'domain', \
                  'August', 'CompetitiveOpportunity', 'newsletter', \
                  'One-Off', 'crm', 'ComingSoon']

In [17]:
# Original values for Creative_Family in GoalStats
col_list(GoalStats, 'Creative_Family')

List of values in < Creative_Family > : 
 ['(not set)' 'SEM Ads' 'Trade Media Ads' 'CompetitiveOpportunity'
 'CloseFaster' 'CloserTwins' 'newsletter' 'One-Off' 'SnapdocsLive'
 'UnfairAdvantage' 'ComingSoon' 219526440.0 'domain' 'All3' '08-24-2022'
 '08-25-2022' '08-29-2022' 'August' 'crm'
 datetime.datetime(2022, 6, 9, 0, 0) datetime.datetime(2022, 8, 9, 0, 0)] 



In [18]:
# Replace all non-accepted values with "NA"
GoalStats.loc[~GoalStats['Creative_Family'].isin(CreFam_allowed), \
             'Creative_Family'] = 'NA'

col_list(GoalStats, 'Creative_Family')

List of values in < Creative_Family > : 
 ['NA' 'SEM Ads' 'Trade Media Ads' 'CompetitiveOpportunity' 'CloseFaster'
 'CloserTwins' 'newsletter' 'One-Off' 'UnfairAdvantage' 'ComingSoon'
 'domain' 'August' 'crm'] 



In [19]:
# Rename '(not set)' to 'NA'
GoalStats['Creative_Version'] = GoalStats.Creative_Version.replace('(not set)', 'NA')

col_list(GoalStats, 'Creative_Version')

List of values in < Creative_Version > : 
 ['NA' '300x250' '728x90' 'Cancelingyourlock' 'Change Wholesale'
 'Close More. Close Faster.' '3steps' 'MoreAll' 'NoDTI' 'CloseMore'
 'Competitors' 'Cutdown1A' 'Cutdown1B' 'EarlyBird' 'Faceoff' 'Faceoff1'
 'Faceoff2' 'Faster' 'interactive' 'MoreLoans' 'MoreLoansAll' 'Namaste'
 'Paperwork' 'RTB' 'ShapeUp' 'OTT_15' 'OTT_30' 'OTT_30QR' 'One-Off'
 'Rate Lock' 'ROS1' 'ROS5' 'SnapdocsLive' 'Theycancelweclose' 1099.0
 '1page' 'All' 'OnePage' "We Are America's CDFI" 'ComingSoon' 'Animated'
 219526440.0 'FasterAll' 'MoreNoReg' 'ad1' 'ROP' 'All3' '08-24-2022'
 '08-25-2022' '08-29-2022' 'August' datetime.datetime(2022, 6, 9, 0, 0)
 datetime.datetime(2022, 8, 9, 0, 0)] 



### 1.3. Success Markers
Columns:  <br>        
 7   Goal                     16509 non-null  object        <br>
 8   Completions              16509 non-null  float64       <br>
 
 Expected Values: <br>
 
 > Since all values in 'Completions' are numbers, we can get a number sensing

In [20]:
col_list(GoalStats, 'Goal')

List of values in < Goal > : 
 ['Learn More (Community Mortgage)' 'Form Submission'
 'Home Page Video Play' 'Learn More (Closer Twins Banner)'
 'Learn More (Our Story)' 'Start Closing More' 'Get Approved'
 'Closer Twins Page Video Play' 'Broker Login' 'utm_audience' 'test4'] 



In [21]:
GoalStats.Completions.describe()

count    16509.000000
mean         2.667575
std          4.415329
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        116.000000
Name: Completions, dtype: float64

### 1.3.1 Goals
The goals can be mapped back to clicks on the Change Wholesale website that will lead to certain activities, which includes:
1) Visiting another page on the website
> 'Learn More (Community Mortgage)', 'Learn More (Closer Twins Banner)', 'Learn More (Our Story)', 'Start Closing More', 'Get Approved', 'Get Approved' <br>

2) Playing a video <br>
> 'Home Page Video Play', 'Closer Twins Page Video Play' <br>

3) Other actions <br>
> 'Form Submission' <br>

Other values that are not understood will be dropped since an unidentified goal provides no value to the analysis
> 'utm_audience', 'test4'

In [22]:
# drop rows with 'utm_audience', 'test4'
GoalStats = GoalStats.loc[GoalStats['Goal'] != 'utm_audience']
GoalStats = GoalStats.loc[GoalStats['Goal'] != 'test4']

col_list(GoalStats, 'Goal')

List of values in < Goal > : 
 ['Learn More (Community Mortgage)' 'Form Submission'
 'Home Page Video Play' 'Learn More (Closer Twins Banner)'
 'Learn More (Our Story)' 'Start Closing More' 'Get Approved'
 'Closer Twins Page Video Play' 'Broker Login'] 



### 1.3.2 Completions
No cleaning required

## 3. Remove Redundant Columns

Unused: 'Days_Max_Date','Latest_Report'

In [23]:
GoalStats_final = GoalStats.copy().drop(columns=['Days_Max_Date','Latest_Report'])

## 4. Create Target Group Column

In [None]:
GoalStats_final['Target_Groups'] = GoalStats_final.loc[:, 'Audience']

GoalStats_final.head()

In [None]:
# Replace Audience '1,2,3' as '1' in Target_Group
GoalStats_final['Target_Groups'] = GoalStats_final['Target_Groups'].replace(['1', '2', '3'], '1')

# Replace Audience '4,5' as '2' in Target_Group
GoalStats_final['Target_Groups'] = GoalStats_final['Target_Groups'].replace(['4', '5'], '2')

# Replace Audience '6' as '3' in Target_Group
GoalStats_final['Target_Groups'] = GoalStats_final['Target_Groups'].replace(['6'], '3')

# Check Target Group values
GoalStats_final['Target_Groups'].unique()

## 5. Save cleaned dataframe as csv

In [24]:
GoalStats_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16507 entries, 0 to 16508
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              16507 non-null  datetime64[ns]
 1   Campaign          16507 non-null  object        
 2   Audience          16507 non-null  object        
 3   Creative_Family   16507 non-null  object        
 4   Creative_Version  16507 non-null  object        
 5   Platform          16507 non-null  object        
 6   Ad_Format         16507 non-null  object        
 7   Goal              16507 non-null  object        
 8   Completions       16507 non-null  float64       
 9   Campaign_Traffic  16507 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 1.4+ MB


In [25]:
# Create a CSV file as output.
GoalStats_final.to_csv(r'Goal-Stats_Cleaned.csv', index=False)