# Objectives
- Understand 3 main groups of data:
    - Customer (Broker) Demographic data
    - Advertising Campaign data
    - Success Marker data

# Week 2 : Data Cleaning
Assigned DataSets:"Goal stats - web traffic", "General stats - web traffic"

0. Load dataset
1. Change column headings to names that are easier to reference
2. Explore the data.
    1. Create a new DataFrame.
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
3. Remove redundant columns
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.

## 0. Load file & create dataframes (GoalStats)


In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
GoalStats_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="Goal stats - web traffic")

GoalStats_raw.head()

Unnamed: 0,Date,Campaign,Audience,Creative - Family,Creative - Version,Platform,Ad Format,Goal,Completions,Campaign Traffic?,Days away from max date,Latest report?,Unnamed: 12,Goal.1
0,2022-09-22,(not set),,(not set),(not set),,,Learn More (Community Mortgage),116,General traffic,39,0,,Broker Login
1,2022-09-27,(not set),,(not set),(not set),,,Learn More (Community Mortgage),56,General traffic,34,0,,Closer Twins Page Video Play
2,2022-06-15,(not set),,(not set),(not set),,,Learn More (Community Mortgage),54,General traffic,138,0,,Form Submission
3,2022-10-14,(not set),,(not set),(not set),,,Learn More (Community Mortgage),52,General traffic,17,0,,Get Approved
4,2022-09-14,(not set),,(not set),(not set),,,Learn More (Community Mortgage),49,General traffic,47,0,,Home Page Video Play


In [3]:
#Create new dataframe for cleaned data
GoalStats = GoalStats_raw.copy()

## 1. Rename Columns

In [4]:
# Rename the column headers.
GoalStats = GoalStats.rename(
    columns={
        "City, Country": "Location",
        "Creative - Family": "Creative_Family",
        "Creative - Version": "Creative_Version",
        "Ad Format": "Ad_Format",
        "Campaign Traffic?": "Campaign_Traffic",
        "Days away from max date": "Days_Max_Date",
        "Latest report?": "Latest_Report"})

GoalStats.columns

Index(['Date', 'Campaign', 'Audience', 'Creative_Family', 'Creative_Version',
       'Platform', 'Ad_Format', 'Goal', 'Completions', 'Campaign_Traffic',
       'Days_Max_Date', 'Latest_Report', 'Unnamed: 12', 'Goal.1'],
      dtype='object')


## 1. Check for missing values
- Evaluate what to do with entries with missing values

In [5]:
GoalStats_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16509 entries, 0 to 16508
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Date                     16509 non-null  datetime64[ns]
 1   Campaign                 16509 non-null  object        
 2   Audience                 4242 non-null   object        
 3   Creative - Family        16509 non-null  object        
 4   Creative - Version       16509 non-null  object        
 5   Platform                 4549 non-null   object        
 6   Ad Format                4427 non-null   object        
 7   Goal                     16509 non-null  object        
 8   Completions              16509 non-null  int64         
 9   Campaign Traffic?        16509 non-null  object        
 10  Days away from max date  16509 non-null  int64         
 11  Latest report?           16509 non-null  int64         
 12  Unnamed: 12              0 non-n

### Basic Overview:

Out of __16509 entries__

There are missing data for:
- Audience : **12267** missing values [74.3% missing]
- Platform :  **11960** missing values [72.4% missing]
- Ad Format :  **12082** missing values [73.2% missing]

> Significant amount of missing data for these columns <br>
> ⇒ Significantly reduce amount of available data if all rows with missing values removed <br>
> ⇒ Missing data renamed to 'NA' instead of deleting

In [6]:
# Create function to check unique values in 
def col_list(df, para):
    print("List of values in <", para ,"> : \n",df[para].unique(),"\n")
    return

### 1.1. Demographic Information
Columns: 'Audience' <br>
 2   Audience                 4242 non-null   object        
 
 Expected Values: <br>
Audience 1	:	Registered Loan Officers from Registered Brokerage, active (last 120 days). <br>
Audience 2	:	Registered Loan Officers from Registered Brokerage, inactive (last 120 days). <br>
Audience 3	:	Registered Loan Officers from Registered Brokerage, never registered a loan. <br>
Audience 4	:	Non-Registered Loan Officers from Registered Brokerages. <br>
Audience 5	:	Retargeted audience. Non-Registered Loan Officers who visited website (last 7 days). <br>
Audience 6	:	General Targeting. Brokers not registered and not from registered brokerages.

In [7]:
col_list(GoalStats, 'Audience')

List of values in < Audience > : 
 [nan 'General Targetting' '5' '1' 'test4' '4' '2' '3' '1to4'] 



### Cleaning Required:  
1) Rename "General Targetting" to "6" <br>
3) Rename all none 1-6 values to "NA" <br>

In [8]:
# Rename 'General Targetting' to '6'
GoalStats['Audience'] = GoalStats.Audience.str.replace('General Targetting', '6')

# Replace all non-'1-6' values
allowed_vals = ['1','2','3','4','5','6','1to4']
GoalStats.loc[~GoalStats['Audience'].isin(allowed_vals), 'Audience'] = 'NA'


# Check remaining values 
col_list(GoalStats, 'Audience')

List of values in < Audience > : 
 ['NA' '6' '5' '1' '4' '2' '3' '1to4'] 



In [9]:
GoalStats.groupby('Audience')['Audience'].count()

Audience
1          68
1to4        4
2          23
3          21
4         108
5         153
6        3863
NA      12269
Name: Audience, dtype: int64

In [10]:
GoalStats.groupby('Audience')\
        .sum('Completions')\
        .sort_values('Completions', ascending=False)

Unnamed: 0_level_0,Completions,Days_Max_Date,Latest_Report,Unnamed: 12
Audience,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,38170,1372374,767,0.0
6,5489,412591,268,0.0
5,155,16288,30,0.0
4,108,8562,17,0.0
1,69,9498,3,0.0
2,23,3737,0,0.0
3,21,2205,2,0.0
1to4,4,255,0,0.0


### 1.2. Advertising Campaign
Columns:  <br>
 1   Campaign                 16509 non-null  object        
 3   Creative - Family        16509 non-null  object        
 4   Creative - Version       16509 non-null  object        
 5   Platform                 4549 non-null   object        
 6   Ad Format                4427 non-null   object    
 
 Expected Values: <br>

In [11]:
col_list(GoalStats, 'Campaign')
col_list(GoalStats, 'Creative_Family')
col_list(GoalStats, 'Creative_Version')
col_list(GoalStats, 'Platform')
col_list(GoalStats, 'Ad_Format')
col_list(GoalStats, 'Campaign_Traffic')

List of values in < Campaign > : 
 ['(not set)' 'Brand_Exact' 'adhocwhol' 'NBNurture' 'Brand_Phrase'
 'FY22_Broker_Campaign' 'FY23_change_digital_phase3'
 'Change_Wholesale_plusup' 'FY22_broker_campaign_ph2' 2022
 'FY23_broker_campaign' 'Anti_Inflation' 'CloserTwins'
 'NB_Wholesale_Phrase' 'Q4_2022' 'Announcement' 'FY22_broker_campaign'
 'NB_Wholesale_Exact' '5d7f312058-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01'
 'c9dcf05b32-EMAIL_CAMPAIGN_2022_03_21_12_37_COPY_01'
 'f27ee0be9c-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01' 'AE_Intro'
 'Active Broker Emails' 'closr'
 'e3bc604b28-EMAIL_CAMPAIGN_2022_08_02_05_34' 'LORecruiting'
 'FY22_anti_inflation' 'August_Mesaage'] 

List of values in < Creative_Family > : 
 ['(not set)' 'SEM Ads' 'CloserTwins' 'Trade Media Ads' 'UnfairAdvantage'
 'August' 'CloseFaster' 'CompetitiveOpportunity' 'newsletter' 'One-Off'
 'SnapdocsLive' 'ComingSoon' 219526440 'domain' 'All3' '08-24-2022'
 '08-25-2022' '08-29-2022' 'crm' datetime.datetime(2022, 6, 9, 0, 0)
 dateti

### Cleaning Required:  
1) No cleaning required for 'Campaign Traffic?'  <br>
2) Rename missing values in 'Platform', 'Ad Format' with "NA" <br>
<br>
3) "(not set)" value for 'Campaign', 'Creative - Family' and 'Creative - Version' <br>
    - Upon verification on Excel, it's found that most of the "(not set)" values are for entries that record "General Traffic" vs "Campaign" traffic
    - Hence, it makes sense that the 'Campaign' details of the activities are undefi
<br>
4) Many values in 'Campaign', 'Creative - Family' and 'Creative - Version' <br>
    - will need to evaluate which are actual values and which are erroneous <br>
    - Maybe cross-refer to other datasets

In [12]:
# Replace missing values
GoalStats = GoalStats.fillna('NA')

col_list(GoalStats, 'Platform')
col_list(GoalStats, 'Ad_Format')

List of values in < Platform > : 
 ['NA' 'Google SEM' 'OTT' 'Trade Media' 'LinkedIn' 'Domain Display'
 'User ID Display' 'Facebook'] 

List of values in < Ad_Format > : 
 ['NA' 'CPC' 'Video' 'Scotsman' 'Housingwire' 'Chrisman' 'Single image'
 'Inside_mortgage_finance_newsletter' 'Nmn' 'Carousel'
 'National_mortgage_news' 'Animated'] 



In [13]:
test = GoalStats.groupby('Campaign').count()

test

Unnamed: 0_level_0,Date,Audience,Creative_Family,Creative_Version,Platform,Ad_Format,Goal,Completions,Campaign_Traffic,Days_Max_Date,Latest_Report,Unnamed: 12,Goal.1
Campaign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2022,19,19,19,19,19,19,19,19,19,19,19,19,19
(not set),11686,11686,11686,11686,11686,11686,11686,11686,11686,11686,11686,11686,11686
5d7f312058-EMAIL_CAMPAIGN_2022_03_27_11_04_COPY_01,2,2,2,2,2,2,2,2,2,2,2,2,2
AE_Intro,7,7,7,7,7,7,7,7,7,7,7,7,7
Active Broker Emails,1,1,1,1,1,1,1,1,1,1,1,1,1
Announcement,16,16,16,16,16,16,16,16,16,16,16,16,16
Anti_Inflation,26,26,26,26,26,26,26,26,26,26,26,26,26
August_Mesaage,2,2,2,2,2,2,2,2,2,2,2,2,2
Brand_Exact,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630,2630
Brand_Phrase,480,480,480,480,480,480,480,480,480,480,480,480,480


### Remove values from Creative_Family & Creative_Version that are not found in other dataset
Assuming that the "Change_2022_Google Analytics Ma" dataset, contains the full list of expected values

__For 'Creative - Family':__
CloserTwins, 
UnfairAdvantage, 
SEM Ads, 
CloseFaster, 
Trade Media Ads, 
CompetitiveOpportunity, 
newsletter <br>

__For 'Creative - Version':__
RTB, 
0, 
1099, 
1page, 
300x250, 
3steps, 
728x90, 
All, 
Animated, 
Cancelingyourlock, 
Change Wholesale, 
Close More. Close Faster., 
CloseMore, 
Competitors, 
Cutdown1A, 
Cutdown1B, 
EarlyBird, 
Faceoff, 
Faceoff1, 
Faceoff2, 
Faster, 
FasterAll, 
FasterReg, 
interactive, 
MoreAll, 
MoreLoans, 
MoreLoansAll, 
MoreNoReg, 
Namaste, 
NoDTI, 
OnePage, 
Paperwork, 
Rate Lock, 
ROS1, 
ROS2, 
ROS5, 
ShapeUp, 
Theycancelweclose, 
wallpaper, 
We Are America's CDFI

In [14]:
# Load 3rd Dataset for reference
GA_main_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="Change_2022_Google Analytics Ma")
GenStats_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="General stats - web traffic")

GA_main_raw.head()


Unnamed: 0,"City, Country",Audience,Campaign,Date,Platform,Ad Format,Creative - Family,Creative - Version,Total Sessions,Days away from max date,Latest report?
0,"Ashburn, United States",General Targetting,FY23_change_digital_phase3,2022-08-08,Trade Media,Inside_mortgage_finance_newsletter,CloserTwins,RTB,1,84,0
1,"Mebane, United States",4,FY23_broker_campaign,2022-06-16,User ID Display,,UnfairAdvantage,OnePage,1,137,0
2,"Chicago, United States",2,FY23_broker_campaign,2022-06-05,Domain Display,Single image,UnfairAdvantage,NoDTI,2,148,0
3,"South Jordan, United States",1,FY23_broker_campaign,2022-09-02,User ID Display,,CloserTwins,MoreLoansAll,1,59,0
4,"Potsdam, United States",4,FY23_broker_campaign,2022-10-30,Domain Display,,UnfairAdvantage,1099,1,1,1


In [26]:
GoalStats_Xcheck = GoalStats.copy()

In [27]:
# Original values for Creative_Family in GoalStats
col_list(GoalStats_Xcheck, 'Creative_Family')

List of values in < Creative_Family > : 
 ['(not set)' 'SEM Ads' 'CloserTwins' 'Trade Media Ads' 'UnfairAdvantage'
 'August' 'CloseFaster' 'CompetitiveOpportunity' 'newsletter' 'One-Off'
 'SnapdocsLive' 'ComingSoon' 219526440 'domain' 'All3' '08-24-2022'
 '08-25-2022' '08-29-2022' 'crm' datetime.datetime(2022, 6, 9, 0, 0)
 datetime.datetime(2022, 8, 9, 0, 0)] 



In [28]:
CreFam_allowed = ['(not set)', 'SEM Ads', 'CloserTwins', 'Trade Media Ads', 'UnfairAdvantage',
 'August', 'CloseFaster', 'CompetitiveOpportunity', 'newsletter', 'One-Off',
 'SnapdocsLive', 'ComingSoon']
#CreFam_allowed = GA_main_raw['Creative - Family'].unique()

GoalStats_Xcheck.loc[~GoalStats_Xcheck['Creative_Family'].isin(CreFam_allowed), \
             'Creative_Family'] = 'NA'
col_list(GoalStats_Xcheck, 'Creative_Family')


List of values in < Creative_Family > : 
 ['(not set)' 'SEM Ads' 'CloserTwins' 'Trade Media Ads' 'UnfairAdvantage'
 'August' 'CloseFaster' 'CompetitiveOpportunity' 'newsletter' 'One-Off'
 'SnapdocsLive' 'ComingSoon' 'NA'] 



In [18]:
# Original values for Creative_Version in GoalStats
col_list(GoalStats_Xcheck, 'Creative_Version')

List of values in < Creative_Version > : 
 ['(not set)' 'Change Wholesale' "We Are America's CDFI" 'OTT_30'
 'Rate Lock' 'ROS1' 'NoDTI' 'Faster' 'August' 'Close More. Close Faster.'
 '3steps' 'Namaste' 'RTB' 'ROS5' 'MoreLoansAll' 'All' 'OnePage' '300x250'
 '728x90' 'Cancelingyourlock' 'MoreAll' 'CloseMore' 'Competitors'
 'Cutdown1A' 'Cutdown1B' 'EarlyBird' 'Faceoff' 'Faceoff1' 'Faceoff2'
 'interactive' 'MoreLoans' 'Paperwork' 'ShapeUp' 'OTT_15' 'OTT_30QR'
 'One-Off' 'SnapdocsLive' 'Theycancelweclose' 1099 '1page' 'ComingSoon'
 'Animated' 219526440 'FasterAll' 'MoreNoReg' 'ad1' 'ROP' 'All3'
 '08-24-2022' '08-25-2022' '08-29-2022'
 datetime.datetime(2022, 6, 9, 0, 0) datetime.datetime(2022, 8, 9, 0, 0)] 



In [19]:
#CreVer_allowed = GA_main_raw['Creative - Version'].unique()

#GoalStats_Xcheck.loc[~GoalStats_Xcheck['Creative_Version'].isin(CreVer_allowed), \
#             'Creative_Version'] = 'NA'
#col_list(GoalStats_Xcheck, 'Creative_Version')

List of values in < Creative_Version > : 
 ['NA' 'Change Wholesale' "We Are America's CDFI" 'Rate Lock' 'ROS1'
 'NoDTI' 'Faster' 'Close More. Close Faster.' '3steps' 'Namaste' 'RTB'
 'ROS5' 'MoreLoansAll' 'All' 'OnePage' '300x250' '728x90'
 'Cancelingyourlock' 'MoreAll' 'CloseMore' 'Competitors' 'Cutdown1A'
 'Cutdown1B' 'EarlyBird' 'Faceoff' 'Faceoff1' 'Faceoff2' 'interactive'
 'MoreLoans' 'Paperwork' 'ShapeUp' 'Theycancelweclose' 1099 '1page'
 'Animated' 'FasterAll' 'MoreNoReg'] 



### 1.3. Success Markers
Columns:  <br>        
 7   Goal                     16509 non-null  object        <br>
 8   Completions              16509 non-null  float64       <br>
 10  Days away from max date  16509 non-null  float64       <br>
 11  Latest report?           16509 non-null  int64         <br>
 
 Expected Values: <br>
 
 > Since all values in most of these columns are numbers, we can get a number sensing

In [20]:
GoalStats.describe()

Unnamed: 0,Completions,Days_Max_Date,Latest_Report
count,16509.0,16509.0,16509.0
mean,2.667575,110.576655,0.065843
std,4.415329,62.341727,0.248015
min,1.0,0.0,0.0
25%,1.0,55.0,0.0
50%,1.0,115.0,0.0
75%,2.0,166.0,0.0
max,116.0,213.0,1.0


In [21]:
col_list(GoalStats, 'Goal')

List of values in < Goal > : 
 ['Learn More (Community Mortgage)' 'Start Closing More' 'Get Approved'
 'Form Submission' 'Learn More (Closer Twins Banner)'
 'Learn More (Our Story)' 'Broker Login' 'Closer Twins Page Video Play'
 'Home Page Video Play' 'utm_audience' 'test4'] 



### Cleaning Required:  
1) No missing values 

## 3. Remove Redundant Columns

Unused: 'Date'
Unsure: 'Latest_Report'

In [22]:
GoalStats_final = GoalStats.copy().drop(columns=['Date','Latest_Report'])

## 4. Save cleaned dataframe as csv

In [23]:
GoalStats_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16509 entries, 0 to 16508
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Campaign          16509 non-null  object
 1   Audience          16509 non-null  object
 2   Creative_Family   16509 non-null  object
 3   Creative_Version  16509 non-null  object
 4   Platform          16509 non-null  object
 5   Ad_Format         16509 non-null  object
 6   Goal              16509 non-null  object
 7   Completions       16509 non-null  int64 
 8   Campaign_Traffic  16509 non-null  object
 9   Days_Max_Date     16509 non-null  int64 
 10  Unnamed: 12       16509 non-null  object
 11  Goal.1            16509 non-null  object
dtypes: int64(2), object(10)
memory usage: 1.5+ MB


In [24]:
# Create a CSV file as output.
GoalStats_final.to_csv(r'GoalStats_Cleaned.csv', index=False)