# Objectives
- Understand 3 main groups of data:
    - Customer (Broker) Demographic data
    - Advertising Campaign data
    - Success Marker data

# Week 2 : Data Cleaning
Assigned DataSets:"Goal stats - web traffic", "General stats - web traffic"

0. Load dataset
1. Change column headings to names that are easier to reference
2. Explore the data.
    1. Create a new DataFrame.
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
3. Remove redundant columns
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.

## 0. Load file & create dataframes (GenStats)


In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
GenStats_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="General stats - web traffic")

GenStats_raw.head()

Unnamed: 0,Date,Audience,Creative - Family,Creative - Version,Platform,Ad Format,Campaign Traffic?,Total Sessions,Total Bounces,Total Duration,Days away from max date,Latest report?
0,2022-08-16,,CloserTwins,Cutdown1A,Domain Display,Video,Campaign,1.0,0.0,73.0,76.0,0
1,2022-06-16,3.0,CloseFaster,NoDTI,Facebook,Single image,Campaign,1.0,0.0,0.0,137.0,0
2,2022-08-29,1.0,CloseFaster,MoreAll,User ID Display,,Campaign,2.0,0.0,0.0,63.0,0
3,2022-06-09,4.0,UnfairAdvantage,1page,LinkedIn,Single image,Campaign,2.0,0.0,50.0,144.0,0
4,2022-08-03,1.0,UnfairAdvantage,1099.0,Domain Display,Single image,Campaign,1.0,0.0,0.0,89.0,0


In [3]:
#Create new dataframe for cleaned data
GenStats = GenStats_raw.copy()

## 1. Rename Columns

In [4]:
# Rename the column headers.
GenStats = GenStats.rename(
    columns={
        "City, Country": "Location",
        "Creative - Family": "Creative_Family",
        "Creative - Version": "Creative_Version",
        "Ad Format": "Ad_Format",
        "Campaign Traffic?": "Campaign_Traffic",
        "Total Sessions": "Total_Sessions",
        "Total Bounces": "Total_Bounces",
        "Total Duration": "Total_Duration",
        "Days away from max date": "Days_Max_Date",
        "Latest report?": "Latest_Report"})



## 1. Check for missing values
- Evaluate what to do with entries with missing values

In [5]:
GenStats_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13194 entries, 0 to 13193
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Date                     13194 non-null  datetime64[ns]
 1   Audience                 12430 non-null  object        
 2   Creative - Family        13194 non-null  object        
 3   Creative - Version       13194 non-null  object        
 4   Platform                 12821 non-null  object        
 5   Ad Format                7956 non-null   object        
 6   Campaign Traffic?        13194 non-null  object        
 7   Total Sessions           13194 non-null  float64       
 8   Total Bounces            13194 non-null  float64       
 9   Total Duration           13194 non-null  float64       
 10  Days away from max date  13194 non-null  float64       
 11  Latest report?           13194 non-null  int64         
dtypes: datetime64[ns](1), float64(4)

## Basic Overview:

Out of __13194 entries__

There are missing data for:
- Audience : **764** missing values [**5.8%** missing]
    - Unable to determine which demographic group entry falls under
    - Will not be useful in determining effectiveness of ad campaigns among each demographic group<br>
>⇒ More useful to **remove entries** with NA value for 'Audience' column

- Platform :  **373** missing values [**2.8%** missing]
    - However, Ad campaign family & Version information still available <br>
    - Entries can still be used ⇒ no need to remove whole row <br>
>⇒ **Replace NA entries with "unspecified"** for 'Platform' column

 <br>  
- Ad Format :  **5238** missing values [**39.7%** missing]
    - Significant number of entries with missing data for 'Ad Format'
    - Significantly reduce sample size if all the data removed
>⇒ **Replace NA entries with "unspecified"** for 'Platform' column

In [6]:
# Create function to check unique values in 
def col_list(df, para):
    print("List of values in <", para ,"> : \n",df[para].unique(),"\n")
    return

### 1.1. Demographic Information
Columns: 'Audience' <br>

 1   Audience                 12430 non-null  object       
 
 Expected Values: <br>
Audience 1	:	Registered Loan Officers from Registered Brokerage, active (last 120 days). <br>
Audience 2	:	Registered Loan Officers from Registered Brokerage, inactive (last 120 days). <br>
Audience 3	:	Registered Loan Officers from Registered Brokerage, never registered a loan. <br>
Audience 4	:	Non-Registered Loan Officers from Registered Brokerages. <br>
Audience 5	:	Retargeted audience. Non-Registered Loan Officers who visited website (last 7 days). <br>
Audience 6	:	General Targeting. Brokers not registered and not from registered brokerages.

In [7]:
col_list(GenStats, 'Audience')

List of values in < Audience > : 
 [nan '3' '1' '4' '5' 'General Targetting' '2'
 '1to4https://changewholesale.com/anti-inflation-special/?utm_campaign=FY22_anti_inflation'
 '1to4'
 'fourhttps://changewholesale.com/broker-approval/?utm_campaign=FY23_broker_campaign'
 'fivehttps://changewholesale.com/broker-approval/?utm_campaign=FY23_broker_campaign'
 'fivehttps://changewholesale.com/closer-twins/' '44652' 'test2'
 'fivedisparate' 'test3'
 'fourhttps://changewholesale.com/?utm_campaign=FY23_broker_campaign'
 'test20th' 'five/broker-approval/' 'test6' 'one/' 'test5'] 



### Cleaning Required:  
1) Remove unwanted URLs attached to numbers (1-6)  <br>
2) Rename "General Targetting" to "6" <br>
3) Remove all none 1-6 values <br>

In [8]:
# Replace numbers with unwanted tail
GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*one.*$)', '1')
GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*four.*$)', '4')
GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*five.*$)', '5')
GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*1to4.*$)', '1to4')

# Rename 'General Targetting' to '6'
GenStats['Audience'] = GenStats.Audience.str.replace('General Targetting', '6')

# Replace all non-'1-6' values
allowed_vals = ['1','2','3','4','5','6','1to4']
GenStats.loc[~GenStats['Audience'].isin(allowed_vals), 'Audience'] = 'NA'


# Check remaining values 
col_list(GenStats, 'Audience')

List of values in < Audience > : 
 ['unidentified' '3' '1' '4' '5' '6' '2' '1to4'] 



  GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*one.*$)', '1')
  GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*four.*$)', '4')
  GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*five.*$)', '5')
  GenStats['Audience'] = GenStats.Audience.str.replace(r'(^.*1to4.*$)', '1to4')


In [29]:
GenStats.groupby('Audience')\
        .agg({'Total_Sessions':sum,'Total_Bounces':sum,'Total_Duration':sum})\
        .sort_values('Total_Sessions', ascending=False)

Unnamed: 0_level_0,Total_Sessions,Total_Bounces,Total_Duration
Audience,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
unidentified,446497.0,52532.0,134542900.0
4,23098.0,13.0,249435.0
6,21026.0,23.0,2210070.0
5,16385.0,1559.0,428218.0
1,8349.0,3.0,37215.0
3,7484.0,0.0,19694.0
2,4598.0,0.0,11086.0
1to4,450.0,0.0,1483.0


In [32]:
GenStats.sum()

  GenStats.sum()


Audience            unidentified3141451unidentified6unidentified44...
Creative_Family     CloserTwinsCloseFasterCloseFasterUnfairAdvanta...
Platform            Domain DisplayFacebookUser ID DisplayLinkedInD...
Ad_Format           VideoSingle imageunidentifiedSingle imageSingl...
Campaign_Traffic    CampaignCampaignCampaignCampaignCampaignCampai...
Total_Sessions                                               527887.0
Total_Bounces                                                 54130.0
Total_Duration                                       137500063.999999
Days_Max_Date                                               1288877.0
Latest_Report                                                     841
dtype: object

In [30]:
GenStats.groupby('Audience')['Audience'].count()

Audience
1               2041
1to4              30
2               1174
3               1487
4               3678
5               2851
6               1162
unidentified     771
Name: Audience, dtype: int64

### Re-evaluating the need to remove missing data under 'Audience'

Only 5.8% of entries have 'NA' audience type.
However, this group also accounts for 84.6% of sessions and 97.8% of total duration.

> I have opted to keep these entries until further discussion but they can still be easily removed afterwards by deleting all 'NA' entries

### 1.2. Advertising Campaign
Columns:  <br>
 2   Creative - Family        13194 non-null  object         <br>
 3   Creative - Version       13194 non-null  object         <br>
 4   Platform                 12821 non-null  object         <br>
 5   Ad Format                7956 non-null   object         <br>
 6   Campaign Traffic?        13194 non-null  object         <br>
 
 Expected Values: <br>

In [10]:
col_list(GenStats, 'Creative_Family')
col_list(GenStats, 'Creative_Version')
col_list(GenStats, 'Platform')
col_list(GenStats, 'Ad_Format')
col_list(GenStats, 'Campaign_Traffic')

List of values in < Creative_Family > : 
 ['CloserTwins' 'CloseFaster' 'UnfairAdvantage' 'SEM Ads' 'Trade Media Ads'
 '(not set)' datetime.datetime(2022, 6, 9, 0, 0) 'domain' 'August'
 'August/' 'CompetitiveOpportunity' '08-29-2022' 'newsletter' 'NovDec'
 'DybffeGjvaf' 'One-Off' datetime.datetime(2022, 1, 3, 0, 0) 'crm'
 'ComingSoon' 'All3' 227112117.0 '08-25-2022' '08-24-2022' 'SnapdocsLive'
 'eml' '08-30-2022' datetime.datetime(2022, 8, 9, 0, 0) '10-28-2022'
 'ebgf' 'December' 44801.0 206306768.0 'nmls' 'Baf-Baa' 'DbzvatFbba'
 'unfair' 'Bhthfg' 'FabcebdfYvif' 219526440.0 'afjfyfggfe' '08-28-2022'] 

List of values in < Creative_Version > : 
 ['Cutdown1A' 'NoDTI' 'MoreAll' '1page' 1099.0 'MoreLoansAll' 'Faceoff'
 'All' "We Are America's CDFI" 'EarlyBird' 'OnePage' 'CloseMore' 'ROS5'
 'Competitors' 'Namaste' 'interactive' 'OTT_15' 'ROS1' 'Change Wholesale'
 'FasterAll' '3steps' 'Paperwork' '300x250' 'Faster' 'Cutdown1B'
 'Animated' '(not set)' 'Faceoff1' 'FasterReg' 'RTB'
 datetime.dat

### Cleaning Required:  
1) No cleaning required for 'Campaign Traffic?'  <br>
2) Rename missing values in 'Platform', 'Ad Format' with "NA" <br>
<br>
3) Many values in 'Creative - Family' and 'Creative - Version' <br>
    - will need to evaluate which are actual values and which are erroneous <br>
    - Maybe cross-refer to other datasets

In [11]:
# Replace missing values
GenStats = GenStats.fillna('NA')

col_list(GenStats, 'Platform')
col_list(GenStats, 'Ad_Format')

List of values in < Platform > : 
 ['Domain Display' 'Facebook' 'User ID Display' 'LinkedIn' 'Google SEM'
 'Trade Media' 'OTT' 'unidentified'] 

List of values in < Ad_Format > : 
 ['Video' 'Single image' 'unidentified' 'Carousel' 'CPC' 'Housingwire'
 'National_mortgage_news' 'Animated' 'Inside_mortgage_finance_newsletter'
 'Chrisman' 'Scotsman' 'Nmn' 'Nmn_partner_insight_1'
 'Nmn_partner_insight_2'] 



In [12]:
test = GenStats.groupby(['Creative_Family','Creative_Version'])

test

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028D5B9A33A0>

### Remove values from Creative_Family & Creative_Version that are not found in other dataset
Assuming that the "Change_2022_Google Analytics Ma" dataset, contains the full list of expected values

__For 'Creative - Family':__
CloserTwins, 
UnfairAdvantage, 
SEM Ads, 
CloseFaster, 
Trade Media Ads, 
CompetitiveOpportunity, 
newsletter <br>

__For 'Creative - Version':__
RTB, 
0, 
1099, 
1page, 
300x250, 
3steps, 
728x90, 
All, 
Animated, 
Cancelingyourlock, 
Change Wholesale, 
Close More. Close Faster., 
CloseMore, 
Competitors, 
Cutdown1A, 
Cutdown1B, 
EarlyBird, 
Faceoff, 
Faceoff1, 
Faceoff2, 
Faster, 
FasterAll, 
FasterReg, 
interactive, 
MoreAll, 
MoreLoans, 
MoreLoansAll, 
MoreNoReg, 
Namaste, 
NoDTI, 
OnePage, 
Paperwork, 
Rate Lock, 
ROS1, 
ROS2, 
ROS5, 
ShapeUp, 
Theycancelweclose, 
wallpaper, 
We Are America's CDFI

In [14]:
# Load 3rd Dataset for reference
GA_main_raw = pd.read_excel("Change 2022_GA writeback_091122.xlsx", sheet_name="Change_2022_Google Analytics Ma")

GA_main_raw.head()


Unnamed: 0,"City, Country",Audience,Campaign,Date,Platform,Ad Format,Creative - Family,Creative - Version,Total Sessions,Days away from max date,Latest report?
0,"Ashburn, United States",General Targetting,FY23_change_digital_phase3,2022-08-08,Trade Media,Inside_mortgage_finance_newsletter,CloserTwins,RTB,1.0,84.0,0
1,"Mebane, United States",4,FY23_broker_campaign,2022-06-16,User ID Display,,UnfairAdvantage,OnePage,1.0,137.0,0
2,"Chicago, United States",2,FY23_broker_campaign,2022-06-05,Domain Display,Single image,UnfairAdvantage,NoDTI,2.0,148.0,0
3,"South Jordan, United States",1,FY23_broker_campaign,2022-09-02,User ID Display,,CloserTwins,MoreLoansAll,1.0,59.0,0
4,"Potsdam, United States",4,FY23_broker_campaign,2022-10-30,Domain Display,,UnfairAdvantage,1099.0,1.0,1.0,1


In [13]:
# Original values for Creative_Family in GenStats
col_list(GenStats, 'Creative_Family')

List of values in < Creative_Family > : 
 ['CloserTwins' 'CloseFaster' 'UnfairAdvantage' 'SEM Ads' 'Trade Media Ads'
 '(not set)' datetime.datetime(2022, 6, 9, 0, 0) 'domain' 'August'
 'August/' 'CompetitiveOpportunity' '08-29-2022' 'newsletter' 'NovDec'
 'DybffeGjvaf' 'One-Off' datetime.datetime(2022, 1, 3, 0, 0) 'crm'
 'ComingSoon' 'All3' 227112117.0 '08-25-2022' '08-24-2022' 'SnapdocsLive'
 'eml' '08-30-2022' datetime.datetime(2022, 8, 9, 0, 0) '10-28-2022'
 'ebgf' 'December' 44801.0 206306768.0 'nmls' 'Baf-Baa' 'DbzvatFbba'
 'unfair' 'Bhthfg' 'FabcebdfYvif' 219526440.0 'afjfyfggfe' '08-28-2022'] 



In [15]:
CreFam_allowed = GA_main_raw['Creative - Family'].unique()

GenStats.loc[~GenStats['Creative_Family'].isin(CreFam_allowed), \
             'Creative_Family'] = 'NA'
col_list(GenStats, 'Creative_Family')


List of values in < Creative_Family > : 
 ['CloserTwins' 'CloseFaster' 'UnfairAdvantage' 'SEM Ads' 'Trade Media Ads'
 'unidentified' 'CompetitiveOpportunity' 'newsletter'] 



In [18]:
# Original values for Creative_Version in GenStats
col_list(GenStats, 'Creative_Version')

List of values in < Creative_Version > : 
 ['Cutdown1A' 'NoDTI' 'MoreAll' '1page' 1099.0 'MoreLoansAll' 'Faceoff'
 'All' "We Are America's CDFI" 'EarlyBird' 'OnePage' 'CloseMore' 'ROS5'
 'Competitors' 'Namaste' 'interactive' 'OTT_15' 'ROS1' 'Change Wholesale'
 'FasterAll' '3steps' 'Paperwork' '300x250' 'Faster' 'Cutdown1B'
 'Animated' '(not set)' 'Faceoff1' 'FasterReg' 'RTB'
 datetime.datetime(2022, 6, 9, 0, 0) 'Close More. Close Faster.'
 'MoreNoReg' 'ad1' 'Faceoff2' 'MoreLoans' 'August' 'ShapeUp' 'OTT_30QR'
 'August/' 'Cancelingyourlock' 'Rate Lock' '08-29-2022'
 'Theycancelweclose' 'OTT_30' 'NovDec' '728x90' 'EGC' 'One-Off'
 datetime.datetime(2022, 1, 3, 0, 0) 'ComingSoon' 'All3' 227112117.0
 '08-25-2022' '08-24-2022' 'OTT_15QR' 'ROS2' 'SnapdocsLive' '08-30-2022'
 datetime.datetime(2022, 8, 9, 0, 0) '10-28-2022' 'wallpaper' 'ybdx'
 'December' 44801.0 206306768.0 'Baf-Baa' 'ROS' 'DbzvatFbba' 'ROP' 'adv'
 'Bhthfg' 'FabcebdfYvif' 219526440.0 '633k583' '08-28-2022'] 



In [20]:
CreVer_allowed = GA_main_raw['Creative - Version'].unique()

GenStats.loc[~GenStats['Creative_Version'].isin(CreVer_allowed), \
             'Creative_Version'] = 'NA'
col_list(GenStats, 'Creative_Version')

List of values in < Creative_Version > : 
 ['Cutdown1A' 'NoDTI' 'MoreAll' '1page' 1099.0 'MoreLoansAll' 'Faceoff'
 'All' "We Are America's CDFI" 'EarlyBird' 'OnePage' 'CloseMore' 'ROS5'
 'Competitors' 'Namaste' 'interactive' 'unidentified' 'ROS1'
 'Change Wholesale' 'FasterAll' '3steps' 'Paperwork' '300x250' 'Faster'
 'Cutdown1B' 'Animated' 'Faceoff1' 'FasterReg' 'RTB'
 'Close More. Close Faster.' 'MoreNoReg' 'Faceoff2' 'MoreLoans' 'ShapeUp'
 'Cancelingyourlock' 'Rate Lock' 'Theycancelweclose' '728x90' 'ROS2'
 'wallpaper'] 



### 1.3. Success Markers
Columns:  <br>        
 7   Total Sessions           13194 non-null  int64         <br>
 8   Total Bounces            13194 non-null  int64         <br>
 9   Total Duration           13194 non-null  float64       <br>
 10  Days away from max date  13194 non-null  int64         <br>
 11  Latest report?           13194 non-null  int64         <br>
 
 Expected Values: <br>
 
 > Since all values in these columns are numbers, we can get a number sensing

In [16]:
GenStats.describe()

Unnamed: 0,Total_Sessions,Total_Bounces,Total_Duration,Days_Max_Date,Latest_Report
count,13194.0,13194.0,13194.0,13194.0,13194.0
mean,40.009626,4.102622,10421.41,97.6866,0.063741
std,317.764981,39.480841,104319.0,56.556894,0.2443
min,1.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,49.0,0.0
50%,2.0,0.0,0.0,94.0,0.0
75%,6.0,0.0,0.0,146.0,0.0
max,6505.0,608.0,1737539.0,213.0,1.0


### Cleaning Required:  
1) All measures already in numerical data type, no change required  <br>
2) No missing values 

## 3. Remove Redundant Columns

Unused: 'Date'
Unsure: 'Latest_Report'

In [33]:
GenStats_final = GenStats.copy().drop(columns=['Date','Latest_Report'])

## 4. Save cleaned dataframe as csv

In [34]:
GenStats_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13194 entries, 0 to 13193
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Audience          13194 non-null  object 
 1   Creative_Family   13194 non-null  object 
 2   Creative_Version  13194 non-null  object 
 3   Platform          13194 non-null  object 
 4   Ad_Format         13194 non-null  object 
 5   Campaign_Traffic  13194 non-null  object 
 6   Total_Sessions    13194 non-null  float64
 7   Total_Bounces     13194 non-null  float64
 8   Total_Duration    13194 non-null  float64
 9   Days_Max_Date     13194 non-null  float64
dtypes: float64(4), object(6)
memory usage: 1.0+ MB


In [36]:
# Create a CSV file as output.
GenStats_final.to_csv(r'GenStats_Cleaned.csv', index=False)