# CRM Sales Dashboard Data Validation and Cleaning

Avinash Bisram (8/27/24)

**Description:** Data Validation and Cleaning to prepare the necessary data files for constructing the CRM Sales Dashboard.

In [1]:
# General Plan:

# Import packages
# Load raw tables
# Data Validation
    # Correct data types
    # Resolve typos and logical duplicates (also check for diff encoded blanks)
    # Check for duplicates
    # Report null values (and deal with them if necessary)
    # Report outliers (and deal with them if necessary)
# Data Profiling
    # Just understanding the data in general
    # This data has a data dictionary which is great
# EDA (?)
# Prepare data to be used in dashboarding
# Export clean data and final data

In [2]:
# Import necessary packages

import numpy as np
import pandas as pd
import os

# Removing row and column limits for pandas DataFrames
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
# List the tables we have in the Data folder

os.listdir('./Data')

# 5 tables
# It looks like we have a data dictionary here as well so let's look at that first

['accounts.csv',
 'accounts_CLEAN.csv',
 'data_dictionary.csv',
 'products.csv',
 'products_CLEAN.csv',
 'sales_pipeline.csv',
 'sales_pipeline_CLEAN.csv',
 'sales_teams.csv',
 'sales_teams_CLEAN.csv']

## Data QA (data_dictionary.csv)

**Summary:** 
* data_dictionary.csv is simply a Data Dictionary telling us what information should be stored in each column of the tables we are looking at.

In [4]:
# Load the table

df_raw_data_dictionary = pd.read_csv('./Data/data_dictionary.csv')

# Display some general stats and the head of the table

df_raw_data_dictionary.info()
df_raw_data_dictionary.head(10)

# 21 rows, 3 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Table        21 non-null     object
 1   Field        21 non-null     object
 2   Description  21 non-null     object
dtypes: object(3)
memory usage: 632.0+ bytes


Unnamed: 0,Table,Field,Description
0,accounts,account,Company name
1,accounts,sector,Industry
2,accounts,year_established,Year Established
3,accounts,revenue,Annual revenue (in millions of USD)
4,accounts,employees,Number of employees
5,accounts,office_location,Headquarters
6,accounts,subsidiary_of,Parent company
7,products,product,Product name
8,products,series,Product series
9,products,sales_price,Suggested retail price


It looks like data_dictionary.csv is a description of each field/column in the various tables. It doesn't tell us what the data types of each field are supposed to be so we'll have to make our best guess based on the data within and how we'll be using it in our dashboard.

In [5]:
# Checking if every table is covered in this dictionary

df_raw_data_dictionary['Table'].unique()

# Looks like it because they are named the same as the CSV files. We can use this additional information going forward.

array(['accounts', 'products', 'sales_teams', 'sales_pipeline'],
      dtype=object)

## Data QA (accounts.csv)

**Summary:**

* account column: A typo was found and resolved changing "technolgy" to "technology"
* office_location column: A typo was found and resolved changing "Philipines" to "Philippines"
* subsidiary_of column: 82% null values but we don't have to worry as nulls in this column are expected
* 0 duplicates found

In [6]:
# Displaying the relevant information from data_dictionary

df_raw_data_dictionary[df_raw_data_dictionary['Table'] == "accounts"]

Unnamed: 0,Table,Field,Description
0,accounts,account,Company name
1,accounts,sector,Industry
2,accounts,year_established,Year Established
3,accounts,revenue,Annual revenue (in millions of USD)
4,accounts,employees,Number of employees
5,accounts,office_location,Headquarters
6,accounts,subsidiary_of,Parent company


In [7]:
# Load the table

df_raw_accounts = pd.read_csv('./Data/accounts.csv')

# Display some general stats and the head of the table

df_raw_accounts.info()
df_raw_accounts.head(10)

# Shape: 85 rows, 7 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   account           85 non-null     object 
 1   sector            85 non-null     object 
 2   year_established  85 non-null     int64  
 3   revenue           85 non-null     float64
 4   employees         85 non-null     int64  
 5   office_location   85 non-null     object 
 6   subsidiary_of     15 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 4.8+ KB


Unnamed: 0,account,sector,year_established,revenue,employees,office_location,subsidiary_of
0,Acme Corporation,technolgy,1996,1100.04,2822,United States,
1,Betasoloin,medical,1999,251.41,495,United States,
2,Betatech,medical,1986,647.18,1185,Kenya,
3,Bioholding,medical,2012,587.34,1356,Philipines,
4,Bioplex,medical,1991,326.82,1016,United States,
5,Blackzim,retail,2009,497.11,1588,United States,
6,Bluth Company,technolgy,1993,1242.32,3027,United States,Acme Corporation
7,Bubba Gump,software,2002,987.39,2253,United States,
8,Cancity,retail,2001,718.62,2448,United States,
9,Cheers,entertainment,1993,4269.9,6472,United States,Massive Dynamic


In [8]:
# Step 1: Check if the data types of each column seem correct (based on the information we expect them to contain)

# COLUMN_NAME           EXPECTED DATA TYPE OF VALUES      CURRENT DATA TYPE

# account               string                            object (correct)
# sector                string                            object (correct)
# year_established      integer                           int64 (correct)
# revenue               float/decimal                     float64 (correct)
# employees             integer                           int64 (correct)
# office_location       string                            object (correct)
# subsidiary_of         string                            object (correct)

# All data types of columns look good!

In [9]:
# Step 2: Resolve typos, logical duplicates, logical nulls
# We want to do this before actually checking for duplicates because some values might mean the same but are recorded
    # slightly different (lowercase vs. uppercase, null string values being '' or ' ', null int values being -1, etc.)
    
# Going through each column and looking at the unique values

In [10]:
# Step 2a: account (object)

# Quick check comparing current unique values and all lowercase values
df_raw_accounts['account'].nunique() # 85 current unique values
df_raw_accounts['account'].str.lower().nunique() # 85 lowercase unique values
    # Same number means no need to change any values

# Scanning through list of unique values for typos
df_raw_accounts['account'].sort_values()
    # Doesn't look like any typos besides one value "dambase" being lowercase while everything else is capitalized
    # Let's leave it for now (in case we need to merge on this value) but keep in mind for the end

# Logic check (based on data_dictionary)
    # Are all these values company names? I don't recognize them but they could be fabricated for this project
    
# Logical null values?
    # Since this is a string column, we would be checking for values like "Unknown", "Blank", etc.
    # Also checking for any empty string "" which could have varying spaces contained like " ", "  ", etc.
    # I don't see any so looks good

# Count of real null values
print(f"Null count for account column: {df_raw_accounts['account'].isnull().sum()}")
    # No null values

Null count for account column: 0


In [11]:
# Step 2b: sector (object)

# Compare current unique count and all lowercase count
df_raw_accounts['sector'].nunique() # 10 current unique values
df_raw_accounts['sector'].str.lower().nunique() # 10 lowercase unique values
    # Same number = good

# Look for typos in value list
df_raw_accounts['sector'].unique()
    # 'technolgy' is spelled wrong!

# Resolving the typo from above (and noting the change in case we need to do joins later)
indexes_to_change = df_raw_accounts[df_raw_accounts['sector'] == 'technolgy'].index

for index in indexes_to_change:
    df_raw_accounts.at[index,'sector'] = 'technology'

# Checking the result
df_raw_accounts['sector'].unique()
    # Typo resolved

# Logic check (based on data_dictionary)
    # Are the values all Industry names? Yes they look correct

# Logical null values?
    # We didn't find any values that would otherwise indicate unknown or blank so NO logical null values here.
    
# Count of null values
print(f"Null count for sector column: {df_raw_accounts['sector'].isnull().sum()}")
    # No null values

Null count for sector column: 0


In [12]:
# Step 2c: year_established (int)

# Logic check (based on data_dictionary)
    # Are all the contained values valid years?
df_raw_accounts['year_established'].describe()
    # No glaring outliers in either direction so they look good
    
# Logical null values?
    # We don't see any negative values that would usually indicate unknown integer values so all good here.

# Count of null values
print(f"Null count for year_established column: {df_raw_accounts['year_established'].isnull().sum()}")
    # No null values

Null count for year_established column: 0


In [13]:
# Step 2d: revenue (float/decimal)

# Logic check (based on data_dictionary)
    # Are all the contained values revenue (in millions of dollars)
df_raw_accounts['revenue'].describe()
    # Nothing seems extremely out of the ordinary but one company has revenue of 11 billion dollars.

# Logical null values?
    # Again we don't see any negative values (since this column would otherwise only hold positive values)

# Count of null values
print(f"Null count for revenue column: {df_raw_accounts['revenue'].isnull().sum()}")
    # No null values

Null count for revenue column: 0


In [14]:
# Step 2e: employees (int)

# Logic check (based on data_dictionary)
    # Are all the contained values valid employee counts?
df_raw_accounts['employees'].describe()
    # Values range from 9 to 34288 (seems believable if our company list contains a startup or small company)

# Logical null values
    # No negative values or otherwise strange values so all good.

# Count of null values
print(f"Null count for employees column: {df_raw_accounts['employees'].isnull().sum()}")
    # No null values

Null count for employees column: 0


In [15]:
# Step 2f: office_location (object)

# Compare current unique count and all lowercase count
df_raw_accounts['office_location'].nunique() # 15 current unique values
df_raw_accounts['office_location'].str.lower().nunique() # 15 lowercase unique values
    # Same number = good

# Look for typos in value list
df_raw_accounts['office_location'].unique()
    # 'Philipines' spelled wrong! Should be 'Philippines'

# Resolving the typo from above (and noting the change in case we need to do joins later)
indexes_to_change = df_raw_accounts[df_raw_accounts['office_location'] == 'Philipines'].index

for index in indexes_to_change:
    df_raw_accounts.at[index,'office_location'] = 'Philippines'

# Checking the result
df_raw_accounts['office_location'].unique()
    # Typo resolved

# Logic check (based on data_dictionary)
    # Are the values all location names? Yes they look correct (they represent countries)

# Logical null values?
    # We didn't find any values that would otherwise indicate unknown or blank so NO logical null values here.
    
# Count of null values
print(f"Null count for office_location column: {df_raw_accounts['office_location'].isnull().sum()}")
    # No null values

Null count for office_location column: 0


In [16]:
# Step 2g: subsidiary_of (object)

# Compare current unique count and all lowercase unique count
df_raw_accounts['subsidiary_of'].nunique() # 7 current unique values
df_raw_accounts['subsidiary_of'].str.lower().nunique() # 7 lowercase unique values
    # Same number = good

# Look for typos in value list
df_raw_accounts['subsidiary_of'].unique()
    # No typos that we can see

# Logic check (based on data_dictionary)
    # Are all the values all parent_company names? (If so, all values should be present in the account column)
unique_subsidiaries = set(df_raw_accounts['subsidiary_of'].unique()[1:]) # ignoring nan

# Checking if all these values are present in the accounts column using sets
unique_parent_companies = set(df_raw_accounts['account'].unique())
unique_parent_companies

unique_subsidiaries.difference(unique_parent_companies) # Empty set means all subsidiaries are valid!

# Logical null values
df_raw_accounts['subsidiary_of'].unique()
    # No empty or filler values (Unknown, empty, blank, etc.)

# Count of null values
print(f"Null count for subsidiary_of column: {df_raw_accounts['subsidiary_of'].isnull().sum()}")
    # 70 null values (there are only 85 rows so this is 82%)
    # However this makes sense because we expect most companies to be standalone and not subsidiaries of others.
    # We don't have to do anything to these null values since our final deliverable is a dashboard

Null count for subsidiary_of column: 70


In [17]:
# Step 3: Checking for duplicates

df_raw_accounts.duplicated().sum()

# No duplicates

0

## Data QA (products.csv)

**Summary:**
* sales_price column: datatype MAY need to be changed from int to float_decimal but it can easily be done in our dashboarding tool so for now it's not an issue until we look at the other tables. Adding it to our notes for now.
* 0 duplicated

In [18]:
# Displaying the relevant information from data_dictionary

df_raw_data_dictionary[df_raw_data_dictionary['Table'] == "products"]

Unnamed: 0,Table,Field,Description
7,products,product,Product name
8,products,series,Product series
9,products,sales_price,Suggested retail price


In [19]:
# Load the table

df_raw_products = pd.read_csv('./Data/products.csv')

# Display some general stats and the head of the table

df_raw_products.info()
df_raw_products.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   product      7 non-null      object
 1   series       7 non-null      object
 2   sales_price  7 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes


Unnamed: 0,product,series,sales_price
0,GTX Basic,GTX,550
1,GTX Pro,GTX,4821
2,MG Special,MG,55
3,MG Advanced,MG,3393
4,GTX Plus Pro,GTX,5482
5,GTX Plus Basic,GTX,1096
6,GTK 500,GTK,26768


In [20]:
# Step 1: Check if the data types of each column seem correct

# COLUMN_NAME   EXPECTED DATA TYPE OF VALUES      CURRENT DATA TYPE
# product       string                            object (correct)     
# series        string                            object (correct)
# sales_price   float/decimal                     int (not a huge issue honestly)

In [21]:
# Step 2: Resolve typos, logical duplicates, logical nulls

In [22]:
# Step 2a: product (object)

# Note: This table is so small we can do a lot of this work by eye

# Compare current unique count and all lowercase unique count
    # Good

# Look for typos in value list
    # No apparent typos (most product are of the GTX series and one GTK but the series of that value matches)

# Logic check (based on data_dictionary)
    # Are all of these values product names? Yes

# Logical null values
    # No logical null values that we can see

# Count of null values
    # No null values

In [23]:
# Step 2b: series (object) 

# Compare current unique count and all lowercase unique count
    # Good

# Look for typos in value list
    # No apparent typos

# Logic check (based on data_dictionary)
    # Are all of these values product series? Yes, they represent the prefix of each product

# Logical null values
    # No logical null values that we can see

# Count of null values
    # No null values

In [24]:
# Step 2c: sales_price (int)

# Logic check (based on data_dictionary)
    # Are all the contained values valid sales prices?
df_raw_products['sales_price'].describe()
    # Values range from 55 dollars to 26768. We can't really confirm if any of this is wrong without currencies listed.

# Logical null values
    # No negative values or otherwise strange values so all good.

# Count of null values
    # No null values

count        7.000000
mean      6023.571429
std       9388.428070
min         55.000000
25%        823.000000
50%       3393.000000
75%       5151.500000
max      26768.000000
Name: sales_price, dtype: float64

In [25]:
# Step 3: Check for duplicates

df_raw_products.duplicated().sum()

# No duplicates found

0

## Data QA (sales_pipeline.csv)

**Summary:**
* product column: resolved typo changing "GTXPro" to "GTX Pro"
* account column: A LOT of missing values for engaging and prospecting deals but we won't be using the account field in our dashboard (thinking ahead) so we can ignore this time
* engage_date column: Prospecting deals have no engage_date and all other deal stages do have a value for this column
* close_date column: Engaging/Prospecting deals have no close dates while Won/Lost all have close dates
* close_value column: Confirmed that all prospecting and engaging have no close values while lost deals have values of 0
* 0 duplicates

In [26]:
# Displaying the relevant information from data_dictionary

df_raw_data_dictionary[df_raw_data_dictionary['Table'] == "sales_pipeline"]

Unnamed: 0,Table,Field,Description
13,sales_pipeline,opportunity_id,Unique identifier
14,sales_pipeline,sales_agent,Sales agent
15,sales_pipeline,product,Product name
16,sales_pipeline,account,Company name
17,sales_pipeline,deal_stage,Sales pipeline stage (Prospecting > Engaging > Won / Lost)
18,sales_pipeline,engage_date,"Date in which the ""Engaging"" deal stage was initiated"
19,sales_pipeline,close_date,"Date in which the deal was ""Won"" or ""Lost"""
20,sales_pipeline,close_value,Revenue from the deal


In [27]:
# Load the table

df_raw_sales_pipeline = pd.read_csv('./Data/sales_pipeline.csv')

# Display some general stats and the head of the table

df_raw_sales_pipeline.info()
df_raw_sales_pipeline.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8800 entries, 0 to 8799
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   opportunity_id  8800 non-null   object 
 1   sales_agent     8800 non-null   object 
 2   product         8800 non-null   object 
 3   account         7375 non-null   object 
 4   deal_stage      8800 non-null   object 
 5   engage_date     8300 non-null   object 
 6   close_date      6711 non-null   object 
 7   close_value     6711 non-null   float64
dtypes: float64(1), object(7)
memory usage: 550.1+ KB


Unnamed: 0,opportunity_id,sales_agent,product,account,deal_stage,engage_date,close_date,close_value
0,1C1I7A6R,Moses Frase,GTX Plus Basic,Cancity,Won,2016-10-20,2017-03-01,1054.0
1,Z063OYW0,Darcel Schlecht,GTXPro,Isdom,Won,2016-10-25,2017-03-11,4514.0
2,EC4QE1BX,Darcel Schlecht,MG Special,Cancity,Won,2016-10-25,2017-03-07,50.0
3,MV1LWRNH,Moses Frase,GTX Basic,Codehow,Won,2016-10-25,2017-03-09,588.0
4,PE84CX4O,Zane Levy,GTX Basic,Hatfan,Won,2016-10-25,2017-03-02,517.0
5,ZNBS69V1,Anna Snelling,MG Special,Ron-tech,Won,2016-10-29,2017-03-01,49.0
6,9ME3374G,Vicki Laflamme,MG Special,J-Texon,Won,2016-10-30,2017-03-02,57.0
7,7GN8Q4LL,Markita Hansen,GTX Basic,Cheers,Won,2016-11-01,2017-03-07,601.0
8,OLK9LKZB,Niesha Huffines,GTX Plus Basic,Zumgoity,Won,2016-11-01,2017-03-03,1026.0
9,HAXMC4IX,James Ascencio,MG Advanced,,Engaging,2016-11-03,,


In [28]:
# Step 1: Check if the data types of each column seem correct

# COLUMN_NAME           EXPECTED DATA TYPE OF VALUES      CURRENT DATA TYPE

# opportunity_id        string or int?                    object (correct)
# sales_agent           string                            object (correct)
# product               string                            object (correct)
# account               string                            object (correct)
# deal_stage            string                            object (correct)
# engage_date           date                              object (WRONG)
# close_date            date                              object (WRONG)
# close_value           int or float                      float64 (correct)

# The two date columns are strings instead of dates
# Our dashboard software can convert them but we want to do it here so we can check the range and any strange values

In [29]:
# Casting the two date columns as dates

# Engage date (yyyy-mm-dd)
df_raw_sales_pipeline['engage_date'] = pd.to_datetime(df_raw_sales_pipeline['engage_date'])

# Close date (yyyy)
df_raw_sales_pipeline['close_date'] = pd.to_datetime(df_raw_sales_pipeline['close_date'])

# Inspecting the dtypes again
df_raw_sales_pipeline[['engage_date','close_date']].dtypes

# Looks better!

engage_date    datetime64[ns]
close_date     datetime64[ns]
dtype: object

In [30]:
# Step 2: Resolve typos, logical duplicates, logical nulls

In [31]:
# Step 2a: opportunity_id (object)

# Compare current unique count to unique count after lowercase and eliminating whitespace
df_raw_sales_pipeline['opportunity_id'].nunique() # 8800 unique values
df_raw_sales_pipeline['opportunity_id'].str.lower().str.strip().nunique() # 8800 unique values
    # Same number = good

# Look for typos in value list
    # If this is a unique identifier, we can't find typos the normal way
    # Looks like there isn't a clear pattern of letter+number combos but we can at least check the length
df_raw_sales_pipeline[df_raw_sales_pipeline['opportunity_id'].str.len() != 8]
    # no values with lengths other than 8
    
# Logic check (based on data dictionary)
    # We already confirmed that these values are unique identifiers (because nunique = num records)

# Logical null values
    # The only option since every row has SOME value is an eight-letter word used to express nulls but safe to assume
    # they are not present (if they are we can still safely treat them as unique identifiers)

# Count of null values
print(f"Null count for opportunity_id column: {df_raw_sales_pipeline['opportunity_id'].isnull().sum()}")
    # No null values

Null count for opportunity_id column: 0


In [32]:
# Should have made this helper function a long time ago

# Compares the unique count of a series to the unique count after lowercase and trimming
def compareUniqueCounts(series) -> bool:
    return series.nunique() == series.str.lower().str.strip().nunique()    

In [33]:
# Making another helper function to report null counts in a sentence

def stateNullCount(dataframe, column) -> None:
    print(f"Null count for {column} column: {dataframe[column].isnull().sum()}")

In [34]:
# Step 2b: sales_agent (object)

# Compare current unique count to unique count after lowercase and eliminating whitespace
compareUniqueCounts(df_raw_sales_pipeline['sales_agent']) # True = good

# Look for typos
sorted(df_raw_sales_pipeline['sales_agent'].unique())
    # No typos that I can tell (because these are names)

# Logical Check (based on data dictionary)
    # Are these all names of sales agents? Without more information they appear so

# Logical null values
    # None. Checked while looking for typos

# Count null values
stateNullCount(df_raw_sales_pipeline, 'sales_agent')
    # No null values

Null count for sales_agent column: 0


In [35]:
# Step 2c: product (object)

# Compare unique counts
compareUniqueCounts(df_raw_sales_pipeline['product']) # True = good

# Look for typos
df_raw_sales_pipeline['product'].unique()

# These should be the same values listed in the products table so let's compare the two lists
set(df_raw_sales_pipeline['product'].unique()).difference(set(df_raw_products['product']))
    # Found a typo! The product name should be "GTX Pro" not "GTXPro" (to match products table)

# Changing that typo
indexes_to_change = df_raw_sales_pipeline[df_raw_sales_pipeline['product'] == "GTXPro"].index

for index in indexes_to_change:
    df_raw_sales_pipeline.at[index,'product'] = "GTX Pro"

# Check result
set(df_raw_sales_pipeline['product'].unique()).difference(set(df_raw_products['product']))
    # Success

# Logic Check
    # Done while finding the disjoint set
    
# Logical null values
    # None. Easy check while looking for the disjoint set before

# Count null values
stateNullCount(df_raw_sales_pipeline, 'product')
    # No null values

Null count for product column: 0


In [36]:
# Step 2d: account (object)

# Compare unique counts
compareUniqueCounts(df_raw_sales_pipeline['account']) # True = good

# Look for typos + Logical check (based on data_dictionary)
    # Same process as above comparing this unique set to that of the account table
set(df_raw_sales_pipeline['account'].unique()).difference(set(df_raw_accounts['account']))
    # One value in set (nan) meaning there are null values in this column

# Logical null values
    # Handled above as well (other placeholder values would have been present in the returned set)

# Count null values
stateNullCount(df_raw_sales_pipeline, 'account')
    # 1425 null values (16% of records)

# Why would there be null values in the account column of the sales_pipeline? Let's investigate
#df_raw_sales_pipeline[df_raw_sales_pipeline['account'].isnull()]
    # A mix of engaging and prospecting deals.
    # Are these ALL the engaging and prospecting deals? (maybe the company is only added once the deal is closed?)

# After skimming through rows of either Engaging or Prospecting deal stages
    # Looks like MOST of the rows have missing accounts
    # That is a bit strange but thinking ahead, we won't be using account in our dashboard so we can ignore

Null count for account column: 1425


In [37]:
# Step 2e: deal_stage (object)

# Compare unique counts
compareUniqueCounts(df_raw_sales_pipeline['deal_stage']) # True = good

# Look for typos
df_raw_sales_pipeline['deal_stage'].unique()
    # Only 4 values and no typos that I can see

# Logical null values
    # Nope. Already checked from above

# Logic check (based on data dictionary)
    # The only four values that should be here are here so all good.

# Count null values
stateNullCount(df_raw_sales_pipeline, 'deal_stage')
    # No null values

Null count for deal_stage column: 0


In [38]:
# Step 2f: engage_date (datetime)

# Checking the range of dates
df_raw_sales_pipeline['engage_date'].describe()
    # Range between 2016 and 2017

# Logic check (based on data dictionary)
    # They are definitely dates

# Logical nulls?
    # With dates, logical nulls may be encoded as extreme dates such as 1999-01-01 when the rest of dates are in 2016-2017.
    # We don't see any of that here it seems

# Typos?
    # Check range of months
df_raw_sales_pipeline['engage_date'].astype('str').str[5:7] # Between 1 and 12 (good)
    # Check range of dates
df_raw_sales_pipeline['engage_date'].astype('str').str[-2:] # Between 1 and 31 (good)

# Count nulls
stateNullCount(df_raw_sales_pipeline, 'engage_date')
    # 500 nulls
    
# We expect this to be null for all rows that are prospecting and non-null otherwise
df_raw_sales_pipeline[df_raw_sales_pipeline['deal_stage'] == "Prospecting"]['engage_date'].describe()
    # Engage_date NULL for prospecting (good)
df_raw_sales_pipeline[df_raw_sales_pipeline['deal_stage'] != "Prospecting"]['engage_date'].isnull().sum()
    # Engage date NOT NULL for all other deal stages (good)

Null count for engage_date column: 500


0

In [39]:
# Step 2g: close_date (datetime)

# Checking the range of dates
df_raw_sales_pipeline['close_date'].describe()
    # No strange outliers that I can see

# Logic check (based on data dictionary)
    # Again, these are definitely dates
    
# Logical nulls?
    # No glaring ones from the range of values
    
# Typos?
    # Check range of months
df_raw_sales_pipeline['close_date'].astype('str').str[5:7] # Between 1 and 12 (good)
    # Check range of dates
df_raw_sales_pipeline['engage_date'].astype('str').str[-2:] # Between 1 and 31 (good)

# Count nulls
stateNullCount(df_raw_sales_pipeline, 'close_date')
    # 2089 missing values, let's investigate
    
# We expect nulls to be for Prospecting and Engaging deals while Won/Lost should all have dates
df_raw_sales_pipeline[(df_raw_sales_pipeline['deal_stage'] == "Prospecting") | (df_raw_sales_pipeline['deal_stage'] == "Engaging")]['close_date'].describe()
    # All nulls for engaging and prospecting deals
df_raw_sales_pipeline[(df_raw_sales_pipeline['deal_stage'] == "Won") | (df_raw_sales_pipeline['deal_stage'] == "Lost")]['close_date'].isnull().sum()
    # No null values for Won or Lost deals
    
# Last check (all close dates SHOULD be >= engage dates when applicable)
len(df_raw_sales_pipeline[df_raw_sales_pipeline['close_date'] < df_raw_sales_pipeline['engage_date']])
    # 0 rows returned for the above query meaning our assumption is true

Null count for close_date column: 2089


0

In [40]:
# Step 2h: close_value (float)

# Checking the range
df_raw_sales_pipeline['close_value'].describe()
    # Max of 30K which seems believable because the highest price of the products offered was 26K.
    # However a min of 0 dollars? Is the company giving away products for free? Let's investigate
    
#df_raw_sales_pipeline[df_raw_sales_pipeline['close_value'] == 0]
    # Ok so looks like Lost deals are indicated with a close value of 0 which makes more sense

# Let's double check that only happens for LOST deals
df_raw_sales_pipeline[df_raw_sales_pipeline['close_value'] == 0]['deal_stage'].unique()
df_raw_sales_pipeline[df_raw_sales_pipeline['deal_stage'] == 'Lost']['close_value'].describe()
    # Yes the above is correct (Lost deals = 0 close value)

# Checking for logical null values
    # No negative values and we've sorted out the records with 0 so no logical nulls.

# Logic check (based on data dictionary)
    # The range of values seem correct

# Count nulls
stateNullCount(df_raw_sales_pipeline, 'close_value')
    # 2089 nulls, let's investigate
    
df_raw_sales_pipeline[df_raw_sales_pipeline['close_value'].isnull()]
    # Appears that Engaging and Prospecting deals have no close values (which makes sense)

# Let's double check that this follows for ALL engaging and prospecting deals
df_raw_sales_pipeline[(df_raw_sales_pipeline['deal_stage'] == "Engaging") | (df_raw_sales_pipeline['deal_stage'] == "Prospecting")]['close_value'].describe()
    # Yup that follows for all engaging and prospecting deals

Null count for close_value column: 2089


count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: close_value, dtype: float64

In [41]:
# Step 3: Check for duplicates

df_raw_sales_pipeline.duplicated().sum()
    # No duplicates

0

## Data QA (sales_teams.csv)

**Summary**:
* 0 duplicates

In [42]:
# Displaying the relevant information from data_dictionary

df_raw_data_dictionary[df_raw_data_dictionary['Table'] == "sales_teams"]

Unnamed: 0,Table,Field,Description
10,sales_teams,sales_agent,Sales agent
11,sales_teams,manager,Respective sales manager
12,sales_teams,regional_office,Regional office


In [43]:
# Load the table

df_raw_sales_teams = pd.read_csv('./Data/sales_teams.csv')

# Display some general stats and the head of the table

df_raw_sales_teams.info()
df_raw_sales_teams.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sales_agent      35 non-null     object
 1   manager          35 non-null     object
 2   regional_office  35 non-null     object
dtypes: object(3)
memory usage: 968.0+ bytes


Unnamed: 0,sales_agent,manager,regional_office
0,Anna Snelling,Dustin Brinkmann,Central
1,Cecily Lampkin,Dustin Brinkmann,Central
2,Versie Hillebrand,Dustin Brinkmann,Central
3,Lajuana Vencill,Dustin Brinkmann,Central
4,Moses Frase,Dustin Brinkmann,Central
5,Jonathan Berthelot,Melvin Marxen,Central
6,Marty Freudenburg,Melvin Marxen,Central
7,Gladys Colclough,Melvin Marxen,Central
8,Niesha Huffines,Melvin Marxen,Central
9,Darcel Schlecht,Melvin Marxen,Central


In [44]:
# Step 1: Check if the data types of each column seem correct

# COLUMN_NAME           EXPECTED DATA TYPE OF VALUES      CURRENT DATA TYPE
# sales_agent           string                            object (correct)
# manager               string                            object (correct)
# regional_office       string                            object (correct)

In [45]:
# Step 2: Resolve typos, logical duplicates, logical nulls

In [46]:
# Step 2a: sales_agent (object)

# Compare current unique count and all lowercase unique count
compareUniqueCounts(df_raw_sales_teams['sales_agent']) # True = good

# Look for typos in value list + Logical Check (based on data dictionary)
    # All unique values of agents from sales_pipeline should be present in this table
set(df_raw_sales_pipeline['sales_agent'].unique()).difference(set(df_raw_sales_teams['sales_agent'].unique()))
    # Empty set means all sales agents are present. Good!

# Logical null values
df_raw_sales_teams['sales_agent'].unique()
    # Looks like every value is a real name so no logical nulls

# Count of null values
stateNullCount(df_raw_sales_teams, 'sales_agent')
    # No null values

Null count for sales_agent column: 0


In [47]:
# Step 2b: manager (object)

# Compare current unique count and all lowercase unique count
compareUniqueCounts(df_raw_sales_teams['manager']) # True = good

# Look for typos in value list
df_raw_sales_teams['manager'].unique()
    # 6 names and no glaring typos

# Logic check (based on data_dictionary)
    # These values should represent the names of managers and they appear correct
    
# Logical null values
    # No logical nulls during check of unique values

# Count of null values
stateNullCount(df_raw_sales_teams, 'manager')
    # No null values

Null count for manager column: 0


In [48]:
# Step 2c: regional_office (object)

# Compare current unique count and all lowercase unique count
compareUniqueCounts(df_raw_sales_teams['regional_office']) # True = good

# Look for typos in value list
df_raw_sales_teams['regional_office'].unique()
    # Only three values and they are all correct

# Logic check (based on data_dictionary)
    # These are all region locations (East, West, Central)

# Logical null values
    # None seen when inspecting unique values
    
# Count of null values
stateNullCount(df_raw_sales_teams, 'regional_office')
    # No null values

Null count for regional_office column: 0


In [49]:
# Step 3: Check for duplicates

df_raw_sales_teams.duplicated().sum()

# No duplicates

0

## Exporting the Cleaned Tables

In [50]:
# We didn't change data_dictionary.CSV so we can just save the others

df_raw_accounts.to_csv('./Data/accounts_CLEAN.csv', index = False)
df_raw_products.to_csv('./Data/products_CLEAN.csv', index = False)
df_raw_sales_pipeline.to_csv('./Data/sales_pipeline_CLEAN.csv', index = False)
df_raw_sales_teams.to_csv('./Data/sales_teams_CLEAN.csv', index = False)

In [51]:
# Outstanding Notes:
# account table has one lowercase value "dambase" which we want to change to capitalized for consistency sake

# products: sales_price is currently int instead of float but we'll see if this is an issue later
    # Non-issue (we can convert in the dashboard software)