# Exploratory Data Analysis of Customers

In [1]:
import pandas as pd

In [2]:
customers = pd.read_csv('../data/raw/unclean_customers.csv')

customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  5200 non-null   int64  
 1   name         5200 non-null   object 
 2   age          4870 non-null   float64
 3   country      4999 non-null   object 
 4   is_active    4912 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 203.3+ KB


The CSV imported is the one that the customer has provided.  We will explore it and then create a cleaned data file that can be used in testing.

#### Missing Data

We can see that there is some missing data in the `age`, `country` and `is_active` columns.  We will need to clean this data before we can use it for analysis or further processing.  We can see how many rows we expect to lose as a result cleaning this.

> NOTE: in this instance, the customer isn't bothered that the `age` column has missing values, so we will be removing this column from the final cleaned data set.

Let's find out how many rows we will actually lose...

In [3]:
# Show how many rows have both is_active and country missing
# Count rows with just country missing
just_country_missing = (
    customers["country"].isnull() & customers["is_active"].notnull()
)
print(f"Just country missing: {just_country_missing.sum()}")

# Count rows with just is_active missing
just_is_active_missing = (
    customers["is_active"].isnull() & customers["country"].notnull()
)
print(f"Just is_active missing: {just_is_active_missing.sum()}")

# Count rows with both country and is_active missing
both_missing = customers["country"].isnull() & customers["is_active"].isnull()
print(f"Both missing: {both_missing.sum()}")

print(
    f"Total number of rows lost will be {
    customers[['country', 'is_active']].isnull().any(axis=1).sum()
}"
)

Just country missing: 189
Just is_active missing: 276
Both missing: 12
Total number of rows lost will be 477


> ***477*** rows is around ***9%*** of the data, so we will be losing a significant amount of data.
> Our customer is happy with this as the majority are those with `is_active` missing, so they would be considered ***inactive*** anyway.

#### `is_active` Column

Lets take a look at the `is_active` column's values.  We'll get the first 20 unique values to see if there are any inconsistencies or unexpected values.

In [4]:
# Get the first 20 unique values from the is_active column

customers["is_active"].unique()[:20]

array(['0', '1', 'active', nan, 'False', 'True', 'inactive'], dtype=object)

In [5]:
# Check to see if all values are string that are not nan

for value in customers["is_active"].unique()[:20]:
    print(f"Value: {repr(value)}, Type: {type(value)}")

Value: '0', Type: <class 'str'>
Value: '1', Type: <class 'str'>
Value: 'active', Type: <class 'str'>
Value: nan, Type: <class 'float'>
Value: 'False', Type: <class 'str'>
Value: 'True', Type: <class 'str'>
Value: 'inactive', Type: <class 'str'>


We can see that this data is not standardised, although it is all `'str'` (we ignore the `nan` values as they will be removed with missing values).  We will use the following mapping for standardisation:

| Value    | Standardised Value |
|----------|--------------------|
| 1        | True               |
| 0        | False              |
| active   | True               |
| inactive | False              |
| False    | False              |
| True     | True               |

`nan` values will be removed prior to standardisation, as they are not valid values for this column.

#### Duplicates

How many duplicates are there in the data?  Let's get a ball-park figure.

In [6]:
# Duplicate rows in current data set
customers.duplicated().sum()

np.int64(149)

It seems as though we have ***149*** duplcate rows in the data set.  This is around ***3%*** of the data, so we will be losing a small amount of data.

> We will need to check how many duplicates again once we have removed the missing values, as some of these may be duplicates of rows that have missing values in the `is_active` column.

---
---

## Epic 3 - Story 4 - Task 2 - Missing Values

In [7]:
# Drop rows with missing values in country and is_active

customers = customers.dropna(subset=['country', 'is_active'])

customers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4723 entries, 1 to 5199
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  4723 non-null   int64  
 1   name         4723 non-null   object 
 2   age          4422 non-null   float64
 3   country      4723 non-null   object 
 4   is_active    4723 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 221.4+ KB


That's ***477*** rows removed as expected.

In [8]:
# Run a check - this operation will be tested in the pipeline!
just_country_missing = (
    customers["country"].isnull() & customers["is_active"].notnull()
)
print(f"Just country missing: {just_country_missing.sum()}")

# Count rows with just is_active missing
just_is_active_missing = (
    customers["is_active"].isnull() & customers["country"].notnull()
)
print(f"Just is_active missing: {just_is_active_missing.sum()}")

# Count rows with both country and is_active missing
both_missing = customers["country"].isnull() & customers["is_active"].isnull()
print(f"Both missing: {both_missing.sum()}")

print(
    f"Total number of rows lost will be {
    customers[['country', 'is_active']].isnull().any(axis=1).sum()
}"
)

Just country missing: 0
Just is_active missing: 0
Both missing: 0
Total number of rows lost will be 0


---

## Epic 3 - Story 4 - Task 3 - Remove `age` Column

In [9]:
# Remove the `age` column from the dataset
customers = customers.drop(columns=['age'])

customers.shape

(4723, 4)

---

### Epic 3 - Story 4 - Task 4 - Standardise `is_active` Column

In [10]:
# standardise is_active
mapping = {
    "1": True,
    "0": False,
    "active": True,
    "inactive": False,
    "False": False,
    "True": True,
}

# Fill nan with False first, then convert to bool
customers["is_active"] = (
    customers["is_active"]
    .map(mapping)
    .fillna(False)
    .infer_objects()
    .astype(bool)
)


print(customers.shape[0])

4723


In [11]:
# Check unique values in is_active

print(customers['is_active'].unique())

[ True False]


In [12]:
# Check to see if all values are string that are not nan

for value in customers["is_active"].unique()[:20]:
    print(f"Value: {repr(value)}, Type: {type(value)}")

Value: np.True_, Type: <class 'numpy.bool'>
Value: np.False_, Type: <class 'numpy.bool'>


Just boolean `True` or `False` values are now present in the `is_active` column.

---

### Drop Duplicates

In [13]:
# Drop duplicates

customers = customers.drop_duplicates()

print(customers.shape[0])

4587


That's ***136*** duplicated removed - we must have lost 13 of them along the way! 

### Reset the indexes

> This was added after the previous task as COMPONENT tests failed due to index conflicts - the same will be added for the transaction data set.

We can see that the indexes are now out of order, so we will reset them to be sequential again.

In [14]:
customers.reset_index(drop=True, inplace=True)

---
---

### Epic 2 - Story 4 - Task 6 - Save the Cleaned Data

For testing purposes in the pipeline, it makes sense for us to export the cleaned DataFrame to a CSV file.  This will allow us to use the cleaned data in the pipeline without having to run the cleaning steps again.

In [15]:
customers.to_csv('../tests/test_data/expected_customers_clean_results.csv', index=False)

---

### Epic 2 - Story 4 - Task 7 - Transfer the code from the Jupyter Notebook to a Python script, creating separate functions for each cleaning step

### Epic 2 - Story 4 - Task 8 - Write tests for each cleaning function to ensure they work correctly

### Epic 2 - Story 4 - Task 9 - Create a script to run the cleaning functions in sequence and log the process

### Epic 2 - Story 4 - Task 10 - Add the customer cleaning script to scripts/run and update any tests accordingly

Jupyter Notebooks do not play nicely with CI/CD pipelines, so we will need to transfer the code from the Jupyter Notebook to a Python script.  We will create separate functions for each cleaning step and then write tests for each function to ensure they work correctly.