In [1]:
#Install necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns7

In [2]:
customers_data = pd.read_csv(r'../Data/addresses.csv')
customers_data.head()

Unnamed: 0,company_id,address,total_spend
0,1,"APARTMENT 2,\n52 BEDFORD ROAD,\nLONDON,\nENGLA...",5700
1,2,"107 SHERINGHAM AVENUE,\nLONDON,\nN14 4UJ",4700
2,3,"43 SUNNINGDALE,\nYATE,\nBRISTOL,\nENGLAND,\nBS...",5900
3,4,"HAWESWATER HOUSE,\nLINGLEY MERE BUSINESS PARK,...",7200
4,5,"AMBERFIELD BARN HOUSE AMBER LANE,\nCHART SUTTO...",4600


### Data Overview and Documentation

When working with new datasets, it's crucial to document the structure and content of the data for several reasons:
1. Ensures clarity for team members and stakeholders
2. Facilitates future data processing and analysis
3. Provides a foundation for database schema design
4. Aids in maintaining data governance standards

### Customer Data Dictionary

| Column Name | Data Type | Description |
|-------------|-----------|-------------|
| company_id  | INT64     | Unique identifier assigned to each customer company |
| address     | STRING    | Complete postal address of the customer |
| total_spend | INT64     | Cumulative spending amount per customer (in GBP) |

**Note**: This dataset contains transaction records for business customers, with monetary values represented in British Pounds Sterling (GBP).

In [3]:
customers_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   company_id   100000 non-null  int64 
 1   address      99032 non-null   object
 2   total_spend  100000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [4]:
customers_data.shape

(100000, 3)

### Data Exploration Insights

Understanding the fundamental characteristics of our dataset (such as shape and info ) provides several key advantages:

1. **Data Structure Analysis**
   - Identification of data types per column
   - Assessment of storage requirements
   - Overview of dataset dimensions and scale

2. **Query Optimization**
   - Informs decisions about data partitioning strategies
   - Helps optimize cloud computing costs, particularly in platforms like Google BigQuery
   - Enables efficient query planning and execution

3. **Performance Considerations**
   - Guides decisions on indexing strategies
   - Helps determine appropriate storage optimization techniques
   - Facilitates cost-effective query execution planning

This preliminary analysis is crucial for both data engineering decisions and cost optimization in cloud environments.

In [5]:
customers_data['total_spend'].describe()

count    100000.000000
mean       4951.662000
std        1500.983866
min           0.000000
25%        3900.000000
50%        5000.000000
75%        6000.000000
max       11700.000000
Name: total_spend, dtype: float64

In [6]:
#After understanding the data, we need to investigate if there are any missing values or duplicates.
customers_data.isnull().sum()

company_id       0
address        968
total_spend      0
dtype: int64

We can see there are 968 missing addresses, which is just under 1% of our rows. Since we have no way of knowing the addresses of those missing customers just from the data provided, we can safely drop these rows:

In [7]:
customers_data = customers_data.dropna(subset=['address'])
customers_data.shape


(99032, 3)

Losing 1% due to missing key information is fine. If, say, 10% of our customers had missing addresses, we might want to examine why. Alternative solution is, categorizing missing addresses as “Other”

In [8]:
customers_data.duplicated().sum()

np.int64(0)

In [9]:
customers_data['total_spend'].describe()

count    99032.000000
mean      4951.673197
std       1500.642398
min          0.000000
25%       3900.000000
50%       5000.000000
75%       6000.000000
max      11700.000000
Name: total_spend, dtype: float64

### Spending Analysis Overview

Based on the descriptive statistics of the `total_spend` column, we can observe several key insights:

1. **Data Quality Assessment**
   - No negative values detected in the spending data
   - All values are within reasonable business transaction ranges
   - The data distribution appears to be well-structured

2. **Key Metrics**
   - Mean customer spend: £4,951.67
   - Median customer spend: £5,000.00
   - Standard deviation: £1,500.64
   - Range: £0 to just under £12,000

3. **Distribution Characteristics**
   - 25th percentile: £3,900
   - 75th percentile: £6,000
   - The distribution shows a relatively normal pattern with slight right skew

This analysis provides a solid foundation for our geographic spending analysis.Now that we have seen our data, we need to decide on an approach to extract the information about cities.

In [13]:
#Before deciding on a method, we should look at some sample addresses.
for address in customers_data['address'].head(3):
    print(address,"\n")
for address in customers_data['address'].tail(3):
    print(address, "\n")

APARTMENT 2,
52 BEDFORD ROAD,
LONDON,
ENGLAND,
SW4 7HJ 

107 SHERINGHAM AVENUE,
LONDON,
N14 4UJ 

43 SUNNINGDALE,
YATE,
BRISTOL,
ENGLAND,
BS37 4HZ 

MARLAND HOUSE,
13 HUDDERSFIELD ROAD,
BARNSLEY,
SOUTH YORKSHIRE,
ENGLAND,
S70 2LW 

4 MOUNT SCAR VIEW,
SCHOLES,
HOLMFIRTH HUDDERSFIELD,
WEST YORKSHIRE,
HD9 1XH 

Manningham Mills Community Center, Silk Warehouse,
Lilycroft Road,
Bradford,
United Kingdom,
Bd9 5Be 



### Address Pattern Analysis

Initial examination of the sample addresses reveals several important patterns and considerations for data processing:

1. **Address Structure Patterns**
   - All addresses end with a postcode
   - Inconsistent inclusion of "ENGLAND" before postcodes
   - Address components are separated by commas and newline characters
   - Current sample shows uppercase formatting

2. **Data Quality Considerations**
   - Cannot rely on fixed position for city extraction
   - Need to implement robust parsing logic
   - Must handle inconsistent formatting

3. **Data Processing Strategy**
   - Standardize all addresses to uppercase format
   - Preserve original data in a separate column
   - Implement flexible parsing to handle variations

4. **Best Practices**
   - Maintain raw data for reference and validation
   - Create new column for processed addresses
   - Document any data transformations

This analysis will guide our approach to geographic data extraction and standardization.