**Assumption**

ID should be the primary key since the relationship between USERS table and TRANSACTIONS table is 1 to Many.

In [1]:
# Data Quality Check for USER_TAKEHOME
import pandas as pd
from datetime import datetime

df_user = pd.read_csv("USER_TAKEHOME.csv")

In [2]:
df_user.head()

Unnamed: 0,ID,CREATED_DATE,BIRTH_DATE,STATE,LANGUAGE,GENDER
0,5ef3b4f17053ab141787697d,2020-06-24 20:17:54.000 Z,2000-08-11 00:00:00.000 Z,CA,es-419,female
1,5ff220d383fcfc12622b96bc,2021-01-03 19:53:55.000 Z,2001-09-24 04:00:00.000 Z,PA,en,female
2,6477950aa55bb77a0e27ee10,2023-05-31 18:42:18.000 Z,1994-10-28 00:00:00.000 Z,FL,es-419,female
3,658a306e99b40f103b63ccf8,2023-12-26 01:46:22.000 Z,,NC,en,
4,653cf5d6a225ea102b7ecdc2,2023-10-28 11:51:50.000 Z,1972-03-19 00:00:00.000 Z,PA,en,female


In [3]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   ID            100000 non-null  object
 1   CREATED_DATE  100000 non-null  object
 2   BIRTH_DATE    96325 non-null   object
 3   STATE         95188 non-null   object
 4   LANGUAGE      69492 non-null   object
 5   GENDER        94108 non-null   object
dtypes: object(6)
memory usage: 4.6+ MB


In [4]:
# Check for missing values
missing_values = df_user.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 ID                  0
CREATED_DATE        0
BIRTH_DATE       3675
STATE            4812
LANGUAGE        30508
GENDER           5892
dtype: int64


**Quality issues I found**:
1. From Dtype column, we can see CREATED_DATE and BIRTH_DATE are stored as objects (strings) instead of datetime. Should change to datetime

2. BIRTH_DATE (3,675 missing), STATE (4,812 missing), LANGUAGE (30,508 missing), GENDER (5,892 missing).

In [5]:
df_user['BIRTH_DATE']

0        2000-08-11 00:00:00.000 Z
1        2001-09-24 04:00:00.000 Z
2        1994-10-28 00:00:00.000 Z
3                              NaN
4        1972-03-19 00:00:00.000 Z
                   ...            
99995    1992-03-16 08:00:00.000 Z
99996    1993-09-23 05:00:00.000 Z
99997    1983-04-19 00:00:00.000 Z
99998    1995-06-09 04:00:00.000 Z
99999    1995-12-15 05:00:00.000 Z
Name: BIRTH_DATE, Length: 100000, dtype: object

In [6]:
# Data Cleaning
df_user['CREATED_DATE'] = pd.to_datetime(df_user['CREATED_DATE'], errors='coerce')
df_user['BIRTH_DATE'] = pd.to_datetime(df_user['BIRTH_DATE'], errors='coerce')

In [7]:
# 2. Check for duplicates
duplicate_rows = df_user.duplicated(subset=['ID']).sum()
print("Number of Duplicate User IDs:", duplicate_rows)

Number of Duplicate User IDs: 0


In [8]:
# Check for invalid birth dates
# Need to make sure the birth date should not in the future
df_user['BIRTH_DATE'] = pd.to_datetime(df_user['BIRTH_DATE'], errors='coerce').dt.date
future_birth_dates = df_user[df_user['BIRTH_DATE'] > pd.to_datetime('today').date()]
print("Invalid Birth Dates (Future Dates):", future_birth_dates.shape[0])

Invalid Birth Dates (Future Dates): 0


In [9]:
# Validate categorical fields
categorical_fields = ['STATE', 'LANGUAGE', 'GENDER']
for col in categorical_fields:
    unique_values = df_user[col].nunique()
    print(f"Unique Values in {col}: {unique_values}")

Unique Values in STATE: 52
Unique Values in LANGUAGE: 2
Unique Values in GENDER: 11


**Thoughts**:
We can see GENDER has 11 unique values. I want to see if there is any typo or different answers but include in a same type

In [10]:
# Unique values for gender
unique_genders = df_user['GENDER'].dropna().unique()
print("Unique Genders:", unique_genders)

# Unique values for language
unique_languages = df_user['LANGUAGE'].dropna().unique()
print("Unique Languages:", unique_languages)

Unique Genders: ['female' 'male' 'non_binary' 'transgender' 'prefer_not_to_say'
 'not_listed' 'Non-Binary' 'unknown' 'not_specified'
 "My gender isn't listed" 'Prefer not to say']
Unique Languages: ['es-419' 'en']


**Quality issues I found**:
1. 'prefer_not_to_say', 'Prefer not to say', 'not_specified', 'unknown' should be standardized to a single category, e.g., 'prefer_not_to_say'

2. 'not_listed' vs. "My gender isn't listed" should be merged into one category like "not listed"

3. 'non_binary' and 'Non-Binary', and 'prefer_not_to_say' and 'Prefer not to say' should be the same. Convert all values to lowercase for consistency

In [11]:
# Data Cleaning

# Convert all values in the 'GENDER' column to lowercase for consistency
df_user['GENDER'] = df_user['GENDER'].str.lower().str.strip()

# Standardization Mapping
gender_mapping = {
    "prefer not to say": "prefer_not_to_say",
    "not_specified": "prefer_not_to_say",
    "unknown": "prefer_not_to_say",
    "not listed": "not_listed",
    "my gender isn't listed": "not_listed",
    "non-binary": "non_binary"
}

# Apply mapping
df_user['GENDER'] = df_user['GENDER'].replace(gender_mapping)

In [12]:
## Add age column to the dataframe
df_user['BIRTH_DATE'] = pd.to_datetime(df_user['BIRTH_DATE'], errors='coerce')
df_user['AGE'] = df_user['BIRTH_DATE'].apply(lambda x: datetime.now().year - x.year if pd.notnull(x) else None)

In [13]:
df_user['AGE']

0        25.0
1        24.0
2        31.0
3         NaN
4        53.0
         ... 
99995    33.0
99996    32.0
99997    42.0
99998    30.0
99999    30.0
Name: AGE, Length: 100000, dtype: float64

In [14]:
# Summary statistics
print("Summary Statistics:\n", df_user.describe(include='all'))

Summary Statistics:
                               ID               CREATED_DATE  \
count                     100000                     100000   
unique                    100000                      99942   
top     5ef3b4f17053ab141787697d  2023-01-12 18:30:15+00:00   
freq                           1                          2   
first                        NaN  2014-04-18 23:14:55+00:00   
last                         NaN  2024-09-11 17:59:15+00:00   
mean                         NaN                        NaN   
std                          NaN                        NaN   
min                          NaN                        NaN   
25%                          NaN                        NaN   
50%                          NaN                        NaN   
75%                          NaN                        NaN   
max                          NaN                        NaN   

                 BIRTH_DATE  STATE LANGUAGE  GENDER           AGE  
count                 96325 

  print("Summary Statistics:\n", df_user.describe(include='all'))
  print("Summary Statistics:\n", df_user.describe(include='all'))


In [15]:
# Save the cleaned dataset
df_user.to_csv("cleaned_user_takehome.csv", index=False)