# DATA QUALITY
Data quality refers to the accuracy, completeness, consistency, reliability, and relevance of data for its intended purpose. Ensuring high-quality data is crucial for meaningful analysis and informed decision-making. Here's a detailed explanation of steps, tips, and things to consider while exploring data quality:

**1. Understanding the Data:**

- Data Types: Understand the types of data (numeric, categorical, text, etc.) in the dataset.
- Data Source: Know the source of the data and how it was collected.
- Metadata: Gather information about the dataset, including variable definitions and units of measurement.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
Transactions = pd.read_excel('DATA/KPMG_VI_New_raw_data_update_final (1).xlsx',sheet_name='Transactions',)
CustomerAddress = pd.read_excel('DATA/KPMG_VI_New_raw_data_update_final (1).xlsx',sheet_name='CustomerAddress')
CustomerDemographic = pd.read_excel('DATA/KPMG_VI_New_raw_data_update_final (1).xlsx',sheet_name='CustomerDemographic')
NewCustomerList = pd.read_excel('DATA/KPMG_VI_New_raw_data_update_final (1).xlsx',sheet_name='NewCustomerList')

In [3]:
# Adjusting the column names Clearly

Transactions.columns=Transactions.iloc[0].to_list()
Transactions.drop(0,inplace=True)

CustomerDemographic.columns = CustomerDemographic.iloc[0].to_list()
CustomerDemographic.drop(0, inplace=True)


CustomerAddress.columns = CustomerAddress.iloc[0].to_list()
CustomerAddress.drop(0, inplace=True)


NewCustomerList.columns = NewCustomerList.iloc[0].to_list()
NewCustomerList.drop(0, inplace=True)


# Analysis

In [4]:
Transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 1 to 20000
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   transaction_id           20000 non-null  object
 1   product_id               20000 non-null  object
 2   customer_id              20000 non-null  object
 3   transaction_date         20000 non-null  object
 4   online_order             19640 non-null  object
 5   order_status             20000 non-null  object
 6   brand                    19803 non-null  object
 7   product_line             19803 non-null  object
 8   product_class            19803 non-null  object
 9   product_size             19803 non-null  object
 10  list_price               20000 non-null  object
 11  standard_cost            19803 non-null  object
 12  product_first_sold_date  19803 non-null  object
dtypes: object(13)
memory usage: 2.0+ MB


In [9]:
CustomerDemographic.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 1 to 4000
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   customer_id                          4000 non-null   object
 1   first_name                           4000 non-null   object
 2   last_name                            3875 non-null   object
 3   gender                               4000 non-null   object
 4   past_3_years_bike_related_purchases  4000 non-null   object
 5   DOB                                  3913 non-null   object
 6   job_title                            3494 non-null   object
 7   job_industry_category                3344 non-null   object
 8   wealth_segment                       4000 non-null   object
 9   deceased_indicator                   4000 non-null   object
 10  default                              3698 non-null   object
 11  owns_car                             4000 n

In [10]:
CustomerAddress.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 1 to 3999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   object
 1   address             3999 non-null   object
 2   postcode            3999 non-null   object
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   object
dtypes: object(6)
memory usage: 187.6+ KB


In [11]:
NewCustomerList.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 1 to 1000
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   first_name                           1000 non-null   object 
 1   last_name                            971 non-null    object 
 2   gender                               1000 non-null   object 
 3   past_3_years_bike_related_purchases  1000 non-null   object 
 4   DOB                                  983 non-null    object 
 5   job_title                            894 non-null    object 
 6   job_industry_category                835 non-null    object 
 7   wealth_segment                       1000 non-null   object 
 8   deceased_indicator                   1000 non-null   object 
 9   owns_car                             1000 non-null   object 
 10  tenure                               1000 non-null   object 
 11  address                       

In [12]:
Transactions.head()

Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,product_size,list_price,standard_cost,product_first_sold_date
1,1,2,2950,2017-02-25 00:00:00,False,Approved,Solex,Standard,medium,medium,71.49,53.62,41245
2,2,3,3120,2017-05-21 00:00:00,True,Approved,Trek Bicycles,Standard,medium,large,2091.47,388.92,41701
3,3,37,402,2017-10-16 00:00:00,False,Approved,OHM Cycles,Standard,low,medium,1793.43,248.82,36361
4,4,88,3135,2017-08-31 00:00:00,False,Approved,Norco Bicycles,Standard,medium,medium,1198.46,381.1,36145
5,5,78,787,2017-10-01 00:00:00,True,Approved,Giant Bicycles,Standard,medium,large,1765.3,709.48,42226


In [45]:
Transactions.isnull().sum()  # We can drop the values

transaction_id             0
product_id                 0
customer_id                0
transaction_date           0
online_order               0
order_status               0
brand                      0
product_line               0
product_class              0
product_size               0
list_price                 0
standard_cost              0
product_first_sold_date    0
dtype: int64

In [20]:
Transactions.dropna(inplace=True)

In [21]:
Transactions.select_dtypes(include='object')

Unnamed: 0,transaction_date,online_order,order_status,brand,product_line,product_class,product_size,standard_cost,product_first_sold_date
1,2017-02-25 00:00:00,False,Approved,Solex,Standard,medium,medium,53.62,41245
2,2017-05-21 00:00:00,True,Approved,Trek Bicycles,Standard,medium,large,388.92,41701
3,2017-10-16 00:00:00,False,Approved,OHM Cycles,Standard,low,medium,248.82,36361
4,2017-08-31 00:00:00,False,Approved,Norco Bicycles,Standard,medium,medium,381.1,36145
5,2017-10-01 00:00:00,True,Approved,Giant Bicycles,Standard,medium,large,709.48,42226
...,...,...,...,...,...,...,...,...,...
19996,2017-06-24 00:00:00,True,Approved,OHM Cycles,Standard,high,medium,1203.4,37823
19997,2017-11-09 00:00:00,True,Approved,Solex,Road,medium,medium,312.74,35560
19998,2017-04-14 00:00:00,True,Approved,OHM Cycles,Standard,medium,medium,44.71,40410
19999,2017-07-03 00:00:00,False,Approved,OHM Cycles,Standard,high,medium,136.73,38216


# here the numarical columns are also in obj formate we can handel this by typecasting method

In [23]:
numarical=['transaction_id','product_id','customer_id','list_price', 'standard_cost']

In [24]:
for col in numarical:
    Transactions[col]=Transactions[col].astype('int')

In [27]:
Transactions.select_dtypes(include='object').describe()

Unnamed: 0,transaction_date,online_order,order_status,brand,product_line,product_class,product_size
count,19445,19445,19445,19445,19445,19445,19445
unique,364,2,2,6,4,3,3
top,2017-08-18 00:00:00,True,Approved,Solex,Standard,medium,medium
freq,81,9739,19273,4169,13920,13587,12767


In [28]:
Transactions.select_dtypes(include='int').describe()

Unnamed: 0,transaction_id,product_id,customer_id,list_price,standard_cost,product_first_sold_date
count,19445.0,19445.0,19445.0,19445.0,19445.0,19445.0
mean,9989.257393,45.797737,1739.467267,1106.828799,555.037079,38201.758653
std,5779.669087,30.571996,1011.889153,582.625114,405.593133,2878.067854
min,1.0,0.0,1.0,12.0,7.0,33259.0
25%,4976.0,18.0,857.0,575.0,215.0,35667.0
50%,9985.0,45.0,1741.0,1163.0,507.0,38216.0
75%,14997.0,72.0,2615.0,1635.0,795.0,40672.0
max,20000.0,100.0,5034.0,2091.0,1759.0,42710.0


In [36]:
for col in Transactions.columns[Transactions.nunique()<5]:
    print(Transactions[col].value_counts())
    print("-"*15)

online_order
True     9739
False    9706
Name: count, dtype: int64
---------------
order_status
Approved     19273
Cancelled      172
Name: count, dtype: int64
---------------
product_line
Standard    13920
Road         3894
Touring      1213
Mountain      418
Name: count, dtype: int64
---------------
product_class
medium    13587
high       2952
low        2906
Name: count, dtype: int64
---------------
product_size
medium    12767
large      3900
small      2778
Name: count, dtype: int64
---------------


In [42]:
Transactions['transaction_date'] = pd.to_datetime(Transactions['transaction_date'])
Transactions['transaction_date'].describe()

count                            19445
mean     2017-07-01 16:21:18.189766144
min                2017-01-01 00:00:00
25%                2017-04-01 00:00:00
50%                2017-07-03 00:00:00
75%                2017-10-02 00:00:00
max                2017-12-30 00:00:00
Name: transaction_date, dtype: object

In [44]:
Transactions.duplicated().value_counts() 

False    19445
Name: count, dtype: int64

The key and value pairs in the image are:

- **Accuracy**: The degree to which the data values are correct. For example, the name and address of a customer should match the reality.
- **Completeness**: The extent to which the data fields are filled with valid values and do not have missing or contradictory information. For example, a customer record should have all the required fields such as name, phone number, email, etc.
- **Consistency**: The degree to which the data values are in agreement with each other and follow the same rules and formats. For example, the date format should be consistent across the data set and not vary from DD/MM/YYYY to MM/DD/YYYY.
- **Currency**: The extent to which the data values are up to date and reflect the current situation. For example, the stock price of a company should be updated frequently and not be outdated.
- **Relevancy**: The degree to which the data values are useful and meaningful for the intended purpose and audience. For example, the data set should contain the information that is needed to answer the business questions and not include irrelevant or redundant data.
- **Uniqueness**: The degree to which the data values are distinct and do not have duplicates. For example, each customer record should have a unique identifier and not be repeated in the data set.

# Analysis : CustomerAddress

In [49]:
CustomerAddress.head()

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation
1,1,060 Morning Avenue,2016,New South Wales,Australia,10
2,2,6 Meadow Vale Court,2153,New South Wales,Australia,10
3,4,0 Holy Cross Court,4211,QLD,Australia,9
4,5,17979 Del Mar Point,2448,New South Wales,Australia,4
5,6,9 Oakridge Court,3216,VIC,Australia,9


In [47]:
CustomerAddress.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 1 to 3999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   object
 1   address             3999 non-null   object
 2   postcode            3999 non-null   object
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   object
dtypes: object(6)
memory usage: 187.6+ KB


In [48]:
CustomerAddress.describe()

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation
count,3999,3999,3999,3999,3999,3999
unique,3999,3996,873,5,1,12
top,1,3 Mariners Cove Terrace,2170,NSW,Australia,9
freq,1,2,31,2054,3999,647


In [50]:
CustomerAddress.nunique()

customer_id           3999
address               3996
postcode               873
state                    5
country                  1
property_valuation      12
dtype: int64

In [51]:
CustomerAddress.isnull().sum()

customer_id           0
address               0
postcode              0
state                 0
country               0
property_valuation    0
dtype: int64

In [56]:
for col in CustomerAddress.columns:
    print(CustomerAddress[col].value_counts())
    print("-"*15)
    

customer_id
1       1
2676    1
2663    1
2664    1
2665    1
       ..
1343    1
1344    1
1345    1
1346    1
4003    1
Name: count, Length: 3999, dtype: int64
---------------
address
3 Mariners Cove Terrace      2
3 Talisman Place             2
64 Macpherson Junction       2
359 Briar Crest Road         1
4543 Service Terrace         1
                            ..
5063 Shopko Pass             1
09 Hagan Pass                1
87897 Lighthouse Bay Pass    1
294 Lawn Junction            1
320 Acker Drive              1
Name: count, Length: 3996, dtype: int64
---------------
postcode
2170    31
2155    30
2145    30
2153    29
3977    26
        ..
3808     1
3114     1
4721     1
4799     1
3089     1
Name: count, Length: 873, dtype: int64
---------------
state
NSW                2054
VIC                 939
QLD                 838
New South Wales      86
Victoria             82
Name: count, dtype: int64
---------------
country
Australia    3999
Name: count, dtype: int64
------------

In [64]:
CustomerAddress.duplicated().value_counts()

False    3999
Name: count, dtype: int64

# Analysis of CustomerDemographic



In [58]:
CustomerDemographic.head()

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,default,owns_car,tenure
1,1,Laraine,Medendorp,F,93,1953-10-12 00:00:00,Executive Secretary,Health,Mass Customer,N,"""'",Yes,11
2,2,Eli,Bockman,Male,81,1980-12-16 00:00:00,Administrative Officer,Financial Services,Mass Customer,N,<script>alert('hi')</script>,Yes,16
3,3,Arlin,Dearle,Male,61,1954-01-20 00:00:00,Recruiting Manager,Property,Mass Customer,N,2018-02-01 00:00:00,Yes,15
4,4,Talbot,,Male,33,1961-10-03 00:00:00,,IT,Mass Customer,N,() { _; } >_[$($())] { touch /tmp/blns.shellsh...,No,7
5,5,Sheila-kathryn,Calton,Female,56,1977-05-13 00:00:00,Senior Editor,,Affluent Customer,N,NIL,Yes,8


In [60]:
CustomerDemographic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 1 to 4000
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   customer_id                          4000 non-null   object
 1   first_name                           4000 non-null   object
 2   last_name                            3875 non-null   object
 3   gender                               4000 non-null   object
 4   past_3_years_bike_related_purchases  4000 non-null   object
 5   DOB                                  3913 non-null   object
 6   job_title                            3494 non-null   object
 7   job_industry_category                3344 non-null   object
 8   wealth_segment                       4000 non-null   object
 9   deceased_indicator                   4000 non-null   object
 10  default                              3698 non-null   object
 11  owns_car                             4000 n

In [59]:
CustomerDemographic.nunique()

customer_id                            4000
first_name                             3139
last_name                              3725
gender                                    6
past_3_years_bike_related_purchases     100
DOB                                    3448
job_title                               195
job_industry_category                     9
wealth_segment                            3
deceased_indicator                        2
default                                  90
owns_car                                  2
tenure                                   22
dtype: int64

In [62]:
CustomerDemographic.columns[CustomerDemographic.isnull().sum()>0]

Index(['last_name', 'DOB', 'job_title', 'job_industry_category', 'default',
       'tenure'],
      dtype='object')

In [65]:
CustomerDemographic['Full_name'] = CustomerDemographic['first_name']+' '+CustomerDemographic['last_name']

In [71]:
CustomerDemographic.head()

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,default,owns_car,tenure,Full_name
1,1,Laraine,Medendorp,F,93,1953-10-12 00:00:00,Executive Secretary,Health,Mass Customer,N,"""'",Yes,11,Laraine Medendorp
2,2,Eli,Bockman,Male,81,1980-12-16 00:00:00,Administrative Officer,Financial Services,Mass Customer,N,<script>alert('hi')</script>,Yes,16,Eli Bockman
3,3,Arlin,Dearle,Male,61,1954-01-20 00:00:00,Recruiting Manager,Property,Mass Customer,N,2018-02-01 00:00:00,Yes,15,Arlin Dearle
4,4,Talbot,,Male,33,1961-10-03 00:00:00,,IT,Mass Customer,N,() { _; } >_[$($())] { touch /tmp/blns.shellsh...,No,7,
5,5,Sheila-kathryn,Calton,Female,56,1977-05-13 00:00:00,Senior Editor,,Affluent Customer,N,NIL,Yes,8,Sheila-kathryn Calton


In [73]:
NewCustomerList.head()

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,NaN,NaN.1,NaN.2,NaN.3,NaN.4,Rank,Value
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,...,QLD,Australia,6,0.77,0.9625,1.203125,1.022656,1.0,1,1.71875
2,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,Mass Customer,N,No,...,NSW,Australia,11,0.53,0.53,0.6625,0.563125,1.0,1,1.71875
3,Ardelis,Forrester,Female,10,1974-08-28 00:00:00,Senior Cost Accountant,Financial Services,Affluent Customer,N,No,...,VIC,Australia,5,0.89,0.89,0.89,0.89,1.0,1,1.71875
4,Lucine,Stutt,Female,64,1979-01-28,Account Representative III,Manufacturing,Affluent Customer,N,Yes,...,QLD,Australia,1,0.91,1.1375,1.1375,1.1375,4.0,4,1.703125
5,Melinda,Hadlee,Female,34,1965-09-21,Financial Analyst,Financial Services,Affluent Customer,N,No,...,NSW,Australia,9,0.66,0.66,0.825,0.825,4.0,4,1.703125


In [74]:
NewCustomerList.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 1 to 1000
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   first_name                           1000 non-null   object 
 1   last_name                            971 non-null    object 
 2   gender                               1000 non-null   object 
 3   past_3_years_bike_related_purchases  1000 non-null   object 
 4   DOB                                  983 non-null    object 
 5   job_title                            894 non-null    object 
 6   job_industry_category                835 non-null    object 
 7   wealth_segment                       1000 non-null   object 
 8   deceased_indicator                   1000 non-null   object 
 9   owns_car                             1000 non-null   object 
 10  tenure                               1000 non-null   object 
 11  address                       

In [75]:
NewCustomerList.nunique()

first_name                              940
last_name                               961
gender                                    3
past_3_years_bike_related_purchases     100
DOB                                     961
job_title                               184
job_industry_category                     9
wealth_segment                            3
deceased_indicator                        1
owns_car                                  2
tenure                                   23
address                                1000
postcode                                522
state                                     3
country                                   1
property_valuation                       16
NaN                                      71
NaN                                     129
NaN                                     183
NaN                                     327
NaN                                     319
Rank                                    324
Value                           

In [78]:
NewCustomerList.columns[NewCustomerList.isnull().sum()>0]

Index(['last_name', 'DOB', 'job_title', 'job_industry_category'], dtype='object')

In [79]:
NewCustomerList.head()

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,NaN,NaN.1,NaN.2,NaN.3,NaN.4,Rank,Value
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,...,QLD,Australia,6,0.77,0.9625,1.203125,1.022656,1.0,1,1.71875
2,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,Mass Customer,N,No,...,NSW,Australia,11,0.53,0.53,0.6625,0.563125,1.0,1,1.71875
3,Ardelis,Forrester,Female,10,1974-08-28 00:00:00,Senior Cost Accountant,Financial Services,Affluent Customer,N,No,...,VIC,Australia,5,0.89,0.89,0.89,0.89,1.0,1,1.71875
4,Lucine,Stutt,Female,64,1979-01-28,Account Representative III,Manufacturing,Affluent Customer,N,Yes,...,QLD,Australia,1,0.91,1.1375,1.1375,1.1375,4.0,4,1.703125
5,Melinda,Hadlee,Female,34,1965-09-21,Financial Analyst,Financial Services,Affluent Customer,N,No,...,NSW,Australia,9,0.66,0.66,0.825,0.825,4.0,4,1.703125


In [83]:
pd.to_datetime(NewCustomerList['DOB']).describe()

count                              983
mean     1971-04-20 14:05:14.954221768
min                1938-06-08 00:00:00
25%                1957-10-09 00:00:00
50%                1972-03-24 00:00:00
75%                1983-04-12 12:00:00
max                2002-02-27 00:00:00
Name: DOB, dtype: object

# Summary of the Transcation data analysisanalyis 

- product_first_sold_date has unvalid data formate
-  [online_order,brand,product_line,product_class,product_class,product_size,standard_cost,product_first_sold_date] have empty data    


# Summary of the CustomerAddress data analysis


- Country column has only one column
- property valuvation unit mesurementts are missing
- There is imbalance data in the column state

# Summary of the CustomerDemographic data analysis


- ['last_name', 'DOB', 'job_title', 'job_industry_category', 'default','tenure']  null values

- gender has only two types but 6 values presnt
-  DOB there is imbalanced dataset in the DOB column
- deceased_indicator has imblanced indicator
- defaut column has messy data in it

# Summary of the NewCustomerList data analysis



- NewCustomerList has nan column names 
- gender has unknow values
- deceased_indicator&country has no significate constant throughout the data
- and ['last_name', 'DOB', 'job_title', 'job_industry_category'] has missing values


# Subject: Identification of Major Data Errors in Provided Datasets

DEAR AD,

Our analysis of the provided datasets revealed critical data discrepancies, including inconsistencies and formatting errors. These issues, representing major concerns, impact data reliability. We are actively documenting and addressing these errors, anticipating more to be identified. A comprehensive report outlining these findings and suggested rectification steps will be provided. Your collaboration in resolving these concerns is appreciated for ensuring data accuracy and usability.





DATASETS
 Invalid data
columns names missing 
Constant columns
 Outliers


imbalanced data 
Messy data
Missing Data
Transaction data
YES
NO
NO
MAY BE
NO
NO
YES
Customer Address
NO
NO
NO
MAY BE
YES
NO
NO
Customer Demographic
YES
NO
NO
YES
YES
YES
YES
NewCustomer list
NO
YES
YES
MAY BE
NO
NO
YES







Summary of TranscationsData Analysis:
1. Invalid Data Format in 'product_first_sold_date': The 'product_first_sold_date' column contains data in an invalid format, making it challenging to interpret or utilize effectively. 
2. Empty Data in Several Columns: Columns such as 'online_order', 'brand', 'product_line', 'product_class', 'product_size', 'standard_cost', and 'product_first_sold_date' contain empty or missing data, hindering analysis and impacting the completeness of the dataset.
Solutions:
1. Addressing 'product_first_sold_date' Format Issue: Convert the 'product_first_sold_date' column to a standardized date format (e.g., YYYY-MM-DD) to ensure uniformity and enable accurate date-related analyses. Utilize data parsing techniques or conversion functions to rectify the invalid format.
2. Handling Empty Data in Columns: 
   - Imputation: For numerical columns like 'standard_cost', consider imputing missing values using statistical measures (e.g., mean, median) to maintain data integrity. 
   - Categorical Data Handling: For categorical columns ('online_order', 'brand', 'product_line', 'product_class', 'product_size'), options include imputation with the mode or creating a separate category for missing values.
   - Validating 'product_first_sold_date': Cross-reference other date-related columns or data sources to validate and potentially fill in missing 'product_first_sold_date' values.


Summary of CustomerAddress Data Analysis:
1. Country Column Issue: The 'Country' column appears to contain a singular value throughout, indicating potential data uniformity or entry error. This uniformity could potentially limit its significance for analysis or segmentation purposes, requiring further validation or correction.
2. Missing Property Valuation Unit Measurements: The absence of unit measurements for property valuations poses a challenge in accurately interpreting and comparing these values. The lack of standard units (e.g., currency, square footage, etc.) impacts the comprehensibility and usefulness of this data.
3. Imbalanced Data in the 'State' Column: An imbalance in the distribution of data within the 'State' column suggests unequal representation among different states or regions. This imbalance might skew analyses or modeling processes that rely on balanced data distribution.
Recommended Actions:
1. Validation and Correction of 'Country' Data: Conduct a thorough review to confirm if the 'Country' column indeed contains only one value. If so, consider verifying the accuracy of this information or rectifying any potential data entry errors.
2. Adding Property Valuation Unit Measurements: Collaborate with relevant sources or stakeholders to incorporate unit measurements (e.g., currency symbols, square footage) for property valuations, ensuring clarity and consistency in the dataset.
3. Addressing Imbalanced Data in the 'State' Column: Explore strategies to mitigate the imbalance by collecting additional data from underrepresented states/regions or applying specialized techniques (e.g., oversampling, stratified sampling) during analysis to balance the representation of states.

Summary of CustomerDemographic Data Analysis:
1. Null Values in Multiple Columns: Columns 'last_name', 'DOB' (Date of Birth), 'job_title', 'job_industry_category', 'default', and 'tenure' contain null or missing values, potentially impacting the completeness and reliability of the dataset.
2. Gender Column Issue: Despite expecting only two types of values for gender, the presence of six values suggests inconsistencies or data entry errors, which may require validation and correction.
3. Imbalanced Dataset in DOB Column: An imbalanced dataset in the 'DOB' column might indicate irregularities in the distribution of birth dates, potentially influencing analyses involving age demographics or cohort-based segmentation.
4. Imbalanced 'deceased_indicator' Values: The 'deceased_indicator' column exhibits an imbalance in its indicators, suggesting unequal representation of deceased and non-deceased customers.
5. Messy Data in 'default' Column: The 'default' column contains messy or unstructured data, which could hinder its interpretation and utility for analysis.
Recommended Actions:
1. Handling Null Values: Conduct thorough data cleaning by addressing null values in columns such as 'last_name', 'DOB', 'job_title', 'job_industry_category', 'default', and 'tenure'. Strategies might involve imputation, deletion of rows, or acquiring missing data if feasible.
2. Data Validation for Gender: Verify and rectify inconsistencies in the 'gender' column, ensuring it contains only the expected two types of values (e.g., 'M' for Male, 'F' for Female), potentially by standardizing the entries.
3. Addressing Imbalanced Data: Investigate and mitigate imbalanced distributions in the 'DOB', 'deceased_indicator', and any other relevant columns, as balanced data representation is essential for unbiased analyses.4. Cleaning 'default' Column: Process and clean the 'default' column to ensure consistency and clarity in data entries for improved analysis and interpretation.
Summary of NewCustomerList Data Analysis:
1. NaN Column Names: The presence of columns with NaN (Not a Number) names in the NewCustomerList dataset indicates potential data structure issues or incorrect parsing during data import, requiring immediate attention and rectification.
2. 'Gender' Column with Unknown Values: The 'gender' column contains entries labeled as 'unknown,' which disrupts the expected binary classification, potentially impacting gender-based analyses and segmentation.
3. Lack of Consistency in 'deceased_indicator' and 'country': The columns 'deceased_indicator' and 'country' lack significant constants throughout the dataset, suggesting inconsistencies or variations in data entries, which might affect data integrity and reliability.
4. Missing Values in ['last_name', 'DOB', 'job_title', 'job_industry_category'] Columns: Multiple columns, namely 'last_name', 'DOB' (Date of Birth), 'job_title', and 'job_industry_category', exhibit missing values, potentially impacting the completeness and accuracy of the dataset.
Recommended Actions:
1. Address NaN Column Names: Investigate and resolve the issue causing NaN column names, ensuring proper data structure and parsing during data loading or import.
2. Handling 'Unknown' Values in 'Gender': Standardize 'gender' values by validating and correcting entries labeled as 'unknown' to ensure the expected binary classification (e.g., 'M' for Male, 'F' for Female), promoting consistency in gender-related analyses.
3. Ensuring Consistency in Columns: Validate and rectify inconsistencies in the 'deceased_indicator' and 'country' columns by standardizing entries or identifying any underlying issues causing variation across the dataset.
4. Handling Missing Values: Address missing values in columns ['last_name', 'DOB', 'job_title', 'job_industry_category'] through appropriate data imputation methods or acquiring the missing data if feasible to enhance the dataset's completeness and usability for analyses.





Regards,
DARSHAN KUMAR R

