# Assessing Completeness

Completeness of data means that the main features/attributes required for the analysis do not have missing values. Missing values can skew the analysis and lead to misleading trends. 

The metrics to measure it is the ratio of missing records for the features that are necessary (not optional) for the analysis. For example, if 30 out of 100 records in your analysis is missing the industry feature that you need for your marketing campaign, the completeness of your data is 70%.

> _important questions would be:_
>* _Are there any missing values in the data?_
>* _Find the earliest date(s) in our “recent” data file and confirm that they are before a specific date._

> _in case of two data sets:_
>* _Compare the file sizes and/or row counts of the two datasets to confirm that the more recent file is larger than the older file._
>* _Compare the data they contain to confirm that all the records in the earlier file already exist in the later file._

One important thing here is that not every kind of missing data can be detected by the default method of the pandas library e.g. if the missing value has been encoded as an empty string or arbitrary value.

Let's see what strategies we can use to assess completeness. 


In [None]:
#import 

### 1. Are there any missing values in the data?

In [None]:
#for exapmple by checking for NANs
check_nan = df_es.isnull().values.any()
print("Are there any missing values? "+str(check_nan))

print("show number of missing values in each column:")
df_es.isna().sum()

#You can replace NAN values with a string of your choice if you want
#missing_value.replace(np.nan, 'N/A')

You can check for the count of all the values in a column.
By default, there is a ```dropna``` argument that controls the behavior of this function.
You have to explicitly pass it to the function to make it return the missing value as well.

In [None]:
print(df_es['gender'].value_counts())
#or
print(df_es['gender'].value_counts(dropna=False))

### 2. find the earliest date(s) in the dataset file and confirm that they are recent.

In this case find the earliest date(s) in our “recent” data file and confirm that they are before June 30, 2023.

In [None]:
# convert the values in the `entry_date` column to *actual* dates
df_es['entry_date'] = pd.to_datetime(df_es['entry_date'], format='%Y-%m-%d')

# print out the `min()` and `max()` values in the `entry_date` column
print(df_es['entry_date'].min())
print(df_es['entry_date'].max())

### 2. find the earliest date(s) in the dataset file and confirm that they are recent.

In this case find the earliest date(s) in our “recent” data file and confirm that they are before June 30, 2023.