In [3]:
import numpy as np 
import pandas as pd

***Tasks***
*1. Data Quality*

a. Provide an example of poor quality structured data

b. How would you recognize this poor quality? Write 3 - 4 sentences that show how the data fails to include properties of good quality data.

c. Provide an example of poor quality unstructured data

d. Unstructured data can be more difficult to assess than structured data. Just as you did in part b, write 3 - 4 sentences that show how this unstructured data fails to check requirements of good quality data.

*Example of poor quality structured data*

In [11]:
# source: https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured?select=01_people.csv

# Load dataset  
file_path = "01_people.csv"  
people = pd.read_csv(file_path)  

# Display dataset preview  
people.head()

Unnamed: 0,person_id,name,email,phone,linkedin
0,1,Database Administrator,,,
1,2,Database Administrator,,,
2,3,Oracle Database Administrator,,,
3,4,Amazon Redshift Administrator and ETL Develope...,,,
4,5,Scrum Master Scrum Master Scrum Master,,,


In [37]:
missing = people.isnull().sum()
missing

person_id        0
name           114
email        53340
phone        53100
linkedin     46395
dtype: int64

In [19]:
print(people.info())
print(people.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54933 entries, 0 to 54932
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  54933 non-null  int64 
 1   name       54819 non-null  object
 2   email      1593 non-null   object
 3   phone      1833 non-null   object
 4   linkedin   8538 non-null   object
dtypes: int64(1), object(4)
memory usage: 2.1+ MB
None
(54933, 5)


# Table Schemas    

| Column Name | Data Type     | Description                        | Constraints             | Example                     |
|-------------|--------------|------------------------------------|-------------------------|-----------------------------|
| person_id   | INTEGER      | Unique identifier for each person  | Primary Key, Not Null   | 1                           |
| name        | VARCHAR(255) | Full name of the person            | May be Null             | "Database Administrator"    |
| email       | VARCHAR(255) | Email address                      | May be Null             | "john.doe@email.com"        |
| phone       | VARCHAR(50)  | Contact number                     | May be Null             | "+1-555-0123"               |
| linkedin    | VARCHAR(255) | LinkedIn profile URL               | May be Null             | "linkedin.com/in/johndoe"   |


**b. How would you recognize this poor quality?**

When we first examined the dataset, we observed that many of the key fields were blank: columns such as email (97%missing), phone (97%missing), and LinkedIn (85%missing), which immediately made the table feel *incomplete*. As we looked closer, we noticed the name column is inconsistent, often containing job titles instead of individuals' names. The name column also contain inaccurate entry like one row listed “Scrum Master” three times. These are exactly the kinds of problems that Domo (2023) and Freeman (2024) describe when discussing poor data quality: missing values, inconsistent records, and information that can’t really be trusted.
| Field      | Missing Values |
|------------|----------------|
| person_id  | 0              |
| name       | 114            |
| email      | 53,340         |
| phone      | 53,100         |
| linkedin   | 46,395         |

In [63]:
# source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

# Load dataset  
file_path_ = "spam.csv"  
spam = pd.read_csv(file_path_, encoding="latin1") 

# Display dataset preview  
spam.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### v1 (class)
- Values: ham (87%), spam (13%)
- Total: 5572
- Unique values: 2
- Most common: ham

### v2 (sms)
- Total messages: 5572
- Unique messages: 5169
- Most common: "Sorry, I'll call later"

### Other columns (Unnamed: 2–4)
- Mostly empty or missing
- Can be ignored for analysis

In [65]:
missing_ = spam.isnull().sum()
missing_

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [67]:
print(spam.info())
print(spam.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB
None
(5572, 5)


This dataset is a messy example of unstructured data. The useful part is in the first two columns (v1 and v2), which contain the labels and messages, but then we can see the other columns (Unnamed: 2–4) that are basically empty. That kind of inconsistency makes it harder to trust the dataset. (MacDonald, 2025) described unstructured data as “messy” and often missing standardized labels, which fits this case pretty well. IBM also points out that when data is uneven like this, it’s easy to run into errors (IBM, 2023).

If we were actually analyzing these text messages, we think the blank fields and the bad formatting would give inaccurate results, maybe the software would misread them or just give incomplete results. So, even though the dataset is usable, in its raw state it’s not great quality, and it shows the kind of problems we encounter with unstructured information.

| #   | Column       | Non-Null Count | Dtype  |
|-----|-------------|----------------|--------|
| 0   | v1          | 5572 non-null  | object |
| 1   | v2          | 5572 non-null  | object |
| 2   | Unnamed: 2  | 50 non-null    | object |
| 3   | Unnamed: 3  | 12 non-null    | object |
| 4   | Unnamed: 4  | 6 non-null     | object |