# Data Inconsistencies/Abnormalities/Anomalies

1. Duplicate Records\
Multiple entries for the same entity, which can skew analysis and reporting.
2. Inconsistent Formats\
Variations in data formats, such as dates (e.g., MM/DD/YYYY vs. DD/MM/YYYY) or inconsistent use of uppercase and lowercase letters.
3. Contradictory Values\
Conflicting data points within the same record, such as a person's age being listed as 30 in one field and 40 in another.
4. Missing Values\
Absence of data in expected fields, which can lead to incomplete analyses.
5. Outliers\
Values that deviate significantly from the expected range, which may represent errors or genuine anomalies.
6. Inconsistent Units\
Different measurement units used for the same variable, such as height in centimeters in one record and inches in another.
7. Incorrect Data Types\
Data stored in the wrong format, such as numeric values stored as text, which complicates analysis.
8. Ambiguous Data\
Data that lacks clarity or definition, making it difficult to interpret (e.g., "N/A" used for multiple meanings).
9. Data Integration Issues\
Discrepancies arising from merging datasets from different sources, leading to inconsistencies in variable definitions or scales.
10.  Historical Inconsistencies\
Changes in data collection methods or definitions over time, causing inconsistencies in longitudinal data.
11. Inconsistent naming convention\
It refers to the use of varying formats or styles for naming variables or entities within a dataset, leading to confusion and potential errors in data analysis.
12. Inconsistent typo convention\
It refers to variations in spelling or typographical errors across a dataset that create discrepancies and can hinder data analysis and interpretation.



In [77]:
import pandas as pd

In [78]:
# Create a DataFrame
data = {'Date': ['2021-12-01', '01-12-2022', '2022/12/01', '12-01-2021'],
 'Country': ['U.S.A', 'USA', 'America', 'United States'],
  'Name': ['Aammar', 'Hamza', 'ammar', 'Hazma'],
   'Sales_2020': [100, 200, None, 200],
   'Sales_2021': [150, None, 250, 300]}

# Convert to DataFrame
df = pd.DataFrame(data)

In [79]:
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,U.S.A,Aammar,100.0,150.0
1,01-12-2022,USA,Hamza,200.0,
2,2022/12/01,America,ammar,,250.0
3,12-01-2021,United States,Hazma,200.0,300.0


In [80]:
# Standardizing the date format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce').dt.strftime('%Y-%m-%d')
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,U.S.A,Aammar,100.0,150.0
1,,USA,Hamza,200.0,
2,,America,ammar,,250.0
3,,United States,Hazma,200.0,300.0


For Imputation of `Date` column, the best method is `forward fill`\
<b>Best for:</b> Time series data where the last known date is likely to be a good estimate for subsequent missing dates.\
<b>Consideration:</b> Assumes that values do not change drastically over time. 

In [81]:
# 1. Forward Fill
# df_ffill = df.copy()
df['Date'] = df['Date'].ffill()
print("\nAfter Forward Fill:")
print(df)


After Forward Fill:
         Date        Country    Name  Sales_2020  Sales_2021
0  2021-12-01          U.S.A  Aammar       100.0       150.0
1  2021-12-01            USA   Hamza       200.0         NaN
2  2021-12-01        America   ammar         NaN       250.0
3  2021-12-01  United States   Hazma       200.0       300.0


In [82]:
# Harmonizing Country Names
country_mapping = {'U.S.A': 'USA', 'America': 'USA', 'United States': 'USA'}
df['Country'] = df['Country'].replace(country_mapping)
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,USA,Aammar,100.0,150.0
1,2021-12-01,USA,Hamza,200.0,
2,2021-12-01,USA,ammar,,250.0
3,2021-12-01,USA,Hazma,200.0,300.0


In [83]:
# Correct the typographical Errors in Names
# Let's assume, we want to correct 'Jonh Doe' to 'John Doe'
df['Name'] = df['Name'].replace({'ammar': 'Aammar', 'Hazma': 'Hamza'})
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,USA,Aammar,100.0,150.0
1,2021-12-01,USA,Hamza,200.0,
2,2021-12-01,USA,Aammar,,250.0
3,2021-12-01,USA,Hamza,200.0,300.0


In [84]:
# Drop Duplicate Rows
df = df.drop_duplicates(subset='Name')
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,USA,Aammar,100.0,150.0
1,2021-12-01,USA,Hamza,200.0,


In [85]:
# Resolving contradictory data in Sales_2020 and Sales_2021
# For example, if Sales_2021 is greater than Sales_2020, then remove the rows where condition is not met
df = df.drop(df[df['Sales_2021'] <= df['Sales_2020']].index)
df.head()

Unnamed: 0,Date,Country,Name,Sales_2020,Sales_2021
0,2021-12-01,USA,Aammar,100.0,150.0
1,2021-12-01,USA,Hamza,200.0,


# Conclision:
Data inconsistencies, abnormalities, and anomalies can lead to poor decision-making, reduced data quality, and operational inefficiencies. They undermine customer trust and may result in regulatory non-compliance. Addressing these issues is crucial for maintaining data integrity and ensuring accurate analyses.