# Data Inconsistency / Anomalies in Data

It may exist in following ways in data:
- Inconsistent formats(different formats of dates e.g, DD-MM-YYYY or YY/MM/DD)
- Inconsistent naming conventions(such as USA,United States,U.S.A,United America)
- Typographical Errors(such as writing Pakistan with 3 different names e.g, Pakistan,pakistan,paakistan)
- Duplicated rows
- Contradictory data(means logical error that's not possible e.g, son age ka father age sa ziada hona)

In [1]:
import pandas as pd

In [53]:
df = pd.DataFrame({
    "Time" : ["25-04-2022","2022-04-25","15/6/2014","2014-6-15"],
    "Country" : ["U.S.A", "USA", "America","United States"],
    "Name" : ["John","Dave","John","Dave"],
    "sales_2014" : [5200,3000,None,1500],
    "sales_2022" : [3400,None,4000,9000]
})

In [54]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
0,25-04-2022,U.S.A,John,5200.0,3400.0
1,2022-04-25,USA,Dave,3000.0,
2,15/6/2014,America,John,,4000.0
3,2014-6-15,United States,Dave,1500.0,9000.0


### Different formats of Dates

In [55]:
df["Time"] = pd.to_datetime(df["Time"], errors='coerce')
df["Time"] = df["Time"].dt.strftime('%d/%Y/%m')

  df["Time"] = pd.to_datetime(df["Time"], errors='coerce')


In [56]:
df["Time"] = df["Time"].ffill()

In [57]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
0,25/2022/04,U.S.A,John,5200.0,3400.0
1,25/2022/04,USA,Dave,3000.0,
2,25/2022/04,America,John,,4000.0
3,25/2022/04,United States,Dave,1500.0,9000.0


### Different names
Same method will be used for dealing with `typographical error`

In [58]:
country_names = {"U.S.A":"Unites States of America", "USA":"Unites States of America", "America":"Unites States of America", "United States":"Unites States of America"}

df["Country"] = df["Country"].replace(country_names)

In [59]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
0,25/2022/04,Unites States of America,John,5200.0,3400.0
1,25/2022/04,Unites States of America,Dave,3000.0,
2,25/2022/04,Unites States of America,John,,4000.0
3,25/2022/04,Unites States of America,Dave,1500.0,9000.0


### Duplicated names in Name column

In [32]:
df.drop_duplicates(subset="Name", inplace=True)

In [33]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
0,25-04-2022,Unites States of America,John,1200.0,3400.0
1,25-04-2022,Unites States of America,Dave,3000.0,


### Contradictory Data; 

Here `sales_2014` should be less than `sales_2022`, otherwise we consider it a logical error

In [60]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
0,25/2022/04,Unites States of America,John,5200.0,3400.0
1,25/2022/04,Unites States of America,Dave,3000.0,
2,25/2022/04,Unites States of America,John,,4000.0
3,25/2022/04,Unites States of America,Dave,1500.0,9000.0


In [61]:
df = df.drop(df[df["sales_2022"]<=df["sales_2014"]].index)

In [62]:
df

Unnamed: 0,Time,Country,Name,sales_2014,sales_2022
1,25/2022/04,Unites States of America,Dave,3000.0,
2,25/2022/04,Unites States of America,John,,4000.0
3,25/2022/04,Unites States of America,Dave,1500.0,9000.0
