# Classifying values as anomalous

Sometimes we’re happy with the type that ptype infers for a column, but discover that it has incorrectly treated as legitimate some values which we know to be anomalies. We can remedy this by extending the set of values that ptype treats as anomalies, and then rerunning the analysis. We illustrate this with a toy example.

In [1]:
# Preamble to run notebook in context of source package.
# NBVAL_IGNORE_OUTPUT
import sys
sys.path.insert(0, '../')

### Toy Example

In [2]:
import pandas as pd

x = ['Jack', 'Joe', 'James', 'error']
column = 'name'

df = pd.DataFrame(x, dtype='str', columns=[column])
df

Unnamed: 0,name
0,Jack
1,Joe
2,James
3,error


In [3]:
from ptype.Ptype import Ptype

ptype = Ptype()
schema = ptype.schema_fit(df)
schema.show()

Unnamed: 0,name
type,string
normal values,"[Jack, James, Joe, error]"
missing values,[]
anomalous values,[]


In [5]:
an_values = ptype.get_additional_an_values()

an_values.extend(["error"])
ptype.set_an_values(an_values)

schema = ptype.schema_fit(df)
schema.show()

Unnamed: 0,name
type,string
normal values,"[Jack, James, Joe]"
missing values,[]
anomalous values,[error]


# Classifying "anomalous" values as normal
### Real-world Data

In [5]:
import pandas as pd
from ptype.Ptype import Ptype

df = pd.read_csv("../data/gov_323_1.csv", encoding="ISO-8859-1", dtype=str, keep_default_na=False)
df.head()

Unnamed: 0,YEAR,113_CAUSE_NAME,CAUSE_NAME,STATE,DEATHS,AADR
0,1999,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional Injuries,Alabama,2313,52.17
1,1999,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional Injuries,Alaska,294,55.91
2,1999,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional Injuries,Arizona,2214,44.79
3,1999,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional Injuries,Arkansas,1287,47.56
4,1999,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional Injuries,California,9198,28.71


In [6]:
ptype = Ptype()
schema = ptype.schema_fit(df)
schema.show()

Unnamed: 0,YEAR,113_CAUSE_NAME,CAUSE_NAME,STATE,DEATHS,AADR
type,date-iso-8601,string,string,string,integer,float
normal values,"[1999, 2000, 2001, 2002, 2003, 2004, 2005, 200...",[All Causes],"[All Causes, CLRD, Cancer, Chronic liver disea...","[Alabama, Alaska, Arizona, Arkansas, Californi...","[10, 100, 1000, 100056, 10007, 1001, 10012, 10...","[1.29, 1.38, 1.49, 1.53, 1.54, 1.57, 1.58, 1.5..."
missing values,[],[],[],[],[],[*]
anomalous values,[],"[Accidents (unintentional injuries) (V01-X59,Y...","[Alzheimer's disease, Parkinson's disease]",[],[x],[x]


In [7]:
str_alphabet = ptype.get_string_alphabet()

str_alphabet.extend(["'"])
ptype.set_string_alphabet(str_alphabet)

schema = ptype.schema_fit(df)
schema.show()
# to-do: should we consider making this column specific rather than a global list
# this again can be done similar to how it is handled in pandas.read_csv which is 
# keep_default_na=False, na_values={'species':['']}

Unnamed: 0,YEAR,113_CAUSE_NAME,CAUSE_NAME,STATE,DEATHS,AADR
type,date-iso-8601,string,string,string,integer,float
normal values,"[1999, 2000, 2001, 2002, 2003, 2004, 2005, 200...",[All Causes],"[All Causes, Alzheimer's disease, CLRD, Cancer...","[Alabama, Alaska, Arizona, Arkansas, Californi...","[10, 100, 1000, 100056, 10007, 1001, 10012, 10...","[1.29, 1.38, 1.49, 1.53, 1.54, 1.57, 1.58, 1.5..."
missing values,[],[],[],[],[],[*]
anomalous values,[],"[Accidents (unintentional injuries) (V01-X59,Y...",[],[],[x],[x]
