# Data Extraction

This is a simple library to add type constraints to pandas Series. There are several 
reasons why you might want to do this:

- You want to ensure that the data in your Series is of a certain type.
- You want to ensure that the data in your Series is clean and consistent.
- You want to ensure that the data in your Series is correct and valid.

In the context of vantage6, this is useful for:

- Users can only select valid variables (types) when creating a task.
- Data extraction jobs can be validated against the expected types.
- The users has more context about the data they are working with.

-----------

TODO 
- [ ] Units
- [ ] Automatic conversion of types

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
# TODO now all pandas objects have a metadata attribute, maybe we need to be explicit
# for example `from vandas import VDataFrame` and then `df = VDataFrame(df)`
import v_types

In [2]:
df = pd.read_csv("data.csv")
df = v_types.VDataFrame(df)

## Assigning types to variables
By default, all series have a type of `VAbstractType`. Which is the most basic type and
in essence does nothing.

In [3]:
type(df)

v_types.VDataFrame

In [4]:
type(df["age"])

v_types.VSeries

In [5]:
df["age"].validate()

(True, [])

In [6]:
df["age"]

0      69
1      32
2      89
3      78
4      38
       ..
995    27
996    51
997    72
998    49
999    67
Name: age, Length: 1000, dtype: int64
VType: No VType assigned

### VAbstractType
The `VAbstractType` type is the most basic type and does nothing. It is used to 
represent any type of data. It has the following optional parameters:

- `description`: A description of the variable.

In [7]:
v_types.VAbstractType(description="Age in years")

VAbstractType(description='Age in years')

### VNumericType (VIntType, VFloatType)
The `VIntType` and `VFloatType` types are used to represent integer and float values. 
They have the following optional parameters:
- `unit`: The unit of the integer or float values.
- `min`: The minimum value of the integer or float values.
- `max`: The maximum value of the integer or float values.
- `description`: A description of the integer or float values.

The `min` and `max` parameters are used to validate the values of the series. The `unit` can be used to convert the values to a different unit. and the `description` can be used to add a description to the series.



In [8]:
print(df["age"])

0      69
1      32
2      89
3      78
4      38
       ..
995    27
996    51
997    72
998    49
999    67
Name: age, Length: 1000, dtype: int64
VType: No VType assigned


In [9]:
# Integer type
df["age"].v_type = v_types.VIntType(description="Age in years")
print("type:", df["age"]._v_type)
print("valid:", df["age"].validate())
print(df["age"])


DEBUG - Setting type: VIntType(description='Age in years')
DEBUG - Type set to: VIntType(description='Age in years')


type: VIntType(description='Age in years')
valid: (True, [])
0      69
1      32
2      89
3      78
4      38
       ..
995    27
996    51
997    72
998    49
999    67
Name: age, Length: 1000, dtype: int64
VType: VIntType(description='Age in years')


In [10]:
# Ranges can be assigned to the type
df["age"].v_type = v_types.VIntType(min=90, max=120, description="Age in years")
df["age"].v_type

DEBUG - Setting type: VIntType(description='Age in years')
DEBUG - Type set to: VIntType(description='Age in years')


VIntType(description='Age in years', unit=None, min=90, max=120)

In [11]:
# These ranges are used to validate the values of the series
df["age"].validate()

(False, ['Values below minimum (90 [-])'])

In [12]:
df["age"].v_type = v_types.VIntType(min=90, max=120, description="Age in years", unit="years")
df["age"].validate()

DEBUG - Setting type: VIntType(description='Age in years')
DEBUG - Type set to: VIntType(description='Age in years')


(False, ['Values below minimum (90 [years])'])

In [13]:
df["average_purchase"].v_type = v_types.VFloatType(min=0, max=1000, description="Average purchase amount", unit="USD")
df["average_purchase"].validate()

DEBUG - Setting type: VFloatType(description='Average purchase amount')
DEBUG - Type set to: VFloatType(description='Average purchase amount')


(True, [])

We can consider supplying auto-converted values to the type. For example we can easily convert integers to floats. However the other way around is not trivial.


In [14]:
# Assigning an integer type to a float series
df["average_purchase"].v_type = v_types.VIntType(min=0, max=1000, description="Average purchase amount", unit="USD")
df["average_purchase"].validate()

DEBUG - Setting type: VIntType(description='Average purchase amount')
DEBUG - Type set to: VIntType(description='Average purchase amount')


(False, ['Series dtype float64 does not match required dtype Int64'])

### Logical types
There are several logical types that can be used to represent boolean values.

- `VLogicalType`: Represents boolean values.
- `VStringBinaryType`: Represents boolean values as strings.
- `VIntBinaryType`: Represents boolean values as integers.



In [15]:
# Logical type encoded as True/False
df["is_active"].v_type = v_types.VLogicalType()
df["is_active"].validate()

DEBUG - Setting type: VLogicalType()
DEBUG - Type set to: VLogicalType()


(True, [])

In [16]:
print("Related to VAbstractType:", issubclass(v_types.VLogicalType, v_types.VAbstractType))
print("Related to VBinaryType:", issubclass(v_types.VLogicalType, v_types.VBinaryType))

Related to VAbstractType: True
Related to VBinaryType: True


In [17]:
print("`is_active` is instance of VAbstractType:", isinstance(df["is_active"].v_type, v_types.VAbstractType))
print("`is_active` is instance of VBinaryType:", isinstance(df["is_active"].v_type, v_types.VBinaryType))
print("`is_active` is instance of VLogicalType:", isinstance(df["is_active"].v_type, v_types.VLogicalType))
print("`is_active` is instance of VStringBinaryType:", isinstance(df["is_active"].v_type, v_types.VStringBinaryType))

`is_active` is instance of VAbstractType: True
`is_active` is instance of VBinaryType: True
`is_active` is instance of VLogicalType: True
`is_active` is instance of VStringBinaryType: False


In [18]:
# Logical type encoded as Yes/No or any other string pair
df["is_active"].v_type = v_types.VStringBinaryType()
df["is_active"].validate()

DEBUG - Setting type: VStringBinaryType()
DEBUG - Type set to: VStringBinaryType()


(False, ['Series dtype bool does not match required dtype string'])

In [19]:
# Logical type encoded as 0/1
df["has_subscription"].v_type = v_types.VIntBinaryType()
df["has_subscription"].validate()

DEBUG - Setting type: VIntBinaryType()
DEBUG - Type set to: VIntBinaryType()


(True, [])

In [20]:
# Logical type encoded as Yes/No or any other string pair
df["marketing_consent"].v_type = v_types.VStringBinaryType()
df["marketing_consent"].validate()

DEBUG - Setting type: VStringBinaryType()
DEBUG - Type set to: VStringBinaryType()


(True, [])

### VCategoricalType (VOrdinalType)
The `VCategoricalType` and `VOrdinalType` types are used to represent categorical data.

- `VCategoricalType`: Represents categorical data.
- `VOrdinalType`: Represents ordinal categorical data.



In [21]:
# Auto conversion from string to categorical is happing here when the apply method is called
print(df["membership_level"].dtype)
df["membership_level"].v_type = v_types.VCategoricalType(categories=['Bronze', 'Platinum', 'Silver', 'Gold'], description="Membership level")
print(df["membership_level"].dtype)
df["membership_level"].apply()
print(df["membership_level"].dtype)
df["membership_level"].validate()


DEBUG - Setting type: VCategoricalType(description='Membership level')
DEBUG - Type set to: VCategoricalType(description='Membership level')
DEBUG - Setting categories: ['Bronze', 'Platinum', 'Silver', 'Gold']
DEBUG - Series categories: ['Bronze' 'Platinum' 'Silver' 'Gold']
DEBUG - The data has 0 categories that are not in the series definition
DEBUG - Converting to categorical


object
object
category


(True, [])

In [22]:
# Conversion when the original data will pass but all categories that are not defined
# will be converted to NaN
df["membership_level"].v_type = v_types.VCategoricalType(categories=['Platinum', 'Silver', 'Gold'], description="Membership level")
df["membership_level"].apply()
print("validation:", df["membership_level"].validate())
print("levels:", df["membership_level"].unique())


DEBUG - Setting type: VCategoricalType(description='Membership level')
DEBUG - Type set to: VCategoricalType(description='Membership level')
DEBUG - Setting categories: ['Platinum', 'Silver', 'Gold']
DEBUG - Series categories: ['Bronze', 'Platinum', 'Silver', 'Gold']
Categories (4, object): ['Bronze', 'Platinum', 'Silver', 'Gold']
DEBUG - The data has 1 categories that are not in the series definition
DEBUG - Converting to categorical


validation: (True, [])
levels: [NaN, 'Platinum', 'Silver', 'Gold']
Categories (3, object): ['Platinum', 'Silver', 'Gold']


In [23]:
# In case you do not provide categories, any string is allowed.
df["preferred_color"].v_type = v_types.VCategoricalType(
    description="Membership level"
)
print("validation:", df["preferred_color"].validate())

df["preferred_color"].apply()
print("validation:", df["preferred_color"].validate())

DEBUG - Setting type: VCategoricalType(description='Membership level')
DEBUG - Type set to: VCategoricalType(description='Membership level')
DEBUG - Converting to categorical


validation: (False, ['Series dtype object does not match required dtype category'])
validation: (True, [])


In [24]:
# We need to be careful with the categories, as they need to be present in the data. Else
# It will generate a column with all NaN values.
df["preferred_color"].v_type = v_types.VCategoricalType(description="Membership level", categories=['Yellow'])
df["preferred_color"].apply()
print("validation:", df["preferred_color"].validate())

DEBUG - Setting type: VCategoricalType(description='Membership level')
DEBUG - Type set to: VCategoricalType(description='Membership level')
DEBUG - Setting categories: ['Yellow']
DEBUG - Series categories: ['Green', 'Blue', 'Red', 'Black', 'Yellow']
Categories (5, object): ['Black', 'Blue', 'Green', 'Red', 'Yellow']
DEBUG - The data has 4 categories that are not in the series definition
DEBUG - Converting to categorical


validation: (True, [])


In [25]:
df["preferred_color"].unique()

[NaN, 'Yellow']
Categories (1, object): ['Yellow']

In [26]:
df["membership_level"].v_type = v_types.VOrdinalType(categories=["Bronze", "Silver", "Gold", "Platinum"], description="Membership level")
df["membership_level"].validate()

DEBUG - Setting type: VOrdinalType(description='Membership level')
DEBUG - Type set to: VOrdinalType(description='Membership level')


(True, [])

In [27]:
df["membership_level"].dtype

CategoricalDtype(categories=['Platinum', 'Silver', 'Gold'], ordered=False, categories_dtype=object)

In [28]:
df["membership_level"].apply()

DEBUG - Setting categories: ['Bronze', 'Silver', 'Gold', 'Platinum']
DEBUG - Series categories: ['Platinum', 'Silver', 'Gold']
Categories (3, object): ['Platinum', 'Silver', 'Gold']
DEBUG - The data has 0 categories that are not in the series definition
DEBUG - Converting to categorical


0           NaN
1      Platinum
2           NaN
3        Silver
4          Gold
         ...   
995         NaN
996      Silver
997         NaN
998      Silver
999      Silver
Length: 1000, dtype: category
Categories (4, object): ['Bronze' < 'Silver' < 'Gold' < 'Platinum']
VType: VOrdinalType(description='Membership level', categories=['Bronze', 'Silver', 'Gold', 'Platinum'], ordered=True)

In [29]:
df["membership_level"].dtype

CategoricalDtype(categories=['Bronze', 'Silver', 'Gold', 'Platinum'], ordered=True, categories_dtype=object)

### Timestamp and Duration types
The `VTimestampType` and `VDurationType` types are used to represent timestamp and duration data.

- `VTimestampType`: Represents timestamp data.
- `VDurationType`: Represents duration data.



In [30]:
df.head()

Unnamed: 0,customer_id,age,average_purchase,loyalty_score,preferred_color,membership_level,is_active,has_subscription,marketing_consent,last_feedback,registration_date,membership_duration_days,tags
0,1,69,93.58,1.9,Green,Bronze,True,1,Yes,Customer feedback 0: Need improvement,2024-05-10 09:44:26.081044,504.3,"{'interests': ['Technology', 'Food'], 'source'..."
1,2,32,125.61,6.1,Blue,Platinum,False,0,No,Customer feedback 1: Great service!,2025-03-23 09:44:26.081059,865.6,"{'interests': ['Technology', 'Fashion'], 'sour..."
2,3,89,59.44,4.4,Red,Bronze,True,0,No,Customer feedback 2: Great service!,2025-02-28 09:44:26.081063,241.3,"{'interests': ['Food', 'Sports'], 'source': np..."
3,4,78,108.52,5.8,Black,Silver,True,0,Yes,Customer feedback 3: Excellent products,2024-01-13 09:44:26.081065,78.5,"{'interests': ['Fashion', 'Fashion'], 'source'..."
4,5,38,100.64,7.4,Yellow,Gold,True,0,Yes,Customer feedback 4: Excellent products,2024-01-08 09:44:26.081068,356.4,"{'interests': ['Sports', 'Food'], 'source': np..."


In [31]:
df["registration_date"].v_type = v_types.VTimestampType(min_date=pd.Timestamp("2020-01-01"), max_date=pd.Timestamp("2025-01-01"), description="Registration date")
print("dtype:", df["registration_date"].dtype)
print("validation:", df["registration_date"].validate())
df["registration_date"].apply()
print("dtype:", df["registration_date"].dtype)
print("validation:", df["registration_date"].validate())

DEBUG - Setting type: VTimestampType(description='Registration date')
DEBUG - Type set to: VTimestampType(description='Registration date')


dtype: object
validation: (False, ['Series dtype object does not match required dtype datetime64[ns, UTC]'])
dtype: datetime64[ns]
validation: (True, [])


In [32]:
df["registration_date"]

0     2024-05-10 09:44:26.081044
1     2025-03-23 09:44:26.081059
2     2025-02-28 09:44:26.081063
3     2024-01-13 09:44:26.081065
4     2024-01-08 09:44:26.081068
                 ...            
995   2025-02-22 09:44:26.083285
996   2024-10-19 09:44:26.083287
997   2022-09-25 09:44:26.083289
998   2023-07-02 09:44:26.083291
999   2025-03-12 09:44:26.083294
Name: registration_date, Length: 1000, dtype: datetime64[ns]
VType: VTimestampType(description='Registration date', min_date='2020-01-01 00:00:00', max_date='2025-01-01 00:00:00')

In [33]:
df["registration_date"] > pd.Timestamp("2025-01-01")

0      False
1       True
2       True
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999     True
Name: registration_date, Length: 1000, dtype: bool
VType: VTimestampType(description='Registration date', min_date='2020-01-01 00:00:00', max_date='2025-01-01 00:00:00')

In [34]:
df["membership_duration_days"].v_type = v_types.VDurationType(unit="days", description="Membership duration")
print("validation:", df["membership_duration_days"].validate())
print("dtype:", df["membership_duration_days"].dtype)
df["membership_duration_days"].apply()
print("validation:", df["membership_duration_days"].validate())
print("dtype:", df["membership_duration_days"].dtype)
df["membership_duration_days"].v_type = v_types.VDurationType(unit="days", description="Membership duration")
print("validation:", df["membership_duration_days"].validate())

DEBUG - Setting type: VDurationType(description='Membership duration')
DEBUG - Type set to: VDurationType(description='Membership duration')
DEBUG - Setting type: VDurationType(description='Membership duration')
DEBUG - Type set to: VDurationType(description='Membership duration')


validation: (True, [])
dtype: float64
validation: (True, [])
dtype: Float64
validation: (True, [])


In [35]:
df["membership_duration_days"]

0      504.3
1      865.6
2      241.3
3       78.5
4      356.4
       ...  
995     39.7
996    839.4
997    448.2
998    636.7
999    221.2
Length: 1000, dtype: Float64
VType: VDurationType(description='Membership duration', unit='days')

In [36]:
df

Unnamed: 0,customer_id,age,average_purchase,loyalty_score,preferred_color,membership_level,is_active,has_subscription,marketing_consent,last_feedback,registration_date,membership_duration_days,tags
0,1,69,93.58,1.9,Green,Bronze,True,1,Yes,Customer feedback 0: Need improvement,2024-05-10 09:44:26.081044,504.3,"{'interests': ['Technology', 'Food'], 'source'..."
1,2,32,125.61,6.1,Blue,Platinum,False,0,No,Customer feedback 1: Great service!,2025-03-23 09:44:26.081059,865.6,"{'interests': ['Technology', 'Fashion'], 'sour..."
2,3,89,59.44,4.4,Red,Bronze,True,0,No,Customer feedback 2: Great service!,2025-02-28 09:44:26.081063,241.3,"{'interests': ['Food', 'Sports'], 'source': np..."
3,4,78,108.52,5.8,Black,Silver,True,0,Yes,Customer feedback 3: Excellent products,2024-01-13 09:44:26.081065,78.5,"{'interests': ['Fashion', 'Fashion'], 'source'..."
4,5,38,100.64,7.4,Yellow,Gold,True,0,Yes,Customer feedback 4: Excellent products,2024-01-08 09:44:26.081068,356.4,"{'interests': ['Sports', 'Food'], 'source': np..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,27,104.04,5.0,Blue,Bronze,True,1,Yes,Customer feedback 995: Great service!,2025-02-22 09:44:26.083285,39.7,"{'interests': ['Technology', 'Technology'], 's..."
996,997,51,85.04,5.4,Black,Silver,False,0,Yes,Customer feedback 996: Excellent products,2024-10-19 09:44:26.083287,839.4,"{'interests': ['Sports', 'Technology'], 'sourc..."
997,998,72,141.03,1.4,Black,Bronze,False,0,Yes,Customer feedback 997: Could be better,2022-09-25 09:44:26.083289,448.2,"{'interests': ['Food', 'Food'], 'source': np.s..."
998,999,49,106.23,7.8,Green,Silver,True,0,Yes,Customer feedback 998: Need improvement,2023-07-02 09:44:26.083291,636.7,"{'interests': ['Food', 'Food'], 'source': np.s..."


In [39]:
print(df.type_info())

                                                                      Type
Column                                                                    
customer_id                                                           None
age                                   VIntType(description='Age in years')
average_purchase           VIntType(description='Average purchase amount')
loyalty_score                                                         None
preferred_color           VCategoricalType(description='Membership level')
membership_level              VOrdinalType(description='Membership level')
is_active                                              VStringBinaryType()
has_subscription                                          VIntBinaryType()
marketing_consent                                      VStringBinaryType()
last_feedback                                                         None
registration_date          VTimestampType(description='Registration date')
membership_duration_days 

In [38]:
df.to_parquet("data_with_vtypes.parquet")
df_from_parquet = v_types.VDataFrame.read_parquet("data_with_vtypes.parquet")
print(df_from_parquet.type_info())


DEBUG - Setting type: VIntType(description='Age in years')
DEBUG - Type set to: VIntType(description='Age in years')
DEBUG - Setting type: VIntType(description='Average purchase amount')
DEBUG - Type set to: VIntType(description='Average purchase amount')
DEBUG - Setting type: VCategoricalType(description='Membership level')
DEBUG - Type set to: VCategoricalType(description='Membership level')
DEBUG - Setting type: VOrdinalType(description='Membership level')
DEBUG - Type set to: VOrdinalType(description='Membership level')
DEBUG - Setting type: VStringBinaryType()
DEBUG - Type set to: VStringBinaryType()
DEBUG - Setting type: VIntBinaryType()
DEBUG - Type set to: VIntBinaryType()
DEBUG - Setting type: VStringBinaryType()
DEBUG - Type set to: VStringBinaryType()
DEBUG - Setting type: VTimestampType(description='Registration date')
DEBUG - Type set to: VTimestampType(description='Registration date')
DEBUG - Setting type: VDurationType(description='Membership duration')
DEBUG - Type set 

b'{"age": {"type": "VIntType", "description": "Age in years", "unit": "years", "min": 90, "max": 120}, "average_purchase": {"type": "VIntType", "description": "Average purchase amount", "unit": "USD", "min": 0, "max": 1000}, "preferred_color": {"type": "VCategoricalType", "description": "Membership level", "categories": ["Yellow"], "ordered": false}, "membership_level": {"type": "VOrdinalType", "description": "Membership level", "categories": ["Bronze", "Silver", "Gold", "Platinum"], "ordered": true}, "is_active": {"type": "VStringBinaryType", "description": null, "true_value": "Yes", "false_value": "No"}, "has_subscription": {"type": "VIntBinaryType", "description": null}, "marketing_consent": {"type": "VStringBinaryType", "description": null, "true_value": "Yes", "false_value": "No"}, "registration_date": {"type": "VTimestampType", "description": "Registration date", "min_date": "2020-01-01 00:00:00", "max_date": "2025-01-01 00:00:00", "tz": "UTC", "format": null}, "membership_durati

Unnamed: 0_level_0,Type
Column,Unnamed: 1_level_1
customer_id,
age,VIntType(description='Age in years')
average_purchase,VIntType(description='Average purchase amount')
loyalty_score,
preferred_color,VCategoricalType(description='Membership level')
membership_level,VOrdinalType(description='Membership level')
is_active,VStringBinaryType()
has_subscription,VIntBinaryType()
marketing_consent,VStringBinaryType()
last_feedback,
