<img src="https://docs.actable.ai/_images/logo.png" style="object-fit: cover; max-width:100%; height:300px;" />

# AAIDataImputationTask

This notebook is an example on how you can run an automatical data imputation with
[Actable AI](https://actable.ai)

For this example we will fill the np.nan values of a DataFrame\
by using Machine Learning Techniques to infer the values.

For this example the dataset we are going to use is the [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)

### Imports

This part simply imports the python modules.
Last line imports the DataImputationTask from actableai

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay

from actableai.tasks.data_imputation import AAIDataImputationTask

### Importing the data

This part imports the data.\
We will also artificially remove 5 values in random rows and columns of our DataFrame\
for the data imputation to infer them

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/Actable-AI/public-datasets/master/apartments.csv").head(100)


In [3]:
# Randomly assign some values to np.nan
for i in range(5):
    df.iloc[np.random.randint(0, len(df)), np.random.randint(0, len(df.columns))] = np.nan

### Calling Actable AI task

This part is the call to the ActableAI classification analysis.\
To learn more about the available parameters you can consult the [API Documentation](https://lib.actable.ai/actableai.tasks.html#module-actableai.tasks.classification)

In [None]:
# Here df is the DataFrame containing our data
# target is "Churn" because we want to predict the churn
# features set to None means that we will use every single feature available
result = AAIDataImputationTask().run(
    df=df,
    impute_nulls=True
)

### Evaluation of the generated model

In this part we take a look at the metrics created by the model on the validation set.\
The validation set is created internally so you dont need to specify it.

In [11]:
pd.DataFrame.from_records([x["text"] for x in result['data']['records']])

Unnamed: 0,number_of_rooms,number_of_bathrooms,sqft,location,days_on_market,initial_price,neighborhood,rental_price
0,0,1,4848,great,10,2271,south_side,2271.000
1,1,1,674,good,1,2167,downtown,2167.000
2,1,1,554,poor,19,1883,westbrae,1883.000
3,0,1,529,great,3,2431,south_side,2431.000
4,3,2,1219,great,3,5510,south_side,5510.000
...,...,...,...,...,...,...,...,...
95,1,1,588,good,14,1961,downtown,1961.000
96,0,1,334,poor,48,1243,westbrae,1173.392
97,2,1,736,great,2,3854,south_side,3854.000
98,3,2,1056,poor,54,4408,westbrae,4108.256


In [21]:
# Here we verifiy that our new generated DataFrame has no np.nan values
pd.DataFrame.from_records([x["text"] for x in result['data']['records']]).isna().any().any()

False