## Delete Rows with Missing Values:

In [None]:
print(data.isnull().sum(0))
print(data.shape

In [None]:
# Drop rows using dropna Pandas function
data.dropna(inplace= True)
print(data.isnull().sum(0))
print(data.shape)

## Impute missing values with Mean/Median:

In [None]:
# Create an imputer object that fills missing values with the mean
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy='mean')

# Apply the imputer to the DataFrame
df_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Mean Imputation:")
print(df_mean_imputed)

## Imputation method for categorical columns:

In [None]:
# Impute missing values with a constant value (e.g., 'Unknown')
constant_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')

# Apply the imputer to the DataFrame
df_constant_imputed = pd.DataFrame(constant_imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Imputation with Constant Value 'Unknown':")
print(df_constant_imputed)


## Last observation carried forward (LOCF) method

In [None]:
data["Age"] = data["Age"].fillna(method='ffill')

## Using Algorithms that support missing values

All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.

The sklearn implementations of naive Bayes and k-Nearest Neighbors in Python do not support the presence of the missing values.

Another algorithm that can be used here is RandomForest that works well on non-linear and categorical data. It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.

**Pros:**

- No need to handle missing values in each column as ML algorithms will handle them efficiently.

**Cons:**

- No implementation of these ML algorithms in the scikit-learn library.

## Prediction of missing values

In the earlier methods to handle missing values, we do not use the correlation advantage of the variable containing the missing value and other variables. Using the other features which don’t have nulls can be used to predict missing values.

The regression or classification model can be used for the prediction of missing values depending on the nature (categorical or continuous) of the feature having missing value.

In [1]:
from sklearn.linear_model import LinearRegression
import pandas as pd

data = pd.read_csv("train.csv")
data = data[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]

data["Sex"] = [1 if x=="male" else 0 for x in data["Sex"]]

test_data = data[data["Age"].isnull()]
data.dropna(inplace=True)

y_train = data["Age"]
X_train = data.drop("Age", axis=1)
X_test = test_data.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


KeyboardInterrupt



## Imputation using Deep Learning Library — Datawig

This method works very well with categorical, continuous, and non-numerical features. Datawig is a library that learns ML models using Deep Neural Networks to impute missing values in the datagram.

`Install datawig library,
pip3 install datawig`

Datawig can take a data frame and fit an imputation model for each column with missing values, with all other columns as inputs.

In [2]:
data = pd.read_csv("train.csv")

df_train, df_test = datawig.utils.random_split(data)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['Pclass','SibSp','Parch'], # column(s) containing information about the column we want to impute
    output_column= 'Age', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

SyntaxError: invalid syntax (818210435.py, line 2)