### The Goal of the Script is to perform:
Feature Engineering: Feature Handling (handle missing data)

### Three approaches were covered in the lecture:
1. Deleting the feature with missing value
2. Imputation of the missing value

    a. Imputing all missing value in a feature with fixed value - Univariate Feature Imputation

    b. Imputing all missing value in a feature with different values based on other features in the dataset - Multivariate Feature Imputation
    
3. Extended Imputation

References:
1. Jäger, Sebastian, Arndt Allhorn, and Felix Bießmann. "A benchmark for data imputation methods." Frontiers in big Data (2021): 48.
2. https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/

### Missing Mechanisms - [Reference Link](http://www.stat.columbia.edu/~gelman/arm/)

1. Missingness Completely at Random (MCAR): The probability of an instance being missing does not depend on known values or the missing value itself, i.e., a certain missing value is not related to its assumed value or with the values of other features.

2. Missing at Random (MAR): The probability of a feature value is missing depends only on available information. 
"An outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA’s), as long as the regression controls for all the variables that affect the probability of missingness."
Example: In the case of a temperature sensor, the fact that a value is missing doesn’t depend on the temperature, but might be dependent on some other factor, for example on the battery charge of the thermometer. - [Reference Link](https://www.kdnuggets.com/2020/09/missing-value-imputation-review.html)

3. Not Missing at Random (NMAR): The probability of a feature value is missing COULD depend on the value of the feature itself.
Example: In the case of a temperature sensor, the sensor doesn’t work properly when it is colder than 5°C. - [Reference Link](https://www.kdnuggets.com/2020/09/missing-value-imputation-review.html)

[Reference Link](https://medium.com/analytics-vidhya/different-types-of-missing-data-59c87c046bf7)

![alterntiver Text](https://miro.medium.com/max/1400/1*cPNnAnoOYArYyTDPNjJg3A.gif)

[Reference Link](https://medium.com/analytics-vidhya/ways-to-impute-missing-values-in-the-data-fc38e7d7e2c1)

![alternative Text](https://miro.medium.com/max/1400/1*vhwpR-qisCWFdpmAugxcIA.jpeg)

### Environment Setup for the Melbourne Housing Price Prediction

In [46]:
# Importing the libraries
import pandas as pd

# Libraries to train the model
from sklearn import tree
from sklearn.model_selection import train_test_split

# Library for Evaluating the model
from sklearn.metrics import mean_absolute_error

In [47]:
# Importing the dataset: Dataset is the Melbourne House Pricing Dataset
# It is a Regression problem where the goal is to predict the house price based on features of the house
data = pd.read_csv('melbourne_data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [48]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18396 entries, 0 to 18395
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     18396 non-null  int64  
 1   Suburb         18396 non-null  object 
 2   Address        18396 non-null  object 
 3   Rooms          18396 non-null  int64  
 4   Type           18396 non-null  object 
 5   Price          18396 non-null  float64
 6   Method         18396 non-null  object 
 7   SellerG        18396 non-null  object 
 8   Date           18396 non-null  object 
 9   Distance       18395 non-null  float64
 10  Postcode       18395 non-null  float64
 11  Bedroom2       14927 non-null  float64
 12  Bathroom       14925 non-null  float64
 13  Car            14820 non-null  float64
 14  Landsize       13603 non-null  float64
 15  BuildingArea   7762 non-null   float64
 16  YearBuilt      8958 non-null   float64
 17  CouncilArea    12233 non-null  object 
 18  Lattit

In [None]:
!pip install pandas_profiling

In [None]:
import pandas_profiling
import pandas as pd

hourse_price_report = pandas_profiling.ProfileReport(X_train)
hourse_price_report.to_file('house_report.html')

There are 18,396 rows and 22 columns in the dataset

In [49]:
# Preparing the dataset to solve the Regression Problem

# Prepare the Dependent feature, i.e., Price of the house as the Dependent feature
y = data.Price

# Features other than Price feature are now considered as Independent features

# Assumption:
# To solve the regression problem, for simplicity only numerical independent features are considered
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])
X.head()

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,1,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,2,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,4,3,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,5,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,6,4,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


Only 13 independent features are considered for training the regression model

In [50]:
# Create a function that considers training and testing dataset as input
# and train the Decision Tree Regression model and predict the model performance using Mean Absolute Error (MAE score)
def score_dataset(X_train, X_valid, y_train, y_valid):
    clf = tree.DecisionTreeRegressor()
    clf = clf.fit(X_train, y_train)
    pred = clf.predict(X_valid)
    mae = mean_absolute_error(y_valid, pred)
    return mae

In [51]:
# Get features with missing values
cols_with_missing = [col for col in X.columns
                     if X[col].isnull().any()]

cols_with_missing

['Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

**Out of 13 indepdent features, 11 features are having missing values**

In [52]:
null_data = X[X.isnull().any(axis=1)]
null_data

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,1,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.79960,144.99840,4019.0
3,5,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.79690,144.99690,4019.0
5,10,2,2.5,3067.0,2.0,1.0,0.0,181.0,,,-37.80410,144.99530,4019.0
8,15,3,2.5,3067.0,,,,,,,,,4019.0
9,16,2,2.5,3067.0,,,,,,,,,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18388,23537,4,16.7,3150.0,4.0,2.0,2.0,652.0,,1981.0,-37.90562,145.16761,7392.0
18390,23539,3,6.8,3016.0,3.0,2.0,4.0,436.0,,1997.0,-37.85274,144.88738,6380.0
18391,23540,2,6.8,3016.0,2.0,2.0,1.0,,89.0,2010.0,-37.86393,144.90484,6380.0
18393,23544,4,12.7,3085.0,4.0,3.0,2.0,,,,-37.72006,145.10547,1369.0


In [53]:
X.loc[X['Unnamed: 0'] == 3349]

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
2573,3349,4,7.8,3058.0,4.0,2.0,1.0,381.0,,1938.0,-37.7337,144.9548,11204.0


## Score from Approach 1 (Drop Columns with Missing Values)
Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.

In [54]:
# Drop columns in training and validation data
X = X.drop(cols_with_missing, axis=1)

After dropping the columns, we will be only left with 2 columns for training the model

In [55]:
# Split the data into 80:20 ratio for training and validation subsets
reduced_X_train, reduced_X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [57]:
reduced_X_train.loc[reduced_X_train['Unnamed: 0'] == 3349]

Unnamed: 0.1,Unnamed: 0,Rooms
2573,3349,4


In [56]:
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
390482.9366847826


**We can observe that the MAE value for the trained Decision Tree Regressor is 3.9x10^5, i.e., the model performance is very bad**

## Score from Approach 2 (Imputation)

#### a. Imputing all missing values in a feature with the fixed value - Univariate feature Imputation

Use **SimpleImputer** to replace missing values with the **mean value** along each column.

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into machine learning models.

SimpleImputer for Imputing Numerical Missing Data: 
For the numerical missing data, the following strategy can be used:
* Mean
* Median
* Most_frequent

SimpleImputer for Imputing Categorical Missing Data:
For categorical missing values, the following strategies can be used: 
* Most_frequent (Recommended)
* Constant

In [58]:
# Preparing the dataset to solve the Regression Problem

# Prepare the Dependent feature, i.e., Price of the house as the Dependent feature
y = data.Price

# Features other than Price feature are now considered as Independent features

# Assumption:
# To solve the regression problem, for simplicity only numerical independent features are considered
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])
X.head()

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,1,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,2,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,4,3,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,5,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,6,4,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


In [59]:
# Get features with missing values
cols_with_missing = [col for col in X.columns
                     if X[col].isnull().any()]

cols_with_missing

['Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

You can find the SimpleImputer class from the sklearn.impute package. 
You first initialize an instance of the SimpleImputer class by indicating the strategy (mean) as well as specifying the missing values that you want to locate (np.nan):


In [60]:
import numpy as np
from sklearn.impute import SimpleImputer

# Split the data in 80:20 ratio for training and validation dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings - [Reference link](https://scikit-learn.org/stable/modules/impute.html)

In [61]:
X_train.loc[X_train['Unnamed: 0'] == 3349]

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
2573,3349,4,7.8,3058.0,4.0,2.0,1.0,381.0,,1938.0,-37.7337,144.9548,11204.0


In [62]:
# Imputation Approach on the Indepdent features, i.e., replacing all the missing value 
# in a feature with a FIXED Value
my_imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,3349.0,4.0,7.8,3058.0,4.0,2.0,1.0,381.0,152.120627,1938.0,-37.7337,144.9548,11204.0
1,2686.0,3.0,7.8,3124.0,3.0,1.0,1.0,544.0,160.0,1930.0,-37.8436,145.0581,8920.0
2,6065.0,2.0,5.6,3101.0,2.0,1.0,1.0,121.0,152.120627,1966.09975,-37.8126,145.0534,10331.0
3,11346.0,3.0,7.5,3123.0,3.0,2.0,2.0,200.0,152.120627,1966.09975,-37.8396,145.0514,6482.0
4,13474.0,2.0,4.5,3181.0,2.0,1.0,1.0,2842.0,84.0,1920.0,-37.8513,144.9943,7717.0


In [64]:
imputed_X_train.loc[imputed_X_train[0] == 3349]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,3349.0,4.0,7.8,3058.0,4.0,2.0,1.0,381.0,152.120627,1938.0,-37.7337,144.9548,11204.0


Take note that both the fit() and transform() functions expect a 2D array, so be sure to pass in a 2D array or dataframe. If you pass in a 1D array or a Pandas Series, you will get an error.

**Why fit transform and then only transform ?**

You do that on the training set of the data. But then you have to apply the same transformation to your test set (e.g. in cross-validation), or to newly obtained examples before forecasting. But you have to use the exact same two parameters 𝜇 and 𝜎 (values) that you used for centering the training set.

Hence, every scikit-learn's transform's fit() just calculates the parameters (e.g. 𝜇 and 𝜎 ) and saves them as an internal object's state. Afterwards, you can call its transform() method to apply the transformation to any particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set 𝑥, while also returning the transformed 𝑥′. Internally, the transformer object just calls first fit() and then transform() on the same data.

In [19]:
print("MAE from Approach 2 (Imputation Approach to replace missing value in a feature with fixed value):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation Approach to replace missing value in a feature with fixed value):
254106.82364130433


**We can observe that Imputation approach performs better than removing the feature with missing values. Still the performance of the model is bad**

#### b. Imputing all missing values in a feature with different values based on other features present in the datase - Multivariate Feature Imputation

This Approach is also known as fully conditional specification (FCS) or multivariate imputation by chained equations (MICE):

"This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10–20) is often sufficient." - [Reference Link](https://www.jstatsoft.org/article/view/v045i03)

Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate imputation by chained equations in R." Journal of statistical software 45 (2011): 1-67.

In [20]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Split the data in 80:20 ratio for training and validation dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

The IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned. - [Reference Link](https://scikit-learn.org/stable/modules/impute.html)

In [36]:
X_train.loc[X_train['Unnamed: 0'] == 3349]

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
2573,3349,4,7.8,3058.0,4.0,2.0,1.0,381.0,,1938.0,-37.7337,144.9548,11204.0


In [26]:
# Imputation Approach on the Indepdent features, i.e., replacing all the missing value 
# in a feature with different value predicted based on other features in the dataset
iterative_imputer = IterativeImputer(max_iter=20, random_state=0)
imputed_iterative_X_train = pd.DataFrame(iterative_imputer.fit_transform(X_train))
imputed_iterative_X_valid = pd.DataFrame(iterative_imputer.transform(X_valid))

imputed_iterative_X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,3349.0,4.0,7.8,3058.0,4.0,2.0,1.0,381.0,81.065889,1938.0,-37.7337,144.9548,11204.0
1,2686.0,3.0,7.8,3124.0,3.0,1.0,1.0,544.0,160.0,1930.0,-37.8436,145.0581,8920.0
2,6065.0,2.0,5.6,3101.0,2.0,1.0,1.0,121.0,6.52467,1951.387085,-37.8126,145.0534,10331.0
3,11346.0,3.0,7.5,3123.0,3.0,2.0,2.0,200.0,72.289192,1959.959619,-37.8396,145.0514,6482.0
4,13474.0,2.0,4.5,3181.0,2.0,1.0,1.0,2842.0,84.0,1920.0,-37.8513,144.9943,7717.0


In [42]:
imputed_iterative_X_train.loc[imputed_iterative_X_train[0] == 3349.0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,3349.0,4.0,7.8,3058.0,4.0,2.0,1.0,381.0,81.065889,1938.0,-37.7337,144.9548,11204.0


In [30]:
print("MAE from Approach 2 (Imputation Approach to replace missing value in a feature with different value):")
print(score_dataset(imputed_iterative_X_train, imputed_iterative_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation Approach to replace missing value in a feature with different value):
266983.33016304346


**We can observe that Imputation approach performs better than removing the feature with missing values. Still the performance of the model is bad**

**A third part of Approach 2 is to use KNNImputer, where KNN is used to impute the missing value. - [Reference Link](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)**

## Score from Approach 3 (Extended Imputation)


Next, we impute the missing values, while also keeping track of which values were imputed.
The use of this approach is to keep a track of the values that are imputed.

The problem with the Approach 2 is that the model does not know whether the values came from the original data or the imputed value. To make sure the model knows this, we are adding {column_name}_was_missing the column which will have True as value, if it is a null value and False if it is not a null value.

In [71]:
# Preparing the dataset to solve the Regression Problem

# Prepare the Dependent feature, i.e., Price of the house as the Dependent feature
y = data.Price

# Features other than Price feature are now considered as Independent features

# Assumption:
# To solve the regression problem, for simplicity only numerical independent features are considered
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])
X.head()

Unnamed: 0.1,Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,1,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,2,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,4,3,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,5,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,6,4,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


In [72]:
# Get features with missing values
cols_with_missing = [col for col in X.columns
                     if X[col].isnull().any()]

cols_with_missing

['Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [73]:
# Make new columns indicating what will be imputed
# for col in cols_with_missing:
#     X[col + '_was_missing'] = X[col].isnull()

In [74]:
# Split the data in 80:20 ratio for training and validation dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [75]:
# Imputation Approach on the Indepdent features, i.e., replacing all the missing value 
# in a feature with a FIXED Value
my_imputer = SimpleImputer(strategy='mean', missing_values=np.nan, add_indicator=True)
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid))

In [76]:
print("MAE from Approach 3 (An Extension to Imputation Technique to replace all missing values with a fixed value):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))


MAE from Approach 3 (An Extension to Imputation Technique to replace all missing values with a fixed value):
256769.8679347826


**We can observe that by adding one more column of column_was_missing there was a slight improvement in the performance of the model**

<table>
<tr>
<td>Measure</td>   
<td>Drop columns</td>
<td>Imputation with Fixed Values</td>
<td>Imputation with Different Values</td>
<td>Extended Imputation</td>

</tr>
<tr>
<td>Error </td>
<td>390750.32</td>
<td>256909.28</td>
<td>268314.39</td>
<td>256769.86</td>
</tr>    
</table>

## In class assignment -

Titanic data analysis - https://www.kaggle.com/c/titanic/data?select=train.csv 
Take this data into consideration figure out which features contain missing data, and apply the three methods to this dataset and generate ML models.
P.S. this is a classification problem as opposed to the regression problem which we just saw, in this case you have to think which are the measures with which the model Performance can be measured.

P.S. Drop - Gender column and the consider it again and check the difference in accuracy.