# Feature Engineering
Feature Engineering is the process of manipulating and transforming data to create new, more relevant features that can be used to build effective machine learning models. Feature Engineering techniques are commonly used to increase the performance and relevance of models and improve the overall quality of data analysis. The main goal of Feature Engineering is to extract valuable information from existing data by creating new variables or transforming existing variables. This can include combining different features, removing redundant data, transforming numerical data, normalizing, coding categorical variables, scaling data, removing outliers, etc.
Different Feature Engineering techniques are used depending on the nature of the data and the type of problem we are working on. Some popular techniques are:
- feature creation: involves creating new features from existing data. This can include extracting information from text, generating statistics from temporal data, creating interactions between variables, aggregating data.
- feature removal: some features may be uninformative, contain a lot of missing values, or introduce noise into the data. In such cases, they can be removed to increase the effectiveness of the model.
- data transformation: many machine learning algorithms require data to meet certain distribution assumptions. This may include logarithmization, standardization, normalization or removal of skewness in the data.
- coding of categorical variables: categorical variables cannot be used directly in many models. It is necessary to code them, for example, using explicit coding, ordinal coding or binary coding.


Applications of Feature Engineering techniques can vary depending on the dataset and problem. Correct use of Feature Engineering techniques can lead to better model performance, reduced overfitting, increased predictive accuracy and better understanding of the data. However, excessive or inappropriate use of Feature Engineering techniques can lead to the introduction of noise into the data or the introduction of incorrect assumptions, which can degrade model quality.

### Required libraries

In [1]:
import pandas as pd

from main import one_hot_encoding, label_encoding, ordinal_encoding, create_interaction_features, create_polynomial_features, standardize_features, normalize_features, discretize_features, remove_highly_correlated_features, remove_features_with_high_missing_data

### Loading data

In [2]:
data = pd.read_csv('heart_disease_risk.csv')
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,1
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,1


### One-hot encoding
One-hot encoding is a method of encoding categorical variables in binary form. Each category is encoded as a binary vector of length equal to the number of categories. In this vector, a value of 1 indicates category membership, and a value of 0 indicates no membership.

In [3]:
data = pd.read_csv('heart_disease_risk.csv')
categorical_columns = ['cp', 'thal']

data = one_hot_encoding(data, categorical_columns)
data

Unnamed: 0,age,sex,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,decision,cp_1.0,cp_2.0,cp_3.0,cp_4.0,thal_3.0,thal_6.0,thal_7.0
0,63.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,67.0,1.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,1,0.0,0.0,0.0,1.0,1.0,0.0,0.0
2,67.0,1.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,37.0,1.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,41.0,0.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0
293,45.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0
294,68.0,1.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0
295,57.0,1.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### Label encoding
Label encoding is a method of encoding categorical variables to integers. Each category is encoded as an integer. It is used to facilitate analysis of data containing categorical variables.

In [4]:
data = pd.read_csv('heart_disease_risk.csv')
categorical_columns = ['cp', 'thal']

data = label_encoding(data, categorical_columns)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,63.0,1.0,0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,1,0
1,67.0,1.0,3,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,0,1
2,67.0,1.0,3,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,2,1
3,37.0,1.0,2,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,0,0
4,41.0,0.0,1,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,3,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,2,1
293,45.0,1.0,0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,2,1
294,68.0,1.0,3,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,2,1
295,57.0,1.0,3,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,2,1


### Ordinal encoding
Ordinal encoding is a method of encoding categorical variables into integers with order. It is used to facilitate analysis of data containing categorical variables.

In [5]:
data = pd.read_csv('heart_disease_risk.csv')
categorical_columns = ['cp', 'thal']
ordinal_mapping = {
    'cp': {1: 'typical angina', 2: 'atypical angina', 3: 'non-anginal pain', 4: 'asymptomatic'},
    'thal': {3: 'normal', 6: 'fixed defect', 7: 'reversible defect'}
}

data = ordinal_encoding(data, categorical_columns, ordinal_mapping)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,63.0,1.0,typical angina,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,fixed defect,0
1,67.0,1.0,asymptomatic,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,normal,1
2,67.0,1.0,asymptomatic,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,reversible defect,1
3,37.0,1.0,non-anginal pain,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,normal,0
4,41.0,0.0,atypical angina,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,asymptomatic,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,reversible defect,1
293,45.0,1.0,typical angina,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,reversible defect,1
294,68.0,1.0,asymptomatic,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,reversible defect,1
295,57.0,1.0,asymptomatic,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,reversible defect,1


In [6]:
categorical_columns = ['cp', 'thal']
ordinal_mapping = {
    'cp': {'typical angina': 1.0, 'atypical angina': 2.0, 'non-anginal pain': 3.0, 'asymptomatic': 4.0},
    'thal': {'normal': 3.0, 'fixed defect': 6.0, 'reversible defect': 7.0}
}

data = ordinal_encoding(data, categorical_columns, ordinal_mapping)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,1
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,1


### Polynomial transformation
This is a method used in feature engineering that involves generating new features by creating polynomial combinations of existing features. Polynomial transformation allows for the inclusion of nonlinear relationships between features and the target variable, which can improve the performance of a machine learning model when the data has a complex nonlinear structure.

In [7]:
data = pd.read_csv('heart_disease_risk.csv')
degree = 2

data = create_polynomial_features(data, degree)
data

Unnamed: 0,x0,x0 * x1,x0 * x2,x0 * x3,x0 * x4,x0 * x5,x0 * x6,x0 * x7,x0 * x8,x0 * x9,...,x0**2 * x10,x0**2 * x10 * x11,x0**2 * x10 * x12,x0**2 * x10 * x13,x0**2 * x11,x0**2 * x11 * x12,x0**2 * x11 * x13,x0**2 * x12,x0**2 * x12 * x13,x0**2 * x13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,...,9.0,0.0,18.0,0.0,0.0,0.0,0.0,36.0,0.0,0.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,...,4.0,6.0,6.0,2.0,9.0,9.0,3.0,9.0,3.0,1.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,...,4.0,4.0,14.0,2.0,4.0,14.0,2.0,49.0,7.0,1.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,...,9.0,0.0,9.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,...,1.0,0.0,3.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,...,4.0,0.0,14.0,2.0,0.0,0.0,0.0,49.0,7.0,1.0
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,...,4.0,0.0,14.0,2.0,0.0,0.0,0.0,49.0,7.0,1.0
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,...,4.0,4.0,14.0,2.0,4.0,14.0,2.0,49.0,7.0,1.0
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,...,4.0,2.0,14.0,2.0,1.0,7.0,1.0,49.0,7.0,1.0


### Interaction features
Interaction feature creation involves creating new features by combining and multiplying existing features. This allows you to account for interactions between features that can have a significant impact on the prediction result. Creating interaction features can help catch non-linear relationships between variables that were not included in the original feature set.

In [8]:
data = pd.read_csv('heart_disease_risk.csv')

data = create_interaction_features(data)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,oldpeak*slope,oldpeak*ca,oldpeak*thal,oldpeak*decision,slope*ca,slope*thal,slope*decision,ca*thal,ca*decision,thal*decision
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,...,6.9,0.0,13.8,0.0,0.0,18.0,0.0,0.0,0.0,0.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,...,3.0,4.5,4.5,1.5,6.0,6.0,2.0,9.0,3.0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,...,5.2,5.2,18.2,2.6,4.0,14.0,2.0,14.0,2.0,7.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,...,10.5,0.0,10.5,0.0,0.0,9.0,0.0,0.0,0.0,0.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,...,1.4,0.0,4.2,0.0,0.0,3.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,...,0.4,0.0,1.4,0.2,0.0,14.0,2.0,0.0,0.0,7.0
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,...,2.4,0.0,8.4,1.2,0.0,14.0,2.0,0.0,0.0,7.0
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,...,6.8,6.8,23.8,3.4,4.0,14.0,2.0,14.0,2.0,7.0
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,...,2.4,1.2,8.4,1.2,2.0,14.0,2.0,7.0,1.0,7.0


### Standardization of data features
Standardization is the technique of transforming features so that they have a mean value of zero and a standard deviation equal to one. This operation is intended to match the distribution of features to a normal distribution, which can be useful in many statistical analyses and machine learning modeling.
First, mean values and standard deviations are calculated for each characteristic in the data set. Then the data is transformed by subtracting the mean and dividing by the standard deviation.
Feature standardization can have many applications, such as improving the convergence of machine learning algorithms, removing skewness in feature distributions, eliminating the impact of scaling on model performance, and facilitating comparisons of features with different units and scales.

In [9]:
data = pd.read_csv('heart_disease_risk.csv')

data = standardize_features(data)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,0.936181,0.691095,-2.240629,0.750380,-0.276443,2.430427,1.010199,0.017494,-0.696419,1.068965,2.264145,-0.721976,0.655877,-0.925338
1,1.378929,0.691095,0.873880,1.596266,0.744555,-0.411450,1.010199,-1.816334,1.435916,0.381773,0.643781,2.478425,-0.894220,1.080686
2,1.378929,0.691095,0.873880,-0.659431,-0.353500,-0.411450,1.010199,-0.899420,1.435916,1.326662,0.643781,1.411625,1.172577,1.080686
3,-1.941680,0.691095,-0.164289,-0.095506,0.051047,-0.411450,-1.003419,1.633010,-0.696419,2.099753,2.264145,-0.721976,-0.894220,-0.925338
4,-1.498933,-1.446980,-1.202459,-0.095506,-0.835103,-0.411450,1.010199,0.978071,-0.696419,0.295874,-0.976583,-0.721976,-0.894220,-0.925338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,0.272059,-1.446980,0.873880,0.468418,-0.122330,-0.411450,-1.003419,-1.161395,1.435916,-0.734914,0.643781,-0.721976,1.172577,1.080686
293,-1.056185,0.691095,-2.240629,-1.223355,0.320744,-0.411450,-1.003419,-0.768432,-0.696419,0.124076,0.643781,-0.721976,1.172577,1.080686
294,1.489615,0.691095,0.873880,0.693988,-1.047008,2.430427,-1.003419,-0.375469,-0.696419,2.013854,0.643781,1.411625,1.172577,1.080686
295,0.272059,0.691095,0.873880,-0.095506,-2.241384,-0.411450,-1.003419,-1.510696,1.435916,0.124076,0.643781,0.344824,1.172577,1.080686


### Normalize data features

Normalization uses the MinMaxScaler object from the scikit-learn library, which transforms data so that feature values are scaled to a range from 0 to 1. Feature normalization involves scaling feature values so that the smallest feature value is 0 and the largest feature value is 1. In practice, feature normalization can be applied prior to data analysis, such as classification or regression, to ensure consistency and comparability of features. Feature normalization has several important applications:
- it preserves the proportion between feature values, which is important for distance-based learning algorithms (such as k-nearest neighbor algorithms)
- eliminates the effect of skewness in the distribution of features, so you can better compare and interpret the data
- is particularly useful for algorithms that are sensitive to the scale of the data, such as some numerical optimization methods

In [10]:
data = pd.read_csv('heart_disease_risk.csv')

data = normalize_features(data)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,0.708333,1.0,0.000000,0.481132,0.244292,1.0,1.0,0.603053,0.0,0.370968,1.0,0.000000,0.75,0.0
1,0.791667,1.0,1.000000,0.622642,0.365297,0.0,1.0,0.282443,1.0,0.241935,0.5,1.000000,0.00,1.0
2,0.791667,1.0,1.000000,0.245283,0.235160,0.0,1.0,0.442748,1.0,0.419355,0.5,0.666667,1.00,1.0
3,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.0,0.885496,0.0,0.564516,1.0,0.000000,0.00,0.0
4,0.250000,0.0,0.333333,0.339623,0.178082,0.0,1.0,0.770992,0.0,0.225806,0.0,0.000000,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,0.583333,0.0,1.000000,0.433962,0.262557,0.0,0.0,0.396947,1.0,0.032258,0.5,0.000000,1.00,1.0
293,0.333333,1.0,0.000000,0.150943,0.315068,0.0,0.0,0.465649,0.0,0.193548,0.5,0.000000,1.00,1.0
294,0.812500,1.0,1.000000,0.471698,0.152968,1.0,0.0,0.534351,0.0,0.548387,0.5,0.666667,1.00,1.0
295,0.583333,1.0,1.000000,0.339623,0.011416,0.0,0.0,0.335878,1.0,0.193548,0.5,0.333333,1.00,1.0


### Discretization of data features
Discretization is used to transform continuous features in a dataset into discrete values. In practice, feature discretization can be used in various data analyses, such as classification, clustering or decision rule construction, to take into account the characteristics of discrete data or to simplify a problem by converting continuous data into discrete values. The resulting discretization is assigned to the original features in the dataset.
Feature discretization has several applications:
- it can help reduce the impact of noise on data analysis, as it reduces the number of different values on which the model must focus
- useful for modeling when data in the form of categories or ranges are required, such as for duration analysis, data clustering or building decision rules.


In [11]:
data = pd.read_csv('heart_disease_risk.csv')
continuous_columns = ['age', 'trestbps', 'chol', 'thalach']
n_bins = 10

data = discretize_features(data, continuous_columns, n_bins)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,8.0,1.0,1.0,8.0,4.0,1.0,2.0,4.0,0.0,2.3,3.0,0.0,6.0,0
1,9.0,1.0,4.0,9.0,7.0,0.0,2.0,0.0,1.0,1.5,2.0,3.0,3.0,1
2,9.0,1.0,4.0,3.0,3.0,0.0,2.0,1.0,1.0,2.6,2.0,2.0,7.0,1
3,0.0,1.0,3.0,5.0,5.0,0.0,0.0,9.0,0.0,3.5,3.0,0.0,3.0,0
4,0.0,0.0,2.0,5.0,1.0,0.0,2.0,8.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,5.0,0.0,4.0,7.0,4.0,0.0,0.0,1.0,1.0,0.2,2.0,0.0,7.0,1
293,2.0,1.0,1.0,1.0,6.0,0.0,0.0,2.0,0.0,1.2,2.0,0.0,7.0,1
294,9.0,1.0,4.0,7.0,1.0,1.0,0.0,3.0,0.0,3.4,2.0,2.0,7.0,1
295,5.0,1.0,4.0,5.0,0.0,0.0,0.0,0.0,1.0,1.2,2.0,1.0,7.0,1


### Remove highly correlated features
This function is used to identify and remove features in a dataset that are highly correlated with each other. It uses a correlation matrix to calculate correlation coefficients between all pairs of features and identify those whose correlation exceeds a certain threshold. Features with high correlation can introduce excessive redundancy and can lead to model over-fitting problems. The function removes these features from the dataset, which can improve model performance and reduce data complexity.
We set the threshold value of correlation, or threshold, should be in the range from 0 to 1. This means that you can specify values from 0 to 1, inclusive. A threshold of 0 means no correlation, and a threshold of 1 means perfect correlation. In practice, the correlation threshold value depends on the specific problem, data and analysis we want to perform. Choosing the right threshold depends on our goals and expectations of correlation between features. Threshold values such as 0.5, 0.7 or 0.8 are often used, but there are no strict values that are universal for all cases.

In [12]:
data = pd.read_csv('heart_disease_risk.csv')
threshold = 0.5

data = remove_highly_correlated_features(data, threshold)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,ca,thal
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,0.0,6.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,3.0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,7.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,0.0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,0.0,7.0
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,0.0,7.0
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,7.0
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,1.0,7.0


### Removing features with a high percentage of missing data
The function is used to identify and remove features in a dataset that have a high percentage of missing data above a certain threshold. First, the function calculates the percentage of missing data for each feature in the dataset. It then identifies features that have a percentage of missing data above the specified threshold. These features are identified based on the row indices returned by the missing_data > missing_threshold condition. In the next step, the function removes these features from the dataset by calling the drop method with the axis=1 parameter, which deletes the columns. This operation helps reduce the number of features that contain large amounts of missing data and can potentially affect the quality and reliability of data analysis or modeling.

In [13]:
data = pd.read_csv('heart_disease_risk.csv')
missing_threshold = 0.1

data = remove_features_with_high_missing_data(data, missing_threshold)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,decision
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
293,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
294,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,1
295,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,1
