***
# Data Preprocessing : Data Quality Assessment, Preprocessing and Exploration for a Regression Modelling Problem

***
### John Pauline Pineda <br> <br> *November 7, 2023*
***

* [**1. Table of Contents**](#TOC)
    * [1.1 Data Background](#1.1)
    * [1.2 Data Description](#1.2)
    * [1.3 Data Quality Assessment](#1.3)
    * [1.4 Data Preprocessing](#1.4)
        * [1.4.1 Missing Data Imputation](#1.4.1)
        * [1.4.2 Outlier Treatment](#1.4.2)
        * [1.4.3 Zero and Near-Zero Variance](#1.4.3)
        * [1.4.4 Collinearity](#1.4.4)
        * [1.4.5 Linear Dependencies](#1.4.5)
        * [1.4.6 Centering and Scaling](#1.4.6)
        * [1.4.7 Shape Transformation](#1.4.7)
        * [1.4.8. Dummy Variables](#1.4.8)
        * [1.4.9. Preprocessed Data Description](#1.4.9)
     * [1.5 Data Exploration](#1.5)
* [**2. Summary**](#Summary)   
* [**3. References**](#References)

***

# 1. Table of Contents <a class="anchor" id="TOC"></a>

This project explores the various methods in assessing **Data Quality**, implementing **Data Preprocessing** and conducting **Data Exploration** for prediction problems with numeric responses using various helpful packages in <mark style="background-color: #CCECFF">**Python**</mark>. A non-exhaustive list of methods to detect missing data, extreme outlying points, near-zero variance, multicollinearity, linear dependencies and skewed distributions were evaluated. Remedial procedures on addressing data quality issues including missing data imputation, centering and scaling transformation, shape transformation and outlier treatment were similarly considered, as applicable. All results were consolidated in a [<span style="color: #FF0000">**Summary**</span>](#Summary) presented at the end of the document.

[Data quality assessment](http://appliedpredictivemodeling.com/) involves profiling and assessing the data to understand its suitability for machine learning tasks. The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. Issues such as incorrect labels, synonymous categories in a categorical variable or heterogeneity in columns, among others, which might go undetected by standard pre-processing modules in these frameworks can lead to sub-optimal model performance, inaccurate analysis and unreliable decisions.

[Data preprocessing](http://appliedpredictivemodeling.com/) involves changing the raw feature vectors into a representation that is more suitable for the downstream modelling and estimation processes, including data cleaning, integration, reduction and transformation. Data cleaning aims to identify and correct errors in the dataset that may negatively impact a predictive model such as removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Data integration addresses potential issues with redundant and inconsistent data obtained from multiple sources through approaches such as detection of tuple duplication and data conflict. The purpose of data reduction is to have a condensed representation of the data set that is smaller in volume, while maintaining the integrity of the original data set. Data transformation converts the data into the most appropriate form for data modeling.

[Data exploration](http://appliedpredictivemodeling.com/) involves analyzing and investigating data sets to summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to discover patterns, spot anomalies, test a hypothesis, or check assumptions. This process is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them.

## 1.1. Data Background <a class="anchor" id="1.1"></a>

Dataset used for the analysis was separately gathered and consolidated from various sources including: 
1. Cancer Rates from [World Population Review](https://worldpopulationreview.com/country-rankings/cancer-rates-by-country)
2. Social Protection and Labor Indicator from [World Bank](https://data.worldbank.org/topic/social-protection-and-labor?view=chart)
3. Education Indicator from [World Bank](https://data.worldbank.org/topic/education?view=chart)
4. Economy and Growth Indicator from [World Bank](https://data.worldbank.org/topic/economy-and-growth?view=chart)
5. Environment Indicator from [World Bank](https://data.worldbank.org/topic/environment?view=chart)
6. Climate Change Indicator from [World Bank](https://data.worldbank.org/topic/climate-change?view=chart)
7. Agricultural and Rural Development Indicator from [World Bank](https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart)
8. Social Development Indicator from [World Bank](https://data.worldbank.org/topic/social-development?view=chart)
9. Health Indicator from [World Bank](https://data.worldbank.org/topic/health?view=chart)
10. Science and Technology Indicator from [World Bank](https://data.worldbank.org/topic/science-and-technology?view=chart)
11. Urban Development Indicator from [World Bank](https://data.worldbank.org/topic/urban-development?view=chart)
12. Social Protection and Labor Indicator from [World Bank](https://data.worldbank.org/topic/social-protection-and-labor?view=chart)
13. Human Development Indices from [Human Development Reports](https://hdr.undp.org/data-center/human-development-index#/indicies/HDI)
14. Environmental Performance Indices from [Yale Center for Environmental Law and Policy](https://epi.yale.edu/epi-results/2022/component/epi)

This study hypothesized that various global development indicators and indices influence cancer rates across countries.

The target variable for the study is:
* **CANRAT** - Age-standardized cancer rates, per 100K population (2022)

The predictor variables for the study are:
* **GDPPER** - GDP per person employed, current US Dollars (2020)
* **URBPOP** - Urban population, % of total population (2020)
* **PATRES** - Patent applications by residents, total count (2020)
* **RNDGDP** - Research and development expenditure, % of GDP (2020)
* **POPGRO** - Population growth, annual % (2020)
* **LIFEXP** - Life expectancy at birth, total in years (2020)
* **TUBINC** - Incidence of tuberculosis, per 100K population (2020)
* **DTHCMD** - Cause of death by communicable diseases and maternal, prenatal and nutrition conditions,  % of total (2019)
* **AGRLND** - Agricultural land,  % of land area (2020)
* **GHGEMI** - Total greenhouse gas emissions, kt of CO2 equivalent (2020)
* **RELOUT** - Renewable electricity output, % of total electricity output (2015)
* **METEMI** - Methane emissions, kt of CO2 equivalent (2020)
* **FORARE** - Forest area, % of land area (2020)
* **CO2EMI** - CO2 emissions, metric tons per capita (2020)
* **PM2EXP** - PM2.5 air pollution, population exposed to levels exceeding WHO guideline value,  % of total (2017)
* **POPDEN** - Population density, people per sq. km of land area (2020)
* **GDPCAP** - GDP per capita, current US Dollars (2020)
* **ENRTER** - Tertiary school enrollment, % gross (2020)
* **HDICAT** - Human development index, ordered category (2020)
* **EPISCO** - Environment performance index , score (2022)


## 1.2. Data Description <a class="anchor" id="1.2"></a>

The dataset is comprised of:
* **177 rows** (observations)
* **22 columns** (variables)
    * **1/22 metadata** (categorical)
        * **COUNTRY**
    * **1/22 target** (numeric)
         * **CANRAT**
    * **19/22 predictor** (numeric)
         * **GDPPER**
         * **URBPOP**
         * **PATRES**
         * **RNDGDP**
         * **POPGRO**
         * **LIFEXP**
         * **TUBINC**
         * **DTHCMD**
         * **AGRLND**
         * **GHGEMI**
         * **RELOUT**
         * **METEMI**
         * **FORARE**
         * **CO2EMI**
         * **PM2EXP**
         * **POPDEN**
         * **GDPCAP**
         * **ENRTER**
         * **EPISCO**
     * **1/22 predictor** (categorical)
         * **HDICAT**

In [1]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from operator import add,mul,truediv
%matplotlib inline

In [2]:
##################################
# Loading the dataset
##################################
cancer_rate = pd.read_csv('CancerRates.csv')

In [3]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ', cancer_rate.shape)

Dataset Dimensions:  (177, 22)


In [4]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_rate.dtypes)

Column Names and Data Types:


COUNTRY     object
CANRAT     float64
GDPPER     float64
URBPOP     float64
PATRES     float64
RNDGDP     float64
POPGRO     float64
LIFEXP     float64
TUBINC     float64
DTHCMD     float64
AGRLND     float64
GHGEMI     float64
RELOUT     float64
METEMI     float64
FORARE     float64
CO2EMI     float64
PM2EXP     float64
POPDEN     float64
ENRTER     float64
GDPCAP     float64
HDICAT      object
EPISCO     float64
dtype: object

In [5]:
##################################
# Taking a snapshot of the dataset
##################################
cancer_rate.head()

Unnamed: 0,COUNTRY,CANRAT,GDPPER,URBPOP,PATRES,RNDGDP,POPGRO,LIFEXP,TUBINC,DTHCMD,...,RELOUT,METEMI,FORARE,CO2EMI,PM2EXP,POPDEN,ENRTER,GDPCAP,HDICAT,EPISCO
0,Australia,452.4,98380.63601,86.241,2368.0,,1.235701,83.2,7.2,4.941054,...,13.637841,131484.7632,17.421315,14.772658,24.893584,3.335312,110.139221,51722.069,VH,60.1
1,New Zealand,422.9,77541.76438,86.699,348.0,,2.204789,82.256098,7.2,4.35473,...,80.081439,32241.937,37.570126,6.160799,,19.331586,75.734833,41760.59478,VH,56.7
2,Ireland,372.8,198405.875,63.653,75.0,1.23244,1.029111,82.556098,5.3,5.684596,...,27.965408,15252.82463,11.35172,6.768228,0.274092,72.367281,74.680313,85420.19086,VH,57.4
3,United States,362.2,130941.6369,82.664,269586.0,3.42287,0.964348,76.980488,2.3,5.30206,...,13.228593,748241.4029,33.866926,13.032828,3.34317,36.240985,87.567657,63528.6343,VH,51.1
4,Denmark,351.1,113300.6011,88.116,1261.0,2.96873,0.291641,81.602439,4.1,6.82614,...,65.505925,7778.773921,15.711,4.691237,56.914456,145.7851,82.66433,60915.4244,VH,77.9


In [6]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(cancer_rate.describe(include='number').transpose())

Numeric Variable Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CANRAT,177.0,183.829379,79.7434,78.4,118.1,155.3,240.4,452.4
GDPPER,165.0,45284.424283,39417.94,1718.804896,13545.25451,34024.90089,66778.41605,234646.9
URBPOP,174.0,59.788121,22.8064,13.345,42.43275,61.7015,79.1865,100.0
PATRES,108.0,20607.388889,134068.3,1.0,35.25,244.5,1297.75,1344817.0
RNDGDP,74.0,1.197474,1.189956,0.03977,0.256372,0.87366,1.608842,5.35451
POPGRO,174.0,1.127028,1.197718,-2.079337,0.2369,1.179959,2.031154,3.727101
LIFEXP,174.0,71.746113,7.606209,52.777,65.9075,72.46461,77.5235,84.56
TUBINC,174.0,105.005862,136.7229,0.77,12.0,44.5,147.75,592.0
DTHCMD,170.0,21.260521,19.27333,1.283611,6.078009,12.456279,36.980457,65.20789
AGRLND,174.0,38.793456,21.71551,0.512821,20.130276,40.386649,54.013754,80.84112


In [7]:
##################################
# Performing a general exploration of the categorical variable
##################################
print('Categorical Variable Summary:')
display(cancer_rate.describe(include='object').transpose())

Categorical Variable Summary:


Unnamed: 0,count,unique,top,freq
COUNTRY,177,177,Australia,1
HDICAT,167,4,VH,59


## 1.3. Data Quality Assessment <a class="anchor" id="1.3"></a>
Details

In [8]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_rate.dtypes)

In [9]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_rate.columns)

In [10]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_rate)] * len(cancer_rate.columns))

In [11]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_rate.isna().sum())

In [12]:
##################################
# Gathering the number of missing data for each column
##################################
non_null_count_list = list(cancer_rate.count())

In [13]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)

In [14]:
##################################
# Formulating the summary
# for all columns
##################################
all_data_quality_summary = pd.DataFrame(zip(variable_name_list,
                                            data_type_list,
                                            row_count_list,
                                            non_null_count_list,
                                            null_count_list,                                            
                                            fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_data_quality_summary)

Unnamed: 0,Column.Name,Column.Type,Row.Count,Non.Null.Count,Null.Count,Fill.Rate
0,COUNTRY,object,177,177,0,1.0
1,CANRAT,float64,177,177,0,1.0
2,GDPPER,float64,177,165,12,0.932203
3,URBPOP,float64,177,174,3,0.983051
4,PATRES,float64,177,108,69,0.610169
5,RNDGDP,float64,177,74,103,0.418079
6,POPGRO,float64,177,174,3,0.983051
7,LIFEXP,float64,177,174,3,0.983051
8,TUBINC,float64,177,174,3,0.983051
9,DTHCMD,float64,177,170,7,0.960452


In [15]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
cancer_rate_numeric = cancer_rate.select_dtypes(include='number')

In [16]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = cancer_rate_numeric.columns

In [17]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = cancer_rate_numeric.min()

In [18]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = cancer_rate_numeric.mean()

In [19]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = cancer_rate_numeric.median()

In [20]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = cancer_rate_numeric.max()

In [21]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [cancer_rate[x].value_counts(dropna=True).index.tolist()[0] for x in cancer_rate_numeric]

In [22]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [cancer_rate[x].value_counts(dropna=True).index.tolist()[1] for x in cancer_rate_numeric]

In [23]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [cancer_rate_numeric[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_rate_numeric]

In [24]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [cancer_rate_numeric[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_rate_numeric]

In [25]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)

In [26]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = cancer_rate_numeric.nunique(dropna=True)

In [27]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(cancer_rate_numeric)] * len(cancer_rate_numeric.columns))

In [28]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)

In [29]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_rate_numeric.skew()

In [30]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = cancer_rate_numeric.kurtosis()

In [31]:
numeric_data_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_data_quality_summary)

Unnamed: 0,Numeric.Column.Name,Minimum,Mean,Median,Maximum,First.Mode,Second.Mode,First.Mode.Count,Second.Mode.Count,First.Second.Mode.Ratio,Unique.Count,Row.Count,Unique.Count.Ratio,Skewness,Kurtosis
0,CANRAT,78.4,183.829379,155.3,452.4,135.3,106.7,3,2,1.5,167,177,0.943503,0.881825,0.063467
1,GDPPER,1718.804896,45284.424283,34024.90089,234646.9,98380.63601,42154.1781,1,1,1.0,165,177,0.932203,1.517574,3.471992
2,URBPOP,13.345,59.788121,61.7015,100.0,100.0,52.516,2,1,2.0,173,177,0.977401,-0.210702,-0.962847
3,PATRES,1.0,20607.388889,244.5,1344817.0,6.0,2.0,4,3,1.333333,97,177,0.548023,9.284436,91.187178
4,RNDGDP,0.03977,1.197474,0.87366,5.35451,1.23244,0.96218,1,1,1.0,74,177,0.418079,1.396742,1.695957
5,POPGRO,-2.079337,1.127028,1.179959,3.727101,1.235701,1.483129,1,1,1.0,174,177,0.983051,-0.195161,-0.42358
6,LIFEXP,52.777,71.746113,72.46461,84.56,83.2,68.687,1,1,1.0,174,177,0.983051,-0.357965,-0.649601
7,TUBINC,0.77,105.005862,44.5,592.0,12.0,7.2,4,3,1.333333,131,177,0.740113,1.746333,2.429368
8,DTHCMD,1.283611,21.260521,12.456279,65.20789,4.941054,42.079403,1,1,1.0,170,177,0.960452,0.900509,-0.691541
9,AGRLND,0.512821,38.793456,40.386649,80.84112,46.25248,72.006469,1,1,1.0,174,177,0.983051,0.074,-0.926249


In [32]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
cancer_rate_categorical = cancer_rate.select_dtypes(include='object')

In [33]:
##################################
# Gathering the variable names for each categorical column
##################################
categorical_variable_name_list = cancer_rate_categorical.columns

In [34]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [cancer_rate[x].value_counts().index.tolist()[0] for x in cancer_rate_categorical]

In [35]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [cancer_rate[x].value_counts().index.tolist()[1] for x in cancer_rate_categorical]

In [36]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [cancer_rate_categorical[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_rate_categorical]

In [37]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [cancer_rate_categorical[x].isin([cancer_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_rate_categorical]

In [38]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)

In [39]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = cancer_rate_categorical.nunique(dropna=True)

In [40]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(cancer_rate_categorical)] * len(cancer_rate_categorical.columns))

In [41]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)

In [42]:
categorical_data_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_data_quality_summary)

Unnamed: 0,Categorical.Column.Name,First.Mode,Second.Mode,First.Mode.Count,Second.Mode.Count,First.Second.Mode.Ratio,Unique.Count,Row.Count,Unique.Count.Ratio
0,COUNTRY,Australia,Mauritius,1,1,1.0,177,177,1.0
1,HDICAT,VH,H,59,39,1.512821,4,177,0.022599


## 1.4. Data Preprocessing <a class="anchor" id="1.4"></a>
Details

### 1.4.1 Missing Data Imputation <a class="anchor" id="1.4.1"></a>
Details

### 1.4.2 Outlier Treatment <a class="anchor" id="1.4.2"></a>
Details

### 1.4.3 Zero and Near-Zero Variance <a class="anchor" id="1.4.3"></a>
This is sub section 1.3.3

### 1.4.4 Collinearity <a class="anchor" id="1.4.4"></a>
Details

### 1.4.5 Linear Dependencies <a class="anchor" id="1.4.5"></a>
Details

### 1.4.6 Centering and Scaling <a class="anchor" id="1.4.6"></a>
Details

### 1.4.7 Shape Transformation <a class="anchor" id="1.4.7"></a>
Details

### 1.4.8 Dummy Variables <a class="anchor" id="1.4.8"></a>
Details

### 1.4.9 Preprocessed Data Description <a class="anchor" id="1.4.9"></a>
Details

## 1.5. Data Exploration <a class="anchor" id="1.5"></a>
Details

# 2. Summary <a class="anchor" id="Summary"></a>
Details

# 3. References <a class="anchor" id="References"></a>
* **[Book]** [Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python](https://machinelearningmastery.com/data-preparation-for-machine-learning/) by Jason Brownlee
* **[Book]** [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) by Max Kuhn and Kjell Johnson
* **[Book]** [Feature Engineering for Machine Learning](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/) by Alice Zheng and Amanda Casari
* **[Book]** [Applied Predictive Modeling](https://link.springer.com/book/10.1007/978-1-4614-6849-3?page=1) by Max Kuhn and Kjell Johnson
* **[Book]** [Data Mining: Practical Machine Learning Tools and Techniques](https://www.sciencedirect.com/book/9780123748560/data-mining-practical-machine-learning-tools-and-techniques?via=ihub=) by Ian Witten, Eibe Frank, Mark Hall and Christopher Pal 
* **[Book]** [Data Cleaning](https://dl.acm.org/doi/book/10.1145/3310205) by Ihab Ilyas and Xu Chu
* **[Book]** [Data Wrangling with Python](https://www.oreilly.com/library/view/data-wrangling-with/9781491948804/) by Jacqueline Kazil and Katharine Jarmul
* **[Python Library API]** [sklearn.datasets.make classification API](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) by Scikit-Learn Team
* **[Python Library API]** [sklearn.preprocessing.MinMaxScaler API](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) by Scikit-Learn Team
* **[Python Library API]** [sklearn.model selection.train test split API](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) by Scikit-Learn Team
* **[Python Library API]** [sklearn.linear model.LogisticRegression API](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) by Scikit-Learn Team
* **[Python Library API]** [sklearn.model selection.RepeatedStratifiedKFold API](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html) by Scikit-Learn Team
* **[Python Library API]** [sklearn.model selection.cross val score API](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) by Scikit-Learn Team

***

In [43]:
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))