## 📖 Background
### OVERVIEW
The June edition of the 2022 Tabular Playground series is all about data imputation. The dataset has similarities to the May 2022 Tabular Playground, except that there are no targets. Rather, there are missing data values in the dataset, and your task is to predict what these values should be.
## 💾 The data
For this challenge, you are given (simulated) manufacturing control data that contains missing values due to electronic errors. Your task is to predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)


Good luck!

### Files

data.csv - the file includes normalized continuous data and categorical data; your task is to predict the values of the missing data.

sample_submission.csv - a sample submission file in the correct format; the row-col indicator corresponds to the row and column of each missing value in data.csv

<a id=0></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">

<center>CRISP-DM Methodology</center></h3>

* [Buissness Understanding](#1)
* [Data Understanding](#2)
* [Data Preparation](#3)
* [Data Modeling](#4)   
* [Data Evaluation](#5)
    

In this section we overview our selected method for engineering our solution. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is an open standard guide that describes common approaches that are used by data mining experts. CRISP-DM includes descriptions of the typical phases of a project, including tasks details and provides an overview of the data mining lifecycle. The lifecycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary. It starts with business understanding, and then moves to data understanding, data preparation, modelling, evaluation, and deployment. The CRISP-DM model is flexible and can be customized easily.
## I.Buissness Understanding

    Tasks:

Any good project starts with a deep understanding of the customer’s needs. Data mining projects are no exception and CRISP-DM recognizes this. 

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects:

1. Determine business objectives: You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.

1. Assess situation: Determine resources availability, project requirements, assess risks and contingencies, and conduct a cost-benefit analysis.

1. Determine data mining goals: In addition to defining the business objectives, you should also define what success looks like from a technical data mining perspective.

1. Produce project plan: Select technologies and tools and define detailed plans for each project phase.

1. While many teams hurry through this phase, establishing a strong business understanding is like building the foundation of a house – absolutely essential.


## II.Data Understanding 
Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks:

1. Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
1. Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.
1. Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
1. Verify data quality: How clean/dirty is the data? Document any quality issues.  

## III. Data Preparation
A common rule of thumb is that 80% of the project is data preparation.

This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

1. Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
1. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.
1. Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.
1. Integrate data: Create new data sets by combining data from multiple sources.
1. Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.


  
    Order of Tasks:
    
    1.Data selection

    2.Data preprocessing

    3.Feature engineering

    4.Dimensionality reduction

            Steps:

            Data cleaning

            Data integration

            Data sampling

            Data dimensionality reduction

            Data formatting

            Data transformation

            Scaling

            Aggregation

            Decomposition

## IV.Modeling
What is widely regarded as data science’s most exciting work is also often the shortest phase of the project.

Here you’ll likely build and assess various models based on several different modeling techniques. This phase has four tasks:

1. Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
1. Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.
1. Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”.
1. Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

Although the CRISP-DM guide suggests to “iterate model building and assessment until you strongly believe that you have found the best model(s)”,  in practice teams should continue iterating until they find a “good enough” model, proceed through the CRISP-DM lifecycle, then further improve the model in future iterations. 

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

   

## V.Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation phase looks more broadly at which model best meets the business and what to do next. This phase has three tasks:

1. Evaluate results: Do the models meet the business success criteria? Which one(s) should we approve for the business?
1. Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed.
1. Determine next steps: Based on the previous three tasks, determine whether to proceed to deployment, iterate further, or initiate new projects.
## VI.Deployment

“Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.”


A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. This final phase has four tasks:

1. Plan deployment: Develop and document a plan for deploying the model.
Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model.
1. Produce final report: The project team documents a summary of the project which might include a final presentation of data mining results.
1. Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future.
Your organization’s work might not end there. As a project framework, CRISP-DM does not outline what to do after the project (also known as “operations”). But if the model is going to production, be sure you maintain the model in production. Constant monitoring and occasional model tuning is often required.
    
 ref :
 https://www.youtube.com/watch?v=q_okDS2RtzY
 
# Complete Notebooks Guide: 


<a id=1></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Buissness Understanding</center></h3>


There may be two types of questions:

**A.Technical Questions:**
  
Can ML be a solution to the problem?

    
                Do we have THE data?
                Do we have all necessary related data?
                Is there enough amount of data to develop algorithm?
                Is data collected in the right way?
                Is data saved in the right format?
                Is the access to information guaranteed?

Can we satisfy all the Business Questions by means of ML?

**B.Business Questions:**
    
What are the organization's business goals?
    
                To reduce cost and increase revenue? 
                To increase efficiencies?
                To avoid risks? To improve quality?
    
Is it worth to develop ML?
    
                In short term? In long term?
                What are the success metrics?
                Can we handle the risk if the project is unsuccessful?
    
Do we have the resources?
    
                Do we have enough time to develop ML?
                Do we have a right talented team?

The goal of this project  is to build a model that   speed up the booking process .

## 💾 The data
TFor this challenge, you are given (simulated) manufacturing control data that contains missing values due to electronic errors. Your task is to predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)





 We are not looking for a winning solution but more for:

- how you approach the problem

- how do you look at the data

- what do you look at

- how do you structure your projects/prototypes.

- If you chose to build a simple classifier which one did you chose, why etc.

    
**summary:**
we are expecting the following:
    
- 1.Methodology: method for engineering our solution 


        a.CRISP_DM
        
    
- 2. Model


        a.We are expecting a machine learning model that can correctly classify  or impute missing values .


**What is the objective of the machine learning model?**

We aim to predict missing values  scoring algorithms, which make a guess at the probability of nan data.
    
    
## Step 1: Import helpful libraries

In [None]:
###############################################################################
#                         Import libraries                                    #
###############################################################################

#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import matplotlib.gridspec as gridspec
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn
import warnings
# Preparation  
from sklearn.preprocessing import( PowerTransformer, 
                                  StandardScaler,
                                  Normalizer,
                                  RobustScaler,
                                  MaxAbsScaler,
                                  FunctionTransformer,
                                  PolynomialFeatures,
                                  MinMaxScaler,
                                  QuantileTransformer,
                                  LabelEncoder, 
                                  OrdinalEncoder,
                                  KBinsDiscretizer,
                                  OneHotEncoder)
 
from sklearn.neighbors import KNeighborsClassifier

# Import StandardScaler from scikit-learn

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer,IterativeImputer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import make_column_transformer,ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline,FeatureUnion
from sklearn.manifold import TSNE
# Import train_test_split()
# Metrics
from sklearn.metrics import (roc_auc_score, average_precision_score,
                            make_scorer,
                            mean_squared_error,
                            roc_curve,confusion_matrix)
# Date lib
from datetime import datetime, date
from sklearn.linear_model import (ElasticNet, 
                                Lasso,  BayesianRidge,
                                LassoLarsIC,LinearRegression, RidgeCV,
                                LogisticRegression)
# Deep learing tensorflow 
import tensorflow as tf 
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import LearningRateScheduler
#import smogn
#from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
# For training random forest model
import lightgbm as lgb
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 
# Model selection
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif,chi2
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import mutual_info_classif,VarianceThreshold

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from scipy import stats, optimize, interpolate
from lightgbm import LGBMClassifier
import lightgbm as lgbm
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from xgboost import XGBClassifier
from sklearn import set_config
from itertools import combinations
# Cluster :
from sklearn.cluster import MiniBatchKMeans
from yellowbrick.cluster import KElbowVisualizer
#import smong 
import category_encoders as ce
import warnings
import optuna 
from joblib import Parallel, delayed
import joblib 
from sklearn import set_config
from tqdm import tqdm 
# it's a library that we work with plotly
import plotly.express as px
import plotly.offline as py 
py.init_notebook_mode(connected=True) # this code, allow us to work with offline plotly version
import plotly.graph_objs as go # it's like "plt" of matplot
import plotly.tools as tls # It's useful to we get some tools of plotly
from typing import *
from warnings import simplefilter
from auto_viz_stat import *
simplefilter("ignore", category=RuntimeWarning)
set_config(display='diagram')
warnings.filterwarnings('ignore')

# Utils

In [None]:
###############################################################################
#                        Class and config Utils                               #
###############################################################################
class colors: # You may need to change color settings
    RED = '\033[31m'
    ENDC = '\033[m'
    GREEN = '\033[32m'
    YELLOW = '\033[33m'
    BLUE = '\033[34m'
class clr:
    S = '\033[1m' + '\033[96m'
    E = '\033[0m' 
    
    


## Step 2: Load the data
Complete guid to read data : 
https://github.com/DeepSparkChaker/CRISPDM_ULTIME/blob/main/CRISPDM_0_StreamlinedDataIngestionWithPandas.ipynb

#### Optimize pandas data frame 
#####  Trick-1 : Faster dataframe processing using modin

I would like to thank Grand Master @cpmpml for introducing this package . Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks. The system has been designed for existing Pandas users who would like their programs to run faster and scale better without significant code changes.
By default, Modin will use all of the CPU cores available on your machine. There may be some cases where you wish to limit the number of CPU cores that Modin can use, especially if you want to use that computing power elsewhere. We can limit the number of CPU cores Modin has access to through an initialization setting in Ray since Modin uses it on the backend.

    import ray
    ray.init(num_cpus=4)
    import modin.pandas as pd
    
When working with big data, it’s not uncommon for the size of the dataset to exceed the amount of memory (RAM) on your system. Modin has a specific flag that we can set to true, which will enable its out of core mode. Out of core basically means that Modin will use your disk as overflow storage for your memory, allowing you to work with datasets far bigger than your RAM size. We can set the following environment variable to enable this functionality:

    export MODIN_OUT_OF_CORE=true

**Conclusion**

So there you have it! Your guide to accelerating Pandas functions using Modin. Very easy to do by changing just the import statement. Hopefully, you find Modin useful in at least a few situations to accelerate your Pandas functions.
ref : 
1. https://modin.readthedocs.io/en/latest/

1. https://www.kaggle.com/general/117063


Next, we'll load the training and test data.


In [None]:
#If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:
#!pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
#pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
#pip install "modin[all]" # Install all of the above

In [None]:
!pip install modin

In [None]:
%%time 
###############################################################################
#                         Load modin                                          #
###############################################################################

import modin.pandas as pdm

In [None]:
#export MODIN_OUT_OF_CORE=true

In [None]:
%%time 
###############################################################################
#                         Load data                                           #
###############################################################################
train_m = pdm.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
#test = pd.read_csv('../input/booking/test.csv')
train_m.shape

In [None]:
type(train_m)

In [None]:
train_m.head(2)

In [None]:
%%time 
###############################################################################
#                        Quick Viz                                            #
###############################################################################
import pandas as pd
train = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')

<a id=2></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Understanding</center></h3>

# 📖 Exploratory Data Analysis
Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumptions before performing further evaluations.


We will analyse the following:

    The target variable
    
    Variable types (categorical and numerical)
    
    Numerical variables
        Discrete
        Continuous
        Distributions
        Transformations

    Categorical variables
        Cardinality
        Rare Labels
        Special mappings

    Null Data

    Text data 
    
    wich columns will we use
    
    IS there outliers that can destory our algo
    
    IS there diffrent range of data
    
    Curse of dimm...

##  📖 General View :

Before going  deep dive into eda i have did some auto viz to quickly extract cnclusion from the data : 
* https://www.kaggle.com/code/bannourchaker/crisp-dm-dataunderstanding-part1-auto-eda
* https://www.kaggle.com/code/bannourchaker/crisp-dm-dataunderstanding-part2-auto-eda-2

I frequently use these libraries when starting the EDA to uncover interesting trends and patterns quickly with minimum code. I hope you will find these libraries interesting and useful too! 

I like to start by asking the following questions:
1. What are the features?
1. What are the expected types (int, float, string, boolean)?
1. Is there obvious missing data (values that Pandas can detect)?
1. Is there other types of missing data that’s not so obvious (can’t easily detect with Pandas)?


In [None]:
%%time
###############################################################################
#                        Quick EDA                                            #
###############################################################################
print(clr.S+"Train data shape")
print(clr.S+"Number of Rows:"+clr.E, train_m.shape[0])
print(clr.S+"Number of Columns: "+clr.E, train_m.shape[1])
print(clr.S+"Columns names: "+clr.E, train_m.columns)

In [None]:
%%time
###############################################################################
#                        Quick EDA                                            #
###############################################################################
print(clr.S+"Number of Rows:"+clr.E, train_m.info())

In [None]:
%%time 
###############################################################################
#                        Quick Viz                                            #
###############################################################################
selected_float_cols = ["F_1_0",'F_1_1',"F_1_2"]
sns.pairplot(train[selected_float_cols+["F_2_1"]].sample(2000), hue="F_2_1", plot_kws=dict(linewidth=1, s=10,  alpha=0.9, edgecolors="face"));

In [None]:
%%time
###############################################################################
#                         General View                                        #
############################################################################### 
display(general_view(train))

In [None]:
%%time
###############################################################################
#                         Cardinality Part 1                                  #
###############################################################################
cardi_col= cardinality(train_m)

In [None]:
%%time
###############################################################################
#                         Cardinality as Data Frame                           #
###############################################################################
var_=pd.DataFrame(cardi_col.items(),  columns=['Variable','Nunique'])
var_.style.set_na_rep("OutofScope")\
                .highlight_null(null_color="orange")\
                .format({"Variable": lambda x:x.upper()})\
                .hide_index()\
                .highlight_max(color='lightgreen')\
                .highlight_min(color='#cd4f39')\
.background_gradient(cmap='Blues')

In [None]:
%%time
###############################################################################
#                         Cardinality Part 2                                  #
###############################################################################
disc_columns=list(train.select_dtypes(include=['int8', 'category', 'int64']))
value_counts_all(train_m, disc_columns)

In [None]:
%%time
###############################################################################
#                         Cardinality Part 2_3                                #
###############################################################################
category_count_pie_plotly(train_m,'F_2_1')

In [None]:
%%time
###############################################################################
#                         Rare   Values                                       #
###############################################################################
# print categories that are present in less than
# 0.1 % of the observations
for var in tqdm(disc_columns[0:5]):
    print(clr.S+'-'*80+clr.E)
    print(clr.S+ f'Count of rare value for {var} are {analyse_rare_labels(train_m, var, 0.0001).count()}'+clr.E)
    print(clr.S+'-'*80+clr.E)
    print(analyse_rare_labels(train_m, var, 0.0001))
    print()

In [None]:
%%time
###############################################################################
#                        Distrubition                                         #
###############################################################################
features = list(train.columns)
features = features[1:len(features)]
F1_features = [feat for feat in features if feat[:3] == "F_1"]
F2_features = [feat for feat in features if feat[:3] == "F_2"]
F3_features = [feat for feat in features if feat[:3] == "F_3"]
F4_features = [feat for feat in features if feat[:3] == "F_4"]

In [None]:
%%time
###############################################################################
#                        Features F_1                                        #
##############################################################################
plt.figure(figsize=(15,15))
for i, f in enumerate(F1_features):
    plt.subplot(5, 5, i+1)
    axs = sns.kdeplot(x = train[f],color="#d90429")
plt.tight_layout()
plt.show()

In [None]:
%%time
###############################################################################
#                        Features F_2                                        #
##############################################################################
plt.figure(figsize=(15,15))
for i, f in enumerate(F2_features):
    plt.subplot(5, 5, i+1)
    axs = sns.countplot(x = train[f],palette="YlOrRd_r")
plt.tight_layout()
plt.show()

In [None]:
%%time
###############################################################################
#                        Features F_3                                         #
##############################################################################

plt.figure(figsize=(15,15))
for i, f in enumerate(F3_features):
    plt.subplot(5, 5, i+1)
    axs = sns.kdeplot(x = train[f],color="#d90429")
plt.tight_layout()
plt.show()

In [None]:
%%time
###############################################################################
#                        Features F_4                                         #
##############################################################################
plt.figure(figsize=(15,15))
for i, f in enumerate(F4_features):
    plt.subplot(5, 5, i+1)
    axs = sns.kdeplot(x = train[f],color="#d90429")
plt.tight_layout()
plt.show()

In [None]:
%%time
###############################################################################
#                        Correlation map                                      #
##############################################################################
plt.figure(figsize=(11,11))
corr=train.iloc[:,1:].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, robust=True, center=0,square=True, linewidths=.6,cmap='YlOrRd_r')
plt.title('Correlation')
plt.show()

In [None]:
%%time
###############################################################################
#                        Correlation Zoom                                     #
###############################################################################
data_raw = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', index_col="row_id")
F_4 = []
for i in range(15):
    string = "F_4_" + str(i)
    F_4.append(string)
corr = data_raw.loc[:, F_4].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
fig = plt.figure(tight_layout=True, figsize=(20,10))
spec = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)
ax0 = fig.add_subplot(spec[0, 0])
sns.heatmap(corr, mask=mask, cmap='coolwarm', annot=True, fmt='.2f', cbar=False, ax=ax0)
ax0.tick_params(axis='x', colors='w', labelsize=12, rotation=90)
ax0.tick_params(axis='y', colors='w', labelsize=12, rotation=0)
ax0.set_title("Pearson correlation coefficient", fontsize=14, fontweight ='bold', y=1.01)
ax0 = fig.add_subplot(spec[0, 1])
null = pd.DataFrame(data_raw.loc[:, F_4].isnull().sum())
dtypes = pd.DataFrame(data_raw.loc[:, F_4].dtypes)
data_info = dtypes.merge(null, left_index=True, right_index=True)
data_info.columns = ["type", "nulls"]
data_info.sort_values(by=["nulls"], ascending=False, inplace=True)
data_info["% missing"] = 100*data_info["nulls"]/data_raw.shape[0]
sns.scatterplot(x=data_info.index, y="% missing", data=data_info, ax=ax0)
ax0.set_title("Percentage missing values", fontsize=14, fontweight ='bold', y=1.01)
ax0.tick_params(axis='x', colors='w', labelsize=12, rotation=90)   
plt.suptitle("F_4", ha="center", y=1.03, fontweight ='bold', fontsize=24) 
plt.show()

#### 💡 INSIGHTS

1. From the first plot one can conclude that there is correlation between F_2 features and correlation between F_4 features. Otherwise there is no correlation among other features.
1. In the second I have zoomed in on the feature space F_4. Here we can see that F_4_11 correlates the highest with F_4_8. Overall there is some sort of correlation between features.
1. From the third plot one can conclude that all features of F_4 has missing values with F_4_2 the most

[ref](https://www.kaggle.com/code/wti200/model-inspection-with-shap) 

In [None]:
del data_raw

In [None]:
import gc 
gc.collect()

# Null Data

How sparse is my data? Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.

In [None]:
!pip install missingno 

## Visualization of missing values
### Matrix

In [None]:
###############################################################################
#                         Auto  eda                                           #
###############################################################################

import missingno as msno

msno.matrix(train)

In [None]:
%%time
###############################################################################
#                         Missing Value  Part 1                               #
###############################################################################
# summarize the number of rows with missing values for each column
missing_values(train_m)

In [None]:
%%time
###############################################################################
#                         Missing Value  Part 2                               #
###############################################################################
missing_value_reporter(data=train_m, threshold=0.03)

**MISSING VALUES & IMPUTATION**

Missing values might create errors in the analysis.
Our dataset contains a few variables with  missing values. 

This means that to train a machine learning model with this data set, we need to impute the missing data in these variables.

We can also visualize the percentage of missing columns  in the variables as follows:

In [None]:
%%time
###############################################################################
#                         Missing Value  Part 4                               #
###############################################################################
# plot
# make a list of the variables that contain missing values
vars_with_na = [var for var in train_m.columns if train_m[var].isnull().sum() > 0]
train_m[vars_with_na].isnull().mean().sort_values(ascending=False).plot.bar(figsize=(10, 4))
plt.ylabel('Percentage of missing data')
plt.axhline(y=0.03, color='r', linestyle='-')
plt.axhline(y=0.018, color='g', linestyle='-')
plt.show()
#vars_with_na

In [None]:
ncounts = pd.DataFrame([train.isna().mean()
                        #, test.isna().mean()
                       ]
                      ).T
ncounts = ncounts.rename(columns={0: "train_missing", 1: "test_missing"})

ncounts.query("train_missing > 0").plot(
    kind="barh", figsize=(15, 15), title="% of Values Missing"
)
plt.show()

 ### Understanding the number of missing values per observation
  [ref ](https://medium.com/mlearning-ai/best-known-techniques-for-data-scientist-to-handle-missing-null-values-in-any-tabular-dataset-3a9f71c9486)

In [None]:
tt = pd.concat([train,
                #test
               ] 
              ).reset_index(drop=True).copy()

tt["n_missing"] = tt.isna().sum(axis=1)
tt["n_missing"].value_counts().plot(
    kind="bar", title="Number of Missing Values per Sample"
)

### Checking for imbalance in missing values for categorical features

In [None]:
cat_features = ["F_2_0",'F_2_1']
tt.groupby("F_2_0")["n_missing"].mean()

In [None]:
tt.groupby("F_2_1")["n_missing"].agg(['mean','count'])

In [None]:
%%time
###############################################################################
#                         numerical Vs categorical  Part 5                    #
###############################################################################
# now we can determine which variables, from those with missing data,
# are numerical and which are categorical
cat_vars = [var for var in train_m.columns if var  in train_m.select_dtypes(include='category').columns]
bool_vars = [var for var in train_m.columns if train_m[var].dtype == 'bool']
num_vars = [var for var in train_m.columns if  var  in train_m.select_dtypes(exclude=['category','bool']).columns]
cat_na = [var for var in cat_vars if var in vars_with_na]
num_na = [var for var in num_vars if var in vars_with_na]
print(clr.S+'Number of categorical variables with na: ', cat_na)
print(clr.S+'Number of categorical variables with na: ', num_na)
print(clr.S+'Number of categorical variables with na: ', len(cat_na))
print(clr.S+'Number of numerical variables with na: ', len(num_na))

In [None]:
%%time
###############################################################################
#                      Relation between missing value and the target          #
###############################################################################
# let's run the function on each variable with missing data
# make a list of the variables that contain missing values
vars_with_na = [var for var in train_m.columns if train_m[var].isnull().sum() > 0]
def analyse_na_value(df:pd.DataFrame, var:str , target:str )-> print:
    #import matplotib.pyplot as plt 
    # copy of the dataframe, so that we do not override the original data
    # see the link for more details about pandas.copy()
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
    df = df.copy()
    # let's make an interim variable that indicates 1 if the
    # observation was missing or 0 otherwise
    df[var] = np.where(df[var].isnull(), 1, 0)
    # let's compare the median values  in the observations where data is missing
    # vs the observations where data is available
    # determine the median target in the groups 1 and 0,
    # and the standard deviation of the sale price,
    # and we capture the results in a temporary dataset
    tmp = df.groupby(var)[target].agg(['mean', 'std'])
    print(tmp)
    # plot into a bar graph
    tmp.plot(kind="barh", y="mean", legend=True,
             xerr="std", title=target, color='green')
    plt.show()

# let's run the function on each variable with missing data
for var in vars_with_na[7:12]:
    target = 'F_2_1'
    if target  is not var  :
        analyse_na_value(train_m, var, target )
    else :
        pass 

In this example we can seet that F_2_1  is not influenced by the missing values 

##  Duplicates Data 

In [None]:
%%time
###############################################################################
#                      Duplicated data                                        #
###############################################################################
display(train_m.duplicated().sum())

In [None]:
%%time
###############################################################################
#                      Duplicated data                                        #
###############################################################################
display(train_m.duplicated().sum())
#display(test.duplicated().sum())

In [None]:
gc.collect()


<a id=3></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Preparation</center></h3>

## Data preprocessing

Data preprocessing comes after you've cleaned up your data and after you've done some exploratory analysis to understand your dataset. Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them. Think of data preprocessing as a prerequisite for modeling.


## Missing Values  :

- A Simple Option: Drop Columns with Missing Values

-  Replacing missing values with constants        
    
-  A Better Option: Imputation

Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.

- An Extension To Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. 
                    
A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as **“nearest neighbor imputation.”**

- Iterative imputation

One approach to imputing missing values is to use an iterative imputation model.
Iterative imputation refers to a process where each feature is modeled as a function of the other features, e.g. a regression problem where missing values are predicted. Each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features.

It is iterative because this process is repeated multiple times, allowing ever improved estimates of missing values to be calculated as missing values across all features are estimated.

## Numerical Features : Scaling 


## Cat Features: Encoding
   
    

## Outlier Handling


# Feature Engineering

Feature engineering is the act of taking raw data and extracting features from it that are suitable for tasks like machine learning. Most machine learning algorithms work with tabular data. When we talk about features, we are referring to the information stored in the columns of these tables 
 

# Features selection : 

**Feature Selection**
Feature selection is a method of selecting features from your feature set to be used for modeling. It draws from a set of existing features, so it's different than feature engineering because it doesn't create new features. The overarching goal of feature selection is to improve your model's performance. Perhaps your existing feature set is much too large, or some of the features you're working with are unnecessary. There are different ways you can perform feature selection. It's possible to do it in an automated way. Scikit-learn has several methods for automated feature selection, such as choosing a variance threshold and using univariate statistical tests

#  Data Wrangling ,Curation
It’s the start of a new project and you’re excited to apply some machine learning models.
You take a look at the data and quickly realize it’s an absolute mess.
According to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.

![image.png](attachment:ef100a23-1bfc-4988-b0a9-1284fbc1c4ff.png)

## Sources of Missing/Wrong Values
Before we dive into code, it’s important to understand the sources of missing data. Here’s some typical reasons why data is missing:

* User forgot to fill in a field.
* Data was lost while transferring manually from a legacy database.
* There was a programming error.
* Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.



As you can see, some of these sources are just simple random mistakes. Other times, there can be a deeper reason why data is missing.
It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.

Here i will just try to correct data in the right format and in the next notebook i will do all preprocessing steps 
I will show an example how data is wrong : 

Hopefully our data is in the right format and don't need lot of work 

## Curation

In [None]:
%%time 
################################################################################
#                         Wrangling                                           #
###############################################################################
def data_prep(data:pd.DataFrame, type_data:str='train')->pd.DataFrame:
    train=data.copy()
    return train

#train=data_prep(train)
#test=data_prep(test,'test')


In [None]:
%%time 
###############################################################################
#                         Cardinality After curation & preparation            #
###############################################################################

In [None]:
%%time
###############################################################################
#                  Columns types  before reducing memory                      #
###############################################################################
int_features = list(train_m.select_dtypes(include='int').columns)
#int_features.remove('row_id')
float_features = list(train_m.select_dtypes(include='float').columns)
object_features = list(train_m.select_dtypes(include='object').columns)
print(clr.S+"Int feautres:"+clr.E, clr.S+f'{colors.RED} {int_features} {colors.ENDC}')
print(clr.S+"Float feautres:"+clr.E, clr.S+f'{colors.RED}{float_features}{colors.ENDC}')
print(clr.S+"Object feautres:"+clr.E, clr.S+f'{colors.RED}{object_features}{colors.ENDC}')

# Reduce Memory: 
Be aware for high resolution features don't use it 

In [None]:
%%time
###############################################################################
#                         Reduce Memory                                       #
###############################################################################
# Author : https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df:pd.DataFrame)->pd.DataFrame:
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in tqdm(df.columns):
        col_type = df[col].dtype
        name =df[col].dtype.name 
        
        if col_type != object and col_type.name != 'category':
        #if name != "category":    
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

#test= reduce_mem_usage(test)
train= reduce_mem_usage(train)
#train= reduce_mem_usage(train)
#test= reduce_mem_usage(test)

## Correct Dtype

In [None]:
#train['f_29']=train['f_29'].astype('category')


## Variable Types
Next, let's identify the categorical , boolean and numerical variables

In [None]:
%%time
###############################################################################
#                         Columns types  After  reducing memory               #
###############################################################################
# let's identify the categorical variables
# we will capture those of type *object*
cat_vars = [var for var in train_m.columns if var  in train_m.select_dtypes(include='category').columns]
bool_vars = [var for var in train_m.columns if train_m[var].dtype == 'bool']
num_vars = [var for var in train_m.columns if  var  in train_m.select_dtypes(exclude=['category','bool']).columns]
int_features = list(train_m.select_dtypes(include='int8').columns)
#int_features.remove('id')
float_features = list(train_m.select_dtypes(include='float16').columns)
object_features = list(train_m.select_dtypes(include='object').columns)
print(clr.S+"Int feautres:"+clr.E, clr.S+f'{colors.RED} {int_features} {colors.ENDC}')
print(clr.S+'*'*85+clr.E)
print(clr.S+"Float feautres:"+clr.E, clr.S+f'{colors.RED}{float_features}{colors.ENDC}')
print(clr.S+'*'*85+clr.E)
print(clr.S+"Object feautres:"+clr.E, clr.S+f'{colors.RED}{object_features}{colors.ENDC}')
# number of categorical variables
print(clr.S+'*'*85+clr.E)
print(clr.S+ f' Cat  features are :\n {colors.RED} {cat_vars} {colors.ENDC}\n')
print(clr.S+'*'*85+clr.E)
print(clr.S+f' Bool  features are :\n {colors.RED} {bool_vars} {colors.ENDC}\n')
print(clr.S+'*'*85+clr.E)
print(clr.S+ f' Num   features are :\n {colors.RED} {num_vars} {colors.ENDC}\n')

# Feature Engineer: 
## Basic  Features 

In [None]:
%%time 
###############################################################################
#                        Features engineer                                    #
###############################################################################

## Magic Features

In [None]:
%%time 
###############################################################################
#                         Magic Features                                      #
###############################################################################
# Because sometimes there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict.
#train['MISS'] = train.isna().sum(axis=1).apply(lambda x: 0 if x==0 else 1)
train['N_MISS'] = train.isnull().sum(axis=1) # For rows

In [None]:
train['N_MISS'].value_counts()

In [None]:
###############################################################################
#                         Check Cardinality :                                 #
###############################################################################
for i, var in tqdm(enumerate(train.columns)):
    if train[var].nunique()<=36 : 
        print(f'{colors.GREEN} {var} {colors.ENDC} HAS  {colors.RED}{train[var].nunique()} {colors.ENDC} UNIQUE values and they are :{colors.RED}{train[var].unique()} {colors.ENDC}' )
    else : 
        print(f'{colors.GREEN} {var} {colors.ENDC} HAS  {colors.RED}{train[var].nunique()} {colors.ENDC} UNIQUE values')

## Convert Dtypes :

In [None]:
###############################################################################
#                         Convert Dtypes :                                    #
###############################################################################
train[train.select_dtypes(['float16','int16','int8']).columns] = train[train.select_dtypes(['float16','int16','int8']).columns].apply(pd.to_numeric)
train[train.select_dtypes(['object','category']).columns] = train.select_dtypes(['object','category']).apply(lambda x: x.astype('category'))

## Define the model features and target

### Extract X and y 

In [None]:
###############################################################################
#                        Extract features and Target                          #
###############################################################################
le = LabelEncoder()
train['F_2_0'] = le.fit_transform(train['F_2_0'])
target= "F_2_0"
X = train.drop(['F_2_0'], axis='columns')# axis=1
y = train[target]
print(le.classes_.tolist())
print(le.classes_.tolist())

## Num/Cat Features
we should extract them and see what we should do for each one 

In [None]:
###############################################################################
#                        Cat_columns                                          #
###############################################################################
cat_columns = X.select_dtypes(include=['category','object','bool']).columns
cat_columns

In [None]:
###############################################################################
#                        Num_columns                                          #
###############################################################################
num_columns = X.select_dtypes(exclude=['category','object','bool']).columns
num_columns

## Target Class distribution

In [None]:
###############################################################################
#                        Plot Traget Distrubition                             #
###############################################################################
labels = train['F_2_0'].astype('category').cat.categories.tolist()
counts = train['F_2_0'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()


# Create Test and Train groups

Now we’ve got our dataframe ready we can split it up into the train and test datasets for our model to use. We’ll use the Scikit-Learn train_test_split() function for this. By passing in the X dataframe of raw features, the y series containing the target, and the size of the test group (i.e. 0.1 for 10%), we get back the X_train, X_test, y_train and y_test data to use in the model.


In [None]:
###############################################################################
#        Split the dataset and labels into training and test sets             #
###############################################################################
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0,stratify=y )
print(f" {colors.BLUE}{X_test.shape[0]} {colors.ENDC} rows in test set vs. {colors.BLUE}{ X_train.shape[0]} {colors.ENDC}in training set. {colors.BLUE}{X_test.shape[1]}{colors.ENDC} Features.")

# What should we do for each colmun

**Separate features by dtype**

Next we’ll separate the features in the dataframe by their datatype. There are a few different ways to achieve this. I’ve used the select_dtypes() function to obtain specific data types by passing in np.number to obtain the numeric data and exclude=['np.number'] to return the categorical data. Appending .columns to the end returns an Index list containing the column names. For the categorical features, we don’t want to include the target income column, so I’ve dropped that.

In [None]:
###############################################################################
#                         Check that we handle all columns                    #
###############################################################################
all_columns = (num_columns.append(cat_columns))
print(set(X.columns.tolist()).difference(all_columns))
assert set(X.columns.tolist())==set(all_columns)
assert len(X.columns.tolist())==len(all_columns)

# cross_validation_design
## A quick explanation as follows:

**Cross Validation:** Splits the data into k "random" folds

**Stratified Cross Valiadtion:** Splits the data into k folds, making sure each fold is an appropriate representative of the original data. (class distribution, mean, variance, etc)

Cross Validation is a very important concept not  only for chosing best models but also for general evalaution .

In [None]:
cross_validation_design = StratifiedKFold( n_splits=3,
                                           shuffle=True
                                           ,random_state=1)
cross_validation_design

 

## Trick-2: Speedup your model with scikit-learn-intelex
This is an nice trick to spped-up your model without changing your Sklearn based model. All we need to do is add two lines codes to import the library and patch it. I would like to thank Devlikamov Vlad for introducing this library.

- https://github.com/intel/scikit-learn-intelex/blob/master/examples/notebooks/random_forest_yolanda.ipynb

- https://www.kaggle.com/code/lordozvlad/spaceship-titanic-fast-kernel-using-sklearnex

- https://www.kaggle.com/code/lordozvlad/fast-feature-importance-using-scikit-learn-intelex

- https://www.kaggle.com/code/lordozvlad/tps-jan-fast-pycaret-with-scikit-learn-intelex

## Expriment  some Basic Scaler/transformer/Features Enginner : 
I usulally try this if the data set is small , this task is really  time consuming .
This step will give me the best start for chosing the best preprocess steps .


In [None]:
! pip install scikit-learn-intelex

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

## Preprocess Missing values :Best Techniques 


Before deleting the missing values, we should be know the following concept. There are three types of missing values:
1. Missing Completely at Random (MCAR)- ignorable

    It is the highest level of randomness. This means that the missing values in any features are not dependent on any other features values. This is the desirable scenario in case of missing data.
    
1. Missing at Random (MAR) - ignorable

    This means that the missing values in any feature are dependent on the values of other features.
    
1. Missing Not at Random (MNAR) - Not ignorable

    Missing not at random data is a more serious issue and in this case, it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?


With data Missing Completely at Random (MCAR), we can drop the missing values upon their occurrence, but with Missing at Random (MAR) and Missing Not at Random (MNAR) data, this could potentially introduce bias to the model. Moreover, dropping MCAR values may seem safe at first, but, still, by dropping the samples we are reducing the size of the dataset. It is always better to keep the values than to discard them, in the end, the amount of the data plays a very important role in a data science project and its outcome.


To delete/ignore the missing values, it should not be of last type-MNAR. To understand more about these, I request you to read these interesting answers on [stackexchange](https://stats.stackexchange.com/questions/23090/distinguishing-missing-at-random-mar-from-missing-completely-at-random-mcar), especially the second answer by Mr. Wayne.

**What to do with the missing values?**
 
Now that we have identified the missing values in our data, next we should check the extent of the missing values to decide the further course of action.

**Ignore the missing values**

Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data is a MAR or MNAR.
The number of complete cases i.e. observation with no missing data must be sufficient for the selected analysis technique if the incomplete cases are not considered.
Drop the missing values

**Dropping a variable**

If the data is MCAR or MAR and the number of missing values in a feature is very high, then that feature should be left out of the analysis. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out.
If the cases or observations have missing values for target variables(s), it is advisable to delete the dependent variable(s) to avoid any artificial increase in relationships with independent variables.
[ Great ref.](https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184)

### 1) Dropping the missing values
#### a) Dropping the row where there are missing values

This option should be used when other methods of handling the missing values are not useful. In our example, there was only a one row where there were no single missing values. So only that row was retained when we used dropna() function.

**Pros:**

1. A model trained with the removal of all missing values creates a robust model.
**Cons:**

1. Loss of a lot of information.
1. Works poorly if the percentage of missing values is excessive in comparison to the complete dataset. 

In [None]:
non_missing_data = train.dropna()
non_missing_data.shape

#### b) Dropping the entire row/column only when there are multiple missing values in the row

As we have seen, the last method of dropping the entire row even when there is only a single missing value is little harsh, we can specify a threshold number of non-missing values before deleting the row. Suppose we want to drop the drop only if there are less than say 2 non-missing values, then we case the following code:


In [None]:
row_missing_data = train.dropna(axis=0,thresh=2)
row_missing_data.shape

#### c) Dropping the entire column


In [None]:
cdata= train.drop(['row_id','F_1_0','F_1_1'], axis=1)
cdata.head(2)

### 2) Imputing the missing values

####  a) Replacing with a given value

#####  i) Replacing with a given number, let us say with 0.

In [None]:
cdata['F_1_2'] = cdata['F_1_2'].fillna(0)

##### ii) Replacing with a string, let us say with 'Tunis'.


In [None]:
cdata['F_1_3'] = cdata['F_1_3'].fillna(0)

Replacing the missing values with a string could be useful where we want to treat missing values as a separate level.

#### b) Replacing with mean: 
It is the common method of imputing missing values. **However in presence of outliers**, this method may lead to erroneous imputations. In such cases, median is an appropriate measure of central tendency. For some reasons, if you have to use mean values for imputation, then treat the outliers before imputations.

In [None]:
cdata['F_1_3'] = cdata['F_1_3'].fillna(( cdata['F_1_3'].mean()))

#### c) Replacing with Median: 
As median is a position based measure of central tendency (middle most item), this method is not affected by presence of outliers.

In [None]:
cdata['F_1_3'] = cdata['F_1_3'].fillna(( cdata['F_1_3'].median()))

#### d) Replacing with Mode:

Mode is the measure of central tendency for nominal scale data.

Replacing with mode is little bit trickier. Because unlike mean and median, mode returns a dataframe. Why? Because if there are two modal values, pandas will show both these values as modes.

For example, let us say our data set is ['A', 'A', 'B', 'C', 'C'].
Here both 'A' and 'C' are the modes as they are repeated equal number of times. Hence mode returns a dataframe containing 'A' and 'C' not a single value.


While replacing with mode, we need to use mode()[0] at the end as shown in the code below.

In [None]:
cdata['F_1_3'] = cdata['F_1_3'].fillna(( cdata['F_1_3'].mode()[0]))

#### e) Replacing with previous value - Forward fill

In time series data, replacing with nearby values will be more appropriate than replacing it with mean. Forward fill method fills the missing value with the previous value. For better understanding, I have shown the data column both before and after 'ffill'.

![image.png](attachment:2d861651-6b7f-4b92-9e57-8c93c52b5ff0.png)!

In [None]:
#cdata['F_1_3'] = cdata['F_1_3'].fillna(method='ffill')

#### f) Replacing with next value - Backward fill

Backward fill uses the next value to fill the missing value. You can see how it works in the following example.

![image.png](attachment:7233fb28-f255-4c64-9e93-1a980b19b87c.png)!

In [None]:
#cdata['F_1_3'] = cdata['F_1_3'].fillna(method='bfill')

#### g) Replacing with average of previous and next value

In time series data, often the average of value of previous and next value will be a better estimate of the missing value. Use the following code to achieve this. I have shown in the following picture how this method works.

![image.png](attachment:facceac8-c921-4e46-bf96-a3d354af7caf.png)

In [None]:
#cdata['F_1_3'] = pd.concat([cdata['F_1_3'].ffill(), cdata['F_1_3'].bfill()]).groupby(level=0).mean()

#### h) Interpolation

Similar results can be achieved using interpolation. Interpolation is very flexible with different methods of interpolation such as the default 'linear' (average of ffill and bfill was similar to linear), 'quadratic', 'polynomial' methods ([more about this](https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html)).



In [None]:
cdata['F_1_3']=cdata['F_1_3'].interpolate()


####  i) Model based imputation

We can impute the missing values using model based imputation methods. Popular being imputation using K-nearest neighbors (KNN) (Schmitt et al paper on Comparison of Six Methods for Missing Data Imputation).

KNN is useful in predicting missing values in both continuous and categorical data (we use Hamming distance here)

Even under Nearest neighbor based method, there are 3 approaches and they are given below (Tim's answer on stackechange):
NN with one single neighbor (1NN) 
with k neighbors without weighting (kNN) or with weighting (wkNN) (Nearest neighbor imputation algorithms: a critical evaluation paper by  Beretta and Santaniello)
If you are interested to how to run this KNN based imputation, you can click [here](https://stackoverflow.com/questions/45321406/missing-value-imputation-in-python-using-knn) for examples in Python and [here](http://) .

In [None]:
from sklearn.impute import KNNImputer

X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))

In [None]:
gc.collect()

#### Part2  Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:

In [None]:
#!pip install impyute 

In [None]:
import sys
#from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS

# start the KNN training
#imputed_training=fast_knn(train.iloc[:2000,1:].values, k=30)

**Pros:**

Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
**Cons:**

Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
#### Imputation Using Multivariate Imputation by Chained Equation (MICE)


This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns. For more information on the algorithm mechanics, you can refer to the [Research Paper](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)

In [None]:
#from impyute.imputation.cs import mice

# start the MICE training
#imputed_training=mice(train.iloc[:2000,1:].values)



### Prediction of missing values:
In the earlier methods to handle missing values, we do not use the correlation advantage of the variable containing the missing value and other variables. Using the other features which don’t have nulls can be used to predict missing values.

The regression or classification model can be used for the prediction of missing values depending on the nature (categorical or continuous) of the feature having missing value.

In [None]:
#test_data = train[train.isnull()]
#train.dropna(inplace=True)
#y_train = train["F_1_0"]
#X_train = train.drop("F_1_0", axis=1)
#X_test = test_data.drop("F_1_0", axis=1)
#model = LinearRegression()
#model.fit(X_train, y_train)
#y_pred = model.predict(X_test)

####  Pros:

Gives a better result than earlier methods
Takes into account the covariance between the missing value column and other columns.

####  Cons:

Considered only as a proxy for the true values

# Imputation using Deep Learning Library — Datawig
This method works very well with categorical, continuous, and non-numerical features. Datawig is a library that learns ML models using Deep Neural Networks to impute missing values in the datagram.

[ref ](https://github.com/awslabs/datawig/blob/master/docs/source/userguide.rst)

In [None]:
!pip install datawig

In [None]:
import datawig
train = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
#Initialize a SimpleImputer model
df_test = train[train.isnull()]
train_data=train[~train.isnull()]


#y_train = train["F_1_0"]
#X_train = train.drop("F_1_0", axis=1)
#X_test = test_data.drop("F_1_0", axis=1)

imputer = datawig.SimpleImputer(
    input_columns = ['F_1_0', 'F_1_1'], # column(s) containing information about the column we want to impute
    output_column = 'F_1_5', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )
#Fit an imputer model on the train data
imputer.fit(train_df=train_data, num_epochs=2)
#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

In [None]:
gc.collect()

Pros:

Quite accurate compared to other methods.
It has some functions that can handle categorical data (Feature Encoder).
It supports CPUs and GPUs.
Cons:

Single Column imputation.
Can be quite slow with large datasets.
You have to specify the columns that contain information about the target column that will be imputed

## Auto Impute
[autoimpute](https://kearnz.github.io/autoimpute-tutorials/) 

[missingpy](https://pypi.org/project/missingpy/#description)

In [None]:
#! pip install missingpy

In [None]:
#import sklearn.neighbors._base
#import sys
#sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

In [None]:
# Let X be an array containing missing values
#from missingpy import MissForest
#imputer = MissForest()
#X_imputed = imputer.fit_transform(train.iloc[:2000,1:].values)

### Sklearn Iterative Imputer !

    import numpy as np
    import pandas as pd
    import missingno as msno
    from sklearn.impute import IterativeImputer
    from sklearn.experimental import enable_iterative_imputer

    data = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', index_col = 'row_id')
    submission = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', index_col = 'row-col')

    iterative_imputer = IterativeImputer(initial_strategy='median')
    data[:] = iterative_imputer.fit_transform(data)

    for i in submission.index:
        row = int(i.split('-')[0])
        col = i.split('-')[1]
        submission.loc[i,'value'] = data.loc[row,col]

    submission.to_csv('submission.csv')

<a id=4></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Modeling</center></h3>



Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

Tasks

    Select modeling technique

    Generate test design

    Build model

    Assess model


<a id=7></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Summary</center></h3> 

### Conclusion:
Every dataset has missing values that need to be handled intelligently to create a robust model. I have discussed many  ways to handle missing values that can handle missing values in every type of column. There is no thump rule to handle missing values in a particular manner, the method which gets a robust model with the best performance. One can use various methods on different features depending on how and what the data is about. Having domain knowledge about the dataset is important, which can give an insight into how to preprocess the data and handle missing values.

**Some common FAQ’s ?**

**Q1: Should the EDA be performed before or after imputing the missing values? which approach gives us a better result?**

Answer— It is best to do before because the filling itself might change the distribution.

**Q2: Should the missing values be imputed after normalizing or before normalizing the data?**

Answer — There is no right answer to this. You can try both and check the results. Whenever in such doubts test it in your K fold cross validation loop and see the result.

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> If you like the kernal... Don't forget to upvote!!!!!!!!!! </h1>


ref 


https://www.kaggle.com/code/bannourchaker/credit-part2-datapreparation-1-selectpipe

https://www.kaggle.com/code/bannourchaker/frauddetection-part2-preparation


https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

https://www.kaggle.com/code/azminetoushikwasi/intro-to-imputation-different-techniques/notebook

https://www.kaggle.com/code/abdulravoofshaik/quick-eda-and-missing-values-tutorial

https://www.kaggle.com/code/devsubhash/tps-june-eda-xgb-simpleimputer/notebook