### Feature Engineering

This topic covers about the Feature Engineering

missing values imputation with different methods

1. Removing observations with missing data
2. Performing mean or median imputation
3. Implementing mode or frequent category imputation
4. Replacing missing values with an arbitrary number
5. Capturing missing values in a bespoke category
6. Replacing missing values with a value at theend of the distribution
7. Adding a missing value indicator variable
8. Performing multivariate imputation by chained equations


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('C:\\Users\\HP\\Documents\\EDA\\EDA-BySunny\\EDABySunny\\Dataset\\data3\\googleplaystore.csv')
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [3]:
df.shape

(10841, 13)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [11]:
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

 If We Clearlly observed the Rating having the highesh missing values and Type and  Content rating, current ver, Android ver
 we should impute the missing values.

In [12]:
df.isnull()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,False,False,False,False,False,False,False,False,False,False,False,False,False
10837,False,False,False,False,False,False,False,False,False,False,False,False,False
10838,False,False,True,False,False,False,False,False,False,False,False,False,False
10839,False,False,False,False,False,False,False,False,False,False,False,False,False


In [19]:
df.isnull().mean().sort_values(ascending=False)

Rating            0.135965
Current Ver       0.000738
Android Ver       0.000277
Type              0.000092
Content Rating    0.000092
App               0.000000
Category          0.000000
Reviews           0.000000
Size              0.000000
Installs          0.000000
Price             0.000000
Genres            0.000000
Last Updated      0.000000
dtype: float64

In [22]:
df.isnull().any()

App               False
Category          False
Rating             True
Reviews           False
Size              False
Installs          False
Type               True
Price             False
Content Rating     True
Genres            False
Last Updated      False
Current Ver        True
Android Ver        True
dtype: bool

In [20]:
missing_col = [i for i in df.columns if df[i].isnull().any()]
missing_col

['Rating', 'Type', 'Content Rating', 'Current Ver', 'Android Ver']

### Removing observations with missing data
##### Complete case analysis(CCA): 
also called list-wise deletion of cases, consists
of discarding those observations where the values in any of the variables are missing. CCA
can be applied to categorical and numerical variables. CCA is quick and easy to implement
and has the advantage that it preserves the distribution of the variables, provided the data
is missing at random and only a small proportion of the data is missing. However, if data is
missing across many variables, CCA may lead to the removal of a big portion of the
dataset.

In [23]:
import copy

In [24]:
df1 = copy.deepcopy(df)
df1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [25]:
df_cca = df1.dropna()

In [26]:
df.shape

(10841, 13)

In [27]:
df_cca.shape

(9360, 13)

In [28]:
df_cca.isnull().mean()

App               0.0
Category          0.0
Rating            0.0
Reviews           0.0
Size              0.0
Installs          0.0
Type              0.0
Price             0.0
Content Rating    0.0
Genres            0.0
Last Updated      0.0
Current Ver       0.0
Android Ver       0.0
dtype: float64

In [30]:
!pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.5.1-py2.py3-none-any.whl (285 kB)
     -------------------------------------- 285.3/285.3 kB 2.2 MB/s eta 0:00:00
Installing collected packages: feature-engine
Successfully installed feature-engine-1.5.1



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


### Performing mean or median imputation
Mean or median imputation consists of replacing missing values with the variable mean or
median. This can only be performed in numerical variables. The mean or the median is
calculated using a train set, and these values are used to impute missing data in train and
test sets, as well as in future data we intend to score with the machine learning model

In [37]:
from sklearn.impute import SimpleImputer

#from feature_engine.missing_data_imputers  import MeanMedianImputer

In [38]:
mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')

In [61]:
df2 = copy.deepcopy(df)

In [62]:
df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [63]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [64]:
mean_im = mean_imputer.fit_transform(np.array(df2['Rating']).reshape(-1,1))
mean_im

array([[4.1       ],
       [3.9       ],
       [4.7       ],
       ...,
       [4.19333832],
       [4.5       ],
       [4.5       ]])

#### Or you can perform as follows

In [71]:
# median or mode imputation for the numeircal feature
val = df2['Rating'].median()
#val = df2[col].mean()
val

4.3

In [66]:
# we use the fillna method for the replacing the values 
df2['Rating']= df2['Rating'].fillna(val)

In [67]:
df2.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    1
Genres            0
Last Updated      0
Current Ver       8
Android Ver       3
dtype: int64

In [68]:
df2['Rating']

0        4.1
1        3.9
2        4.7
3        4.5
4        4.3
        ... 
10836    4.5
10837    5.0
10838    4.3
10839    4.5
10840    4.5
Name: Rating, Length: 10841, dtype: float64

In [69]:
df2.sample(100)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
3111,Hotels.com: Book Hotel Rooms & Find Vacation D...,TRAVEL_AND_LOCAL,4.5,260121,Varies with device,"10,000,000+",Free,0,Everyone,Travel & Local,"July 4, 2018",Varies with device,Varies with device
7387,CI Time,PRODUCTIVITY,4.3,4,2.5M,100+,Free,0,Everyone,Productivity,"March 17, 2017",1.2.392,4.0 and up
3937,Fortune City - A Finance App,FINANCE,4.6,49275,91M,"500,000+",Free,0,Everyone,Finance,"July 17, 2018",2.0.3.1,4.4 and up
5108,Lakeside AG Moultrie,LIFESTYLE,5.0,3,8.6M,50+,Free,0,Everyone,Lifestyle,"May 23, 2017",1.0,4.1 and up
6703,Loteria BR,FAMILY,4.3,0,1.8M,50+,Free,0,Everyone,Entertainment,"May 17, 2017",1.0.6,4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1449,Realtor.com Real Estate: Homes for Sale and Rent,HOUSE_AND_HOME,4.5,162243,12M,"10,000,000+",Free,0,Everyone,House & Home,"July 26, 2018",8.18,4.0.3 and up
1681,Flow Free,GAME,4.3,1295557,11M,"100,000,000+",Free,0,Everyone,Puzzle,"April 11, 2018",4.0,4.1 and up
2164,Chess School for Beginners,FAMILY,4.3,879,Varies with device,"100,000+",Free,0,Everyone,Board;Brain Games,"May 22, 2018",1.1.0,4.1 and up
9481,"Period Tracker, Pregnancy Calculator & Calendar 🌸",HEALTH_AND_FITNESS,4.3,0,Varies with device,"10,000+",Free,0,Everyone,Health & Fitness,"August 1, 2018",Varies with device,Varies with device


In [70]:
missing_col

['Rating', 'Type', 'Content Rating', 'Current Ver', 'Android Ver']

#### Imputation for the Categorical column (Implementing mode or frequent category imputation)

In [74]:
# we need to compue mode for categorical value to using fillna
cat_val = df2['Current Ver'].mode()[0]

'Varies with device'

In [75]:
for i in ['Type', 'Content Rating', 'Current Ver', 'Android Ver']:
    cat_val = df2[i].mode()[0]
    df2[i] = df2[i].fillna(cat_val)

In [76]:
df2.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

In [78]:
# for sklearn we can use the following
imputer = SimpleImputer(strategy='most_frequent')
imputer

In [80]:
# imputer.fit_transform(df)
#mode_imputer = FrequentCategoryImputer(variables=['A4', 'A5', 'A6','A7'])
#imputer.statistics_
#mode_imputer.fit(X_train)
#X_train = mode_imputer.transform(X_train)
#/X_test = mode_imputer.transform(X_test)


#### Replacing missing values with an arbitrary number
Arbitrary number imputation consists of replacing missing values with an arbitrary value.
Some commonly used values include 999, 9999, or -1 for positive distributions. It is only works for Numerical Distribution data.

In [103]:
df3 = copy.deepcopy(df)
df3

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [90]:
df3.isnull().mean().sort_values(ascending=False)

Rating            0.135965
Current Ver       0.000738
Android Ver       0.000277
Type              0.000092
Content Rating    0.000092
App               0.000000
Category          0.000000
Reviews           0.000000
Size              0.000000
Installs          0.000000
Price             0.000000
Genres            0.000000
Last Updated      0.000000
dtype: float64

In [91]:
max_val = df3['Rating'].max()
max_val

19.0

In [92]:
#replace the missing values in rating with arbitory number 99
df3['Rating'] = df3['Rating'].fillna(99)  # we can also replace 99 with max_val

In [93]:
df3.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    1
Genres            0
Last Updated      0
Current Ver       8
Android Ver       3
dtype: int64

In [94]:
df3.sample(100)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
6158,BGKontakti Bayern BG Kontakti,SOCIAL,99.0,32,7.8M,"1,000+",Free,0,Everyone,Social,"May 25, 2016",1.6,3.0 and up
7047,Dragon B.Z Wallpapers,PERSONALIZATION,99.0,0,3.8M,10+,Free,0,Everyone,Personalization,"May 6, 2018",2.1,4.0.3 and up
772,"play2prep: ACT, SAT prep",EDUCATION,4.2,3692,4.4M,"100,000+",Free,0,Everyone,Education,"July 23, 2015",4.1.3,2.3.3 and up
4043,Vector,GAME,4.4,3058687,89M,"100,000,000+",Free,0,Everyone 10+,Arcade,"July 18, 2016",1.2.0,4.0 and up
9137,quran-DZ,SOCIAL,99.0,0,6.2M,10+,Free,0,Teen,Social,"June 13, 2018",1.1,4.2 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4619,Don't Starve: Shipwrecked,GAME,4.1,1468,4.9M,"10,000+",Paid,$4.99,Teen,Adventure,"July 10, 2018",0.16,5.0 and up
10620,WFLA News Channel 8 - Tampa FL,NEWS_AND_MAGAZINES,3.8,133,14M,"10,000+",Free,0,Everyone,News & Magazines,"July 18, 2018",v4.30.0.8,5.0 and up
1081,İşCep,FINANCE,4.5,381788,32M,"10,000,000+",Free,0,Everyone,Finance,"August 2, 2018",3.22.0,4.1 and up
7369,Green Build - An unofficial Travis CI client,TOOLS,3.9,7,3.2M,100+,Free,0,Everyone,Tools,"June 6, 2018",1.2.1,5.0 and up


In [95]:
## We use simple imputer from sklearn
imputer = SimpleImputer(strategy='constant', fill_value=99)

In [96]:
#imputer.fit(X_train)
#X_train = imputer.transform(X_train)
#X_test = imputer.transform(X_test)
#imputer = ArbitraryNumberImputer(arbitrary_number=99,variables=['A2','A3', 'A8', 'A11'])
#X_train = imputer.transform(X_train)
#X_test = imputer.transform(X_test)

### Capturing missing values in a bespoke category
Missing data in categorical variables can be treated as a different category, so it is common
to replace missing values with the Missing string. In this recipe, we will learn how to do so
using pandas, scikit-learn, and Feature-engine.

In [99]:
from sklearn.impute import SimpleImputer
#from feature_engine.missing_data_imputers import CategoricalVariableImputer

In [101]:
for var in ['Type', 'Content Rating', 'Current Ver', 'Android Ver']:
    df3[var] = df3[var].fillna('Missing')


In [102]:
df3.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

In [100]:
imputer = SimpleImputer(strategy='constant', fill_value='Missing')

In [104]:
imputer.fit_transform(df3)

array([['Photo Editor & Candy Camera & Grid & ScrapBook',
        'ART_AND_DESIGN', 4.1, ..., 'January 7, 2018', '1.0.0',
        '4.0.3 and up'],
       ['Coloring book moana', 'ART_AND_DESIGN', 3.9, ...,
        'January 15, 2018', '2.0.0', '4.0.3 and up'],
       ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
        'ART_AND_DESIGN', 4.7, ..., 'August 1, 2018', '1.2.4',
        '4.0.3 and up'],
       ...,
       ['Parkinson Exercices FR', 'MEDICAL', 'Missing', ...,
        'January 20, 2017', '1.0', '2.2 and up'],
       ['The SCP Foundation DB fr nn5n', 'BOOKS_AND_REFERENCE', 4.5, ...,
        'January 19, 2015', 'Varies with device', 'Varies with device'],
       ['iHoroscope - 2018 Daily Horoscope & Astrology', 'LIFESTYLE',
        4.5, ..., 'July 25, 2018', 'Varies with device',
        'Varies with device']], dtype=object)

In [108]:
im_besp = pd.DataFrame(imputer.fit_transform(df3),columns = df.columns)
im_besp

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,Missing,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [109]:
im_besp.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

In [111]:
im_besp.info()  #float converted to object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   App             10841 non-null  object
 1   Category        10841 non-null  object
 2   Rating          10841 non-null  object
 3   Reviews         10841 non-null  object
 4   Size            10841 non-null  object
 5   Installs        10841 non-null  object
 6   Type            10841 non-null  object
 7   Price           10841 non-null  object
 8   Content Rating  10841 non-null  object
 9   Genres          10841 non-null  object
 10  Last Updated    10841 non-null  object
 11  Current Ver     10841 non-null  object
 12  Android Ver     10841 non-null  object
dtypes: object(13)
memory usage: 1.1+ MB


In [112]:
#imputer = CategoricalVariableImputer(variables=['A4', 'A5', 'A6','A7'])
#X_train = imputer.transform(X_train)
#X_test = imputer.transform(X_test)


### Replacing missing values with a value at the end of the distribution
Replacing missing values with a value at the end of the variable distribution is equivalent
to replacing them with an arbitrary value, but instead of identifying the arbitrary values
manually, these values are automatically selected as those at the very end of the variable
distribution. The values that are used to replace missing information are estimated using
the mean plus or minus three times the standard deviation if the variable is normally
distributed, or the inter-quartile range (IQR) proximity rule otherwise. According to the
IQR proximity rule, missing values will be replaced with the 75th quantile + (IQR * 1.5) at
the right tail or by the 25th quantile - (IQR * 1.5) at the left tail. The IQR is given by the 75th
quantile - the 25th quantile.


In [113]:
df4 = copy.deepcopy(df)

In [114]:
missing_col

['Rating', 'Type', 'Content Rating', 'Current Ver', 'Android Ver']

In [115]:
for var in ['Rating']:
    IQR = df4[var].quantile(0.75) - df4[var].quantile(0.25)
    value = df4[var].quantile(0.75) + 1.5 * IQR
    print(value)
    df4[var]= df4[var].fillna(value)
 

5.25


In [116]:
df4.sample(100)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
7366,usgang.ch,LIFESTYLE,3.70,492,Varies with device,"100,000+",Free,0,Everyone 10+,Lifestyle,"June 11, 2018",3.0.7,4.1 and up
9683,Masha and The Bear Puzzle Game,FAMILY,4.30,13330,73M,"1,000,000+",Free,0,Everyone,Puzzle;Brain Games,"May 24, 2018",2.0,4.1 and up
3616,How do I care about my child?,PARENTING,5.25,34,4.9M,"10,000+",Free,0,Everyone,Parenting,"July 3, 2018",1.2,4.0 and up
9501,Racing Moto,GAME,4.30,697805,7.4M,"50,000,000+",Free,0,Everyone,Racing,"July 3, 2018",1.2.13,3.0 and up
9579,Live Hold’em Pro Poker - Free Casino Games,GAME,4.60,1123190,Varies with device,"10,000,000+",Free,0,Teen,Card,"August 7, 2018",Varies with device,Varies with device
...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,Online Girls Chat,DATING,4.80,5323,3.5M,"50,000+",Free,0,Mature 17+,Dating,"April 21, 2018",8.2,4.0.3 and up
599,"Chatting - Free chat, random chat, boyfriend, ...",DATING,4.20,2506,6.1M,"500,000+",Free,0,Mature 17+,Dating,"June 15, 2017",3.0.0,4.0 and up
6546,Badoo - Free Chat & Dating App,SOCIAL,4.30,3781467,Varies with device,"100,000,000+",Free,0,Mature 17+,Social,"August 2, 2018",Varies with device,Varies with device
916,HISTORY: Watch TV Show Full Episodes & Specials,ENTERTAINMENT,4.10,33387,20M,"1,000,000+",Free,0,Teen,Entertainment,"July 16, 2018",3.1.4,4.4 and up


In [117]:
#from feature_engine.missing_data_imputers import EndTailImputer
#imputer = EndTailImputer(distribution='skewed', tail='right',variables=['A2', 'A3', 'A8', 'A11', 'A15'])
#imputer.fit(X_train)
#imputer.imputer_dict_
#X_train = imputer.transform(X_train)
#X_test = imputer.transform(X_test)

### Adding a missing value indicator variable
A missing indicator is a binary variable that specifies whether a value was missing for an
observation (1) or not (0). It is common practice to replace missing observations by the
mean, median, or mode while flagging those missing observations with a missing
indicator, thus covering two angles: if the data was missing at random, this would be
contemplated by the mean, median, or mode imputation, and if it wasn't, this would be
captured by the missing indicator. In this recipe, we will learn how to add missing
indicators using NumPy, scikit-learn, and Feature-engine

In [118]:
#from sklearn.impute import MissingIndicator
#from feature_engine.missing_data_imputers import AddNaNBinaryImputer

In [119]:
#for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
 #X_train[var + '_NA'] = np.where(X_train[var].isnull(), 1, 0)
 #X_test[var + '_NA'] = np.where(X_test[var].isnull(), 1, 0)


In [120]:
#imputer = AddNaNBinaryImputer()
#imputer.fit_transform(X_train)

In [121]:
#indicator = MissingIndicator(features='missing-only')


### Performing multivariate imputation by chained equations
Multivariate imputation methods, as opposed to univariate imputation, use the entire set of
variables to estimate the missing values. In other words, the missing values of a variable are
modeled based on the other variables in the dataset. Multivariate imputation by chained
equations (MICE) is a multiple imputation technique that models each variable with
missing values as a function of the remaining variables and uses that estimate for
imputation

In [122]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [123]:
#imputer = IterativeImputer(estimator = BayesianRidge(),max_iter=10, random_state=0)
#imputer.fit_transform(X_train)

# Thank you