# **Discriminative Feature Selection**

# FEATURE SELECTION

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

We are going to understand it with a practice example. Steps are as follows :

>1) Import important libraries

>2) Importing data

>3) Data Preprocessing

>>i) Price

>>ii) Size

>>iii) Installs

>4) Discriminative Feature Check

>>i) Reviews

>>ii) Price

**1. Import Important Libraries**

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [7]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'termios'

**2. Importing Data**

Today we will be working on a playstore apps dataset with ratings. Link to the dataset --> https://www.kaggle.com/lava18/google-play-store-apps/data

In [None]:
df = pd.read_csv('/content/drive/My Drive/Academic/extras/Projects/Summer2k20/Colab Notebooks/Tutorials/googleplaystore.csv',encoding='unicode_escape')
df.head()

**3. Data Preprocessing**

Let us have a look at all the datatypes first :

In [None]:
df.dtypes

We see that all the columns except 'Rating' are object datatype. We want those columns also as numeric as they dont make sense when they are in object form.Let us start with the 'Price' column.

**i) Price** 

When we saw the head of the dataset, we only see the 0 values in 'Price' column. Let us have a look at the rows with non zero data. As the 'Price column is object type, we compare the column with '0' instead of 0. 

In [None]:
df[df['Price']!='0'].head()

We see that the 'Price' column has dollar sign in the beginning for the apps which are not free. Hence we cannot directly convert it to numeric type. We will first have to remove the $ sign so that all datas are uniform and can be converted.

We use the replace function over here to replace the dollar sign by blank. Notice that we had to convert the column to string type from object type as the replace function is only applicable on string functions.

In [None]:
df['Price'] = df['Price'].str.replace('$','')
df[df['Price']!='0'].head()

**ii) Size**

As we see the 'Size' column, we see that the value ends with the letter 'M' for mega. We want to convert the size to numeric value to use in the dataset. Hence we will need to remove the letter 'M'.

For this, we convert the column to string and omit the last letter of the string and save the data in 'Size' column.

Notice from the previous head that we saw, that the 'Size' for row 427 is given as varies with device. We obviously cannot convert such data to numeric. We will see how to deal with it later.

In [None]:
df['Size'] = df['Size'].str[:-1]
df.head()

**iii) Installs**

If we see the 'Installs' column, there are 2 major changes that we need to make to convert it to numeric. We have to remove the '+' sign from the end of the data as well as remove the commas before converting to numeric.

To remove the last letter, we apply the same procedure as for the 'Size' column :

In [None]:
df['Installs'] = df['Installs'].str[:-1]
df.head()

For the removal of commas, we will use the replace function to replace commas with blank.

Replace function only works on string, hence we access the values of the series as string before applying the replace function :

In [None]:
df['Installs'] = df['Installs'].str.replace(',','')
df.head()

Now, we will finally convert all the data to numeric type using the to_numeric function. Notice that we have used the errors='coerce' parameter. This parameter converts all the data which cannot be converted to numeric into NaN. For example the 'Size' in row 427 cannot be converted to int. Hence it will be converted to NaN. After that we take a look at the datatypes of the columns again.

In [None]:
df['Reviews'] = pd.to_numeric(df['Reviews'],errors='coerce')
df['Size'] = pd.to_numeric(df['Size'],errors='coerce')
df['Installs'] = pd.to_numeric(df['Installs'],errors='coerce')
df['Price'] = pd.to_numeric(df['Price'],errors='coerce')
df.dtypes

Now we will see and work with all the NaN values. Let us first have a look at all the NaN values in the dataset :

In [None]:
df.isna().sum()

As rating is the output of our dataset, we cannot have that to be NaN. Hence we will remove all the rows with 'Rating' as NaN :

In [None]:
df = df[df['Rating'].isna()==False]
df.isna().sum()

This is the final preprocessed dataset that we obtained :

In [None]:
df.head()

**4. Discriminative Feature Check**

Now we will move on to checking the discriminative feature checking, to see which feature is good and which is not. We will start with the 'Reviews' column. For our case, we will take rating > 4.3 as a good rating. We take that value because as we see in the following stats, the rating is divided 50:50 at that value.

Before we do that, let us have a look at the statistics of the whole table :

In [None]:
df.describe()

**i) Reviews**

We will have to check for multiple values that which of them has the best rating distinction. We will start by comparing with the mean of the 'Reviews' column which is 514098.

We will use a new function over here known as crosstab. Crosstab allows us to have a frequency count across 2 columns or conditions.

We could also normalize the column results to obtain the conditional probability of P(Rating = HIGH | condition)

We have also turned on the margins to see the total frequency under that condition.

In [None]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>514098,rownames=['Ratings>4.3'],colnames=['Reviews>514098'],margins= True)

We see that the number of ratings in the case of Reviews > 514098 is very less (close to 10%).

Hence it is preferred to take the 50 percentile point rather than the mean to be the pivot point. Let us now take the 50 percentile point which is 5930 reviews in this case. So let us take a look at that :

In [None]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>5930,rownames=['Ratings>4.3'],colnames=['Reviews>5930'],margins= True)

Now we see that the number of ratings is equal for both high and low reviews. So we will take the 50 percentile point to start from now on. Let us now look at the conditional probability :

In [None]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>5930,rownames=['Ratings>4.3'],colnames=['Reviews>5930'],margins= True,normalize='columns')

There is not much difference between P(Ratings=HIGH|Reviews<5930) and P(Ratings=HIGH|Reviews>5930) so this is a bad feature.

Let us increase the value of the pivot for ratings to 80000 and check again. We dont need to check for the percentage being too low as we are almost at 75 percentile mark.

In [None]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>80000,rownames=['Ratings>4.3'],colnames=['Reviews>80000'],margins= True,normalize='columns')

Now we see that there is a good difference in the probabilities and hence Rating>80000 is a good feature.

**ii) Price**

We will do the same for 'Price' column to find out the best distinctive feature. We see that in this case, even the 75 percentile mark also points to 0. Hence in this case, we will classify the data as Free or not :

In [None]:
pd.crosstab(df['Rating']>4.3,df['Price']==0,rownames=['Ratings>4.3'],colnames=['Price=$0'],margins= True)

This shows us that it is very difficult to use the Price as a feature. Hence it is a doubtful feature. If then also we want to force this as a feature, let us see the conditional probability :

In [None]:
pd.crosstab(df['Rating']>4.3,df['Price']==0,rownames=['Ratings>4.3'],colnames=['Price=$0'],margins= True,normalize='columns')

We see that there is not much difference in probability either, hence this would serve as a bad feature in any case.

This is the end of this tutorial. Now you can move on to assignment 7 in which you have to check the other 2 distinctive features.