
### **Data Discretization**
The process of converting continuous data into discrete bins or intervals is Data Discretization.
It simplifies data analysis by reducing complexity, enhances interpretability, and can improve the performance of algorithms sensitive to numerical noise or scale.

#### **Data Bining :**
 Data Bining is a specific discretization technique that groups continuous values into intervals (bins) based on rules like equal width, equal frequency, or custom thresholds.It reduces the impact of outliers, smooths data, and makes patterns more evident in visualizations and analyses.

### **Feature Subset Selection:**
 Feature Selection is the process of identifying and retaining the most relevant features from a dataset while removing redundant or irrelevant ones.It improves model performance, reduces overfitting, lowers computational costs, and enhances interpretability by focusing on the most important variables.

##  **Data Bining can be achieved by cut method  - "Cut()" will group data and apply user defined labels**

### **Syntax** - pd.cut(x, bins, labels = None)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
glc_level_chk_dta = pd.read_csv('/content/imputed_data_diabetes1.csv')
glc_level_chk_dta.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148.0,72,35.0,125,33.6,0.627,50,1
1,1,85.0,66,29.0,125,26.6,0.351,31,0
2,8,183.0,64,29.15342,125,23.3,0.672,32,1
3,1,89.0,66,23.0,94,28.1,0.167,21,0
4,0,137.0,40,35.0,168,43.1,2.288,33,1


In [3]:
print("Maximum Marks: ", glc_level_chk_dta ['glucose'].max())
print("Minimum Marks: ", glc_level_chk_dta ['glucose'].min())

Maximum Marks:  199.0
Minimum Marks:  44.0


## **Create two buckets for  glucose values of 0-140 and 140-199**

In [4]:
glc_level_chk_dta['bin'] = pd.cut(glc_level_chk_dta['glucose'], bins = [0,140,199])
glc_level_chk_dta.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,6,148.0,72,35.0,125,33.6,0.627,50,1,"(140, 199]"
1,1,85.0,66,29.0,125,26.6,0.351,31,0,"(0, 140]"
2,8,183.0,64,29.15342,125,23.3,0.672,32,1,"(140, 199]"
3,1,89.0,66,23.0,94,28.1,0.167,21,0,"(0, 140]"
4,0,137.0,40,35.0,168,43.1,2.288,33,1,"(0, 140]"


****
<b>Above code and table shows conversion of glucose attribute value(continuous) in the 2nd column to its corresponding specific bin size at the right most part.</b>

#### **After creating two bucktes for different insulin ranges, create one labels  "Normal" for glucose  range of 0-140  and  "Prediabetic or Risky" for range of 140-199 as in below code and tables**

In [5]:
glc_level_chk_dta['bin'] = pd.cut(glc_level_chk_dta['glucose'], bins = [0,140,199], labels = ['Normal','Prediabetic or Risky'])
glc_level_chk_dta.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,6,148.0,72,35.0,125,33.6,0.627,50,1,Prediabetic or Risky
1,1,85.0,66,29.0,125,26.6,0.351,31,0,Normal
2,8,183.0,64,29.15342,125,23.3,0.672,32,1,Prediabetic or Risky
3,1,89.0,66,23.0,94,28.1,0.167,21,0,Normal
4,0,137.0,40,35.0,168,43.1,2.288,33,1,Normal


In [6]:
##  show column values of glucose label and its corresponsing categorical value as created in bin
dta_frm = glc_level_chk_dta[['glucose','bin']]
dta_frm.head()

Unnamed: 0,glucose,bin
0,148.0,Prediabetic or Risky
1,85.0,Normal
2,183.0,Prediabetic or Risky
3,89.0,Normal
4,137.0,Normal


####  **Replace each glucose value with its correspoding categorical bin value**

In [7]:
glc_level_chk_dta['glucose'] = glc_level_chk_dta['bin'].values
glc_level_chk_dta.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic,bin
0,6,Prediabetic or Risky,72,35.0,125,33.6,0.627,50,1,Prediabetic or Risky
1,1,Normal,66,29.0,125,26.6,0.351,31,0,Normal
2,8,Prediabetic or Risky,64,29.15342,125,23.3,0.672,32,1,Prediabetic or Risky
3,1,Normal,66,23.0,94,28.1,0.167,21,0,Normal
4,0,Normal,40,35.0,168,43.1,2.288,33,1,Normal


#### **Drop bin column  from the dataframe "glc_level_chk_dta" using (axis = 1,inplace =True) for column representation and store the modified content in the same dataframe respectively**

In [8]:
glc_level_chk_dta.drop(['bin'], axis=1, inplace = True)
glc_level_chk_dta.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,Prediabetic or Risky,72,35.0,125,33.6,0.627,50,1
1,1,Normal,66,29.0,125,26.6,0.351,31,0
2,8,Prediabetic or Risky,64,29.15342,125,23.3,0.672,32,1
3,1,Normal,66,23.0,94,28.1,0.167,21,0
4,0,Normal,40,35.0,168,43.1,2.288,33,1


### **Data Binarization**

### One of the  binarization methis is  --->   **One Hot Encoding**

#### convert each category value into a new column and assign a 1 or 0 (True/False) value to the column

#### Use pandas --> pd.get_dummies( obj_df, columns=[ "   " ] )   method  to realize one hot encoding


In [9]:
glc_level_chk_dta.glucose.value_counts()

Unnamed: 0_level_0,count
glucose,Unnamed: 1_level_1
Normal,576
Prediabetic or Risky,192


***
<font color = green >Above code counts and returns normal and prediabetic patients in the glucose column based on categorical label. Below code then binarizes "Normal" as "01" and "Prediabetic or Risky " as "10".</font>

In [10]:
pd.get_dummies(glc_level_chk_dta.glucose)

Unnamed: 0,Normal,Prediabetic or Risky
0,False,True
1,True,False
2,False,True
3,True,False
4,True,False
...,...,...
763,True,False
764,True,False
765,True,False
766,True,False


## **Feature Subset Selection**
####  It applies to those cases where most of the attributes/features are redundant or irrelevant in data sets and we don’t need all of them.
>Filter Approach – Features to be included in subset are selected before the subset is fed into algo is run and is independent of the algorithm. In it a certain mathematical basis is used to evaluate the most promising sub features. Or if we want variability in the reduced feature set we select those features that are related in least way i.e. select attributes whose pairwise co-relation is as low as possible– eg Age and Dob have very high dependency on each other – don’t select both of them but DOB and Medical History may have low co-relation and you can choose this set of attributes having low co-relation out of these three ones. </b>

# **Filter Approach**
### Below code shows an example of "Filter Approach" to attribute selection using CHI square test.
****
In CHI Sqaure test we see the corerelation of each atrribute with the output attribute and attributes having high correaltion with target variable are selected.Here we select the first "m" attributes that are highy corelated with the output variable.


In [12]:
dbts_new = pd.read_csv('/content/imputed_data_diabetes1.csv')
from sklearn.feature_selection import SelectKBest, chi2

new_dtaset = dbts_new.values
#  split the dataset into input and output variables.Since we are creating a subset of only the input or independent variables
X = new_dtaset[:,0:8]                  # select 8 input variables.
Y = new_dtaset[:,8]                  # select last  output variable
test = SelectKBest(score_func=chi2, k =5)   # function to get first k = 5 highest chisqaured input feature scores
fit = test.fit(X,Y)                        # Run score function on (X, Y) and get the appropriate features

# show all chisquared value/score for each input attribute
for i,j in enumerate(fit.scores_):
 print('Input Feature: %0d, Score: %.5f' % (i,j))


# Reduce X input fetaures = (9 input fetaures) to highest chisquared input fetaures K = 5 in this dataset
dbts_ftr_sbset = fit.transform(X)

# summarize selected input features  value from the cleaned table after mean in this case
print(dbts_ftr_sbset[0:5,:])

Input Feature: 0, Score: 111.51969
Input Feature: 1, Score: 1418.70503
Input Feature: 2, Score: 42.58251
Input Feature: 3, Score: 94.24570
Input Feature: 4, Score: 1689.71107
Input Feature: 5, Score: 108.76637
Input Feature: 6, Score: 5.39268
Input Feature: 7, Score: 181.30369
[[  6.  148.  125.   33.6  50. ]
 [  1.   85.  125.   26.6  31. ]
 [  8.  183.  125.   23.3  32. ]
 [  1.   89.   94.   28.1  21. ]
 [  0.  137.  168.   43.1  33. ]]


### **Result Interpretation of the CHI Square score value**
<u>from the chisquare test we found the following attributes are less corelated to the output variable "diabetes".</u>
****
Feature: 2 with score = 42.58251 is  "bp"
****
Feature: 3, Score: 94.24570 is "skin"
****
Feature: 6, Score: 5.39268  is "Pedigree"

<b><font color = red>  Hence we can reduce the 8 feature set {pregnant,glucose,bp,skin,insulin,bmi,pedigree,age} </font>
    -->

<font color = green>into 5 features as {pregnant,glucose,insulin,bmi,age} that's stuitable for the  learning algorithm.</font></b>
