# Practical - 2

Use any dataset and perform operations with suitable programming language.
1. Find standard deviation, variance of every numerical attribute.
2. Find Covariance and perform correlation analysis using correlation coefficient.
3. How many independent features are present in the given dataset.
4. Can we identify unwanted features.
5. Perform the data discretization using equi frequency binning method on any numeric attributes.
6. Normalize the numeric attributes using min-max normalization, Z-Score normalization and decimal scaling normalization.

#### Correlation Coefficient: The correlation coefficient is a statistical measure that quantifies the strength and direction of a relationship between two variables. It is denoted by r and ranges from -1 to 1.

In [26]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [27]:
df = pd.read_csv("C:/Users/aadit/OneDrive/文档/Engineering TY/Sem-6th/Data_Mining/IRIS.csv")

In [28]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [29]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [30]:
# Drop categorical column
iris_numeric = df.drop(columns=['species'])

In [32]:
# std() calculates the spread of the data for each column.
std = df.std()
std

  std = df.std()


sepal_length    0.828066
sepal_width     0.433594
petal_length    1.764420
petal_width     0.763161
dtype: float64

In [33]:
# var() calculates the variance , which measures how much the vlaues deviate from the mean
var = df.var()
var

  var = df.var()


sepal_length    0.685694
sepal_width     0.188004
petal_length    3.113179
petal_width     0.582414
dtype: float64

In [34]:
# cov() computes the covariance matrix, which shows how two features vary together.
cov = df.cov()
cov

  cov = df.cov()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.685694,-0.039268,1.273682,0.516904
sepal_width,-0.039268,0.188004,-0.321713,-0.117981
petal_length,1.273682,-0.321713,3.113179,1.296387
petal_width,0.516904,-0.117981,1.296387,0.582414


In [35]:
# corr() computes the correlation matrix, which tells how strongly two features are related.
corr = df.corr()
corr

# A correlation close to 1 or -1 indicates a strong relationship.

  corr = df.corr()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.109369,0.871754,0.817954
sepal_width,-0.109369,1.0,-0.420516,-0.356544
petal_length,0.871754,-0.420516,1.0,0.962757
petal_width,0.817954,-0.356544,0.962757,1.0


In [61]:
# Independent features identification (assuming 0.8 for correlation)
independent_features = [column for column in iris_numeric.columns if all(abs(corr[column].drop(column)) < 0.8)]
print("Independent Features:\n", independent_features)

'''Selects the feature that have correlation less than 0.6. Helps identify redundant features.'''

'''An independent feature is a numerical variable that does not have a strong correlation with other features in the dataset. 
This means that the feature provides unique information and is not redundant.'''

Independent Features:
 ['sepal_width']


'An independent feature is a numerical variable that does not have a strong correlation with other features in the dataset. \nThis means that the feature provides unique information and is not redundant.'

In [59]:
# Dependent features identification (assuming 0.8 for correlation)
dependent_features = [column for column in iris_numeric.columns if all(abs(corr[column]) > 0.3)]
print("Dependent Features:\n", dependent_features)

Dependent Features:
 ['petal_length', 'petal_width']


In [38]:
# Unwanted features (low variance threshold)
low_variance_features = [column for column in iris_numeric.columns if iris_numeric[column].var() < 0.1]
print("Unwanted Features:\n", low_variance_features)

# If a feature has a very low variance, it doesn’t contribute much information.

Unwanted Features:
 []


In [42]:
# Data discretization using equi-frequency binning for column 'sepal_length'

# Data discretization is the process of converting continuous numerical data into discrete categories

num_bins = 4
df['sepal_length_binned'] = pd.qcut(df['sepal_length'], q=num_bins, labels=False)
print("Binned Data:\n", df[['sepal_length', 'sepal_length_binned']])

# Splits sepal_length into 4 bins of equal frequency.

Binned Data:
      sepal_length  sepal_length_binned
0             5.1                    0
1             4.9                    0
2             4.7                    0
3             4.6                    0
4             5.0                    0
..            ...                  ...
145           6.7                    3
146           6.3                    2
147           6.5                    3
148           6.2                    2
149           5.9                    2

[150 rows x 2 columns]


In [43]:
# Normalization
scaler_minmax = MinMaxScaler()
iris_minmax = pd.DataFrame(scaler_minmax.fit_transform(iris_numeric), columns=iris_numeric.columns)
print("Min-Max Normalized Data:\n", iris_minmax)

# Scales values between 0 and 1.

Min-Max Normalized Data:
      sepal_length  sepal_width  petal_length  petal_width
0        0.222222     0.625000      0.067797     0.041667
1        0.166667     0.416667      0.067797     0.041667
2        0.111111     0.500000      0.050847     0.041667
3        0.083333     0.458333      0.084746     0.041667
4        0.194444     0.666667      0.067797     0.041667
..            ...          ...           ...          ...
145      0.666667     0.416667      0.711864     0.916667
146      0.555556     0.208333      0.677966     0.750000
147      0.611111     0.416667      0.711864     0.791667
148      0.527778     0.583333      0.745763     0.916667
149      0.444444     0.416667      0.694915     0.708333

[150 rows x 4 columns]


In [60]:
# Z-Score normalization
scaler_zscore = StandardScaler()
iris_zscore = pd.DataFrame(scaler_zscore.fit_transform(iris_numeric), columns=iris_numeric.columns)
print("Z-Score Normalized Data:\n", iris_zscore)

# Centers data around mean = 0 and std = 1.

Z-Score Normalized Data:
      sepal_length  sepal_width  petal_length  petal_width
0       -0.900681     1.032057     -1.341272    -1.312977
1       -1.143017    -0.124958     -1.341272    -1.312977
2       -1.385353     0.337848     -1.398138    -1.312977
3       -1.506521     0.106445     -1.284407    -1.312977
4       -1.021849     1.263460     -1.341272    -1.312977
..            ...          ...           ...          ...
145      1.038005    -0.124958      0.819624     1.447956
146      0.553333    -1.281972      0.705893     0.922064
147      0.795669    -0.124958      0.819624     1.053537
148      0.432165     0.800654      0.933356     1.447956
149      0.068662    -0.124958      0.762759     0.790591

[150 rows x 4 columns]


In [45]:
# Decimal Scaling Normalization
def decimal_scaling(column):
    max_val = abs(column).max() # Find max absolute value
    scaling_factor = 10 ** len(str(int(max_val))) # Determine j
    return column / scaling_factor # apply scaling

iris_decimal_scaled = iris_numeric.apply(decimal_scaling)
print("Decimal Scaled Data:\n", iris_decimal_scaled)

'''Decimal Scaling is a normalization technique where we move the decimal point of the values to scale 
them within a specific range. It is based on the maximum absolute value in the dataset.'''

Decimal Scaled Data:
      sepal_length  sepal_width  petal_length  petal_width
0            0.51         0.35          0.14         0.02
1            0.49         0.30          0.14         0.02
2            0.47         0.32          0.13         0.02
3            0.46         0.31          0.15         0.02
4            0.50         0.36          0.14         0.02
..            ...          ...           ...          ...
145          0.67         0.30          0.52         0.23
146          0.63         0.25          0.50         0.19
147          0.65         0.30          0.52         0.20
148          0.62         0.34          0.54         0.23
149          0.59         0.30          0.51         0.18

[150 rows x 4 columns]


In [62]:
max_val

NameError: name 'max_val' is not defined