<a href="https://colab.research.google.com/github/MikkoDT/MexEE402_AI/blob/main/Ch7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 7: Feature Selection**

## Understanding Correlation  

- **Definition**: Measures how two variables are linearly related (move together).  
- **Use**: Helps in understanding and predicting relationships in data.  

**Types of Correlation**  
- **Positive**: Both variables increase together (e.g., study hours ↑ → grades ↑).  
- **Negative**: One increases while the other decreases (e.g., extracurriculars ↑ → grades ↓).  
- **Zero**: No linear relationship (e.g., assignments completed ↔ grades unrelated).  

**Correlation Coefficients**  
- Range: **-1.0 to 1.0**  
  - `-1.0`: Perfect negative  
  - `0`: No correlation  
  - `1.0`: Perfect positive  

**Python Implementation**  
- Use `df.corr()` in **pandas** to compute correlations.  

In [1]:
import pandas as pd

In [3]:
# example data
data = {
'Study Hours': [10, 15, 8, 9, 12, 14, 13],
'Assignments Completed': [5, 7, 4, 5, 6, 7, 6],
'Extracurricular Activities': [3, 1, 4, 2, 3, 1, 2],
'Final Grade': [85, 90, 76, 81, 87, 92, 88]
}

print(data)

{'Study Hours': [10, 15, 8, 9, 12, 14, 13], 'Assignments Completed': [5, 7, 4, 5, 6, 7, 6], 'Extracurricular Activities': [3, 1, 4, 2, 3, 1, 2], 'Final Grade': [85, 90, 76, 81, 87, 92, 88]}


In [4]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Study Hours,Assignments Completed,Extracurricular Activities,Final Grade
0,10,5,3,85
1,15,7,1,90
2,8,4,4,76
3,9,5,2,81
4,12,6,3,87


In [5]:
# calculate correlations
correlations = df.corr()
print(correlations)

                            Study Hours  Assignments Completed  \
Study Hours                    1.000000               0.973841   
Assignments Completed          0.973841               1.000000   
Extracurricular Activities    -0.803419              -0.865385   
Final Grade                    0.938558               0.956511   

                            Extracurricular Activities  Final Grade  
Study Hours                                  -0.803419     0.938558  
Assignments Completed                        -0.865385     0.956511  
Extracurricular Activities                    1.000000    -0.793204  
Final Grade                                  -0.793204     1.000000  


## Basics of Feature Selection  
- Selects the most relevant features for prediction.  
- Irrelevant features can reduce accuracy.  

**Main methods:**  
- **Filter methods**  
- **Wrapper methods**  
- **Embedded methods**  

---

## Filter Methods  
- Use statistical measures to score features.  
- Rank features → keep or remove based on scores.  
- Examples: Chi-square test, information gain, correlation coefficient.  
- **Analogy:** Like a sieve, filtering out less useful features.  

📌 Example: Check correlation of features with `final grade` and drop those with low correlation.  

In [6]:
import pandas as pd

In [7]:
# hypothetical data
data_2 = {
'study hours': [10, 9, 8, 7, 10, 9, 8],
'assignments completed': [10, 9, 8, 7, 10, 9, 8],
'class participation': [8, 8, 7, 7, 8, 8, 7],
'extracurricular activities': [2, 3, 2, 3, 2, 3, 2],
'final grade': [90, 89, 78, 77, 90, 89, 78]
}

df_2 = pd.DataFrame(data_2)

df_2.head()

Unnamed: 0,study hours,assignments completed,class participation,extracurricular activities,final grade
0,10,10,8,2,90
1,9,9,8,3,89
2,8,8,7,2,78
3,7,7,7,3,77
4,10,10,8,2,90


In [9]:
# calculate correlations with 'final grade'
correlations = df_2.corr()['final grade'].sort_values()

print(correlations)

extracurricular activities    0.084215
study hours                   0.916995
assignments completed         0.916995
class participation           0.996546
final grade                   1.000000
Name: final grade, dtype: float64


In [10]:
# keep only features with correlation above 0.5
relevant_features = correlations[correlations > 0.5]

print(relevant_features)

study hours              0.916995
assignments completed    0.916995
class participation      0.996546
final grade              1.000000
Name: final grade, dtype: float64


## Wrapper Methods  
- Treat feature selection as a **search problem**.  
- Different feature combinations → evaluated using a predictive model.  
- Score based on **model accuracy**.  

**Examples:**  
- Recursive Feature Elimination (RFE)  
- Step Forward Selection  
- Step Backward Selection  

**Analogy:** Like trying on different outfits to see which combination looks best.  
- Add/remove features → check performance → keep best set.  

📌 Common approach: **RFE with Cross-Validation (CV)**.

In [11]:
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR

In [13]:
# create a model
estimator = SVR(kernel="linear")
print(estimator)

SVR(kernel='linear')


In [14]:
# create the RFE object and compute a cross-validated score.
selector = RFECV(estimator, step=1, cv=5)
print(selector)

RFECV(cv=5, estimator=SVR(kernel='linear'))


In [18]:
# fit the data
selector = selector.fit(df_2.drop('final grade', axis=1), df_2['final grade'])

# print out the features selected
print(df_2.drop('final grade', axis=1).columns[selector.support_])

Index(['assignments completed'], dtype='object')




## Embedded Methods  
- Select features **during model training**.  
- Bias model toward **lower complexity** (fewer coefficients).  
- Common approach: **Regularization methods** (e.g., Lasso).  

**Analogy:** Like a sculptor removing unnecessary pieces to reveal the final statue.  

📌 Example: **Lasso regression** adjusts feature importance while training.

In [20]:
from sklearn.linear_model import LassoCV
import numpy as np
import pandas as pd

In [25]:
X = df_2.drop('final grade', axis=1) # input features
y = df_2['final grade'] # target

print(X)

   study hours  assignments completed  class participation  \
0           10                     10                    8   
1            9                      9                    8   
2            8                      8                    7   
3            7                      7                    7   
4           10                     10                    8   
5            9                      9                    8   
6            8                      8                    7   

   extracurricular activities  
0                           2  
1                           3  
2                           2  
3                           3  
4                           2  
5                           3  
6                           2  


In [26]:
print(y)

0    90
1    89
2    78
3    77
4    90
5    89
6    78
Name: final grade, dtype: int64


In [24]:
lasso = LassoCV(cv=5) # our "sculptor"
lasso.fit(X, y) # sculpting process
importance = np.abs(lasso.coef_) # importance of each feature
important_features = X.columns[importance > 0] # our final "statue"
print(important_features)

Index(['study hours', 'class participation', 'extracurricular activities'], dtype='object')


# In this code, 'important_features' are the ones that Lasso decided were important for predicting 'final grade'. These features remained as part of the "statue", while others were "chipped away".