<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.1: Feature Selection

In this lab, we delve into the fundamental concept of feature selection. We start by conducting correlation analysis to identify relevant features for our regression model. By examining the relationship between each feature and the target variable, we aim to pick the most influential features. Additionally, we explore the significance of cross validation in model evaluation and how it relates to feature selection. Through cross validation, we ensure that our model generalises well to unseen data by assessing its performance across multiple validation sets.

### 1. Load & Explore Data

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline

#### 1.1 Load Data

In [24]:
# Read CSV
wine_csv = 'winequality_merged.csv'
df=pd.read_csv(wine_csv )

#### 1.2 Explore Data (Exploratory Data Analysis)

In [9]:
# ANSWER
print(df.head())


   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  red_wine  
0      9.4        5         1  
1      9.8        5   

In [11]:
print(df.describe())


       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    6497.000000       6497.000000  6497.000000     6497.000000   
mean        7.215307          0.339666     0.318633        5.443235   
std         1.296434          0.164636     0.145318        4.757804   
min         3.800000          0.080000     0.000000        0.600000   
25%         6.400000          0.230000     0.250000        1.800000   
50%         7.000000          0.290000     0.310000        3.000000   
75%         7.700000          0.400000     0.390000        8.100000   
max        15.900000          1.580000     1.660000       65.800000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  6497.000000          6497.000000           6497.000000  6497.000000   
mean      0.056034            30.525319            115.744574     0.994697   
std       0.035034            17.749400             56.521855     0.002999   
min       0.009000             1.000000         

In [13]:
print(df.shape)

(6497, 13)


In [15]:
print(df.dtypes)


fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
red_wine                  int64
dtype: object


In [17]:
print(df.isnull().sum())


fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
red_wine                0
dtype: int64


### 2. Set Target Variable

Create a target variable for wine quality.

In [50]:
# Target Variable

target_variable = 'quality' 
wine_quality = df[target_variable]
print("Target variable 'wine_quality' created successfully.")
print(wine_quality.head())  

Target variable 'wine_quality' created successfully.
0    5
1    5
2    5
3    6
4    5
Name: quality, dtype: int64


### 3. Set Predictor Variables

Create a predictor matrix with variables of your choice. State your reasoning for the choices you make.

In [54]:
# ANSWER
predictor_variables = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'alcohol']
X = df[predictor_variables]
print("Predictor matrix 'X' created successfully with variables:")
print(X.head())


Predictor matrix 'X' created successfully with variables:
   fixed acidity  volatile acidity  citric acid  residual sugar  alcohol
0            7.4              0.70         0.00             1.9      9.4
1            7.8              0.88         0.00             2.6      9.8
2            7.8              0.76         0.04             2.3      9.8
3           11.2              0.28         0.56             1.9      9.8
4            7.4              0.70         0.00             1.9      9.4


### 4. Using Linear Regression Create a Model and Test Score

In [57]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [59]:
# Train-Test Split
X = df[predictor_variables]
y = df[target_variable]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [61]:
# Create a model for Linear Regression
model = LinearRegression()

# Fit the model with the Training data
model.fit(X_train, y_train)


# Calculate the score (R^2 for Regression) for Training Data
train_r2 = model.score(X_train, y_train)
print(f"R² score for training data: {train_r2:.4f}")


# Calculate the score (R^2 for Regression) for Testing Data
test_r2 = model.score(X_test, y_test)
print(f"R² score for testing data: {test_r2:.4f}")

R² score for training data: 0.2702
R² score for testing data: 0.2526


## BONUS: Cross validation

In [64]:
# Cross validation
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

In [70]:
 
df = df.dropna(subset=predictor_variables + [target_variable])

# Create predictor matrix X and target variable y
X = df[predictor_variables]
y = df[target_variable]

# Initialize KFold with 5 folds and shuffle the data
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

train_scores = []
train_rmse = []
test_scores = []
test_rmse = []

# Iterate through each fold
for k, (train_idx, test_idx) in enumerate(k_fold.split(X, y)):
    # Get training and test sets for X and y
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    # Create a linear regression model
    model = LinearRegression()
    
    # Fit the model with training set
    model.fit(X_train, y_train)
    
    # Make predictions with training and test set
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate R² scores
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Calculate RMSE (Root Mean Squared Error)
    train_rmse_score = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse_score = np.sqrt(mean_squared_error(y_test, y_test_pred))
    
    # Append scores to lists
    train_scores.append(train_r2)
    train_rmse.append(train_rmse_score)
    test_scores.append(test_r2)
    test_rmse.append(test_rmse_score)

# Create a metrics_df dataframe to display r2 and rmse scores
metrics_df = pd.DataFrame({
    'Fold': np.arange(1, 6),
    'Train R²': train_scores,
    'Test R²': test_scores,
    'Train RMSE': train_rmse,
    'Test RMSE': test_rmse
})

# Display the metrics dataframe
print("Metrics for each fold:")
print(metrics_df)










Metrics for each fold:
   Fold  Train R²   Test R²  Train RMSE  Test RMSE
0     1  0.270246  0.252595    0.748810   0.742963
1     2  0.267140  0.265673    0.747500   0.748179
2     3  0.264128  0.277286    0.748112   0.745627
3     4  0.269108  0.258547    0.744217   0.760963
4     5  0.265197  0.272763    0.748799   0.743326


In [None]:
# Describe the metrics

### 5. Feature Selection

What's your score (R^2 for Regression) for Testing Data?

How many feature have you selected? Can you improve your score by selecting different features?

**Please continue with Lab 4.2.2 with the same dataset.**



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



