
# **MIT Academy of Engineering (MIT AOE)**
## **Department of Artificial Intelligence**
### **Assignment No. 1 â€“ Feature Engineering onWine Quality Dataset**
##Problem Type: Regression
##Source: UCI Machine Learning Repository.
Features: 11 physicochemical properties
---

| Name | Roll Number |
|------|--------------|
| Arjun Tate | 202401110061 |

---


1. Setup and Data Exploration


In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
import seaborn as sns
import matplotlib.pyplot as plt

# --- Dataset Chosen: Wine Quality (Red Wine) from UCI ML Repository ---
# The target variable is the 'quality' score (3-8), a regression/classification problem.
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(url, sep=';')

# Rename columns for easier access (replace spaces with underscores)
df.columns = df.columns.str.replace(' ', '_')

# Separate target variable (quality) for later use
target_col = 'quality'
X = df.drop(columns=[target_col])
y = df[target_col]

# 1. Data Exploration
print("--- 1. Data Exploration ---")
print("Initial Dataset Shape:", df.shape)
print("\nData Types:")
print(df.info())
print("\nMissing Values Check (Should be 0 for this dataset):")
print(df.isnull().sum())
print("\nSummary Statistics:")
print(df.describe().T)

# Observation: The dataset is clean. All features are numeric (float64), and there are no missing values.
# However, for a robust pipeline, we will keep the imputation and scaling steps.

--- 1. Data Exploration ---
Initial Dataset Shape: (1599, 12)

Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         1599 non-null   float64
 1   volatile_acidity      1599 non-null   float64
 2   citric_acid           1599 non-null   float64
 3   residual_sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free_sulfur_dioxide   1599 non-null   float64
 6   total_sulfur_dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None

Missing Values Check (Should be 0 for this da

2. Handle Missing Data

In [None]:
# 2. Handle Missing Data (Using SimpleImputer with median/most_frequent)
print("\n--- 2. Handle Missing Data ---")
# The red wine dataset has no missing values, but we fit the imputer anyway.

# Identify numeric and categorical columns
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include='object').columns

# Initialize Imputers
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

# Impute numeric columns
X[num_cols] = num_imputer.fit_transform(X[num_cols])

# Impute categorical columns (if any)
if len(cat_cols) > 0:
    X[cat_cols] = cat_imputer.fit_transform(X[cat_cols])

print("Missing Data Handled (No change for this dataset).")

3. Encode Categorical Variables

In [23]:
# 3. Encode Categorical Variables
print("\n--- 3. Encode Categorical Variables ---")
# Wine Quality Red dataset is purely numeric, so no encoding is needed.
# We treat the current feature set as the encoded/prepared set for scaling.
X_encoded = X.copy()
print("No categorical variables to encode in the Wine Quality dataset.")


--- 3. Encode Categorical Variables ---
No categorical variables to encode in the Wine Quality dataset.


4. Feature Scaling

In [24]:
# 4. Feature Scaling (Standardization)
print("\n--- 4. Feature Scaling (Standardization) ---")
scaler = StandardScaler()
# Fit and transform the feature matrix
X_scaled_array = scaler.fit_transform(X_encoded)
X_scaled = pd.DataFrame(X_scaled_array, columns=X_encoded.columns)

print("Scaled Features (Head):")
print(X_scaled.head())
print(f"Features now have mean close to 0 and std dev close to 1.")


--- 4. Feature Scaling (Standardization) ---
Scaled Features (Head):
   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0      -0.528360          0.961877    -1.391472       -0.453218  -0.243707   
1      -0.298547          1.967442    -1.391472        0.043416   0.223875   
2      -0.298547          1.297065    -1.186070       -0.169427   0.096353   
3       1.654856         -1.384443     1.484154       -0.453218  -0.264960   
4      -0.528360          0.961877    -1.391472       -0.453218  -0.243707   

   free_sulfur_dioxide  total_sulfur_dioxide   density        pH  sulphates  \
0            -0.466193             -0.379133  0.558274  1.288643  -0.579207   
1             0.872638              0.624363  0.028261 -0.719933   0.128950   
2            -0.083669              0.229047  0.134264 -0.331177  -0.048089   
3             0.107592              0.411500  0.664277 -0.979104  -0.461180   
4            -0.466193             -0.379133  0.558274  1.288643  

5. PCA

In [25]:
# 5. PCA (Dimensionality Reduction)
print("\n--- 5. PCA (Dimensionality Reduction) ---")
# Check variance explained to decide on n_components
pca_full = PCA()
pca_full.fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Determine n_components to retain 95% variance
n_components_95 = np.where(cumulative_variance >= 0.95)[0][0] + 1
print(f"Components to retain 95% variance: {n_components_95} out of {X_scaled.shape[1]}")

# Apply PCA with a fixed number of components (e.g., 2 for visualization or n_components_95)
n_components_final = 2 # Using 2 components for a simple demonstration as in the image
pca = PCA(n_components=n_components_final)
pca_features = pca.fit_transform(X_scaled)

# Create a DataFrame for PCA features
pca_df = pd.DataFrame(pca_features, columns=[f'PC{i+1}' for i in range(n_components_final)])

print(f"\nPCA Features (Head with {n_components_final} components):")
print(pca_df.head())


--- 5. PCA (Dimensionality Reduction) ---
Components to retain 95% variance: 9 out of 11

PCA Features (Head with 2 components):
        PC1       PC2
0 -1.619530  0.450950
1 -0.799170  1.856553
2 -0.748479  0.882039
3  2.357673 -0.269976
4 -1.619530  0.450950


6. Feature Selection

In [27]:

print("\n--- 6. Feature Selection (SelectKBest) ---")

# Choose k (e.g., top 6 features)
k = 6


selector = SelectKBest(score_func=f_classif, k=k)
selector.fit(X_scaled, y)
selected_features_mask = selector.get_support()
selected_features = X_scaled.columns[selected_features_mask]

# Create a DataFrame with only the selected features
X_selected = X_scaled[selected_features]

print(f"Selected {k} Features using SelectKBest:")
print(selected_features.tolist())
print(f"\nSelected Features (Head):")
print(X_selected.head())




--- 6. Feature Selection (SelectKBest) ---
Selected 6 Features using SelectKBest:
['volatile_acidity', 'citric_acid', 'total_sulfur_dioxide', 'density', 'sulphates', 'alcohol']

Selected Features (Head):
   volatile_acidity  citric_acid  total_sulfur_dioxide   density  sulphates  \
0          0.961877    -1.391472             -0.379133  0.558274  -0.579207   
1          1.967442    -1.391472              0.624363  0.028261   0.128950   
2          1.297065    -1.186070              0.229047  0.134264  -0.048089   
3         -1.384443     1.484154              0.411500  0.664277  -0.461180   
4          0.961877    -1.391472             -0.379133  0.558274  -0.579207   

    alcohol  
0 -0.960246  
1 -0.584777  
2 -0.584777  
3 -0.584777  
4 -0.960246  


#Conclusion
The transformations successfully handled potential scale differences, created a smaller set of less-redundant features (PCA), and identified the most impactful original features, preparing the data for optimal model training.