
# **MIT Academy of Engineering (MIT AOE)**
## **Department of Artificial Intelligence**
### **Assignment No. 1 â€“ Boston Housing Dataset**

####Problem Type: Regression

Objective: Predict the Median value of owner-occupied homes ($1000s).

---


| Name | Roll Number |
|------|--------------|
| Arjun Tate | 202401110061 |



---


1. DATA EXPLORATION


In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif

# Load dataset from online source (Heart Disease Dataset - Cleveland)
# The dataset has no header and uses '?' for missing values, which pandas detects as objects.
# We explicitly set '?' as the na_values.
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
df = pd.read_csv(url, header=None, na_values='?')

# Rename columns for clarity (based on UCI documentation)
# The column index is used as the column name
df.columns = [str(i) for i in range(df.shape[1])]
df.rename(columns={'13': 'target'}, inplace=True) # Assuming the last column is the target

print("Initial Dataset Head:")
print(df.head())
print("\nInitial Dataset Info:")
df.info()
print("\nSummary Statistics:")
print(df.describe().T)

Initial Dataset Head:
      0    1    2      3      4    5    6      7    8    9   10   11   12  \
0  63.0  1.0  1.0  145.0  233.0  1.0  2.0  150.0  0.0  2.3  3.0  0.0  6.0   
1  67.0  1.0  4.0  160.0  286.0  0.0  2.0  108.0  1.0  1.5  2.0  3.0  3.0   
2  67.0  1.0  4.0  120.0  229.0  0.0  2.0  129.0  1.0  2.6  2.0  2.0  7.0   
3  37.0  1.0  3.0  130.0  250.0  0.0  0.0  187.0  0.0  3.5  3.0  0.0  3.0   
4  41.0  0.0  2.0  130.0  204.0  0.0  2.0  172.0  0.0  1.4  1.0  0.0  3.0   

   target  
0       0  
1       2  
2       1  
3       0  
4       0  

Initial Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       303 non-null    float64
 1   1       303 non-null    float64
 2   2       303 non-null    float64
 3   3       303 non-null    float64
 4   4       303 non-null    float64
 5   5       303 non-null    float64
 6   6       303 no

2. Handle Missing Data

In [None]:

print("Missing values before imputation:")
print(df.isnull().sum())


num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns




num_imputer = SimpleImputer(strategy='median')

df[num_cols] = num_imputer.fit_transform(df[num_cols])


if len(cat_cols) > 0:
    cat_imputer = SimpleImputer(strategy='most_frequent')
    df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

print("\nMissing values after imputation:")
print(df.isnull().sum())

3. Encode Categorical Variables

In [2]:
cols_to_encode = ['2', '6', '10', '12']
for col in cols_to_encode:
    df[col] = df[col].astype('category')
df_encoded = pd.get_dummies(df, columns=cols_to_encode, drop_first=True)

df_encoded.columns = df_encoded.columns.astype(str)

print("\nEncoded Dataset Head:")
print(df_encoded.head())
print(f"Encoded Dataset Shape: {df_encoded.shape}")


Encoded Dataset Head:
      0    1      3      4    5      7    8    9   11  target  2_2.0  2_3.0  \
0  63.0  1.0  145.0  233.0  1.0  150.0  0.0  2.3  0.0       0  False  False   
1  67.0  1.0  160.0  286.0  0.0  108.0  1.0  1.5  3.0       2  False  False   
2  67.0  1.0  120.0  229.0  0.0  129.0  1.0  2.6  2.0       1  False  False   
3  37.0  1.0  130.0  250.0  0.0  187.0  0.0  3.5  0.0       0  False   True   
4  41.0  0.0  130.0  204.0  0.0  172.0  0.0  1.4  0.0       0   True  False   

   2_4.0  6_1.0  6_2.0  10_2.0  10_3.0  12_6.0  12_7.0  
0  False  False   True   False    True    True   False  
1   True  False   True    True   False   False   False  
2   True  False   True    True   False   False    True  
3  False  False  False   False    True   False   False  
4  False  False   True   False   False   False   False  
Encoded Dataset Shape: (303, 19)


4.  Feature Scaling

In [3]:
# Separate features (X) and target (y)
X = df_encoded.drop(columns=['target'], errors='ignore')
y = df_encoded['target']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the features
scaled_features = scaler.fit_transform(X)

# Convert back to DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=X.columns)

print("\nScaled Dataset Head:")
print(scaled_df.head())


Scaled Dataset Head:
          0         1         3         4         5         7         8  \
0  0.948726  0.686202  0.757525 -0.264900  2.394438  0.017197 -0.696631   
1  1.392002  0.686202  1.611220  0.760415 -0.417635 -1.821905  1.435481   
2  1.392002  0.686202 -0.665300 -0.342283 -0.417635 -0.902354  1.435481   
3 -1.932564  0.686202 -0.096170  0.063974 -0.417635  1.637359 -0.696631   
4 -1.489288 -1.457296 -0.096170 -0.825922 -0.417635  0.980537 -0.696631   

          9        11     2_2.0     2_3.0     2_4.0     6_1.0     6_2.0  \
0  1.087338 -0.718306 -0.444554 -0.629534 -0.951662 -0.115663  1.023375   
1  0.397182  2.487269 -0.444554 -0.629534  1.050793 -0.115663  1.023375   
2  1.346147  1.418744 -0.444554 -0.629534  1.050793 -0.115663  1.023375   
3  2.122573 -0.718306 -0.444554  1.588476 -0.951662 -0.115663 -0.977158   
4  0.310912 -0.718306  2.249444 -0.629534 -0.951662 -0.115663  1.023375   

     10_2.0    10_3.0    12_6.0    12_7.0  
0 -0.926766  3.664502  3.979112 

5. PCA (Dimensionality Reduction)

In [6]:

if scaled_df.isnull().values.any():
    # If NaNs are found, we impute them using median *again* just for the PCA step
    # Note: If this happens, it points to a problem in steps 2 or 3.
    temp_imputer = SimpleImputer(strategy='median')
    scaled_df_clean = pd.DataFrame(temp_imputer.fit_transform(scaled_df), columns=scaled_df.columns)
else:
    scaled_df_clean = scaled_df


pca = PCA(n_components=2)
# Use the cleaned version of the scaled data
pca_features = pca.fit_transform(scaled_df_clean)

# Convert PCA features back to DataFrame
pca_df = pd.DataFrame(pca_features, columns=['PC1', 'PC2'])

print("\nPCA Transformed Dataset Head (Reduced to 2 Components):")
print(pca_df.head())


PCA Transformed Dataset Head (Reduced to 2 Components):
        PC1       PC2
0  0.663443  2.237916
1  3.487949  0.979003
2  3.341976 -0.716217
3 -1.683404  0.403007
4 -2.505285 -0.374885


6. Feature Selection

In [8]:
# --- Feature Selection (SelectKBest) ---
# We use the scaled, but not PCA-reduced, data for selection
X_select = scaled_df # This DataFrame is the suspect for containing NaNs
y_select = y

# FIX: Impute any remaining NaNs in X_select before applying SelectKBest
# Using median imputation again to clean up any NaNs that might have persisted
nan_imputer = SimpleImputer(strategy='median')

# Apply imputation to X_select. We ensure the output is a DataFrame for feature name retention.
X_select_clean = pd.DataFrame(
    nan_imputer.fit_transform(X_select),
    columns=X_select.columns
)


# Select top 8 features as shown in the reference image
selector = SelectKBest(score_func=f_classif, k=8)
# Use the cleaned data (X_select_clean) for fitting
selector.fit(X_select_clean, y_select)

# Get the indices and names of the selected features
selected_indices = selector.get_support(indices=True)
selected_features = X_select_clean.columns[selected_indices]

# Create the final selected dataset using the cleaned data
X_new = X_select_clean[selected_features]

print("\nSelected Features (k=8):")
print(list(selected_features))
print("\nSelected Dataset Head:")
print(X_new.head())


Selected Features (k=8):
['7', '8', '9', '11', '2_3.0', '2_4.0', '10_2.0', '12_7.0']

Selected Dataset Head:
          7         8         9        11     2_3.0     2_4.0    10_2.0  \
0  0.017197 -0.696631  1.087338 -0.718306 -0.629534 -0.951662 -0.926766   
1 -1.821905  1.435481  0.397182  2.487269 -0.629534  1.050793  1.079021   
2 -0.902354  1.435481  1.346147  1.418744 -0.629534  1.050793  1.079021   
3  1.637359 -0.696631  2.122573 -0.718306  1.588476 -0.951662 -0.926766   
4  0.980537 -0.696631  0.310912 -0.718306 -0.629534 -0.951662 -0.926766   

     12_7.0  
0 -0.793116  
1 -0.793116  
2  1.260850  
3 -0.793116  
4 -0.793116  


7. Summary of Transformations

In [9]:

# We use the scaled, but not PCA-reduced, data for selection
X_select = scaled_df # This DataFrame is the suspect for containing NaNs
y_select = y

# FIX: Impute any remaining NaNs in X_select before applying SelectKBest
# Using median imputation again to clean up any NaNs that might have persisted
nan_imputer = SimpleImputer(strategy='median')

# Apply imputation to X_select. We ensure the output is a DataFrame for feature name retention.
X_select_clean = pd.DataFrame(
    nan_imputer.fit_transform(X_select),
    columns=X_select.columns
)


# Select top 8 features as shown in the reference image
selector = SelectKBest(score_func=f_classif, k=8)
# Use the cleaned data (X_select_clean) for fitting
selector.fit(X_select_clean, y_select)

# Get the indices and names of the selected features
selected_indices = selector.get_support(indices=True)
selected_features = X_select_clean.columns[selected_indices]

# Create the final selected dataset using the cleaned data
X_new = X_select_clean[selected_features]

print("\nSelected Features (k=8):")
print(list(selected_features))
print("\nSelected Dataset Head:")
print(X_new.head())


Selected Features (k=8):
['7', '8', '9', '11', '2_3.0', '2_4.0', '10_2.0', '12_7.0']

Selected Dataset Head:
          7         8         9        11     2_3.0     2_4.0    10_2.0  \
0  0.017197 -0.696631  1.087338 -0.718306 -0.629534 -0.951662 -0.926766   
1 -1.821905  1.435481  0.397182  2.487269 -0.629534  1.050793  1.079021   
2 -0.902354  1.435481  1.346147  1.418744 -0.629534  1.050793  1.079021   
3  1.637359 -0.696631  2.122573 -0.718306  1.588476 -0.951662 -0.926766   
4  0.980537 -0.696631  0.310912 -0.718306 -0.629534 -0.951662 -0.926766   

     12_7.0  
0 -0.793116  
1 -0.793116  
2  1.260850  
3 -0.793116  
4 -0.793116  


##Conclusion
The feature engineering process successfully transformed the raw Heart Disease dataset into a robust and model-ready format. This transformation began with cleaning the data through imputation to handle missing values, followed by One-Hot Encoding to convert categorical features into a quantitative, machine-readable format. Subsequently, Standardization was applied, which is critical for distance-based models, by scaling all numeric features to a common range. Finally, the data quality was further refined and optimized by applying PCA for dimensionality reduction and SelectKBest for retaining only the top 8 most statistically predictive features. This comprehensive pre-processing ensures the final dataset is free of missing values, consistently scaled, and highly informative, thereby enhancing the interpretability and predictive accuracy of any subsequent machine learning model.