# FEATURE SELECTION

## # Missing Values Ratio

**1. Diabetes Dataset:** Identify and remove features in the diabetes dataset where the
percentage of missing values exceeds 30%, then analyze how the reduced feature set
affects model accuracy when predicting diabetes outcomes.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [5]:
diabetes_data = pd.read_csv((r"C:\Users\guntu\Downloads\diabetes.csv"))

In [6]:
# Step 1: Count the number of zero values in relevant columns (Glucose, BloodPressure, etc.)
invalid_columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
zero_counts = (diabetes_data[invalid_columns] == 0).sum()

In [7]:
# Step 2: Calculate the percentage of zero values in these columns
missing_percentage = (zero_counts / len(diabetes_data)) * 100

In [8]:
# Display the number and percentage of zero values
print("Number of zero values:")
print(zero_counts)

Number of zero values:
Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64


In [9]:
print("\nPercentage of zero values:")
print(missing_percentage)


Percentage of zero values:
Glucose           0.651042
BloodPressure     4.557292
SkinThickness    29.557292
Insulin          48.697917
BMI               1.432292
dtype: float64


In [10]:
# Step 3: Model accuracy before removing any columns
X_full = diabetes_data.drop(columns=['Outcome']) 
y_full = diabetes_data['Outcome'] 

In [11]:
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.2, random_state=42)


In [12]:
# Train the RandomForest model on full data
model_full = RandomForestClassifier(random_state=42)
model_full.fit(X_train_full, y_train_full)

In [13]:
y_pred_full = model_full.predict(X_test_full)
accuracy_full = accuracy_score(y_test_full, y_pred_full)

In [14]:
print("\nModel Accuracy before removing columns: {:.2f}%".format(accuracy_full * 100))



Model Accuracy before removing columns: 72.08%


In [15]:
columns_to_remove = missing_percentage[missing_percentage > 30].index
reduced_data = diabetes_data.drop(columns=columns_to_remove)


In [16]:
print("\nReduced dataset columns:")
print(reduced_data.columns)


Reduced dataset columns:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'BMI',
       'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')


In [17]:
# Step 5: Model accuracy after removing columns with more than 30% missing values
X_reduced = reduced_data.drop(columns=['Outcome'])  # Features after reduction
y_reduced = reduced_data['Outcome'] 

In [18]:
X_train_reduced, X_test_reduced, y_train_reduced, y_test_reduced = train_test_split(X_reduced, y_reduced, test_size=0.2, random_state=42)

In [19]:
model_reduced = RandomForestClassifier(random_state=42)
model_reduced.fit(X_train_reduced, y_train_reduced)

In [20]:
y_pred_reduced = model_reduced.predict(X_test_reduced)
accuracy_reduced = accuracy_score(y_test_reduced, y_pred_reduced)

In [21]:
print("\nModel Accuracy after removing columns: {:.2f}%".format(accuracy_reduced * 100))


Model Accuracy after removing columns: 74.03%


**2. Melbourne Housing Dataset:** Filter out columns in the Melbourne housing dataset
where more than 20% of values are missing, and determine the impact on a price
prediction model&#39;s performance.

In [222]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [224]:
melbourne_data = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [226]:
missing_counts = melbourne_data.isnull().sum()  # Count of missing values
missing_ratios = melbourne_data.isnull().mean() * 100  # Percentage of missing values

In [228]:
print("Missing Values:")
for col in melbourne_data.columns:
    print(f"{col}: {missing_counts[col]} ")

Missing Values:
Suburb: 0 
Rooms: 0 
Type: 0 
Price: 7610 
Method: 0 
SellerG: 0 
Date: 0 
Distance: 1 
Postcode: 1 
Bedroom2: 8217 
Bathroom: 8226 
Car: 8728 
Landsize: 11810 
BuildingArea: 21115 
YearBuilt: 19306 
CouncilArea: 3 
Lattitude: 7976 
Longtitude: 7976 
Regionname: 3 
Propertycount: 3 


In [230]:
print("\nPercentage of Missing Values:")
for col in melbourne_data.columns:
    print(f"{col}: {missing_ratios[col]:.2f}%")


Percentage of Missing Values:
Suburb: 0.00%
Rooms: 0.00%
Type: 0.00%
Price: 21.83%
Method: 0.00%
SellerG: 0.00%
Date: 0.00%
Distance: 0.00%
Postcode: 0.00%
Bedroom2: 23.57%
Bathroom: 23.60%
Car: 25.04%
Landsize: 33.88%
BuildingArea: 60.58%
YearBuilt: 55.39%
CouncilArea: 0.01%
Lattitude: 22.88%
Longtitude: 22.88%
Regionname: 0.01%
Propertycount: 0.01%


In [232]:
columns_to_drop = missing_ratios[missing_ratios > 20].index.tolist()

In [234]:
print("\nColumns to be removed (more than 20% missing values):")
for col in columns_to_drop:
    print(col)


Columns to be removed (more than 20% missing values):
Price
Bedroom2
Bathroom
Car
Landsize
BuildingArea
YearBuilt
Lattitude
Longtitude


In [236]:
if 'Price' in columns_to_drop:
    columns_to_drop.remove('Price')

In [238]:
filtered_data = melbourne_data.drop(columns=columns_to_drop).dropna(subset=['Price'])

In [240]:
# Define features (X) and target (y)
X = filtered_data.drop(columns=['Price'])
y = filtered_data['Price']

In [242]:
X_encoded = pd.get_dummies(X, drop_first=True)

In [244]:
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_encoded)

In [246]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

In [248]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [250]:
model = LinearRegression()
model.fit(X_train, y_train)

In [251]:
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [252]:
print(f'\nMean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R² Score: {r2:.2f}')


Mean Absolute Error (MAE): 1731993960724925184.00
Mean Squared Error (MSE): 898818058633529407361721075437200736256.00
R² Score: -2103434703322854677293301760.00


## # High Correlation Filter

**3.Diabetes Dataset:** Identify pairs of highly correlated features (correlation &gt; 0.8) in the
diabetes dataset, then remove one feature from each pair and assess how model
performance changes in diabetes classification.

In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [26]:
data = pd.read_csv((r"C:\Users\guntu\Downloads\diabetes.csv"))

In [27]:
#  Compute the correlation matrix
print("Computing the correlation matrix \n")
corr_matrix = data.corr()
print(corr_matrix, "\n")

Computing the correlation matrix 

                          Pregnancies   Glucose  BloodPressure  SkinThickness  \
Pregnancies                  1.000000  0.129459       0.141282      -0.081672   
Glucose                      0.129459  1.000000       0.152590       0.057328   
BloodPressure                0.141282  0.152590       1.000000       0.207371   
SkinThickness               -0.081672  0.057328       0.207371       1.000000   
Insulin                     -0.073535  0.331357       0.088933       0.436783   
BMI                          0.017683  0.221071       0.281805       0.392573   
DiabetesPedigreeFunction    -0.033523  0.137337       0.041265       0.183928   
Age                          0.544341  0.263514       0.239528      -0.113970   
Outcome                      0.221898  0.466581       0.065068       0.074752   

                           Insulin       BMI  DiabetesPedigreeFunction  \
Pregnancies              -0.073535  0.017683                 -0.033523   
Glucos

In [28]:
corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        corr_value = corr_matrix.iloc[i, j]
        if abs(corr_value) > 0.8:
            corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_value))

if corr_pairs:
    print("Highly Correlated Pairs:")
    for pair in corr_pairs:
        print(f"{pair[0]} and {pair[1]}: Correlation = {pair[2]:.2f}")
else:
    print("No pairs with correlation > 0.8 found.\n")

No pairs with correlation > 0.8 found.



In [29]:
features_to_remove = set(pair[1] for pair in corr_pairs)  # Keep only one feature from each pair
data_reduced = data.drop(columns=features_to_remove)
print(f"Features removed: {features_to_remove}\n")

Features removed: set()



In [30]:
#  Split the data into train and test sets
X = data.drop(columns='Outcome')
X_reduced = data_reduced.drop(columns='Outcome')
y = data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_reduced, X_test_reduced, _, _ = train_test_split(X_reduced, y, test_size=0.3, random_state=42)

In [31]:
#  Train models on both versions of the dataset
model = LogisticRegression(max_iter=500)
model_reduced = LogisticRegression(max_iter=500)

model.fit(X_train, y_train)
model_reduced.fit(X_train_reduced, y_train)

In [32]:
#  Evaluate model performance
y_pred = model.predict(X_test)
y_pred_reduced = model_reduced.predict(X_test_reduced)

In [33]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with all features: {accuracy:.4f}")

Accuracy with all features: 0.7359


In [34]:
accuracy_reduced = accuracy_score(y_test, y_pred_reduced)
print(f"Accuracy with reduced features: {accuracy_reduced:.4f}\n")

Accuracy with reduced features: 0.7359



**4. Melbourne Housing Dataset:** Remove highly correlated features (correlation &gt; 0.85)
from the Melbourne housing dataset and evaluate the effect on the prediction of property
prices.

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import matplotlib.pyplot as plt

In [37]:
# Load the dataset
df = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [38]:
#  Handle missing values by dropping rows with missing values
df_cleaned = df.dropna()

In [39]:
# Verify there are no missing values
print(df_cleaned.isnull().sum())

Suburb           0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64


In [40]:
#  Remove non-numeric columns before calculating the correlation matrix
df_numeric = df_cleaned.select_dtypes(include=[np.number])

In [41]:
# Compute the correlation matrix for numeric columns only
correlation_matrix = df_numeric.corr()
print(correlation_matrix)

                  Rooms     Price  Distance  Postcode  Bedroom2  Bathroom  \
Rooms          1.000000  0.475074  0.276585  0.084236  0.964465  0.624070   
Price          0.475074  1.000000 -0.231212  0.046033  0.460880  0.463501   
Distance       0.276585 -0.231212  1.000000  0.489537  0.283460  0.122132   
Postcode       0.084236  0.046033  0.489537  1.000000  0.087286  0.111617   
Bedroom2       0.964465  0.460880  0.283460  0.087286  1.000000  0.626493   
Bathroom       0.624070  0.463501  0.122132  0.111617  0.626493  1.000000   
Car            0.401423  0.209464  0.259374  0.055531  0.405570  0.310962   
Landsize       0.101158  0.058375  0.138559  0.069623  0.101035  0.075939   
BuildingArea   0.606738  0.507284  0.135509  0.077091  0.595299  0.553855   
YearBuilt      0.006935 -0.313664  0.313383  0.089913  0.016310  0.192914   
Lattitude      0.018758 -0.224255 -0.055317 -0.195081  0.022745 -0.041859   
Longtitude     0.083016  0.212174  0.163941  0.358005  0.082671  0.109268   

In [42]:
#  Identify and remove highly correlated features (correlation > 0.85)
threshold = 0.85
high_corr_features = np.where(correlation_matrix > threshold)
high_corr_pairs = [(correlation_matrix.index[x], correlation_matrix.columns[y]) for x, y in zip(*high_corr_features) if x != y and x < y]

In [43]:
print("Highly correlated features (correlation > 0.85): \n " , high_corr_pairs)

Highly correlated features (correlation > 0.85): 
  [('Rooms', 'Bedroom2')]


In [44]:
# Remove one feature from each highly correlated pair
features_to_remove = set([pair[1] for pair in high_corr_pairs])
df_reduced = df_cleaned.drop(columns=features_to_remove)

In [45]:
#  Convert categorical variables into numerical values using OneHotEncoder
df_reduced = pd.get_dummies(df_reduced, drop_first=True)

In [46]:
X = df_reduced.drop(columns=['Price'])  
y = df_reduced['Price']

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [48]:
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [49]:
y_pred = model.predict(X_test)

In [50]:
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 6.605093373383016e+24


In [51]:
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

R-squared: -17191301001261.773


## # Low Variance Filter

**5. Diabetes Dataset:** Apply a low variance filter to remove features in the diabetes dataset
with very low variability, and observe how this affects the model&#39;s accuracy in predicting
diabetes.

In [54]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold

In [55]:
data = pd.read_csv(r"C:\Users\guntu\Downloads\diabetes.csv")


In [56]:
X = data.drop(columns='Outcome')
y = data['Outcome']


In [57]:
#  Calculate the variance of each feature
variances = X.var()
print("Variance of features:\n", variances, "\n")

Variance of features:
 Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
dtype: float64 



In [58]:
#  Apply a low variance filter with a threshold of 1.0
selector = VarianceThreshold(threshold=1.0)
X_reduced = selector.fit_transform(X)

In [59]:
# Get the names of features that remain after filtering
remaining_features = X.columns[selector.get_support()]
print(f"Remaining Features after Low Variance Filter: {list(remaining_features)}")
print(f"Reduced dataset shape: {X_reduced.shape}\n")

Remaining Features after Low Variance Filter: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']
Reduced dataset shape: (768, 7)



In [60]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_reduced, X_test_reduced, _, _ = train_test_split(X_reduced, y, test_size=0.3, random_state=42)

In [61]:
#  Train Logistic Regression models
model = LogisticRegression(max_iter=500)
model_reduced = LogisticRegression(max_iter=500)

In [62]:
model.fit(X_train, y_train)
model_reduced.fit(X_train_reduced, y_train)

In [63]:
y_pred = model.predict(X_test)
y_pred_reduced = model_reduced.predict(X_test_reduced)

In [64]:
accuracy = accuracy_score(y_test, y_pred)
accuracy_reduced = accuracy_score(y_test, y_pred_reduced)

In [65]:
print(f"Accuracy with all features: {accuracy:.4f}")
print(f"Accuracy with reduced features: {accuracy_reduced:.4f}\n")

Accuracy with all features: 0.7359
Accuracy with reduced features: 0.7186



**6. Melbourne Housing Dataset:** Filter out features in the Melbourne housing dataset with
low variance (e.g., those that are nearly constant across samples), and analyze its
impact on predicting housing prices.

In [67]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [68]:
# Load the dataset from the specified file path
df = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [69]:
# Handle missing values by dropping rows with missing values
df_cleaned = df.dropna()

In [70]:
# Convert categorical variables into numerical values using OneHotEncoder or get_dummies
df_encoded = pd.get_dummies(df_cleaned, drop_first=True)

In [71]:
X = df_encoded.drop(columns=['Price'])  
y = df_encoded['Price']

In [72]:
# Calculate variance of all features
feature_variances = X.var()
print("Variance of features before filtering:")
print(feature_variances)

Variance of features before filtering:
Rooms                                        0.928884
Distance                                    46.422449
Postcode                                 12681.973288
Bedroom2                                     0.933676
Bathroom                                     0.520723
                                             ...     
Regionname_Northern Victoria                 0.006929
Regionname_South-Eastern Metropolitan        0.040008
Regionname_Southern Metropolitan             0.211844
Regionname_Western Metropolitan              0.178028
Regionname_Western Victoria                  0.004816
Length: 696, dtype: float64


In [73]:
# Apply Variance Threshold (Filter out features with variance below the threshold, e.g., 0.01)
variance_threshold = VarianceThreshold(threshold=0.01)
X_high_variance = variance_threshold.fit_transform(X)

In [74]:
# Get the indices of the features that were kept and removed
features_kept = X.columns[variance_threshold.get_support()]
features_removed = X.columns[~variance_threshold.get_support()]

In [75]:
print("\nFeatures removed after applying variance threshold:")
print(features_removed)



Features removed after applying variance threshold:
Index(['Lattitude', 'Suburb_Aberfeldie', 'Suburb_Airport West',
       'Suburb_Albanvale', 'Suburb_Albert Park', 'Suburb_Albion',
       'Suburb_Alphington', 'Suburb_Altona', 'Suburb_Altona Meadows',
       'Suburb_Altona North',
       ...
       'CouncilArea_Frankston City Council',
       'CouncilArea_Greater Dandenong City Council',
       'CouncilArea_Macedon Ranges Shire Council',
       'CouncilArea_Mitchell Shire Council',
       'CouncilArea_Moorabool Shire Council',
       'CouncilArea_Nillumbik Shire Council',
       'CouncilArea_Yarra Ranges Shire Council', 'Regionname_Eastern Victoria',
       'Regionname_Northern Victoria', 'Regionname_Western Victoria'],
      dtype='object', length=552)


In [76]:
X_train, X_test, y_train, y_test = train_test_split(X_high_variance, y, test_size=0.2, random_state=42)

In [77]:
#  Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


In [78]:
y_pred = model.predict(X_test)

In [79]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [80]:
print(f"\nMean Squared Error after removing low variance features: {mse}")
print(f"R-squared after removing low variance features: {r2}")


Mean Squared Error after removing low variance features: 99023026194.04904
R-squared after removing low variance features: 0.7422694043633302


In [81]:
#  Compare with model performance before filtering low variance
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size=0.2, random_state=42)
model_full = LinearRegression()
model_full.fit(X_train_full, y_train_full)
y_pred_full = model_full.predict(X_test_full)

In [82]:
mse_full = mean_squared_error(y_test_full, y_pred_full)
r2_full = r2_score(y_test_full, y_pred_full)


In [83]:
print(f"\nMean Squared Error before removing low variance features: {mse_full}")
print(f"R-squared before removing low variance features: {r2_full}")


Mean Squared Error before removing low variance features: 5.57739244609372e+24
R-squared before removing low variance features: -14516468870727.008


 ## # Forward Feature Selection

**7.Diabetes Dataset:** Use forward feature selection to iteratively select the best features
from the diabetes dataset for a logistic regression model, and determine how many
features are optimal for predicting diabetes outcomes.

In [86]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SequentialFeatureSelector

In [87]:
data = pd.read_csv(r"C:\Users\guntu\Downloads\diabetes.csv")

In [88]:
X = data.drop(columns='Outcome')
y = data['Outcome']

In [89]:
#  Initialize Logistic Regression model
logreg = LogisticRegression(max_iter=500)

In [90]:
# Apply Forward Feature Selection
# Using SequentialFeatureSelector for forward selection
sfs = SequentialFeatureSelector(
    logreg, direction='forward', scoring='accuracy', cv=5
)
sfs.fit(X, y)

In [91]:
# Get the selected features
selected_features = X.columns[sfs.get_support()]
print(f"Selected Features: {list(selected_features)}")
print(f"Number of selected features: {len(selected_features)}\n")

Selected Features: ['Glucose', 'Insulin', 'BMI', 'Age']
Number of selected features: 4



In [92]:
#  Train model with selected features
X_selected = X[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)

logreg.fit(X_train, y_train)

In [93]:
#  Evaluate the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.4f}\n")

Accuracy with selected features: 0.7359



**8. Melbourne Housing Dataset:** Implement forward feature selection on the Melbourne
housing dataset to find the optimal set of features for predicting housing prices using a
linear regression model.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.impute import SimpleImputer

In [5]:
melbourne_data = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [7]:
data_clean = melbourne_data.dropna(subset=['Price'])

In [9]:
X = data_clean.drop(columns=['Price'])
y = data_clean['Price']

In [11]:
numerical_features = X.select_dtypes(include=['float64', 'int64'])
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(numerical_features), columns=numerical_features.columns)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

In [15]:
model = LinearRegression()

In [17]:
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False) 

In [19]:
selector = SequentialFeatureSelector(
    model, 
    n_features_to_select="auto", 
    direction='forward', 
    scoring=mae_scorer, 
    cv=5
)
selector.fit(X_train, y_train)

In [21]:
selected_features = X_imputed.columns[selector.get_support()]
print("Selected Features:")
print(selected_features)

Selected Features:
Index(['Rooms', 'Distance', 'Postcode', 'YearBuilt', 'Lattitude',
       'Longtitude'],
      dtype='object')


In [23]:
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
model.fit(X_train_selected, y_train)

In [25]:
y_pred = model.predict(X_test_selected)
mae = mean_absolute_error(y_test, y_pred)

In [28]:
print(f"\nMean Absolute Error (MAE) with selected features: {mae:.2f}")


Mean Absolute Error (MAE) with selected features: 310442.66


## # Backward Feature Elimination

**9. Diabetes Dataset:** Perform backward feature elimination on the diabetes dataset using
a decision tree classifier, removing the least important features one by one, and examine
the final set of features and its effect on model performance.

In [97]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE

In [98]:
data = pd.read_csv(r"C:\Users\guntu\Downloads\diabetes.csv")

In [99]:
X = data.drop(columns='Outcome')
y = data['Outcome']

In [100]:
#  Initialize Decision Tree Classifier
model = DecisionTreeClassifier(random_state=42)

In [101]:
#  Apply Recursive Feature Elimination (RFE) for backward feature elimination
rfe = RFE(estimator=model, n_features_to_select=1)  # Keep reducing until 1 feature remains
rfe.fit(X, y)

In [102]:
# Print ranking of features (1 = most important)
for feature, rank in zip(X.columns, rfe.ranking_):
    print(f"{feature}: Rank {rank}")

Pregnancies: Rank 6
Glucose: Rank 1
BloodPressure: Rank 5
SkinThickness: Rank 8
Insulin: Rank 7
BMI: Rank 2
DiabetesPedigreeFunction: Rank 3
Age: Rank 4


In [103]:
# Select the optimal set of features (those with rank 1)
selected_features = X.columns[rfe.support_]
print(f"\nFinal Selected Features: {list(selected_features)}")
print(f"Number of Selected Features: {len(selected_features)}\n")



Final Selected Features: ['Glucose']
Number of Selected Features: 1



In [104]:
# Train the model with selected features
X_selected = X[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)

In [105]:
#Evaluate the model performance
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.4f}\n")

Accuracy with selected features: 0.6840



**10. Melbourne Housing Dataset:** Apply backward feature elimination on the Melbourne
housing dataset using a random forest model, and analyze how removing the least
important features one at a time impacts the accuracy of price predictions.

In [32]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import RFE
from sklearn.impute import SimpleImputer

In [34]:
melbourne_data = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [36]:
melbourne_data_cleaned = melbourne_data.dropna(subset=['Price'])

In [38]:
X = melbourne_data_cleaned.select_dtypes(include=['float64', 'int64']).drop(columns=['Price'])
y = melbourne_data_cleaned['Price']

In [40]:
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)


In [44]:
model = RandomForestRegressor(random_state=42)

In [46]:
rfe = RFE(estimator=model, n_features_to_select=1, step=1)
rfe.fit(X_train, y_train)

In [47]:
ranking = rfe.ranking_
feature_names = melbourne_data_cleaned.select_dtypes(include=['float64', 'int64']).drop(columns=['Price']).columns

In [48]:
feature_ranking = pd.DataFrame({'Feature': feature_names, 'Ranking': ranking})
feature_ranking_sorted = feature_ranking.sort_values(by='Ranking')

In [49]:
print("Feature Rankings:")
print(feature_ranking_sorted)

Feature Rankings:
          Feature  Ranking
2        Postcode        1
1        Distance        2
0           Rooms        3
6        Landsize        4
10     Longtitude        5
9       Lattitude        6
7    BuildingArea        7
11  Propertycount        8
8       YearBuilt        9
4        Bathroom       10
5             Car       11
3        Bedroom2       12


In [50]:
mae_list = []
features_left = list(feature_names)

for i in range(1, len(features_left)):
    # Select top i features (excluding least important features)
    selected_features = feature_ranking_sorted['Feature'].head(len(features_left) - i)
    X_train_selected = pd.DataFrame(X_train, columns=feature_names)[selected_features]
    X_test_selected = pd.DataFrame(X_test, columns=feature_names)[selected_features]

    # Train and evaluate model with selected features
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    mae_list.append((len(selected_features), mae))

In [51]:
print("\n Performance after feature elimination:")
for num_features, mae in mae_list:
    print(f'Total features : {num_features}, MAE: {mae}')


 Performance after feature elimination:
Total features : 11, MAE: 179424.51396626508
Total features : 10, MAE: 180048.5765718176
Total features : 9, MAE: 183282.71177159972
Total features : 8, MAE: 184083.83523843126
Total features : 7, MAE: 185060.50616066912
Total features : 6, MAE: 192430.39868839944
Total features : 5, MAE: 198423.13374158964
Total features : 4, MAE: 207478.119795803
Total features : 3, MAE: 229791.0755896465
Total features : 2, MAE: 335183.6888627427
Total features : 1, MAE: 336796.30023891194


## # Random Forest

**11. Diabetes Dataset:** Use the feature importance scores from a random forest model to
rank the features in the diabetes dataset, then keep only the top 5 most important
features and evaluate how well the reduced model predicts diabetes.

In [108]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [109]:
data = pd.read_csv(r"C:\Users\guntu\Downloads\diabetes.csv")

In [110]:
X = data.drop(columns='Outcome')
y = data['Outcome']

In [111]:
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

In [112]:
# Fit the model
rf_model.fit(X, y)

In [113]:
# Get feature importance scores
importance_scores = rf_model.feature_importances_
feature_importance = pd.Series(importance_scores, index=X.columns).sort_values(ascending=False)

In [114]:
# Print feature importance scores
print("Feature Importance Scores:")
print(feature_importance, "\n")

Feature Importance Scores:
Glucose                     0.267142
BMI                         0.168769
Age                         0.131567
DiabetesPedigreeFunction    0.122695
BloodPressure               0.088660
Pregnancies                 0.085017
Insulin                     0.071547
SkinThickness               0.064604
dtype: float64 



In [115]:
# Select the top 5 most important features
top_features = feature_importance.head(5).index.tolist()
print(f"Top 5 Features: {top_features}\n")

Top 5 Features: ['Glucose', 'BMI', 'Age', 'DiabetesPedigreeFunction', 'BloodPressure']



In [116]:
X_top = X[top_features]
X_train, X_test, y_train, y_test = train_test_split(X_top, y, test_size=0.3, random_state=42)

In [117]:
rf_model.fit(X_train, y_train)

In [118]:
# Evaluate the model performance
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with top 5 features: {accuracy:.4f}\n")

Accuracy with top 5 features: 0.7489



**12. Melbourne Housing Dataset:** Train a random forest model on the Melbourne housing
dataset to determine the most important features for predicting housing prices, and
assess the model’s accuracy after removing the least important features.

In [59]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
import numpy as np

In [61]:
melbourne_data = pd.read_csv(r"C:\Users\guntu\Downloads\melbourne_housing_raw.csv")

In [63]:
melbourne_data_cleaned = melbourne_data.dropna(subset=['Price'])

In [65]:
X = melbourne_data_cleaned.select_dtypes(include=['float64', 'int64']).drop(columns=['Price'])
y = melbourne_data_cleaned['Price']

In [67]:
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

In [73]:
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

In [75]:
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort by importance descending
feature_names = melbourne_data_cleaned.select_dtypes(include=['float64', 'int64']).drop(columns=['Price']).columns

In [77]:
print("Feature Importance Ranking:")
for idx in indices:
    print(f'{feature_names[idx]}: {importances[idx]}')

Feature Importance Ranking:
Distance: 0.27642548087435304
Rooms: 0.21859591163938227
Postcode: 0.18913604092123223
Landsize: 0.07666168167616282
BuildingArea: 0.05089059548042885
Longtitude: 0.04396828931563525
Lattitude: 0.04358082967759251
Propertycount: 0.034685054759188455
YearBuilt: 0.02502637273662553
Bathroom: 0.019585440128347707
Car: 0.016252519882662745
Bedroom2: 0.005191782908388576


In [79]:
y_pred = model.predict(X_test)
mae_full = mean_absolute_error(y_test, y_pred)
print(f'\nModel accuracy with all features, MAE: {mae_full}')


Model accuracy with all features, MAE: 179476.4312564365


In [81]:
n_least_important = 3
least_important_features = feature_names[indices[-n_least_important:]]
selected_features_idx = indices[:-n_least_important]
X_train_reduced = X_train[:, selected_features_idx]
X_test_reduced = X_test[:, selected_features_idx]

In [83]:
model.fit(X_train_reduced, y_train)
y_pred_reduced = model.predict(X_test_reduced)
mae_reduced = mean_absolute_error(y_test, y_pred_reduced)

In [84]:
print(f'\nRemoved the {n_least_important} least important features: {list(least_important_features)}')
print(f'Model accuracy after removing these features, MAE: {mae_reduced}')



Removed the 3 least important features: ['Bathroom', 'Car', 'Bedroom2']
Model accuracy after removing these features, MAE: 183280.52848451494
