<a href="https://colab.research.google.com/github/Katkins178/KendallA_DTSC4050_Fall2025/blob/main/KA_0955_Assignment_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 9

Kendall Atkins

**Instructions**

* Save a copy to your drive
* Provide your name
* Run the first code cell
* Complete the rest of the cells given the prompt

**Note**: There are three target variables. Only one is used in a code cell at a time.

## Filter Methods

In [2]:
import pandas as pd
import numpy as np
import random
import time
from sklearn.feature_selection import mutual_info_classif, chi2
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor


def generate_user_seed():
# Get current time in nanoseconds (more granular)
    nanoseconds = time.time_ns()

# Add a small random component to further reduce collision chances
    random_component = random.randint(0, 1000)  # Adjust range as needed

# Combine them (XOR is a good way to mix values)
    seed = nanoseconds ^ random_component

# Ensure the seed is within the valid range for numpy's seed
    seed = seed % (2**32)  # Modulo to keep it within 32-bit range

    return seed

user_seed = generate_user_seed()
print(user_seed)
np.random.seed(user_seed)


# Generate dataset
n_samples = 500
n_features = 10

# Numerical features
numerical_data = np.random.randn(n_samples, n_features - 2)  # 8 numerical features

# Categorical features
categorical_data = np.random.choice(['A', 'B', 'C', 'D'], size=(n_samples, 2))  # 2 categorical features

# Create DataFrame
columns_numerical = [f'num_{i}' for i in range(n_features - 2)]
columns_categorical = ['cat_1', 'cat_2']
df_numerical = pd.DataFrame(numerical_data, columns=columns_numerical)
df_categorical = pd.DataFrame(categorical_data, columns=columns_categorical)

df = pd.concat([df_numerical, df_categorical], axis=1)

# Target variable (numerical and categorical for different methods)
df['target_numerical'] = df['num_0'] + 0.5 * df['num_1'] + np.random.randn(n_samples) * 0.5
df['target_categorical'] = np.random.choice(['Yes', 'No'], size=n_samples, p=[0.7, 0.3])

# Create highly correlated features
df['num_6'] = df['num_0'] + 0.8 * df['num_1'] + np.random.randn(n_samples) * 0.1  # Highly correlated with num_0 and num_1
df['num_7'] = 0.7 * df['num_2'] - 0.9 * df['num_3'] + np.random.randn(n_samples) * 0.1  # Highly correlated with num_2 and num_3

# Encode categorical target
le = LabelEncoder()
df['target_categorical_encoded'] = le.fit_transform(df['target_categorical'])

# Encode categorical features
encoder = OrdinalEncoder()
df[columns_categorical] = encoder.fit_transform(df[columns_categorical])

1538654281


In [3]:
# 1a. Print the Pearson correlation coefficients between the columns_numerical and the target_numerical only
pearson_corr = df[columns_numerical].corrwith(df["target_numerical"])
print(pearson_corr)

num_0    0.813522
num_1    0.407011
num_2    0.050437
num_3    0.026618
num_4    0.071344
num_5    0.041500
num_6    0.889647
num_7    0.013619
dtype: float64


1b. What is the correlation between the columns_numerical and the target_numerical telling us?

num_0 and num_6 have strong correlations. Most likely the true driver of the target
num_1 has a moderate correlation
the rest have weak to no correlations

In [4]:
# 2a. Print the feature and the Variance Inflation Factor (VIF) values for the numerical features only
vif_data = pd.DataFrame()
vif_data["feature"] = df_numerical.columns
vif_data["VIF"] = [variance_inflation_factor(df_numerical.values, i) for i in range(len(df_numerical.columns))]
print(vif_data.to_string(index=False))

feature      VIF
  num_0 1.014793
  num_1 1.014504
  num_2 1.013029
  num_3 1.005749
  num_4 1.015674
  num_5 1.007460
  num_6 1.003990
  num_7 1.016054



2b. What is VIF telling us?

Their is no multicolinierity among the features because the VIF is around 1 for all features

In [8]:
# 3a. Print the Mutual Information Classification values between all features that start with num or cat and the categorical target encoded
feature_cols = columns_numerical + columns_categorical
X = df[feature_cols]
y = df['target_categorical_encoded']

mi_scores = mutual_info_classif(X, y, random_state=user_seed)

mi_data = pd.DataFrame({
    'Feature': feature_cols,
    'MI_Score': mi_scores
})

print("\nMutual Information Classification Scores:")
print(mi_data.to_string(index=False))


Mutual Information Classification Scores:
Feature  MI_Score
  num_0  0.000000
  num_1  0.010987
  num_2  0.000000
  num_3  0.000000
  num_4  0.009639
  num_5  0.047573
  num_6  0.033834
  num_7  0.008853
  cat_1  0.000000
  cat_2  0.017127


3b. What are the mutual information values telling us?

None of the features help predict the target as all values were around 0

In [7]:
chi2_scores, p_values = chi2(df[columns_categorical], df['target_categorical_encoded'])
chi2_data = pd.DataFrame({
    'Feature': columns_categorical,
    'Chi2_Score': chi2_scores,
    'P_Value': p_values
})
print("\nChi-Square Test Results:")
print(chi2_data.to_string(index=False))


Chi-Square Test Results:
Feature  Chi2_Score  P_Value
  cat_1    0.144973 0.703387
  cat_2    0.071599 0.789023


4b. What are the Chi-Square values telling us?

Because the Chi-square is much smaller than .05, there is no statistical relationship between the categorical feature and the target

In [9]:
# 5a. Print the Variance Threshold values for the columns_numerical
variances = df_numerical.var()

variance_data = pd.DataFrame({
    'Feature': columns_numerical,
    'Variance': variances.values
})

print("\nVariance Threshold Values:")
print(variance_data.to_string(index=False))


Variance Threshold Values:
Feature  Variance
  num_0  1.037391
  num_1  0.926036
  num_2  0.986655
  num_3  1.092006
  num_4  0.959237
  num_5  1.042099
  num_6  0.971602
  num_7  0.931579


5b. What is are the Variance Threshold values telling us?

Each feature has a normal sized variation (1)

**Run the following code cell**

In [10]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset for regression
X, y = make_regression(n_samples=200, n_features=10, n_informative=5, random_state=user_seed)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=user_seed)
print(user_seed)



1538654281


In [11]:
# 6a. Create a linear regression model and a RFE object, selecting the top 5 features.
# Fit (train) the RFE model with X_train and y_train
# Print the selected features and their rankings

# Create RFE object to select top 5 features
rfe = RFE(estimator= LinearRegression(), n_features_to_select=5)

# Fit the RFE model
rfe.fit(X_train, y_train)

# Get the selected features
selected_features = X_train.columns[rfe.support_].tolist()

# Get the feature rankings (1 = selected, higher numbers = eliminated earlier)
feature_rankings = pd.DataFrame({
    'Feature': X_train.columns,
    'Selected': rfe.support_,
    'Ranking': rfe.ranking_
}).sort_values('Ranking')

print("Selected Features (Top 5):")
print(selected_features)
print("\nFeature Rankings:")
print(feature_rankings)
print("\nDetailed Ranking (1 = selected):")
for idx, row in feature_rankings.iterrows():
    status = "✓ SELECTED" if row['Selected'] else "✗ Eliminated"
    print(f"{row['Feature']}: Rank {row['Ranking']} {status}")

Selected Features (Top 5):
['feature_0', 'feature_1', 'feature_3', 'feature_7', 'feature_8']

Feature Rankings:
     Feature  Selected  Ranking
0  feature_0      True        1
1  feature_1      True        1
3  feature_3      True        1
7  feature_7      True        1
8  feature_8      True        1
9  feature_9     False        2
2  feature_2     False        3
6  feature_6     False        4
4  feature_4     False        5
5  feature_5     False        6

Detailed Ranking (1 = selected):
feature_0: Rank 1 ✓ SELECTED
feature_1: Rank 1 ✓ SELECTED
feature_3: Rank 1 ✓ SELECTED
feature_7: Rank 1 ✓ SELECTED
feature_8: Rank 1 ✓ SELECTED
feature_9: Rank 2 ✗ Eliminated
feature_2: Rank 3 ✗ Eliminated
feature_6: Rank 4 ✗ Eliminated
feature_4: Rank 5 ✗ Eliminated
feature_5: Rank 6 ✗ Eliminated


6b. What is Recursive Feature Elimination doing for us?

Helps narrow down from 10 features to 5 to give us a simpler and better performing model


**Run the following code cell**

In [13]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.feature_selection import SelectKBest, f_regression, SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Generate a synthetic dataset for regression
X, y = make_regression(n_samples=200, n_features=10, n_informative=5, random_state=user_seed)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=user_seed)
print(user_seed)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



1538654281


In [14]:
# 7a. Create a Lasso model with alpha = 0.1
# Fit (train) the model with X_train_scaled and y_train
# Print the selected features and coefficients (with their feature names)
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_train_scaled, y_train)

# Print the selected features and coefficients
print("Lasso Coefficients:")
for feature, coef in zip(feature_names, lasso.coef_):
    print(f"{feature}: {coef:.4f}")

# Show which features were selected (non-zero coefficients)
selected_features = [feature for feature, coef in zip(feature_names, lasso.coef_) if coef != 0]
print(f"\nSelected features (non-zero coefficients): {selected_features}")
print(f"Number of features selected: {len(selected_features)}")

Lasso Coefficients:
feature_0: 23.6109
feature_1: 34.5022
feature_2: -0.0000
feature_3: 2.1830
feature_4: 0.0000
feature_5: -0.0000
feature_6: 0.0000
feature_7: 71.0828
feature_8: 13.5648
feature_9: -0.0000

Selected features (non-zero coefficients): ['feature_0', 'feature_1', 'feature_3', 'feature_7', 'feature_8']
Number of features selected: 5


7b. What is the Lasso model doing for us?

Removing features  and regularization to help the model generalize better to unseen data

7c. Why aren't we scaling y_train?

Lasso only effects coefficients/slopes not the target variable

In [15]:
# 8a. Create a Ridge model aith alpha = 0.5
# Fit (train) the model with X_train_scaled and y_train
# Print the selected features based on threshold and print the coefficients (with their feature names)
ridge_model = Ridge(alpha=0.5, random_state=42)

# Fit (train) the model with X_train_scaled and y_train
ridge_model.fit(X_train_scaled, y_train)

# Print the coefficients with their feature names
print("Ridge Model Coefficients:")
print("-" * 50)
coefficients_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print(coefficients_df.to_string(index=False))
print(f"\nIntercept: {ridge_model.intercept_:.4f}")

Ridge Model Coefficients:
--------------------------------------------------
  Feature  Coefficient
feature_7    70.913710
feature_1    34.463798
feature_0    23.633995
feature_8    13.626567
feature_3     2.297274
feature_5    -0.038610
feature_6     0.025717
feature_2    -0.012401
feature_9    -0.004525
feature_4    -0.000061

Intercept: -1.1897


8b. What does the ridge model do for us?

Penalizes large coefficients to create a simpler model and reduce overfitting

In [18]:
# 9a. Print the top 5 Select K Best Selected Features using f_regression
selector = SelectKBest(score_func=f_regression, k=5)

# Fit the selector on the training data
selector.fit(X_train_scaled, y_train)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)

# Get the feature names and scores
selected_features = [feature_names[i] for i in selected_indices]
feature_scores = selector.scores_

# Create a DataFrame with all features and their scores
feature_scores_df = pd.DataFrame({
    'Feature': feature_names,
    'F-Score': feature_scores,
    'Selected': ['Yes' if i in selected_indices else 'No' for i in range(len(feature_names))]
}).sort_values('F-Score', ascending=False)

print("Top 5 Features:")
print(feature_scores_df.head(5).to_string(index=False))



Top 5 Features:
  Feature    F-Score Selected
feature_7 342.773887      Yes
feature_1  24.073844      Yes
feature_0  14.126981      Yes
feature_8   5.128623      Yes
feature_5   2.295739      Yes


9b. What does Select K Best do for us?
Calculates F-statistic for each feacure then selects the k features with the highest F-scores/ stongest linear relationships.

In [22]:
# 10a. Print the Selected Features with Select From Model (using RandomForestRegressor)

# Create a RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on training data
rf_model.fit(X_train_scaled, y_train)

# Create SelectFromModel using the trained RandomForest
selector_model = SelectFromModel(rf_model, prefit=True)

# Fit the SelectFromModel to set its internal attributes like threshold_
selector_model.fit(X_train_scaled, y_train)

# Get the selected feature indices
selected_indices = selector_model.get_support(indices=True)

# Get the feature names and importances
selected_features = [feature_names[i] for i in selected_indices]
feature_importances = rf_model.feature_importances_

# Create a DataFrame with all features and their importance scores
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances,
    'Selected': ['Yes' if i in selected_indices else 'No' for i in range(len(feature_names))]
}).sort_values('Importance', ascending=False)

print("\nSelected Features:")
print(feature_importance_df[feature_importance_df['Selected'] == 'Yes'].to_string(index=False))



Selected Features:
  Feature  Importance Selected
feature_7    0.742151      Yes
feature_1    0.134233      Yes


10b. What does Select From Model do for us?
Selects features based on the importance weights learned by the model, keeping onlt the most influential ones for predicting the target