### Step 1: Import Libraries
In this step, we import the necessary Python libraries:
1. `pandas` for data manipulation.
2. `mutual_info_classif` from `sklearn` to calculate mutual information.
3. `LabelEncoder` from `sklearn` to encode categorical variables into numeric format.


In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder


### Step 2: Load the Dataset
Here, we load the cleaned dataset into a Pandas DataFrame. Replace `cleaned_dataset.csv` with the path to your dataset file.


In [2]:
# Step 2: Load the dataset
file_path = 'cleaned_dataset.csv'  # Replace with the path to your dataset
df_cleaned = pd.read_csv(file_path)


### Step 3: Define Features and Target
We split the data into:
1. `X` (features): All columns except the target variable.
2. `y` (target): The column we want to predict (e.g., stress levels).


In [3]:
# Step 3: Define features (X) and target (y)
target_column = "How_often_do_you_feel_stressed"  # Replace with your target column
X = df_cleaned.drop(columns=[target_column])  # Features
y = df_cleaned[target_column]  # Target


### Step 4: Encode Categorical Variables
1. Convert categorical variables in `X` into numeric values using `LabelEncoder`.
2. Encode the target variable `y` for compatibility with the mutual information function.


In [4]:
# Step 4: Encode categorical variables in features
X_encoded = X.copy()
categorical_columns = X_encoded.select_dtypes(include=['object']).columns
for col in categorical_columns:
    le = LabelEncoder()
    X_encoded[col] = le.fit_transform(X_encoded[col].astype(str))

# Encode the target variable
y_encoded = LabelEncoder().fit_transform(y)


### Step 5: Calculate Mutual Information
Use the `mutual_info_classif` function to compute the mutual information between each feature and the target variable. This measures how much information each feature provides about the target.


In [5]:
# Step 5: Calculate mutual information
mutual_info = mutual_info_classif(X_encoded, y_encoded, discrete_features='auto')
mutual_info_series = pd.Series(mutual_info, index=X_encoded.columns).sort_values(ascending=False)


### Step 6: Display Results
Print the mutual information scores for all features, sorted in descending order to highlight the most relevant features.


In [6]:
# Step 6: Display the results
print("Mutual Information Scores (Features vs Target):")
print(mutual_info_series)


Mutual Information Scores (Features vs Target):
I_see_myself_as_someone_who_is_anxious_easily_upset                                                                                                                           0.144296
I_see_myself_as_someone_who_is_reserved_quiet                                                                                                                                 0.119505
Do_you_have_close_friends                                                                                                                                                     0.113222
I_see_myself_as_someone_who_is_calm_emotionally_stable                                                                                                                        0.112324
Gender                                                                                                                                                                        0.087159
Do_you_find_it_difficult_to_refrain_f

In [7]:
# Filter columns with non-zero mutual information scores
relevant_features = mutual_info_series[mutual_info_series > 0].index

# Create a new dataset with only the relevant features and the target variable
df_relevant = df_cleaned[relevant_features.tolist() + [target_column]]

# Save the new dataset for future use
new_file_path = 'relevant_features_dataset.csv'  # Path to save the dataset
df_relevant.to_csv(new_file_path, index=False)

# Display the new dataset's columns
print("Relevant Features in the Dataset:")
print(df_relevant.columns)

print(f"\nNew dataset saved as: {new_file_path}")


Relevant Features in the Dataset:
Index(['I_see_myself_as_someone_who_is_anxious_easily_upset',
       'I_see_myself_as_someone_who_is_reserved_quiet',
       'Do_you_have_close_friends',
       'I_see_myself_as_someone_who_is_calm_emotionally_stable', 'Gender',
       'Do_you_find_it_difficult_to_refrain_from_smoking_where_it_is_forbidden_church_library_cinema_plane_etc',
       'How_old_were_you_the_first_time_you_smoked_a_full_cigarette_not_just_a_few_puffs',
       'What_is_your_favorite_or_preferred_cigarette_brands_if_you_were_able_to_access_it',
       'What_is_the_highest_level_of_education_you_have_attained',
       'How_often_do_you_exercise', 'How_old_are_you',
       'What_cigarette_brands_are_you_currently_using',
       'How_many_cigarettes_do_you_smoke_each_day',
       'I_see_myself_as_someone_who_is_disorganized_careless',
       'Do_you_smoke_if_you_are_so_ill_that_you_are_in_bed_most_of_the_day',
       'Including_yourself_how_many_people_currently_live_in_your_house