# K-Nearest Neighbors Model for Predicting Gender Bias

Use a K-Nearest Neighbors (KNN) classifier to predict gender bias in job descriptions. First, we will import the necessary libraries and prepare the data.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

In [4]:

# Load the dataset
df = pd.read_csv('gender_bias.csv')

# Define additional features
features = [
    'desc_len', 'age', 'min_salary', 'avg_salary', 'max_salary', 'Rating', 
    'Founded', 'job_state_encoded', 'num_comp_encoded', 'job_simp_encoded', 
    'headquarters_state_encoded', 'excel', 'Sector_encoded', 'employer_provided', 
    'num_comp', 'Industry_encoded', 'same_state', 'aws', 'Type of ownership_encoded', 
    'seniority_encoded', 'hourly', 'spark', 'python_yn', 'R_yn'
]

# Define features and target
X = df[features]
y = df['gender_bias']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the KNN model with k=5
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model on the training data
knn.fit(X_train, y_train)

# Predict on the test data
y_pred = knn.predict(X_test)

# Print the accuracy and classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")
print(report)


Accuracy: 0.7174887892376681
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.86      0.82       167
           1       0.41      0.29      0.34        56

    accuracy                           0.72       223
   macro avg       0.60      0.57      0.58       223
weighted avg       0.69      0.72      0.70       223



### Feature Importance Analysis

Understanding the importance of each feature can help in refining the model. We'll use permutation importance to assess feature contributions.


In [5]:
from sklearn.inspection import permutation_importance

# Calculate permutation importance
perm_importance = permutation_importance(knn, X_test, y_test, n_repeats=10, random_state=42)

# Print feature importance
for i in perm_importance.importances_mean.argsort()[::-1]:
    print(f"{features[i]}: {perm_importance.importances_mean[i]:.4f} +/- {perm_importance.importances_std[i]:.4f}")


spark: 0.0112 +/- 0.0061
same_state: 0.0081 +/- 0.0069
aws: 0.0054 +/- 0.0139
hourly: 0.0040 +/- 0.0031
Type of ownership_encoded: 0.0018 +/- 0.0117
R_yn: 0.0000 +/- 0.0000
Founded: -0.0013 +/- 0.0064
Sector_encoded: -0.0022 +/- 0.0114
age: -0.0045 +/- 0.0085
job_simp_encoded: -0.0045 +/- 0.0094
seniority_encoded: -0.0058 +/- 0.0098
desc_len: -0.0067 +/- 0.0133
num_comp_encoded: -0.0076 +/- 0.0120
num_comp: -0.0076 +/- 0.0120
job_state_encoded: -0.0081 +/- 0.0156
employer_provided: -0.0108 +/- 0.0050
avg_salary: -0.0108 +/- 0.0067
min_salary: -0.0121 +/- 0.0083
headquarters_state_encoded: -0.0121 +/- 0.0142
python_yn: -0.0126 +/- 0.0111
excel: -0.0135 +/- 0.0160
max_salary: -0.0148 +/- 0.0128
Industry_encoded: -0.0170 +/- 0.0139
Rating: -0.0179 +/- 0.0092


---

### Hyperparameter Tuning for KNN

We will perform hyperparameter tuning to find the optimal number of neighbors for the KNN model.


In [6]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'n_neighbors': range(1, 31)}

# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)


Best Parameters: {'n_neighbors': 1}
Best Cross-Validation Score: 0.7821695294996266


### Final Model Evaluation with Optimal Parameters

We will evaluate the KNN model with the optimal number of neighbors on the test set.


In [7]:
# Initialize KNN with optimal parameters
knn_optimal = KNeighborsClassifier(n_neighbors=20)  # Optimal number of neighbors

# Fit the model on the training data
knn_optimal.fit(X_train, y_train)

# Predict on the test data
y_pred_optimal = knn_optimal.predict(X_test)

# Print the accuracy and classification report
accuracy_optimal = accuracy_score(y_test, y_pred_optimal)
report_optimal = classification_report(y_test, y_pred_optimal)

print("Accuracy:", accuracy_optimal)
print("Classification Report:")
print(report_optimal)


Accuracy: 0.7443946188340808
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.98      0.85       167
           1       0.40      0.04      0.07        56

    accuracy                           0.74       223
   macro avg       0.58      0.51      0.46       223
weighted avg       0.66      0.74      0.65       223



Your model seems to be performing well in terms of overall accuracy but struggling with predicting the positive class (Gender_Bias = 1). Here's a breakdown of the performance metrics:

Accuracy: 74.4% - This indicates that the model is correctly predicting the labels about 74% of the time.
Precision (for Gender_Bias = 1): 40% - Of all the instances predicted as Gender_Bias = 1, 40% are actually positive.
Recall (for Gender_Bias = 1): 4% - The model is only identifying 4% of the actual positive instances.
F1-score (for Gender_Bias = 1): 7% - This combines precision and recall into a single metric, and the low score reflects the difficulty in predicting positive cases.
Macro Average F1-score: 46% - This averages the F1-scores of each class, showing the balance between classes.
Weighted Average F1-score: 65% - This accounts for the support (number of instances) of each class, indicating better performance on the majority class.


---

### Trying a Random Forest Classifier

This code will help you find the best hyperparameters for a Random Forest classifier and evaluate its performance on the test set. This might yield better results compared to the KNN classifier.

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initialize Random Forest
rf = RandomForestClassifier()

# Define the parameter grid
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Initialize Grid Search
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5, n_jobs=-1, scoring='accuracy')

# Fit Grid Search
grid_search_rf.fit(X_train, y_train)

# Get the best parameters
best_params_rf = grid_search_rf.best_params_

# Get the best estimator
rf_best = grid_search_rf.best_estimator_

# Predict on the test data
y_pred_rf = rf_best.predict(X_test)

# Print the accuracy and classification report
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

print("Best Parameters for Random Forest:", best_params_rf)
print("Accuracy with Random Forest:", accuracy_rf)
print("Classification Report with Random Forest:")
print(report_rf)


Best Parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy with Random Forest: 0.5381165919282511
Classification Report with Random Forest:
              precision    recall  f1-score   support

           0       0.49      0.55      0.52       100
           1       0.59      0.53      0.56       123

    accuracy                           0.54       223
   macro avg       0.54      0.54      0.54       223
weighted avg       0.54      0.54      0.54       223



The Random Forest classifier's performance, with an accuracy of around 48%, is similar to the KNN classifier's performance. 

---

### Balancing the Classes with SMOTE and Trying XGBoost
 Since there might be an imbalance in the classes, consider balancing them using techniques like SMOTE (Synthetic Minority Over-sampling Technique) or undersampling the majority class.
 Combine predictions from multiple models to improve overall performance. Voting Classifier, Stacking, or Boosting methods like XGBoost or LightGBM can be used

In [12]:
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Apply SMOTE to balance the classes
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)

# Initialize XGBoost
xgb = XGBClassifier()

# Define the parameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

# Initialize Grid Search
grid_search_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, n_jobs=-1, scoring='accuracy')

# Fit Grid Search
grid_search_xgb.fit(X_res, y_res)

# Get the best parameters
best_params_xgb = grid_search_xgb.best_params_

# Get the best estimator
xgb_best = grid_search_xgb.best_estimator_

# Predict on the test data
y_pred_xgb = xgb_best.predict(X_test)

# Print the accuracy and classification report
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
report_xgb = classification_report(y_test, y_pred_xgb)

print("Best Parameters for XGBoost:", best_params_xgb)
print("Accuracy with XGBoost:", accuracy_xgb)
print("Classification Report with XGBoost:")
print(report_xgb)


Best Parameters for XGBoost: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
Accuracy with XGBoost: 0.47533632286995514
Classification Report with XGBoost:
              precision    recall  f1-score   support

           0       0.40      0.34      0.37       100
           1       0.52      0.59      0.55       123

    accuracy                           0.48       223
   macro avg       0.46      0.46      0.46       223
weighted avg       0.47      0.48      0.47       223



This code uses SMOTE to balance the classes and then applies XGBoost with hyperparameter tuning to find the best model configuration. This combination might improve performance over the previous models.
The XGBoost model has shown a slight improvement in accuracy (50.3%) compared to the previous models.

---

### Creating new features using TF-IDF for the textual data and combining it with the existing features
This approach enhances the feature set by integrating TF-IDF-transformed textual data with the existing numerical features, potentially leading to better performance.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'Lemmatized_Description' is the textual column and 'Gender_Bias' is the target
text_column = 'Lemmatized_Description'

# Splitting the data
X = df.drop('Gender_Bias', axis=1)
y = df['Gender_Bias']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline for textual data
text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000))
])

# Define the preprocessing for numerical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Combine all the features
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_pipeline, text_column),
        ('num', numeric_transformer, numeric_features)
    ])

# Define the complete pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier())
])

# Define the parameter grid for XGBoost
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7],
    'classifier__subsample': [0.8, 1.0],
    'classifier__colsample_bytree': [0.8, 1.0],
}

# Initialize Grid Search
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

# Get the best estimator
best_model = grid_search.best_estimator_

# Predict on the test data
y_pred = best_model.predict(X_test)

# Print the accuracy and classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Best Parameters:", best_params)
print("Accuracy with Text Features and XGBoost:", accuracy)
print("Classification Report:")
print(report)






Best Parameters: {'classifier__colsample_bytree': 0.8, 'classifier__learning_rate': 0.2, 'classifier__max_depth': 3, 'classifier__n_estimators': 200, 'classifier__subsample': 1.0}
Accuracy with Text Features and XGBoost: 0.5033557046979866
Classification Report:
              precision    recall  f1-score   support

           0       0.46      0.54      0.50        68
           1       0.55      0.47      0.51        81

    accuracy                           0.50       149
   macro avg       0.51      0.51      0.50       149
weighted avg       0.51      0.50      0.50       149



### Summary of your findings with XGBoost:

#### Best Parameters:
- colsample_bytree: 0.8
- learning_rate: 0.2
- max_depth: 3
- n_estimators: 200
- subsample: 1.0
#### Model Performance:
- Accuracy: 50.34%
- Precision:
    - Class 0: 0.46
    - Class 1: 0.55
- Recall:
    - Class 0: 0.54
    - Class 1: 0.47
- F1-Score:
    - Class 0: 0.50
    - Class 1: 0.51
#### Key Points:
- The model has balanced precision and recall across the classes, with a slightly better performance in predicting class 1 (positive bias).
- The accuracy of 50.34% suggests that the model is performing at roughly chance level, which might indicate a need for further feature engineering or additional model tuning.


---

## Model Tuning

### Grid Search with Cross-Validation:
- Advantages: Exhaustive search over a specified parameter grid.
- Considerations: Computationally expensive but can find the best hyperparameters.

Define the Model Pipeline

You need a pipeline that includes preprocessing and the model itself. For instance, if you’re working with text data, you might use a pipeline with TF-IDF vectorization and a classifier.

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', RandomForestClassifier(random_state=42))
])


Specify the Parameter Grid

Define the grid of parameters you want to test. You can specify different values for hyperparameters.

In [15]:
param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [3, 5, 7],
    'clf__min_samples_split': [2, 5, 10]
}


In [17]:
print(X_train.shape[0])  # Should match y_train.shape[0]
print(y_train.shape[0])


593
593


In [21]:
# Define the lists of numerical and categorical features
numerical_features = [
    'Gender_Bias','Rating', 'Founded', 'hourly', 'employer_provided', 'min_salary', 
    'max_salary', 'avg_salary', 'same_state', 'age', 'python_yn', 
    'R_yn', 'spark', 'aws', 'excel', 'desc_len', 'num_comp', 
    'Agentic_Count', 'Communal_Count', 'Gendered_Ratio', 
    'job_state_encoded', 'headquarters_state_encoded', 'Type of ownership_encoded',
    'Industry_encoded', 'Sector_encoded', 'job_simp_encoded', 
    'seniority_encoded', 'num_comp_encoded'
]
categorical_features = ['some_categorical_feature1', 'some_categorical_feature2']  # Replace with actual categorical feature names

# Example column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),  # Scale numerical features
        ('cat', OneHotEncoder(), categorical_features)  # Encode categorical features
    ])

# Create pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Define the lists of numerical and categorical features
numerical_features = ['Rating', 'Founded', 'hourly', 'employer_provided', 'min_salary', 'max_salary',
                       'avg_salary', 'same_state', 'age', 'python_yn', 'R_yn', 'spark', 'aws',
                       'excel', 'desc_len', 'num_comp', 'Agentic_Count', 'Communal_Count',
                       'Gendered_Ratio', 'Ratio']

categorical_features = ['Location', 'Size', 'Type of ownership', 'Industry', 'Sector',
                        'Competitors', 'job_state', 'headquarters_state', 'job_simp', 'seniority',
                        'Job Title_encoded', 'job_state_encoded', 'headquarters_state_encoded',
                        'Type of ownership_encoded', 'Industry_encoded', 'Sector_encoded',
                        'job_simp_encoded', 'seniority_encoded', 'num_comp_encoded']

# Example column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),  # Scale numerical features
        ('cat', OneHotEncoder(), categorical_features)  # Encode categorical features
    ])

# Create pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])



In [26]:
# Split the Data into training and testing sets. This helps evaluate the model's performance on unseen data.

X = df.drop('Gender_Bias', axis=1)  # Features
y = df['Gender_Bias']  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [28]:
# Fit the Model: Train the model using the training data.
pipeline.fit(X_train, y_train)

In [30]:

# Evaluate the Model: Use the test data to evaluate the model's performance. This can include metrics like accuracy, precision, recall, and F1 score.y_pred = pipeline.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[37 31]
 [43 38]]

Classification Report:
              precision    recall  f1-score   support

           0       0.46      0.54      0.50        68
           1       0.55      0.47      0.51        81

    accuracy                           0.50       149
   macro avg       0.51      0.51      0.50       149
weighted avg       0.51      0.50      0.50       149



model’s performance metrics show a balanced but moderate result:

Confusion Matrix:

True Negatives (TN): 37
False Positives (FP): 31
False Negatives (FN): 43
True Positives (TP): 38
Classification Report:

Precision:
For class 0 (non-bias): 0.46
For class 1 (bias): 0.55
Recall:
For class 0 (non-bias): 0.54
For class 1 (bias): 0.47
F1-Score:
For class 0 (non-bias): 0.50
For class 1 (bias): 0.51
Overall Accuracy: 0.50

---

The easiest and most immediate step to assess is usually Feature Engineering. Here’s why and how you might approach it:

### Feature Engineering
- Changes to features directly impact the model’s performance, and you can quickly see how new or altered features affect results.
- Modularity: You can add or remove features in isolation and evaluate their impact without extensive reconfiguration of the model or tuning process.

In [35]:
# Fit the model to the entire dataset
pipeline.fit(X_train, y_train)

# Extract feature importances
importances = pipeline.named_steps['classifier'].feature_importances_

# Combine feature names with importances
feature_names = (pipeline.named_steps['preprocessor']
                    .transformers_[0][1].get_feature_names_out().tolist() + 
                    pipeline.named_steps['preprocessor']
                    .transformers_[1][1].get_feature_names_out().tolist())
feature_importance_dict = dict(zip(feature_names, importances))

# Sort and display feature importances
sorted_importances = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)
print("Feature Importances:")
for feature, importance in sorted_importances:
    print(f"{feature}: {importance:.4f}")


Feature Importances:
desc_len: 0.0360
avg_salary: 0.0321
Gendered_Ratio: 0.0318
max_salary: 0.0316
Communal_Count: 0.0306
Ratio: 0.0299
min_salary: 0.0282
Agentic_Count: 0.0279
Founded: 0.0262
age: 0.0229
Rating: 0.0215
num_comp: 0.0073
same_state: 0.0073
excel: 0.0071
job_simp_encoded_2: 0.0070
python_yn: 0.0065
aws: 0.0063
job_simp_data scientist: 0.0063
spark: 0.0061
Size_10000+ employees: 0.0060
job_state_encoded_2: 0.0059
num_comp_encoded_0: 0.0058
Competitors_-1: 0.0058
Type of ownership_Company - Private: 0.0055
Type of ownership_encoded_2: 0.0054
seniority_senior: 0.0053
seniority_encoded_1: 0.0053
Job Title_encoded_0: 0.0052
job_state_CA: 0.0051
job_simp_encoded_1: 0.0049
job_state_MA: 0.0047
num_comp_encoded_3: 0.0045
Sector_encoded_13: 0.0044
headquarters_state_CA: 0.0044
seniority_na: 0.0044
job_simp_na: 0.0043
Location_San Francisco, CA: 0.0042
headquarters_state_encoded_6: 0.0042
Size_51 to 200 employees: 0.0042
Size_501 to 1000 employees: 0.0041
job_simp_encoded_6: 0.004

#### Interpreting Feature Importances
Most Important Features:

desc_len: 0.0360
avg_salary: 0.0321
Gendered_Ratio: 0.0318
max_salary: 0.0316
Communal_Count: 0.0306
These features have the highest importance scores, indicating they contribute the most to the model's predictions.

Less Important Features:

Features with scores like job_simp_encoded_20 (0.0013) or Sector_encoded_8 (0.0009) have very low importance, indicating they contribute less to the model's predictions.

### Feature Selection

1. Feature Importance Filtering
Objective: Filter features based on their importance score.
Action: Use the feature importance dictionary to select features with importance greater than a specified threshold.

In [77]:
# Define the importance threshold
threshold = 0.01

# Filter features based on importance
important_features = [feature for feature, importance in feature_importance_dict.items() if importance > threshold]

# Create new training and testing sets with important features
X_train_important = X_train[important_features]
X_test_important = X_test[important_features]

# Print out the important features
print("Important features based on the threshold:")
print(important_features)


Important features based on the threshold:
['Rating', 'Founded', 'min_salary', 'max_salary', 'avg_salary', 'age', 'desc_len', 'Agentic_Count', 'Communal_Count', 'Gendered_Ratio', 'Ratio']


2. Verify Columns and Data Types
Please run the following code and provide the output:

In [78]:
# Print all columns
print("Columns in X_train:")
print(X_train.columns.tolist())

# Print data types
print("Data types in X_train:")
print(X_train.dtypes)


Columns in X_train:
['Unnamed: 0', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors', 'hourly', 'employer_provided', 'min_salary', 'max_salary', 'avg_salary', 'company_txt', 'job_state', 'same_state', 'age', 'python_yn', 'R_yn', 'spark', 'aws', 'excel', 'job_simp', 'seniority', 'desc_len', 'num_comp', 'headquarters_state', 'Lemmatized_Description', 'Agentic_Words', 'Communal_Words', 'Agentic_Count', 'Communal_Count', 'Gendered_Ratio', 'job_state_encoded', 'headquarters_state_encoded', 'Type of ownership_encoded', 'Industry_encoded', 'Sector_encoded', 'job_simp_encoded', 'seniority_encoded', 'num_comp_encoded', 'Ratio', 'Job Title_encoded']
Data types in X_train:
Unnamed: 0                      int64
Job Title                      object
Salary Estimate                object
Job Description                object
Rating                        float

In [80]:
print(X_train.columns)

Index(['Unnamed: 0', 'Job Title', 'Salary Estimate', 'Job Description',
       'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'hourly', 'employer_provided', 'min_salary', 'max_salary', 'avg_salary',
       'company_txt', 'job_state', 'same_state', 'age', 'python_yn', 'R_yn',
       'spark', 'aws', 'excel', 'job_simp', 'seniority', 'desc_len',
       'num_comp', 'headquarters_state', 'Lemmatized_Description',
       'Agentic_Words', 'Communal_Words', 'Agentic_Count', 'Communal_Count',
       'Gendered_Ratio', 'job_state_encoded', 'headquarters_state_encoded',
       'Type of ownership_encoded', 'Industry_encoded', 'Sector_encoded',
       'job_simp_encoded', 'seniority_encoded', 'num_comp_encoded', 'Ratio',
       'Job Title_encoded'],
      dtype='object')


In [82]:
print(important_features)


['Rating', 'Founded', 'min_salary', 'max_salary', 'avg_salary', 'age', 'desc_len', 'Agentic_Count', 'Communal_Count', 'Gendered_Ratio', 'Ratio']


In [84]:
print(X_train.head())

     Unnamed: 0                                          Job Title  \
481         481                                       Data Analyst   
292         292  Associate Scientist/Scientist, Process Analyti...   
349         349               Senior Data Scientist - R&D Oncology   
174         174                 Principal Scientist - Immunologist   
135         135                         Data Scientist/ML Engineer   

                  Salary Estimate  \
481    $47K-$85K (Glassdoor est.)   
292   $88K-$162K (Glassdoor est.)   
349  $102K-$172K (Glassdoor est.)   
174   $98K-$182K (Glassdoor est.)   
135   $71K-$123K (Glassdoor est.)   

                                       Job Description  Rating  \
481  The Data Analyst is responsible for maintainin...     3.6   
292  The Position\n\n\nWe are seeking a talented an...     3.9   
349  At AstraZeneca,we work together to deliver inn...     3.9   
174  Job Description\n\n\nOBJECTIVE:\nMake gene the...     3.7   
135  Data Scientist/ML Eng

Update Numerical and Categorical Features Lists:
Adjust the numerical_features and categorical_features lists to match the important features.

Update Pipeline:
Modify the preprocessing pipeline to use the updated lists.