# Model Testing

Here we will test different models to see which one performs best on our data. We will test the following models:
- K-Nearest Neighbors


We will work on already prepared dataset.

In [11]:
import pandas as pd
import numpy as np
import data_preprocessing as dp

jobs_df = pd.read_csv('data/cleared/linkedin_data.csv')

# Encoding categorical data - job titles
jobs_df = dp.encode_job_ttls(jobs_df)
jobs_df = jobs_df.drop(['Job_Desc', 'Job_Ttl', 'max_sal', 'min_sal', 'Co_Nm', 'py_prd', 'loc', 'wrk_typ'], axis=1)
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13222 entries, 0 to 13221
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Flw_Cnt        13222 non-null  int64  
 1   Is_Supvsr      13222 non-null  bool   
 2   med_sal        13222 non-null  float64
 3   st_code        13222 non-null  object 
 4   is_remote      13222 non-null  int64  
 5   views          13222 non-null  int64  
 6   xp_lvl         13222 non-null  object 
 7   mean_year_sal  13222 non-null  float64
dtypes: bool(1), float64(2), int64(3), object(2)
memory usage: 839.3+ KB


In [12]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# Calculate mean 'med_sal' for each 'xp_lvl'
mean_sal_by_xp_lvl = jobs_df.groupby('xp_lvl')['med_sal'].mean()

# Add a new column 'mean_sal_by_xp_lvl' to the DataFrame
jobs_df['mean_sal_by_xp_lvl'] = jobs_df['xp_lvl'].map(mean_sal_by_xp_lvl)

# Make a copy of the DataFrame and drop the 'st_code' column
jobs_df_copy = jobs_df.copy()
jobs_df_copy = jobs_df_copy.drop(['st_code'], axis=1)

# Define the target variable 'y' and the feature matrix 'X'
y = jobs_df['med_sal'].values
X = jobs_df[['Flw_Cnt', 'Is_Supvsr', 'is_remote', 'views', 'mean_sal_by_xp_lvl']].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)


# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create and train a KNN regressor on the scaled data
regressor = KNeighborsRegressor(n_neighbors=5, metric='minkowski', p=2)
regressor.fit(X_train_scaled, y_train)

# Make predictions and print the mean squared error
y_pred = regressor.predict(X_test_scaled)
print('Mean squared error: ', mean_squared_error(y_test, y_pred))

# Perform 5-fold cross-validation
scores = cross_val_score(regressor, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')

# Convert scores to positive and take the square root (to get the root mean squared error)
rmse_scores = np.sqrt(-scores)

print('RMSE scores: ', rmse_scores)
print('Mean RMSE: ', rmse_scores.mean())

Mean squared error:  2193137615.069355
RMSE scores:  [47057.28078767 46135.23308833 47409.74622193 47645.44955165
 47745.13521829]
Mean RMSE:  47198.56897357495


The provided Python code is performing a regression task using a K-Nearest Neighbors (KNN) model to predict median salaries based on several job features. The code includes data preprocessing (such as feature scaling and splitting the data into training and testing sets), model training, prediction, and evaluation using Mean Squared Error (MSE) and cross-validation. The results indicate that the model's predictions have a high error rate, suggesting the need for model tuning or a different approach.

## Comments on the results

The Mean Squared Error (MSE) of 2193137615.069355 and the Root Mean Squared Error (RMSE) scores ranging from around 46135 to 47745 suggest that the model's predictions are not very accurate. The high error rates indicate that the KNN model may need further tuning or a different modeling approach might be more effective.
