In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error as mape

In [3]:
data = pd.read_csv('data.csv')

# STAGE 3

Create DataFrame with predictors and a series with a target. For predictors, drop target variable from the data. All other variables stay unchanged

In [4]:
X = data.drop(columns=['salary'])
y = data['salary']

Splitting data into test and train sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

Fitting the model to predict salary based on all other variables, as opposed to just rating 

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)
print(','.join(map(str, model.coef_)))

1187791.2641789436,246170.1790599436,430020.2213681231,182762.6127964016,-87689.5852029371


# STAGE 4

If you have a linear regression with many variables, some of them may be correlated. This way, the performance of the model may decrease. A crucial step is to check the model for multicollinearity and exclude the variables with a strong correlation with other variables. Carry out this check, find the best model by removing the variables with high correlation, and return its MAPE score

Correlation matrix for variables

In [7]:
correlation_matrix = data.corr()
print(correlation_matrix)

               rating    salary  draft_round       age  experience       bmi
rating       1.000000  0.810271     0.008064  0.292463    0.416545  0.077345
salary       0.810271  1.000000     0.003875  0.442715    0.532065  0.036957
draft_round  0.008064  0.003875     1.000000 -0.081857   -0.055498  0.047742
age          0.292463  0.442715    -0.081857  1.000000    0.920067  0.086477
experience   0.416545  0.532065    -0.055498  0.920067    1.000000  0.070941
bmi          0.077345  0.036957     0.047742  0.086477    0.070941  1.000000


Find variables with a corellation coefficient greater than 0.2

In [8]:
high_correlation_vars = correlation_matrix.index[correlation_matrix['salary'] > 0.2].tolist()
print(high_correlation_vars)

['rating', 'salary', 'age', 'experience']


Creating training and test sets and a evaluate model function

In [9]:
X = data.drop(columns=['salary','draft_round','bmi'],axis=1) # Dropping target and low correlation variables
y = data[['salary']].values
def evaluate_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mape_calculation = round(mape(y_test, y_pred), 5)
    return mape_calculation

Fit the linear models for salary prediction based on the subsets of other variables. The subsets are as follows:
First, try to remove each of the three variables found in correlation matrix
Second, remove each possible pair of these three variables.

For example, if you have found out that the highly correlated variables are a, b, and c, then first you fit a model where a is removed, then a model without b, and then the model without c. After this, you estimate the model without both a and b, then without both b and c, and last, without both a and c. As a result, you will have six models to choose the best from.

Creating required subsets of variable combinations

In [10]:
feature_sets = {
    'Full set': X,
    'No Age': X.drop(columns=['age']),
    'No Experience': X.drop(columns=['experience']),
    'No Rating': X.drop(columns=['rating']),
    'No Age and Experience': X.drop(columns=['age', 'experience']),
    'No Experience and Rating': X.drop(columns=['experience', 'rating']),
    'No Age and Rating': X.drop(columns=['age', 'rating'])
}

Make predictions and print the lowest MAPE. The MAPE is a floating number rounded to five decimal places

In [11]:
# Dictionary to store MAPE results
mape_results = {}
# Loop through the feature sets and compute the model and MAPE for each
for model_name, X_features in feature_sets.items():
    model = LinearRegression().fit(X_features, y)  # Fit the model
    y_pred = model.predict(X_features)  # Predict the values
    mape_results[model_name] = mape(y, y_pred)  # Store MAPE result
# print(mape_results)
min_mape = min(mape_results.values())
print(round(min_mape,5))

1.21816


# STAGE 5 - Dealing with negative predictions

As predictors select those variables that gave the best metric in the previous stage 4. Make X a DataFrame with predictors and y a series with a target. To make X, drop target variable from the data

In [12]:
X = data.drop(columns=['salary', 'age', 'experience'],axis=1) 
y = data[['salary']].values
# print(X_best)
# print(y)

Split predictors and the target into train and test parts. Use test_size=0.3 and random_state=100 — they guarantee that the results will be the same as the test system expects.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
linearModel = LinearRegression()
linearModel.fit(X_train, y_train)
predicted_salaries = linearModel.predict(X_test)
# print(predicted_salaries)

Try two techniques to deal with negative predictions:
1.Replace the negative values with 0.
2.Replace the negative values with the median of the training part of y.

In [22]:
zero_replace = np.where(predicted_salaries < 0, 0, predicted_salaries)
median_y_train = np.median(predicted_salaries)
median_replace = np.where(predicted_salaries < 0, median_y_train, predicted_salaries)

Calculate the MAPE for every two options and print the best as a floating number rounded to five decimal places.

In [23]:
mape_zero_replace = float(mape(y_test, zero_replace).round(5))
mape_median_replace = float(mape(y_test, median_replace).round(5))
print(min(mape_zero_replace,mape_median_replace))

0.94701
