In [82]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Define problem

- **Type of Problem:** Regression
- **Objective:** Predict the total compensation of an employee based on various job-related features.
- **Features:**
    1. `company`: The company the employee works for (categorical).
    2. `company_size`: The size of the company (numerical).
    3. `job_title`: The job title of the employee (categorical).
    4. `level`: The job level of the employee (categorical).
    5. `domain`: The domain of the company or job (categorical).
    6. `yoe_total`: Years of total work experience (numerical).
    7. `yoe_at_company`: Years of work experience at the current company (numerical).
    8. `base`: Base salary (numerical).
    9. `stock`: Stock-related compensation (numerical).
    10. `bonus`: Bonus amount (numerical).
    11. `total_compensation`: **The target variable** - total compensation (numerical).

- **Example Questions:**
    - What is the expected total compensation for an employee with a certain job title, at a specific company, with a given level of experience?
    - How much does the total compensation vary based on job level or company size?

- **Potential Use Cases:**
    - Helping HR departments and employees understand the factors influencing compensation.
    - Guiding salary negotiations by providing estimates based on relevant features.

## Why we choose this as a regression problem?

The problem is chosen as a <font color='#F3E5AB'>regression problem</font> because we aim to predict a continuous value, specifically the total compensation of an employee. Here are some reasons explaining why regression is an appropriate choice:
 1. **Continuous Target Variable:** Total compensation is a continuous variable, not falling into fixed categories. When predicting an exact amount, regression is commonly used.
 2. **Prediction of Specific Quantities:** In this context, we are interested in predicting a specific quantity, such as the exact income that an employee might have in a given scenario.
 3. **Relationship between Features and Target:** Features such as job level, experience, domain, and base salary can significantly influence total compensation. This relationship can be well captured by regression models.
 4. **Model Evaluation:** With regression models, we can use metrics like <font color='#F3E5AB'>Mean Squared Error (MSE)</font> or <font color='#F3E5AB'>R-squared</font> to evaluate prediction performance and measure the difference between predicted and actual values.
 5. **Convenient for Model Interpretation:** Regression models provide a convenient way to interpret the impact of each feature on the target variable. This can be valuable in understanding why the model makes specific predictions.