# Project: Salary Predict

### excel implemetation in python
in this part we will do how to do the same as we did in the excel file "modelo_predicciones.xlsx" but in python, to compare how easier is whit this language.

The first thing we need to do is import the necessary libraries we will use.

In [None]:
#libraries
import pandas as pd
import statsmodels.formula.api as smf


Second issue is to read the data and save in a variable

In [2]:
# Read the data
data = pd.read_excel("Datos_Modelo.xlsx", sheet_name="Datos")

We need to be sure that the category variable is treated right

In [None]:
# ensure type of data
data['Tamaño_Empresa'] = data['Tamaño_Empresa'].astype('category')

now we create the models using statsmodels to implement single linear regression and multiple linear regression

### Model 1: Salary ~ Experience

In [6]:
model_1 = smf.ols('Salario ~ Experiencia',data=data).fit()
print(model_1.summary())

                            OLS Regression Results                            
Dep. Variable:                Salario   R-squared:                       0.651
Model:                            OLS   Adj. R-squared:                  0.649
Method:                 Least Squares   F-statistic:                     368.7
Date:                Sun, 10 Aug 2025   Prob (F-statistic):           4.27e-47
Time:                        20:58:11   Log-Likelihood:                -2049.6
No. Observations:                 200   AIC:                             4103.
Df Residuals:                     198   BIC:                             4110.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    4.128e+04    888.850     46.447      

### Model 2: Salary ~ Experience + Education_years

In [7]:
model_2 = smf.ols('Salario ~ Experiencia + Años_Educación',data=data).fit()
print(model_2.summary())

                            OLS Regression Results                            
Dep. Variable:                Salario   R-squared:                       0.781
Model:                            OLS   Adj. R-squared:                  0.779
Method:                 Least Squares   F-statistic:                     350.8
Date:                Sun, 10 Aug 2025   Prob (F-statistic):           1.20e-65
Time:                        21:00:54   Log-Likelihood:                -2003.0
No. Observations:                 200   AIC:                             4012.
Df Residuals:                     197   BIC:                             4022.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       2.667e+04   1524.955     17.

### Model 3: Salary ~ Experience + Education_Years + Company_size (as factor)

In [8]:
model_3 = smf.ols('Salario ~ Experiencia + Años_Educación + C(Tamaño_Empresa)',data=data).fit()
print(model_3.summary())

                            OLS Regression Results                            
Dep. Variable:                Salario   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.809
Method:                 Least Squares   F-statistic:                     211.8
Date:                Sun, 10 Aug 2025   Prob (F-statistic):           8.57e-70
Time:                        21:03:30   Log-Likelihood:                -1987.2
No. Observations:                 200   AIC:                             3984.
Df Residuals:                     195   BIC:                             4001.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

# Machine Learning Part: Regression Model in python
- We will implement many machine learning models at the same time and with low code
- We prioritize the sklearn use, for facility in machine learning
- We separate in train and test part the data (things that we didn't do in excel)

In [3]:
# Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np


now we need to read the excel file (in this case "Datos_modelo.xlsx), pandas do the task with his function read_excel(file_path,sheet_name="sheet_name") and transform the category variable in category type

In [4]:
# read the file
data = pd.read_excel('Datos_Modelo.xlsx',sheet_name='Datos')

#transform the variable
data['Tamaño_Empresa']=data['Tamaño_Empresa'].astype('category')

then we define predictor and goal variables

In [5]:
# predictor variables
X = data.drop(columns='Salario')
X = pd.get_dummies(X,drop_first=True) # one-hot encoding for category variables

# goal variable
y = data['Salario']

We need to separate the training data and testing data

In [6]:
# split the data in training data and testing data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)

Now we inicialize the models

In [7]:
# inicializing the models
models = {
"Linear Regression": LinearRegression(),
"Decision Tree": DecisionTreeRegressor(random_state=123),
"Random Forest": RandomForestRegressor(random_state=123),
"KNN": KNeighborsRegressor()
}

then we need to evaluate the resuts

In [8]:
# Evaluating the models
results=[]

for name, model in models.items():
    # training the model
    model.fit(X_train,y_train)
    # testing the model
    prediction = model.predict(X_test)
    
    # r2, mae and rmse values
    r2 = r2_score(y_test,prediction)
    mae = mean_absolute_error(y_test,prediction)
    rmse = np.sqrt(mean_squared_error(y_test,prediction))
    
    results.append(
        {
            "Model": name,
            "R2": r2,
            "MAE":mae,
            "RMSE":rmse
        }
    )

at the end, we show the results

In [9]:
# printing the results
df_results = pd.DataFrame(results)
print(df_results)

               Model        R2          MAE         RMSE
0  Linear Regression  0.938911  2725.021351  3203.692836
1      Decision Tree  0.801479  4912.576000  5775.292152
2      Random Forest  0.903859  3279.324056  4019.077296
3                KNN  0.889586  3480.826550  4307.073523
