# AI/ML/Data Science Salary Trends: Time Series Forecasting

**Project Lead:** Kwaku Boateng - Modeling Lead  
**Date:** November 10, 2025  
**Objective:** Build time series models (Prophet or ARIMA) to forecast average salaries for 2026-2027 by experience level.  
**Data Source:** salary_data_cleaned.csv (cleaned dataset with salaries from 2020-2025).  
**Models:** Prophet (primary) or ARIMA (alternative).  
**Deliverables:** Model code, evaluation metrics, forecasts, and plots.

In [1]:
# === PREVENT HUGE OUTPUTS ===
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 10)

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error  # For evaluation

# Load the cleaned data
df = pd.read_csv(r'I:\fakenews_AI4All\notebooks\data\processed\salary_data_cleaned.csv')


# Aggregate average salary by year and experience level
agg_df = df.groupby(['work_year', 'experience_level_full'])['salary_in_usd'].mean().reset_index()
agg_df.rename(columns={'work_year': 'year', 'experience_level_full': 'experience', 'salary_in_usd': 'avg_salary'}, inplace=True)

# Overall aggregate for reference
overall_agg = df.groupby('work_year')['salary_in_usd'].mean().reset_index()
overall_agg.rename(columns={'work_year': 'year', 'salary_in_usd': 'avg_salary'}, inplace=True)
overall_agg['experience'] = 'Overall'

# Combine for easier looping
all_data = pd.concat([agg_df, overall_agg], ignore_index=True)
all_data.sort_values(['experience', 'year'], inplace=True)

# Display aggregated data
display(all_data)

## Data Preparation
- Aggregated average USD salaries by year and experience level (Entry, Mid, Senior, Executive, plus Overall).
- Time series: 6 points per group (2020-2025).
- For modeling, convert 'year' to datetime (end-of-year) for proper time indexing.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error
from IPython.display import display
import warnings

warnings.filterwarnings('ignore')

# Dictionary to store forecasts
forecasts_arima = {}
plots_arima = []

# Loop through each experience level
for exp in all_data['experience'].unique():
    group = all_data[all_data['experience'] == exp].copy()
    group.set_index('year', inplace=True)
    series = group['avg_salary']
    
    if len(series) < 3:
        print(f"Skipping {exp} â€” too few data points ({len(series)}) for ARIMA.")
        continue
    
    # Fit ARIMA (1,1,1)
    model = ARIMA(series, order=(1,1,1))
    model_fit = model.fit()
    print(f'\n{exp} - ARIMA Summary:\n{model_fit.summary()}')
    
    # Forecast next 2 years
    forecast_steps = 2
    forecast = model_fit.forecast(steps=forecast_steps)
    forecast_index = [series.index.max() + i for i in range(1, forecast_steps+1)]
    forecast_df = pd.DataFrame({'year': forecast_index, 'forecast': forecast.values})
    forecasts_arima[exp] = forecast_df
    
    # Format forecast table nicely
    forecast_df_formatted = forecast_df.copy()
    forecast_df_formatted['forecast'] = forecast_df_formatted['forecast'].apply(lambda x: f"${x:,.0f}")
    
    print(f'\nForecasted Salaries for {exp}:')
    display(forecast_df_formatted)
    
    # Simple evaluation: Hold-out MAE (if enough data)
    train = series[series.index < series.index.max()]
    test = series[series.index >= series.index.max()]
    
    if len(train) > 1 and len(test) > 0:
        model_eval = ARIMA(train, order=(1,1,1))
        model_eval_fit = model_eval.fit()
        pred = model_eval_fit.forecast(steps=1).iloc[0]
        mae = mean_absolute_error([test.iloc[0]], [pred])
        print(f'{exp} - Hold-out MAE: {mae:.2f}')
    
    # Plot
    plt.figure(figsize=(8,5))
    plt.plot(series.index, series.values, label='Historical', marker='o')
    plt.plot(forecast_index, forecast.values, label='Forecast', marker='x', linestyle='--')
    
    # Annotate forecasted points
    for x, y in zip(forecast_index, forecast.values):
        plt.text(x, y, f'{y:,.0f}', ha='center', va='bottom', fontsize=9, color='red')
    
    plt.title(f'ARIMA Salary Forecast for {exp}')
    plt.xlabel('Year')
    plt.ylabel('Average Salary (USD)')
    plt.legend()
    plt.grid(True)
    plt.savefig(f'forecast_{exp.lower().replace(" ", "_")}.png')
    plt.show()
    
    plots_arima.append(plt.gcf())


## ARIMA Model Notes
- Used order (1,1,1): AR(1) for lag, differencing(1) for trend, MA(1) for errors.
- Evaluation: Hold-out MAE (train 2020-2024, test 2025); AIC in summary for fit quality.
- No uncertainty bounds (add if needed via model_fit.get_forecast().conf_int()).
- Insights: Salaries show upward trends overall, but vary by level (e.g., Executive highest growth).
- Limitations: Few data points; ARIMA assumes stationarity after differencing.

## Final Deliverables
- **Models:** Trained Prophet/ARIMA per experience level.
- **Forecasts:** For 2026-2027 (see tables above).
- **Plots:** Historical trends + forecasts (saved as PNGs).
- **Table of Actual vs. Predicted:** (Add if needed, e.g., compare hold-out predictions).

## Discussion
- Trends: Salaries increased ~40% overall from 2020-2025, driven by demand in AI/ML.
- Forecasts: Expect stabilization or slight growth in 2026-2027.
- Biases: Data US-heavy; self-reported; limited freelance/contract roles.
- Next Steps: Incorporate more data (e.g., regions) or advanced models (LSTM if time series expands).