<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
   
# Data Science Salary - Tableau Data Preparation </div>

Previously, data transformations were performed directly during the analysis process — fragmentarily, as new hypotheses and insights emerged. Now I am moving data preparation into a separate specialized notebook, focused exclusively on creating final datasets for the Tableau dashboard. This allows for a systematic and sequential approach to transformations: adding necessary calculated fields, conducting targeted transformations, and preparing data with consideration for specific interactive visualization requirements.

The goal is to create two optimized CSV files, each tailored to specific analytical tasks of the dashboard and ensuring maximum Tableau performance. This approach guarantees reproducibility of the entire data preparation pipeline, maintains clean project architecture, and ensures stable visualization performance for effective presentation of results to the business audience.

**Target outputs**: 2 clean CSV files ready for Tableau import

<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
   
## Dataset Preparation Strategy</div>

I will create two specialized datasets for our Tableau dashboard:

| **Dataset** | **Purpose** | **Key Features** |
|-------------|-------------|-----------------| 
| `tableau_main_dataset.csv` | Core analysis with economic indicators | Salary data + GDP, population, economic metrics + AI classification |
| `tableau_wikipedia_trends.csv` | AI/ML search trends over time | Wikipedia page views aggregated by year with 2025 extrapolation |

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Importing required libraries</div>

In [1]:
import pandas as pd
import numpy as np
import warnings
import os

warnings.filterwarnings('ignore')

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 1: Data loading</div>

In [2]:
# Load all three datasets
salary_df = pd.read_csv('cleaned_salaries.csv')
worldbank_df = pd.read_csv('cleaned_worldbank.csv') 

print('\nDataset shapes:\n')
print(f'Salary data: {salary_df.shape}')
print(f'World Bank data: {worldbank_df.shape}')


Dataset shapes:

Salary data: (66056, 9)
World Bank data: (564, 7)


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 1: Data integration with economic indicators</div>

Integrating salary data with World Bank data.

In [3]:
# Merge salary data with World Bank indicators
df = salary_df.merge(
    worldbank_df, 
    left_on=['company_location', 'work_year'], 
    right_on=['country_code', 'year'], 
    how='left'
)

# Remove duplicate columns
df = df.drop(['country_code', 'year'], axis=1)

print(f'\nIntegrated dataset shape: {df.shape}\n')
df.head()


Integrated dataset shape: (66056, 14)



Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,country_name,value_population,value_gdp_per_capita,value_education,value_internet
0,2025,MI,FT,Data Scientist,132600.0,US,100,US,M,United States,341454306.1,92481.07,39.16,94.22
1,2025,MI,FT,Data Scientist,102000.0,US,100,US,M,United States,341454306.1,92481.07,39.16,94.22
2,2025,SE,FT,Data Product Manager,260520.0,US,0,US,M,United States,341454306.1,92481.07,39.16,94.22
3,2025,SE,FT,Data Product Manager,140280.0,US,0,US,M,United States,341454306.1,92481.07,39.16,94.22
4,2025,SE,FT,Machine Learning Engineer,215000.0,US,0,US,M,United States,341454306.1,92481.07,39.16,94.22


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">
    
### Dataset 1: Converting categorical codes to descriptive labels</div>

To ensure clear and professional Tableau visualizations, I replace abbreviated categorical codes with their full descriptive names directly in the dataset. This eliminates the need for repetitive mapping operations in Tableau and provides immediate data readability for dashboard creation.

This preprocessing approach creates self-documenting data that improves both Tableau workflow efficiency and dashboard presentation quality.

In [4]:
# Replace abbreviated codes with full descriptive names
df['company_size'] = df['company_size'].map({
    'S': 'Small', 'M': 'Medium', 'L': 'Large'
})

df['experience_level'] = df['experience_level'].map({
    'EN': 'Entry', 'MI': 'Mid', 'SE': 'Senior', 'EX': 'Executive'
})

df['employment_type'] = df['employment_type'].map({
    'FT': 'Full-time', 'PT': 'Part-time', 'CT': 'Contract', 'FL': 'Freelance'
})

df['remote_ratio'] = df['remote_ratio'].map({
    0: 'On-site', 50: 'Hybrid', 100: 'Fully Remote'
})

print('\nCategorical columns updated with full descriptive names!')
print(f'\nDataset shape: {df.shape}')


Categorical columns updated with full descriptive names!

Dataset shape: (66056, 14)


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 1: AI role classification</div>

Adding AI role classification column to enable filtering between traditional Data Science roles and AI-specific positions for ChatGPT impact analysis and dashboard visualizations.

In [5]:
# Classifies job titles as AI-related based on keywords
def is_ai_profession(job_title):
    ai_keywords = ['ai ', 'machine learning', 'deep learning', 'computer vision', 'nlp', 'llm', 'genai']
    if any(keyword in job_title.lower() for keyword in ai_keywords):
        return 'AI Role'
    else:
        return 'Traditional DS'

In [6]:
# Add AI role classification
df['role_category'] = df['job_title'].apply(is_ai_profession)

print(f'AI roles identified: {(df["role_category"] == "AI Role").sum()} ({(df["role_category"] == "AI Role").mean()*100:.1f}%)')

AI roles identified: 5955 (9.0%)


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 1: Saving main analytical data with economic indicators</div>

I have assembled the main dataset combining salary data, World Bank economic indicators, and AI role classification. Now I am saving this file — it will become the key foundation for geographic analysis, company size comparisons, and creating visualizations in Tableau.

In [7]:
# Save main dataset for Tableau
df.to_csv('tableau_main_dataset.csv', index=True)

print(f'\nMain dataset saved: {df.shape}')


Main dataset saved: (66056, 15)


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 2: Wikipedia AI trends data</div>

Preparing Wikipedia search trends to show correlation between AI interest and salary growth in our dual-axis Tableau visualizations. I will aggregate daily data into yearly trends and create combined AI category for analysis.

In [8]:
# Load Wikipedia trends data
wikipedia_df = pd.read_csv('wikipedia_data_complete.csv')

print(f'\nWikipedia data loaded: {wikipedia_df.shape}\n')
wikipedia_df.head()


Wikipedia data loaded: (10035, 5)



Unnamed: 0,date,keyword,views,category,period
0,2020-01-01,chatgpt,0,AI,before
1,2020-01-01,ai,6572,AI,before
2,2020-01-01,ml,2867,AI,before
3,2020-01-01,dl,1672,AI,before
4,2020-01-01,python,4685,Programming,before


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 2: Extrapolating incomplete 2025 Wikipedia data</div>

Since our Wikipedia dataset contains only partial 2025 data, I extrapolate annual figures using daily averages to maintain statistical comparability with complete years for accurate before/after ChatGPT analysis.

In [9]:
# Convert date column and check 2025 completeness
wikipedia_df['date'] = pd.to_datetime(wikipedia_df['date'])
wiki_2025 = wikipedia_df[wikipedia_df['date'].dt.year == 2025]

print(f'2025 data: from {wiki_2025["date"].min()} to {wiki_2025["date"].max()}')
days_available = wiki_2025['date'].nunique()
print(f'Available days: {days_available}')

# Extrapolate to full year
daily_avg = wiki_2025.groupby('keyword')['views'].sum() / days_available
wiki_2025_adjusted = daily_avg * 365

print(f'\nExtrapolated 2025 annual views:')
print(wiki_2025_adjusted.round(0))

2025 data: from 2025-01-01 00:00:00 to 2025-06-29 00:00:00
Available days: 180

Extrapolated 2025 annual views:
keyword
ai          8678925.0
chatgpt    44811520.0
dl           824304.0
ml          1944116.0
python      4110748.0
Name: views, dtype: float64


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 2: Creating complete annual dataset</div>

Building comprehensive yearly data combining complete years (2020-2024) with extrapolated 2025 figures for accurate trend analysis.

In [13]:
# Exclude partial 2025 data and aggregate complete years
wiki_without_2025 = wikipedia_df[wikipedia_df['date'].dt.year != 2025]
wiki_tableau = wiki_without_2025.groupby([wiki_without_2025['date'].dt.year, 'keyword'])['views'].sum().reset_index()
wiki_tableau.columns = ['year', 'keyword', 'annual_views']

# Add extrapolated 2025 data
for term in ['ai', 'ml', 'dl', 'python', 'chatgpt']:
    if term in wiki_2025_adjusted.index:
        new_row = pd.DataFrame({
            'year': [2025],
            'keyword': [term], 
            'annual_views': [int(wiki_2025_adjusted[term])]
        })
        wiki_tableau = pd.concat([wiki_tableau, new_row], ignore_index=True)

print('\nSample of final data:\n')
wiki_tableau.head()


Sample of final data:



Unnamed: 0,year,keyword,annual_views
0,2020,ai,3429230
1,2020,chatgpt,0
2,2020,dl,972118
3,2020,ml,1799644
4,2020,python,3434807


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Dataset 2: Saving Wikipedia AI trends data</div>

I have processed and aggregated Wikipedia page view data for AI-related keywords, including extrapolation of incomplete 2025 data to ensure statistical accuracy. This dataset will enable dual-axis visualizations showing correlation between AI search interest and salary trends over time in Tableau.

In [11]:
# Создаем ID колонку явно
df['ID'] = range(len(df))

# Сохраняем с ID в начале
df = df[['ID'] + [col for col in df.columns if col != 'ID']]
df.to_csv('tableau_main_dataset.csv', index=False)

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Data preparation summary</div>

Successfully prepared two specialized datasets for our comprehensive Tableau dashboard analyzing Data Science salary trends and AI revolution impact.

**Work completed:**
- Transformed categorical codes into descriptive labels for improved dashboard readability
- Added AI role classification to enable filtering between traditional Data Science and AI-specific positions
- Created complete annual Wikipedia trends dataset with extrapolated 2025 data for accurate time-series analysis
- Ensured proper data relationships between salary and search trend datasets via year linkage

**Results achieved:**
- **tableau_main_dataset.csv**: Comprehensive dataset with salary data, economic indicators, and AI classification ready for geographic analysis and company comparisons
- **tableau_wikipedia_trends.csv**: Annual page view trends for AI keywords enabling correlation analysis with salary growth

Both datasets are optimized for Tableau performance and ready for professional business intelligence analysis. The ChatGPT impact analysis will be performed directly in Tableau using calculated fields from the main dataset.