## EDA - Dataset 11 - jobs_and_salaries_in_data_science

### 1. Packages & Settings

In [7]:
# Core libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Advanced analysis
from scipy import stats
from ydata_profiling import ProfileReport

# Interactive tables (optional)
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)

# Configuration
pd.set_option('display.max_columns', 30)
sns.set_theme(style='whitegrid')
%config InlineBackend.figure_format = 'retina'
np.random.seed(42)  # Reproducibility

### 2. Importing Data

In [8]:
# Data Loading
df = pd.read_csv(r"E:\Work\DataSciencePortfolio\0_Data\0.11_jobs_and_salaries_in_data_science\0.11.1_Raw\jobs_in_data.csv")

### 3.1 First Look

In [9]:
# Sample data
show(df.sample(5))  # Random rows to avoid bias

# Summary of the DataFrame
df.info()

# Basic Statistics of the DataFrame
df.describe()

Unnamed: 0,work_year,job_title,job_category,salary_currency,salary,salary_in_usd,employee_residence,experience_level,employment_type,work_setting,company_location,company_size
Loading ITables v2.2.5 from the init_notebook_mode cell... (need help?),,,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           9355 non-null   int64 
 1   job_title           9355 non-null   object
 2   job_category        9355 non-null   object
 3   salary_currency     9355 non-null   object
 4   salary              9355 non-null   int64 
 5   salary_in_usd       9355 non-null   int64 
 6   employee_residence  9355 non-null   object
 7   experience_level    9355 non-null   object
 8   employment_type     9355 non-null   object
 9   work_setting        9355 non-null   object
 10  company_location    9355 non-null   object
 11  company_size        9355 non-null   object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB


Unnamed: 0,work_year,salary,salary_in_usd
Loading ITables v2.2.5 from the init_notebook_mode cell... (need help?),,,


### 3.2 Results of First Look:
##### Observations:
- small sized dataframe (9355 x 12)
- There are no missing values, which is good. No cleaning of null/faulty values necessary.
- Data types look good, no correction of data types necessary.
##### Potential Problems:
- *None*

### 4.1 Automated Analysis with ydata_profiling

In [10]:
# profile = ProfileReport(df, title="Automated EDA", explorative=True)
# profile.to_notebook_iframe()
# Save to HTML for later review (optional)
# profile.to_file("5.1.11b_automated_eda_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 12/12 [00:00<00:00, 324.05it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 4.2 Automated EDA Results
##### Observations
- Clean dataset

##### Hypotheses:
- Correlation between company_size and most columns, **observe**.
- Higher experience means more salary, makes sense.
- Salary_currency correlates to work_setting? Why? **investigate**.
- work_year also correlates to work_setting.

### 5. Necessary Cleaning & Transformation Steps for Python Scripts
##### Cleaning:
- *None*
##### Transformation:
- Investigate correlations.
- Observe specific, interesting job titles.

### 6. Hypothesis-Driven Analysis

In [11]:
# Check hypotheses in 4.2

### 7. Focused Investigation
*When to use*:  
- Drill into subgroups (e.g., "Why do users aged 30-40 have higher churn?")  
- Export specific slices for stakeholder reviews  
*Industry Standard*: Never explore blindly – start with hypotheses from Sections 4-5.

In [12]:
# Check how hyptheses from 3-5 can be explained by the data
"""
show(
    df.query("Income > 70000"),
    column_filters="footer",
    buttons=["copy", "csv"],
    scrollY="300px",
    classes="compact"
)
"""

'\nshow(\n    df.query("Income > 70000"),\n    column_filters="footer",\n    buttons=["copy", "csv"],\n    scrollY="300px",\n    classes="compact"\n)\n'