# 01. Data Loading

## Goal
Load the raw dataset and perform an initial inspection to understand its structure, columns, and verify it loaded correctly.

## Steps
1. Import necessary libraries (Pandas).
2. Define the file path.
3. Load the CSV file into a DataFrame.
4. Display the first few rows (head).
5. Check data types and missing values (info).
6. Check dataset shape and basic statistics.

In [1]:
import pandas as pd
import os

### 1. Load Dataset
We will load the dataset from the `data/raw/` directory.  
**Note:** Ensure you have downloaded `data_science_salaries_2024.csv` from Kaggle and placed it in the `data/raw` folder.

In [3]:
# Define file path
file_path = "../data/raw/data_science_salaries.csv"

# Check if file exists before loading
if not os.path.exists(file_path):
    print(f"❌ Error: File not found at {file_path}.")
    print("Please download the dataset from Kaggle and place it in the 'data/raw' folder.")
else:
    # Load data
    df = pd.read_csv(file_path)
    print("✅ Dataset loaded successfully!")

✅ Dataset loaded successfully!


### 2. Inspect Data Structure

In [4]:
# Display first 5 rows
df.head()

Unnamed: 0,job_title,experience_level,employment_type,work_models,work_year,employee_residence,salary,salary_currency,salary_in_usd,company_location,company_size
0,Data Engineer,Mid-level,Full-time,Remote,2024,United States,148100,USD,148100,United States,Medium
1,Data Engineer,Mid-level,Full-time,Remote,2024,United States,98700,USD,98700,United States,Medium
2,Data Scientist,Senior-level,Full-time,Remote,2024,United States,140032,USD,140032,United States,Medium
3,Data Scientist,Senior-level,Full-time,Remote,2024,United States,100022,USD,100022,United States,Medium
4,BI Developer,Mid-level,Full-time,On-site,2024,United States,120000,USD,120000,United States,Medium


In [5]:
# Display dataset info (Columns, Types, Non-null counts)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6599 entries, 0 to 6598
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   job_title           6599 non-null   object
 1   experience_level    6599 non-null   object
 2   employment_type     6599 non-null   object
 3   work_models         6599 non-null   object
 4   work_year           6599 non-null   int64 
 5   employee_residence  6599 non-null   object
 6   salary              6599 non-null   int64 
 7   salary_currency     6599 non-null   object
 8   salary_in_usd       6599 non-null   int64 
 9   company_location    6599 non-null   object
 10  company_size        6599 non-null   object
dtypes: int64(3), object(8)
memory usage: 567.2+ KB


In [6]:
# Check shape (Rows, Columns)
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

Rows: 6599
Columns: 11


### 3. Basic Statistics
View summary statistics for numerical columns.

In [7]:
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd
count,6599.0,6599.0,6599.0
mean,2022.818457,179283.3,145560.558569
std,0.674809,526372.2,70946.83807
min,2020.0,14000.0,15000.0
25%,2023.0,96000.0,95000.0
50%,2023.0,140000.0,138666.0
75%,2023.0,187500.0,185000.0
max,2024.0,30400000.0,750000.0
