### ðŸ“Œ Libraries Used

This project uses the following Python libraries:

- **ast** â€” converts skills stored as strings into Python lists for analysis  
- **pandas** â€” handles data cleaning, manipulation, and analysis  
- **seaborn / matplotlib** â€” used to create visualizations and charts  
- **datasets (HuggingFace)** â€” loads the dataset directly from HuggingFace using `load_dataset()`


In [1]:
# Importing Libraries
import ast
import pandas as pd
import seaborn as sns
from datasets import load_dataset
import matplotlib.pyplot as plt 

  from .autonotebook import tqdm as notebook_tqdm


## ðŸ“¥ Loading the Dataset

In this step, I load the **data_jobs** dataset directly from HuggingFace using `load_dataset()`.  
The dataset is then converted into a Pandas DataFrame to make it easier to clean, explore, and visualize throughout the analysis.


In [2]:
# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

## ðŸ“Œ Dataset Inspection & Cleanup

Before starting the analysis, I first inspect the dataset structure using `df.info()`.  
This provides a quick overview of:

- Total number of rows and columns  
- Data types of each column  
- Missing values across the dataset  

After understanding the dataset format, I apply essential cleanup steps:

1. **Convert `job_posted_date` to datetime format**  
   This makes it possible to perform time-based analysis such as monthly or yearly trends.

2. **Convert `job_skills` into Python list format**  
   The skills column is stored as a string representation of a list, so I use `ast.literal_eval()` to transform it into real lists for accurate skill analysis.


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785741 entries, 0 to 785740
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   job_title_short        785741 non-null  object 
 1   job_title              785740 non-null  object 
 2   job_location           784696 non-null  object 
 3   job_via                785733 non-null  object 
 4   job_schedule_type      773074 non-null  object 
 5   job_work_from_home     785741 non-null  bool   
 6   search_location        785741 non-null  object 
 7   job_posted_date        785741 non-null  object 
 8   job_no_degree_mention  785741 non-null  bool   
 9   job_health_insurance   785741 non-null  bool   
 10  job_country            785692 non-null  object 
 11  salary_rate            33067 non-null   object 
 12  salary_year_avg        22003 non-null   float64
 13  salary_hour_avg        10662 non-null   float64
 14  company_name           785723 non-nu

In [4]:
#data cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
df['job_skills'] = df['job_skills'].apply(lambda x : ast.literal_eval(x) if pd.notna(x) else x)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785741 entries, 0 to 785740
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   job_title_short        785741 non-null  object        
 1   job_title              785740 non-null  object        
 2   job_location           784696 non-null  object        
 3   job_via                785733 non-null  object        
 4   job_schedule_type      773074 non-null  object        
 5   job_work_from_home     785741 non-null  bool          
 6   search_location        785741 non-null  object        
 7   job_posted_date        785741 non-null  datetime64[ns]
 8   job_no_degree_mention  785741 non-null  bool          
 9   job_health_insurance   785741 non-null  bool          
 10  job_country            785692 non-null  object        
 11  salary_rate            33067 non-null   object        
 12  salary_year_avg        22003 non-null   floa