This dataset provides a comprehensive list of data science job postings sourced from Indeed. It includes details about job titles, companies, locations, salary ranges, short descriptions, and job links. The dataset is designed to help job seekers and researchers analyze current trends in the data science job market, understand salary expectations, and identify key hiring companies and locations.

In [1]:
pip install pandas 

Note: you may need to restart the kernel to use updated packages.


Initial Data Exploration

In [2]:
import pandas as pd
import numpy as np

#Load data from a CSV file
df = pd.read_csv('../../Indeed-Data Science Jobs List 3.csv')

In [None]:
# Display the first five rows of the Dataframe
print(df.head())

In [None]:
#Removing the Position column since it was an index or a placeholder 
df = df.drop(columns=['Position'])

In [None]:
#Viewing the full URLs in the job link column, by adjusting the display option in pandas
pd.set_option('display.max_colwidth', None)
print(df.head()) #Displaying the result  

In [None]:
#Get a concise summary of the DataFrame
print(df.info())

In [None]:
# Generate descriptive statistics
print(df.describe())

# Include categorical data:
print(df.describe(include='all'))

# Numerical data only:
print(df.describe(include=[np.number]))

#Categorical data only:
print(df.describe(include=[object]))

In [None]:
#Checking missing values

print(df.isnull().sum())#Count the number of missing values in each column

In [None]:
#Checking unique values in a column

print(df['Job Title'].unique())#Display unique values in a specific column

In [None]:
#Checking Value counts

print(df['Job Title'].value_counts())

Data Cleaning

In [3]:
#Handling Missing Values
#First Identifying missing values by printinng the number of missing values in each column
print(df.isnull().sum())

Position               0
Job Title              0
Company                0
Location               0
Salary               138
Short Description      0
Posted At            184
Job link               0
dtype: int64


Handling Strategies

In [4]:
#Removing rows or column that has missing data
#Drop rows with any missing values
df_clean = df.dropna()

#Drop columns with any missing values
df_clean = df.dropna(axis=1)

Impute Missing Values; 

- replace missing values with the mean, median or mode for numerical columns
  
- use the mode or a specific values for categorical columns


In [6]:
#First, checking the column type for the column that has missing values
print(df['Salary'].dtype)
print(df['Posted At'].dtype)


object
object


In [7]:
#Because both the column type is object meant that the column contains strings or mixed types 
#Converting 'Salary' column to Numeric
# Clean the "Salary" column by removing non-numeric characters and converting to numeric
df['Salary'] = df['Salary'].str.replace(r'[\$,]', '', regex=True).str.extract(r'(\d+)').astype(float)
print(df['Salary'].head())

0    120000.0
1         NaN
2         NaN
3         NaN
4    103409.0
Name: Salary, dtype: float64


In [11]:
# Fill missing values (NaN) with the mean
# ensuring the original DataFrame is being modified by assigning the 'fillna' back to 'df['Salary']'
df['Salary'] = df['Salary'].fillna(df['Salary'].mean()) 

# For categorical data, fill missing values with the most frequent value (mode)
# Ensuring the 'Posted At' column is directly modified in the original data frame
df['Posted At'] = df['Posted At'].fillna(df['Posted At'].mode()[0])


In [12]:
# Print the first few rows of the 'Salary' column to see the changes
print("Updated 'Salary' column:")
print(df['Salary'].head())

# Print the first few rows of the 'Posted At' column to see the changes
print("Updated 'Posted At' column:")
print(df['Posted At'].head())

Updated 'Salary' column:
0    120000.000000
1    111460.532258
2    111460.532258
3    111460.532258
4    103409.000000
Name: Salary, dtype: float64
Updated 'Posted At' column:
0    Employer\nActive 3 days ago
1    Employer\nActive 3 days ago
2    Employer\nActive 3 days ago
3    Employer\nActive 3 days ago
4    Employer\nActive 3 days ago
Name: Posted At, dtype: object


- The Salary column now holds numeric data as expected, and it seems to have worked correctly by filling the missing values with the column mean.
- The Posted At column has been filled with the most frequent entry, which seems to be the mode.

This last final part is to check for further details and checking any other 'NaN' values using 'df.isna90.sum()'

In [13]:
# Check for remaining NaN values in the DataFrame
print("Remaining NaN values in each column:")
print(df.isna().sum())

# Inspect the first 10 rows of the DataFrame
print("First 10 rows of the DataFrame:")
print(df.head(10))

# Inspect the last 10 rows of the DataFrame
print("Last 10 rows of the DataFrame:")
print(df.tail(10))


Remaining NaN values in each column:
Position             0
Job Title            0
Company              0
Location             0
Salary               0
Short Description    0
Posted At            0
Job link             0
dtype: int64
First 10 rows of the DataFrame:
   Position                                          Job Title  \
0         1                                     Data Scientist   
1         2  Senior Artificial Intelligence Researcher for ...   
2         3                              Senior Data Scientist   
3         4                                     Data Scientist   
4         5                                     Data Scientist   
5         6                                     Data Scientist   
6         7                   Business Data Scientist/Engineer   
7         8                                   Data Scientist I   
8         9                                     Data Scientist   
9        10                           Associate Data Scientist   

       

In [14]:
import os

file_path = '../../Indeed-Data Science Jobs List 3.csv'
# Check if the file exists
if os.path.exists(file_path):
    print("File has been saved successfully!")
else:
    print("File does not exist.")


File has been saved successfully!
