# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:90%;text-align:center;border-radius:10px; border: 2px solid #FFA500; padding: 10px;">Data Science Salary Prediction (Glassdoor)</p>

<center><img src="https://github.com/FahadUrRehman07/ds_Salary_proj/blob/main/Data%20Sciene%20Salaries%20Prediction.gif?raw=true"></center>



In this project, I will be using  Machine learning Models such as `Multiple Linear Regression, Lasso Regression and Random Forest` on my own dataset `Data Science Jobs & Salaries` to predict salaries, jobs and more. I will use `GridSearchCV` for tuning the model and `Test Ensembling Techniques`.

<a id='top'></a>
<p style="background-color:#6A5ACD;font-family:Tahoma, Geneva, sans-serif;color:#FFFFFF;font-size:150%;text-align:center;border-radius:10px;padding:10px;"> 📊 Learn About Data 📈 </p>

### `About Dataset`

The Data in the dataset is extracted from the Glassdoor website, which is a job posting website. The dataset has data related to data science jobs and salaries and a lot more, offering a clear view of job opportunities. It is packed with essential details like job titles, estimated salaries, job descriptions, company ratings, and key company info such as location, size, and industry. Whether you're job hunting or researching, this dataset helps you understand the job market easily. Start exploring now to make smart career choices!".


<h4 style = "color:orange">Columns in Dataset:</h4>

1. `**Job Title:**`                           _Title of the Job_
2. `**Salary Estimate:**`	        _Estimated salary for the job that the company provides_
3. `**Job Description:**`	        _The description of the job_
4. `**Rating:**`                              _Rating of the company_
5. `**Company Name:**`               _Name of the Company_
6. `**Location:**`                           _Location of the job_
7. `**Headquarters:**`                   _Headquarters of the company_
8. `**Size:**`                                  _Number of employees in the company_
9.  `**Founded:**`                         _The year company founded_
10. `**Type of ownership:**`        _Ownership types like private, public, government, and non-profit organizations_
11. `**Industry:**`                         _Industry type like `Aerospace, Energy` where the company provides services_
12. `**Sector:**`                           _Which type of services company provide in the industry, like industry (Energy), Sector (Oil, Gas)_
13. `**Revenue:**`                       _Total revenue of the company_
14. `**Competitors:**`                  _Company competitors_


<!-- .......................................................................................................................... -->

<p style="background-color:#20B2AA;font-family:Arial, sans-serif;color:#FFFFFF;font-size:150%;text-align:center;border-radius:10px;padding:10px;border:2px solid #FFFFFF;"> 🔄 Life Cycle Of Machine Learning Project 🤖 </p>


   <ul style="font-size: 18px; font-family: 'Segoe UI';">
        <li><strong>Understanding the Problem Statement</strong></li>
        <li><strong>Exploratory Data Analysis</strong></li>
        <li><strong>Data Pre-Processing</strong></li>
        <li><strong>Model Training</strong></li>
        <li><strong>Choose Best Model</strong></li>
        <li><strong>Model Tuning</strong></li>
        <li><strong>Test Ensembling</strong></li>
        <li><strong>Putting Model into Productino</strong></li>
    </ul>


<!-- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -->




<a id='top'></a>

<p style="background-color: #4cbb17; font-family: Arial, sans-serif; color: #ffffff; font-size: 24px; text-align: center; padding: 10px; border-radius: 10px;">📋 TABLE OF CONTENTS 📋</p>
   
    
* [1. IMPORTING LIBRARIES](#1)
    
* [2. LOADING DATA](#2)

* [4. DATA CLEANING](#data_cleaning)

* [3. Exploratory Data Analysis](#EDA)
    
* [5. Model Building](#ModelBuilding)   
    
* [6. TUNING BY GRIDSEARCHCV](#Tuning)
      
* [7. TEST ENSEMBLING](#ensembles)
    
* [8. PUTTING MODEL INTO PRODUCTION](#production)










# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ✨ Business Problem ✨</p>

<strong> Before diving into technical aspects of ML projects, the first step is always defining the business (or scientific) problem related to data. </strong>

Here, we have a dataset that displays data science salaries with 11 columns.

Our purpose is to predict the salary of an employee, using the measurements in the dataset. To be able to predict it correctly, we need to grasp the <b>domain knowledge </b>, which will be provided in the next stages.


<a id='1'></a>

# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:80%;text-align:center;border-radius:10px; border: 2px solid #FFA500; padding: 10px;"><b>1|</b> 📚 IMPORTING LIBRARIES 📚</p>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

<a id='2'></a>

# <p style="background-color: #20B2AA; font-family: Arial, sans-serif; color: #FFFFFF; font-size: 80%; text-align: center; border-radius: 10px; padding: 15px; border: 4px solid yellow; border-style: dashed;"><b>2|</b> 🔄 LOADING DATASET 🤖 </p>



In [None]:
df=pd.read_csv('/content/glassdoor_jobs.csv')
df

In [None]:
df.head()

In [None]:
df.info()

In [None]:
def check_data(df):
    print(80 * "*")
    print('DIMENSION: ({}, {})'.format(df.shape[0], df.shape[1]))
    print(80 * "*")
    print("COLUMNS:\n")
    print(df.columns.values)
    print(80 * "*")
    print("DATA INFO:\n")
    print(df.dtypes)
    print(80 * "*")
    print("MISSING VALUES:\n")
    print(df.isnull().sum())
    print(80 * "*")
    print("NUMBER OF UNIQUE VALUES:\n")
    print(df.nunique())

check_data(df)

In [None]:
df_1=pd.read_csv("/content/train (1).csv")

In [None]:
def data_classification(dataframe,cat_th=7):
  cat_cols=[ col for col in  dataframe.columns if dataframe[col].dtype == "O" and dataframe[col].nunique() <= cat_th]
  num_cols=[col for col in  dataframe.columns if dataframe[col].dtype !="O" and dataframe[col].nunique() > cat_th]
  cat_but_car=[col for col in dataframe.columns if dataframe[col].dtype =="O" and dataframe[col].nunique() > cat_th]
  num_but_cat=[col for col in dataframe.columns if dataframe[col].dtype !="O" and dataframe[col].nunique()<=cat_th ]
  return {"cat_cols":cat_cols,"num_cols":num_cols,"cat_but_car":cat_but_car,"num_but_cat":num_but_cat}


In [None]:
import pandas as pd
df=pd.read_csv("/content/train (1).csv")

In [None]:
import pandas as pd

def data_classification(dataframe, cat_th=8):
    """
    Classifies columns in a DataFrame into:
    - Categorical columns (`cat_cols`)
    - Numerical columns (`num_cols`)
    - High cardinality categorical columns (`cat_but_car`)
    - Numerical columns that behave as categorical (`num_but_cat`)

    Parameters:
    - dataframe: pd.DataFrame, the DataFrame to classify
    - cat_th: int, threshold for unique values to consider as categorical

    Returns:
    - A dictionary with lists of columns for each category
    """
    # Categorical columns (object or category type with unique values <= cat_th)
    cat_cols = [
        col for col in dataframe.columns
        if dataframe[col].dtype in ["O", "category"] and dataframe[col].nunique() <= cat_th
    ]

    # Numerical columns (excluding object or category, with unique values > cat_th)
    num_cols = [
        col for col in dataframe.columns
        if pd.api.types.is_numeric_dtype(dataframe[col]) and dataframe[col].nunique() > cat_th
    ]

    # High cardinality categorical columns (object or category type with unique values > cat_th)
    cat_but_car = [
        col for col in dataframe.columns
        if dataframe[col].dtype in ["O", "category"] and dataframe[col].nunique() > cat_th
    ]

    # Numerical but categorical columns (numeric types with unique values <= cat_th)
    num_but_cat = [
        col for col in dataframe.columns
        if pd.api.types.is_numeric_dtype(dataframe[col]) and dataframe[col].nunique() <= cat_th
    ]

    return {
        "cat_cols": cat_cols,
        "num_cols": num_cols,
        "cat_but_car": cat_but_car,
        "num_but_cat": num_but_cat
    }

# Example usage:
# dataframe = pd.DataFrame({...})  # Define or load your DataFrame
# classification = data_classification(dataframe)
# print(classification)


In [None]:
data_classification(df)

In [None]:
df.describe().T

In [None]:
df

# Summary of the Dataset
- The dataset consists of 956 rows and 15 columns
- The target variable is salary
- We have many variables with high cardinality, which means it's technically categorical but has so many labels and encoding its values can increase the
- There are technically no missing values
- Descriptive statistics show that some features have outliers

# Data Pre-Processing



#### By looking into the scraped data we will do the following tasks.

<div style="color: #1E90FF; display: inline-block; border-radius: 10px; background-color: #F0F8FF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px groove #1E90FF; width:50%;">
    <p style="padding: 15px; color: #1E90FF; overflow: hidden; font-size: 24px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b> Tasks List:</b>
    </p>
</div>


<div style="border: 2px solid #1E90FF; border-radius: 10px; margin-top: 10px; width:50%">
    <ol style="list-style-type: none; padding: 10px;">
        <li>1. Renaming Columns.</li>
        <li>2. Salary Parsing.</li>
        <li>3. Company Name text only.</li>
        <li>4. State of Field.</li>
        <li>5. Age of Company.</li>
        <li>6. Parsing of job description (python, etc.)</li>
    </ol>
</div>


In [None]:
df.drop(columns=['Unnamed: 0'],inplace=True)

In [None]:
df["Job Title"].value_counts()

In [None]:
df["Job Title"].value_counts()

In [None]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'machine learning' in title.lower():
        return 'machine learning'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'director'
    else:
        return 'na'

def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
        return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'jr'
    else:
        return 'na'





In [None]:
df['Job_Title'] = df['Job Title'].apply(title_simplifier)

In [None]:
df['seniority'] = df['Job Title'].apply(seniority)
df

In [None]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}

data = pd.DataFrame(data)
display(data)

row_sums = data.sum(axis=0)

print(row_sums)


In [None]:
np.array([False,False,False,True]).any()

In [None]:
df.columns[(df == '-1').any() | (df == -1).any()].to_list()

In [None]:
columns_with_minus_one = df.columns[(df == '-1').any() | (df == -1).any()].tolist()
print(columns_with_minus_one)

In [None]:
data = {'A': [1, '-1', -1, -1],
        'B': [-1, -1, -1, 8],
        'C': [-1, -1, '-1', -1]}

data = pd.DataFrame(data)
print(data)
rows_all_minus_one = ((data == '-1') | (data == -1)).all(axis=1)
print(rows_all_minus_one)
print(data[rows_all_minus_one])


In [None]:
rows_all_minus_one = ((df == '-1') | (df == -1)).all(axis=1)
display(df[rows_all_minus_one])


In [None]:
df["Rating"].mode()

In [None]:
df[~df["Founded"].isin([-1, '-1'])]["Founded"].mode()[0]

In [None]:
def replace_with_mode(df,columns):
    for col in columns:
        mode_value = df[~df[col].isin([-1, '-1'])][col].mode()[0]
        df[col] = df[col].replace([-1, '-1'], mode_value)

    return df

In [None]:
replace_with_mode(df,columns_with_minus_one[1:])

In [None]:
columns_with_minus_one = df.columns[(df == '-1').any() | (df == -1).any()].tolist()
print(columns_with_minus_one)

In [None]:
df['Salary Estimate'].value_counts(normalize=True)

## `Removing -1 from Salary Estimate`

In [None]:
df.shape

In [None]:
df = df[df['Salary Estimate']!= '-1']
df.head()

## `Removing the glassdoor est text in the salary column:`

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
df.head(10)


## `Removing the k and $ sign from the Salary Estimate:`

> Removing the k and $ sign from the salary column so that
we can predict or do analysis on numbers

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.replace('K','').replace('$',''))
df.head(10)

- Employer Provided Salary: This is an official figure disclosed by the company for a particular job posting. It reflects what the company intends to pay for that role and tends to be more accurate and reliable since it comes from the employer.

- Glassdoor Salary Estimate: When an employer doesn’t provide salary details, Glassdoor often estimates the pay range based on employee reviews, industry averages, location, and similar roles.

In [None]:
df['PerHour'] = df['Salary Estimate'].str.contains('per hour', case=False)

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.lower().replace('per hour', ''))
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.lower().replace('employer provided salary:', ''))

In [None]:
df

## `Splitting the Salary estimated column:`

- The first will be the minimum salary.<br>
- The second will be the Maximumn Salary.


In [None]:
df['Min_Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[0]))
df['Max_Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[1]))

In [None]:
df['Salary Estimate']= (df['Min_Salary'] + df['Max_Salary'])/2
df.head()

In [None]:
df["Salary Estimate"]= df.apply(
    lambda x: (x["Salary Estimate"] * 40 * 52)/1000 if x["PerHour"] == True else x["Salary Estimate"],
    axis=1
)

In [None]:
df["Salary Estimate"]=df["Salary Estimate"]*1000

In [None]:
df.drop(columns=["Min_Salary","Max_Salary","PerHour","Job Title"],inplace=True)

In [None]:
df.head(10)

In [None]:
df['Company Name']= df['Company Name'].apply(lambda x: x.split('\n')[0])

In [None]:
df.head()

In [None]:
df["Job Description"]

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 6. Parsing the Job Description (python etc):</b>
    </p>
</div>

<div style="border: 2px solid #4CAF50; border-radius: 8px; margin-top: 10px; width:40%">
    <ul style="list-style-type: none; padding: 10px;">
        <li>python.</li>
        <li>R Studio.</li>
        <li>Spark.</li>
        <li>AWS.</li>
        <li>Excel.</li>
    </ul>
</div>


In [None]:
df['Python_yn'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df['R Studio'] = df['Job Description'].apply(lambda x: 1 if 'r studio' in x.lower() or 'r-studio' in x.lower() or 'r_studio' in x.lower() else 0)
df['Spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df['AWS_yn'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df['Excel_yn'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df["SQL_yn"] = df["Job Description"].apply(lambda x: 1 if "sql" in x.lower() else 0)
df["Tensorflow_yn"] = df["Job Description"].apply(lambda x: 1 if "tensorflow" in x.lower() else 0)

In [None]:
df.groupby("AWS_yn")["Salary Estimate"].agg("mean")

In [None]:
df.head()

<b>Now the Company Name column is cleaned and ready to use for EDA</b>

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 4. State of Field Column Cleaning:</b>
    </p>
</div>

We can create the state column from the location column

In [None]:
df["Location"].value_counts()

In [None]:
# df['City'] = df["Location"].apply(lambda x: x.split(',')[0])
df['State'] = df["Location"].apply(lambda x: x.split(',')[1])

### `Lets see if the location & Headquarter is same`


In [None]:
df['Same State'] = df.apply(lambda x: 1 if x["Location"] == x["Headquarters"] else 0, axis=1)
df.head()


In [None]:
df.drop(columns=["Location"],inplace=True)

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 5. Age of Company:</b>
    </p>
</div>

In [None]:
df['Age'] = 2024-df['Founded']


In [None]:
df["Age"].max()

In [None]:
# job descrition length
df['desc_len'] = df['Job Description'].apply(lambda x: len(x))

## `- Competitor Count:`



In [None]:
df["Competitors"]

In [None]:
# competitors count
df['Num_comp'] = df['Competitors'].apply(lambda x: len(x.split(',')))

## `- Hourly Wage to annual`

> ### Now Everything is ready for EDA.

<a id='EDA'></a>
# <p style="background-color:#FF6347;font-family:Verdana, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:10px;padding:10px;"><b>4|</b> 📊 Exploratory Data Analysis 📊 </p>



In [None]:
sns.set(style="whitegrid")
sns.set_palette("viridis")
numerical_columns = ['Rating', 'Founded', 'Salary Estimate']
for col in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
def plot_top_n_categories(df, column, n=20):
    top_n = df[column].value_counts().nlargest(n)
    plt.figure(figsize=(12, 8))
    sns.barplot(y=top_n.index, x=top_n.values, palette='viridis')
    plt.title(f'Top {n} most common {column}')
    plt.ylabel(column)
    plt.xlabel('Count')
    plt.show()

categorical_columns = ['Job Title', 'Location', 'Headquarters', 'Type of ownership', 'Industry', 'Sector']
for col in categorical_columns:
    plot_top_n_categories(df, col, n=20)

In [None]:
from wordcloud import WordCloud
def plot_wordcloud(df, column):
    text = ' '.join(df[column].astype(str).values)
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.title(f'Word Cloud of {column}')
    plt.show()

for col in categorical_columns:
    plot_wordcloud(df, col)

In [None]:
for col in numerical_columns:
    if col != 'Salary Estimate':
        plt.figure(figsize=(10, 6))
        sns.scatterplot(x=df[col], y=df['Salary Estimate'])
        plt.title(f'Salary Estimate vs {col}')
        plt.show()

In [None]:
sns.pairplot(df[numerical_columns])
plt.show()

## Checking Outliers by boxplot
- Columns:

- Salary Estimate.
- Age.
- desc_len.
- Rating.

In [None]:
def boxplot_numeric_columns(df, columns):
    num_cols = len(columns)
    plt.figure(figsize=(25, 5))
    for i, column in enumerate(columns):
        plt.subplot(2, num_cols, i+1)
        sns.boxplot(x=df[column])
        plt.title(f'Box plot for {column}')

In [None]:
boxplot_numeric_columns(df,['Salary Estimate', 'Age', 'desc_len', 'Rating'])

# Correlation between Variables


In [None]:
df[['Age','Salary Estimate', 'Rating', 'desc_len']].corr()


In [None]:
sns.heatmap(df[['Age','Salary Estimate', 'Rating', 'desc_len']].corr(), annot=True)

# Second data

In [None]:
df2=pd.read_csv('/content/ds_salaries.csv')
df2


<br>

# <b><span style='color:#00aee5'>|</span> Domain Knowledge</b>

<br>

1. **`work_year` [categorical] :** This represents the specific year in which the salary was disbursed.

2. **`experience_level` [categorical] :** The level of experience a person holds in a particular job. This is a key determinant in salary calculation as typically

3. **`employment_type` [categorical] :** The nature of the employment contract such as full-time, part-time, or contractual can greatly influence the salary. Full-time employees often have higher annual salaries compared to their part-time or contractual counterparts.

4. **`job_title` [categorical] :** The role an individual holds within a company. Different roles have different salary scales based on the responsibilities and skills required. For example, managerial roles typically pay more than entry-level positions.

5. **`salary` [numerical] :** The total gross salary paid to the individual. This is directly influenced by factors such as experience level, job title, and employment type.

6. **`salary_currency` [categorical] :** The specific currency in which the salary is paid,Exchange rates could affect the value of the salary when converted into different currencies.

7. **`salaryinusd` [numerical] :** The total gross salary amount converted to US dollars. This allows for a uniform comparison of salaries across different countries and currencies.

8. **`employee_residence` [categorical]:** The primary country of residence of the employee,The cost of living and prevailing wage rates in the employee's country of residence can impact salary levels.

9. **`remote_ratio` [ratio]:** The proportion of work done remotely. With the rise of remote work, companies may adjust salaries based on the cost of living in the employee's location and the proportion of remote work.

10. **`company_location` [categorical]:**  The location of the employer's main office or the branch that holds the contract. Companies in different locations may offer different salary scales due to varying economic conditions and cost of living.

11. **`company_size` [categorical]:** The median number of employees in the company during the work year. Larger companies often have structured salary scales and may offer higher salaries due to economies of scale and larger revenue streams.

<br>


In [None]:
def data_info(data):
    cols = data.columns
    unique_val = [data[col].value_counts().head(10).index.to_numpy() for col in cols]
    n_uniques = [data[col].nunique() for col in cols]
    dtypes = [data[col].dtype for col in cols]
    nulss = [data[col].isnull().sum() for col in cols]
    dup = [data.duplicated().sum() for col in cols]
    return pd.DataFrame({'Col': cols, 'dtype': dtypes, 'n_uniques': n_uniques, 'n_nan': nulss, 'unique_val': unique_val, 'duplicated': dup})


In [None]:
data_info(df2)

### <b><span style='color:#00aee5'>|</span> Categorical</b>

*  Work_year
*  experience_level
*  employment_type
*  job_title
*  salary_currency
*  employee_residence
*  remote_ratio
*  company_location
*  company_size
* salary_currency


### <b><span style='color:#00aee5'>|</span> Numerical</b>
*  salary
*  salary_in_usd

# 📊 Exploratory Data Analysis 📊

## Numerical

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='darkgrid')

plt.figure(figsize=(10, 6))
sns.histplot(df2['salary_in_usd'], bins=30, kde=True, color='red')

plt.title('Salary in USD Distribution', fontsize=20, loc='center')
plt.xlabel('Salary', fontsize=16)
plt.ylabel('Count', fontsize=16)

plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='darkgrid')

plt.figure(figsize=(10, 6))
sns.boxplot(y='salary_in_usd', data=df2, color='red')

plt.title('Salary Box Plot', fontsize=20)
plt.ylabel('Salary', fontsize=16)
plt.xlabel('Salary', fontsize=16)

plt.show()


Salary in USD has a **right skewed distribution**, which is a typical distribution happening in income data. This means that most people earn in the low/medium range of salaries with a few exceptions that are distributed along a large range of higher values.

<a id='job-titles'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Top 10 Job Titles in 2023 and 2022 and 2020</div></b>


In [None]:
sns.set(style='darkgrid')

unique_years = df2['work_year'].unique()

plt.figure(figsize=(20, 20
                ))

for i, year in enumerate(unique_years, start=1):
    top_jobs = df2[df2['work_year'] == year]['job_title'].value_counts().head(10)

    jobs_df = top_jobs.reset_index()
    jobs_df.columns = ['job_title', 'count']

    plt.subplot(len(unique_years), 1, i)
    sns.barplot(data=jobs_df, x='count', y='job_title', palette='Reds')

    plt.title(f'Top 10 Job Titles in {year}', fontsize=16)
    plt.xlabel('Counts', fontsize=14)
    plt.ylabel('Job Titles', fontsize=14)

    for index, value in enumerate(jobs_df['count']):
        plt.text(value, index, str(value))

plt.tight_layout()
plt.show()


- As you can see, there are the most data engineers, followed by data scientists.

<a id='experience-levels'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Experience Levels</div></b>

In [None]:
df2['experience_level'].unique()

- As you can see, there are 4 unique values which are SE(Senior level/expert) , MI(medium level/intermediate) , EN (Entry level) and EX(Executive level). Lets rename these values with the `rename` method.

In [None]:
df2['experience_level'] = df2['experience_level'].replace('EN','Entry-level/Junior')
df2['experience_level'] = df2['experience_level'].replace('MI','Mid-level/Intermediate')
df2['experience_level'] = df2['experience_level'].replace('SE','Senior-level/Expert')
df2['experience_level'] = df2['experience_level'].replace('EX','Executive-level/Director')

In [None]:
sns.countplot(df2["experience_level"],palette="Reds")
plt.show()

- As you can see, the senior-level positions have the highest count, followed by mid-level and junior positions. There are fewer director-level positions compared to other levels.

<a id='employment-types'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Employment Types</div></b>

In [None]:
df2['employment_type'].unique()

In [None]:
df2['employment_type'] = df2['employment_type'].replace('FT','Full-Time')
df2['employment_type'] = df2['employment_type'].replace('PT','Part-Time')
df2['employment_type'] = df2['employment_type'].replace('CT','Contract')
df2['employment_type'] = df2['employment_type'].replace('FL','Freelance')

In [None]:
employment_counts = df2['employment_type'].value_counts().reset_index()
employment_counts.columns = ['employment_type', 'count']

plt.figure(figsize=(10, 6))
ax = sns.barplot(data=employment_counts, x='count', y='employment_type', palette='viridis')

plt.title('Counts of Employment Types', fontsize=20)
plt.xlabel('Counts', fontsize=16)
plt.ylabel('Employment Type', fontsize=16)

for index, value in enumerate(employment_counts['count']):
    plt.text(value, index, str(value), va='center')

plt.show()


- As you can see, a considerable number of people are employed here on a full-time basis. Among the full-time employees, the majority of them are senior. We observe that freelancing is less prevalent these days

<a id='salary-job-title'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Salaries by Job Titles</div></b>

In [None]:
job_title_salary = df2.groupby(df2['job_title'])["salary_in_usd"].mean().round(0).sort_values(ascending=False).head(15).reset_index()
job_title_salary.columns = ['job_title', 'average_salary']

sns.set(style='darkgrid')

plt.figure(figsize=(12, 8))
ax = sns.barplot(data=job_title_salary, x='average_salary', y='job_title', palette='viridis')

plt.title('Top 15 Job Titles by Average Salary in USD', fontsize=20)
plt.xlabel('Average Salary (USD)', fontsize=16)
plt.ylabel('Job Titles', fontsize=16)
for index, value in enumerate(job_title_salary['average_salary']):
        plt.text(value, index, str(value), va='center')

plt.show()


<a id='average'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Salaries by Employment Types</div></b>

In [None]:
avg_salaries = df2.groupby('employment_type')['salary_in_usd'].mean().round(0).sort_values(ascending = False).reset_index()
avg_salaries.columns = ['employment_type', 'avg_salary']
sns.barplot(data = avg_salaries, x = 'avg_salary', y = 'employment_type', palette = 'viridis')
plt.show()

As you can see, average salaries for full-time have increased over the years. It shows that companies care about data science. The second-highest salaries on the plot belong to freelancers, which is a clear indication of the growing trend in freelance work.

<a id='salaries'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Salaries by Work Years</div></b>

In [None]:
year_based_salary=df2['salary_in_usd'].groupby(df2['work_year']).mean()
plt.title("Average Salaries based on Work Year")
plt.xlabel('Work Year')
plt.ylabel('Salary')
sns.lineplot(x=['2020', '2021', '2022','2023'],y=year_based_salary)
plt.show()

As you can see, the average salary for data-driven jobs is increasing every year, with a particularly significant jump observed between 2021 and 2022. This trend underscores the growing demand for skilled professionals in this field.

In [None]:
!pip install country_converter -q
import country_converter as coco

<a id='remote'></a>
# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Remote Jobs Locations</div></b>

In [None]:
df2['company_location'] = coco.convert(df2['company_location'], to='name_short')
df2['employee_residence'] = coco.convert(df2['employee_residence'], to='name_short')

In [None]:
rr = df2.groupby('company_location')['remote_ratio'].mean().reset_index()
rr['company_location'] =  coco.convert(names = rr['company_location'], to = "ISO3")
rr.head()

In [None]:
import plotly.express as px
fig = px.choropleth(rr,
                    locations = rr.company_location,
                    color = rr.remote_ratio,
                    labels={'company_location':'Country','remote_ratio':'Remote Jobs Ratio'})

fig.update_layout(title = "Remote Jobs Locations")
fig.show()

# Data Pre-Processing

In [None]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'director'
    else:
        return np.nan




In [None]:
df2['job_title'] = df2['job_title'].apply(title_simplifier)

In [None]:
df2.dropna(subset=['job_title'], inplace=True)

In [None]:
def remove_categorical_outliers(data, columns, threshold):
    for col in columns:
        series = data[col].value_counts()
        outliers = series[series <= threshold].index
        print(outliers)
        data = data[~data[col].isin(outliers)]
    return data.reset_index(drop=True)

In [None]:
df2=remove_categorical_outliers(df2, ['employee_residence','company_location'], 11)

In [None]:
df2.drop(["salary","salary_currency"],axis=1,inplace=True)

In [None]:
df2

- When scaling features, remember to scale only the input features and not the target variable. Scaling the target variable can lead to incorrect predictions and model performance.

In [None]:
df2.drop_duplicates(inplace=True)

In [None]:
X=df2.drop("salary_in_usd",axis=1)
y=df2["salary_in_usd"]


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['Entry-level/Junior', "Mid-level/Intermediate",'Senior-level/Expert',"Executive-level/Director"],["S","M","L"]])
X_train[["experience_level","company_size"]] = ordinal_encoder.fit_transform(X_train[["experience_level","company_size"]])
X_test[["experience_level","company_size"]] = ordinal_encoder.transform(X_test[["experience_level","company_size"]])

In [None]:
pip install category-encoders


In [None]:
from category_encoders import BinaryEncoder
encoder = BinaryEncoder(cols=['employment_type', 'job_title', 'employee_residence','company_location'])
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

In [None]:
X_train_encoded