# AI, ML, Data Salaries Dataset (Hierarchical Clustering and RFECV) Dated July 2023: Notebook-2

## Table of Contents

1. [Introduction](#1.-Introduction)

    1.1 [Continuing the Problem Statement](#1.1-Continuing-the-Problem-Statement)
    
    1.2 [Relevant Success Criteria](#1.2-Relevant-Success-Criteria)
    
    1.3 [Current status of analysis](#1.3-Current-status-of-analysis)

2. [Data Modelling](#2.-Data-Modelling)
    
    2.1 [Imports](#2.1-Imports)

## 1. Introduction

This notebook is the continuation of the implementation of CRISP-DM methodology on the AL, ML, Data salaries dataset from Kaggle. com.

The relevant links are as follows:

* For Exploratory Data Analysis (Notebook-1) on the data set: [click here](https://www.kaggle.com/code/harismirds/eda-on-ai-ml-data-salaries-by-haris)
* For Medium.com article of EDA based on Notebook-1: [click here](https://medium.com/@harrismir90/eda-on-ai-ml-data-salaries-dataset-from-kaggle-2020-2023-539d475d35ed)
* For README file of the whole project: [click here](https://github.com/Haris-Mir/personal_python_projects/blob/main/01_ds_salaries_project/README.md)

## 1.1 Continuing the Problem Statement

We can extract many valuable insights from this dataset. As this datset provides three years salary data (dependent variable) of three major fields (AI, ML, and Data), we can address some of the following questions from the thorough analysis of the dataset:

#### Hierarchical Clustering and RFECV

* Can we group similar types of jobs based on their salaries and other features using clustering techniques?
[ANSWER TO PROBLEM STATEMENT 1](#ANSWER-TO-PROBLEM-STATEMENT-1)
* What are the most important features in predicting salaries in the AI, ML, and Data fields?

#### Machine Learning (ML)

* Can we predict salaries in the AI, ML, and Data fields based on certain features such as job title, experience level, or company size?
* What is the best model or technique for predicting salaries in the AI, ML, and Data fields based on our evaluation criteria?

## 1.2 Relevant Success Criteria

Based on the problem statement and the use of the CRISP-DM philosophy, the success criteria for the project can be defined as follows:

* __Cluster Analysis:__ Successfully group similar types of jobs based on their salaries and other features using clustering techniques. The success criterion could be creating meaningful clusters that accurately represent different job groups based on salary levels and other relevant characteristics.

* __Identification of Key Factors:__ Identify the most important features that contribute to salary prediction in the AI, ML, and Data fields. This can be determined by analyzing the feature importance or coefficients of the predictive model. The success criterion could be identifying the top three factors that have the highest impact on salaries.

* __Accuracy of Salary Prediction:__ Develop a predictive model that accurately predicts salaries in the AI, ML, and Data fields based on relevant features such as job title, experience level, and company size. The success criterion could be achieving a certain level of accuracy, such as an overall prediction accuracy of 80% or higher.

Meeting these success criteria would indicate a comprehensive analysis of the dataset, providing valuable insights into salary trends, predictive modeling, and the factors influencing salaries in the AI, ML, and Data fields.

## 1.3 Current status of analysis

In the previous notebook, we completed the third phase of CRISP-DM methodology (Data Processing). We saw the influence of individual features on the independent variable `salary_in_usd`. Now I will commence with the 4th phase of CRISP-DM which is the modelling phase.  In this phase, various modeling techniques are applied to the prepared data to create the best possible model. Hierarchical Clustering, Variable selection, Model selection using python or R.

We will now used the dataset that we curated and saved at the end of the last notebook for further analysis i.e. `salaries_usd_filtered`. 

## 2. Data Modelling

Using the CRISP-DM methodology, I will now delve in the fourth phase of data mining called the data modelling.

## 2.1 Imports

In [1]:
# Import all the relevant libraries for the anaylysis

# For data Manipulation
import numpy as np
import pandas as pd

# For Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

# To ignore warning text etc. 
import warnings
warnings.filterwarnings('ignore')

# Misc. libraries
import os
import re
from collections import Counter
from sklearn.preprocessing import OrdinalEncoder

In [3]:
main_df = pd.read_csv('C:/Users/HRM/Documents/github_repos/python_projects/personal_python_projects/01_ds_salaries_project/data/processed/salaries_usd_filtered.csv')
main_df.head(3)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,general_job_title
0,2023,MI,FT,AWS Data Architect,258000,US,100,US,L,Data Architect
1,2023,SE,FT,Data Scientist,225000,US,0,US,M,Data Scientist
2,2023,SE,FT,Data Scientist,156400,US,0,US,M,Data Scientist


## 2.2 Hierarchical Clustering

## ANSWER TO PROBLEM STATEMENT 1

To answer one of our prolem statement that:

__Can we group similar types of jobs based on their salaries and other features using clustering techniques?__

We will now try to group 12 different job titles in the `general_job_title` feature into similar categories using the heirarchical clustering. Hierarchical clustering is an unsupervised machine learning technique used to group similar data points into clusters based on their similarity or dissimilarity. 

In the context of our dataset, we can apply hierarchical clustering to group similar job titles based on the similarity or dissimilarity of the other features in our dataset, such as `work_year`, `company_location`, `experience_level`, `remote_ratio`, and `employee_residence` etc. By applying hierarchical clustering, we can obtain a dendrogram, which is a tree-like diagram that displays the hierarchical relationship between the clusters. The dendrogram can help us visualize the similarities and dissimilarities between the job titles and identify potential groups or clusters that may exist within the data.
However, in order to do this, we need to transform the categorical features into numerical features to make the model simpler.

## The key take away from this is that we do not always have to include all the variables in our analysis. Sometimes, simpler analysis is better

#### The two main ingredients of hierarchical Clustering

There are several distance metrics available for calculating the distances between data points in hierarchical clustering. Some of the commonly used distance metrics are:

* Euclidean distance: This is the most commonly used distance metric in clustering algorithms. It measures the straight-line distance between two points in a multidimensional space.
* Manhattan distance: This metric is also known as the city block distance or taxicab distance. It measures the distance between two points by summing the absolute differences of their coordinates.
* Gower distance: This is a similarity measure used for calculating the distance between two samples with mixed data types, such as numerical, categorical, or ordinal data. It is a modified form of the Manhattan distance that is scaled by the range of each feature
* Chebyshev distance: This metric measures the maximum distance between two points in any dimension. It is also known as the chessboard distance.
* Cosine distance: This metric measures the angular distance between two vectors. It is commonly used in text data analysis and recommendation systems.
* Jaccard distance: This metric is used to measure the similarity between two sets of binary data. It is commonly used in clustering categorical data.
* Hamming distance: This metric is used to measure the similarity between two binary strings of equal length. It counts the number of positions at which the corresponding symbols are different.
* Mahalanobis distance: This metric takes into account the correlation between variables in the dataset. It is used when the dataset has a covariance structure.

These are some of the commonly used distance metrics in clustering algorithms. The choice of distance metric depends on the nature of the data and the clustering problem at hand.

There are several different linkage methods in hierarchical clustering. Some of the most commonly used linkage methods are:

* Single Linkage: In this method, the distance between two clusters is defined as the minimum distance between any two points in the two clusters. This method tends to produce long, thin clusters and is sensitive to outliers.
* Complete Linkage: In this method, the distance between two clusters is defined as the maximum distance between any two points in the two clusters. This method tends to produce compact, spherical clusters and is less sensitive to outliers than single linkage.
* Average Linkage: In this method, the distance between two clusters is defined as the average distance between all pairs of points in the two clusters. This method strikes a balance between single linkage and complete linkage and is less sensitive to outliers than single linkage.
* Ward's Method: In this method, the distance between two clusters is defined as the increase in the sum of squared distances when the two clusters are merged. This method tends to produce clusters of similar size and is less sensitive to outliers than single linkage.

Other less commonly used linkage methods include centroid linkage, median linkage, and weighted linkage.