## Problem Statement


- Scaler is an online tech-versity offering intensive computer science & Data Science courses through live classes delivered by tech leaders and subject matter experts. 
- The meticulously structured program enhances the skills of software professionals by offering a modern curriculum with exposure to the latest technologies. It is a product by InterviewBit.


- You are working as a data scientist with the analytics vertical of Scaler, focused on profiling the best companies and job positions to work for from the Scaler database.
- You are provided with the information for a segment of learners and tasked to cluster them on the basis of their job profile, company, and other features. Ideally, these clusters should have similar characteristics.




## Data Dictionary:

- ‘Unnamed 0’- Index of the dataset
- Email_hash- Anonymised Personal Identifiable Information (PII)
- Company_hash- Current employer of the learner
- orgyear- Employment start date
- CTC- Current CTC
- Job_position- Job profile in the company
- CTC_updated_year: Year in which CTC got updated (Yearly increments, Promotions)


#### Concept Used:

- Manual Clustering
- Unsupervised Clustering - K- means, Hierarchical Clustering



---

- Import the dataset and do usual exploratory data analysis steps like checking the structure & characteristics of the dataset
- Checking unique emails and frequency of occurrence of the same email hash in the data. 
- Recording observation and inference, wherever necessary.
- Checking for missing values and Prepare data for KNN/ Mean Imputation.
- You may have to remove special characters from the dataset by using Regex
- Don’t worry if you haven’t used that before. The syntax is quite simple and intuitive
- Code:
- mystring='\tAirtel X Labs'
- re.sub('[^A-Za-z0-9 ]+', '', mystring)

Checking for duplicates in the dataset and drop them
Making some new features like adding ‘Years of Experience’ column by subtracting orgyear from current year
Manual Clustering on the basis of learner’s company, job position and years of experience
Getting the 5 point summary of CTC (mean, median, max, min, count etc) on the basis of Company, Job Position, Years of Experience
Merging the same with original dataset carefully and creating some flags showing learners with CTC greater than the Average of their Company’s department having same Years of Experience - Call that flag designation with values [1,2,3]
Doing above analysis at Company & Job Position level. Name that flag Class with values [1,2,3]
Repeating the same analysis at the Company level. Name that flag Tier with values [1,2,3]
Based on the manual clustering done so far, answering few questions like:
Top 10 employees (earning more than most of the employees in the company) - Tier 1
Top 10 employees of data science in Amazon / TCS etc earning more than their peers - Class 1
Bottom 10 employees of data science in Amazon / TCS etc earning less than their peers - Class 3
Bottom 10 employees (earning less than most of the employees in the company)- Tier 3
Top 10 employees in Amazon- X department - having 5/6/7 years of experience earning more than their peers - Tier X
Top 10 companies (based on their CTC)
Top 2 positions in every company (based on their CTC)
Data processing for Unsupervised clustering - Label encoding/ One- hot encoding, Standardization of data
Unsupervised Learning - Clustering
Checking clustering tendency
Elbow method
K-means clustering
Hierarchical clustering (you can do this on a sample of the dataset if your process is taking time)
Insights from Unsupervised Clustering
Provide actionable Insights & Recommendations for the Business.
Evaluation Criteria (100 Points):

Define Problem Statement and perform Exploratory Data Analysis (10 points)
Definition of problem (as per given problem statement with additional views)
Observations on shape of data, data types of all the attributes, conversion of categorical attributes to 'category' (If required) , missing value detection, statistical summary.
Univariate Analysis (distribution plots of all the continuous variable(s) barplots/countplots of all the categorical variables)
Bivariate Analysis (Relationships between important variables such as workday and count, season and count, weather and count.
Illustrate the insights based on EDA
Comments on range of attributes, outliers of various attributes
Comments on the distribution of the variables and relationship between them
Comments for each univariate and bivariate plots
Data Pre-processing: (30 Points)
Mean/ KNN Imputation
Regex for cleaning company names
Standardization & Encoding
Manual Clustering: (30 Points)
Creating Designation Flag & Insights
Creating Class Flag & Insights
Creating Tier Flag & Insights
Unsupervised learning: (20 Points)
Checking clustering tendency, Elbow method & K- means clustering
Hierarchical Clustering
Actionable Insights & Recommendations (10 Points)

In [1]:
import warnings
warnings.filterwarnings("ignore")


In [2]:
df = pd.read_csv("scaler_clustering.csv")

<IPython.core.display.Javascript object>

In [44]:
df.sample(50)

Unnamed: 0.1,Unnamed: 0,company_hash,email_hash,orgyear,ctc,job_position,ctc_updated_year
104200,104439,onqxatvx xzw,b7a8e353e7c5091c79629acf2e8be487167c46b354ff53...,2015.0,1220000,Frontend Engineer,2019.0
138833,139316,bwbvprtq,f369a279ffe35c79d1750233df50bf1fbf21f2acbef895...,2014.0,5600000,Backend Engineer,2020.0
160874,161518,xzexzxnt wgbuhntq ogrhnxgzo,c898db4296b5041798eb6bfa2248c3bb59aea3bc718447...,2012.0,1850000,Backend Engineer,2019.0
144172,144711,atryx ntwyzgrgsxwvr hzxctqoxnj,ecf94bdc5f516eba9d481b2043f4520d80dd719a770b37...,2020.0,2800000,Engineering Intern,2020.0
186059,186966,vbvkgz,1f599abb10a302679d03cc6e041ba764be2c47b78e88c4...,2019.0,476000,,2021.0
154825,155437,troutwnqv,7d7187f926a6b4feaa11c825b2df961a442c02a02048f6...,2020.0,240000,Frontend Engineer,2020.0
43641,43692,utqoxontzn ojontbo,1fc96d15f643454515f5ac6274f6a9cc8f2aaf5e98faea...,2009.0,1600000,,2021.0
46977,47032,zetqtzwt,698e367bbea42f49cf72d0244f67c7f4768e3705f6e5fc...,2017.0,3500000,Backend Engineer,2021.0
24942,24965,wggrfxzpo,c29f17ef66fdfcd07e310b259faeef0a167ae7f1f8b14d...,2014.0,2400000,Engineering Leadership,2019.0
41028,41078,uyvqbtvoj,90c9f20ee4c72c716d3c2813bcd04f870dbe47d1897ebe...,2018.0,1650000,Frontend Engineer,2021.0


In [32]:
import re

In [43]:
mystring='atrgxnnt xzaxv'
re.sub('[^A-Za-z0-9 ]+', '', mystring)

'atrgxnnt xzaxv'

In [46]:
df["email_hash"][0]

'6de0a4417d18ab14334c3f43397fc13b30c35149d70c050c0618caea697c87af'