# DATA SCIENTIST JOB RECOMMENDER SYSTEM
- Names: 
    - Ingavi Kilavuka
    - Calvin Omwega
    - Alvin Kimathi
    - Ronny Kabiru
- Instructor: Maryann Mwikali
- Modeling Focus: Hybrid NLP & Recommender system 

## Table of contents 
1. [Business Understanding](#1-business-understanding)
    1. [Objectives](#objectives)
    1. [Problem Definition](#Problem-definition)
3. [Imports](#imports)
2. [Data Cleaning](#Data-Cleaning)
    1. [Data engineering](#data-engineeringnlp)

## 1. Business Understanding

## Objective

In the pursuit for jobs in our desired field, This project aims to:
1. *Develop a Personalized Recommendation Algorithm*:
   - Create a machine learning-based recommender system that analyzes a data scientist's profile (e.g., skills, experience, career goals) and matches them with the most relevant job opportunities.

2. *Enhance Job Matching Precision*:
   - Improve the accuracy of job recommendations by incorporating advanced NLP techniques to analyze job descriptions and user profiles, ensuring that recommendations are based on a deep understanding of both the candidate's qualifications and the job requirements.

3. *Improve User Experience*:
   - Design an intuitive user interface that allows data scientists to easily input their preferences, receive job recommendations, and provide feedback to further refine the recommendation algorithm.

4. *Integrate Continuous Learning*:
   - Implement a feedback loop where the recommender system continuously learns from user interactions (e.g., job applications, rejections, and preferences) to improve the relevance of future recommendations.

5. *Address Diversity and Inclusion*:
   - Ensure the recommender system promotes diversity by identifying and mitigating biases in job recommendations, allowing for fair and equitable matching of candidates from various backgrounds.

6. *Performance Evaluation*:
   - Develop metrics to evaluate the success of the recommender system, such as user satisfaction, recommendation accuracy, and the rate of successful job placements.

## Problem definition 
In today's rapidly evolving job market, data scientists face a plethora of job opportunities across various industries and roles. However, the diversity in job descriptions, required skills, and company expectations can make it challenging for data scientists to identify roles that align with their specific skills, experience, and career aspirations. Traditional job search platforms often provide overwhelming results, making it difficult for data scientists to filter through the noise and find the most suitable opportunities. This lack of tailored recommendations can lead to inefficient job searches, missed opportunities, and potentially unsatisfactory job placements. 

### Imports

In [308]:
import pandas as pd
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [309]:
# df = pd.read_csv('data.csv', on_bad_lines='skip', sep=",", encoding='cp1252')
df = pd.read_csv('jd_structured_data.csv')

In [310]:
df.columns

Index(['Job Title', 'Rating', 'Company Name', 'Location', 'Headquarters',
       'Size', 'Founded', 'Type of ownership', 'Industry', 'Sector',
       'Competitors', 'Average Salary', 'Average Revenue', 'Processed_JD'],
      dtype='object')

In [311]:
df.Location.unique()

array(['Albuquerque, NM', 'Linthicum, MD', 'Clearwater, FL',
       'Richland, WA', 'New York, NY', 'Dallas, TX', 'Baltimore, MD',
       'San Jose, CA', 'Rochester, NY', 'Chantilly, VA', 'Plano, TX',
       'Seattle, WA', 'Cambridge, MA', 'Newark, NJ', 'Mountain View, CA',
       'San Francisco, CA', 'Denver, CO', 'Chicago, IL', 'Louisville, KY',
       'Oregon', 'Herndon, VA', 'Hillsboro, OR', 'Worcester, MA',
       'Groton, CT', 'Detroit, MI', 'Sunnyvale, CA', 'Ipswich, MA',
       'Redlands, CA', 'Woburn, MA', 'Fremont, CA', 'Long Beach, NY',
       'Marlborough, MA', 'Allendale, NJ', 'Chandler, AZ',
       'Washington, DC', 'Bellevue, WA', 'Longmont, CO',
       'Beavercreek, OH', 'Peoria, IL', 'Kingdom, IL',
       'Fort Lauderdale, FL', 'Boston, MA', 'Huntsville, AL',
       'Armonk, NY', 'San Diego, CA', 'Saint Louis, MO', 'Lincoln, RI',
       'Cincinnati, OH', 'Palo Alto, CA', 'Coraopolis, PA',
       'Framingham, MA', 'Atlanta, GA', 'New Jersey', 'Philadelphia, PA',
       

In [312]:
df.shape

(956, 14)

In [313]:
df.describe()

Unnamed: 0,Rating,Size,Founded,Average Salary,Average Revenue
count,956.0,956.0,956.0,956.0,956.0
mean,3.601255,3027.393199,1774.605649,103.1539,24319.000761
std,1.067619,3677.688565,598.942517,31.971932,60571.30857
min,-1.0,-1.0,-1.0,15.5,1.0
25%,3.3,350.5,1937.0,84.5,17.5
50%,3.8,750.5,1992.0,103.1539,1500.0
75%,4.2,3027.393199,2008.0,114.0,24319.000761
max,5.0,10000.0,2019.0,254.0,250500.0


In [314]:
df.dtypes

Job Title             object
Rating               float64
Company Name          object
Location              object
Headquarters          object
Size                 float64
Founded                int64
Type of ownership     object
Industry              object
Sector                object
Competitors           object
Average Salary       float64
Average Revenue      float64
Processed_JD          object
dtype: object

In [315]:
df

Unnamed: 0,Job Title,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Competitors,Average Salary,Average Revenue,Processed_JD
0,Data Scientist,3.8,Tecolote Research,"Albuquerque, NM","Goleta, CA",750.5,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,-1,72.0000,75.000000,"Data Scientist Location: Albuquerque, Educatio..."
1,Healthcare Data Scientist,3.4,University of Maryland Medical System,"Linthicum, MD","Baltimore, MD",10000.0,1984,Other Organization,Health Care Services & Hospitals,Health Care,-1,87.5000,3500.000000,What You Will Do: General Summary The Healthca...
2,Data Scientist,4.8,KnowBe4,"Clearwater, FL","Clearwater, FL",750.5,2010,Company - Private,Security Services,Business Services,-1,85.0000,300.000000,"KnowBe4, Inc. high growth information security..."
3,Data Scientist,3.8,PNNL,"Richland, WA","Richland, WA",3000.5,1965,Government,Energy,"Oil, Gas, Energy & Utilities","Oak Ridge National Laboratory, National Renewa...",76.5000,250500.000000,*Organization Job ID** Job ID: 310709 Director...
4,Data Scientist,2.9,Affinity Solutions,"New York, NY","New York, NY",125.5,1998,Company - Private,Advertising & Marketing,Business Services,"Commerce Signals, Cardlytics, Yodlee",114.5000,24319.000761,Data Scientist Affinity Solutions Marketing Cl...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
951,Senior Data Engineer,4.4,Eventbrite,"Nashville, TN","San Francisco, CA",3000.5,2006,Company - Public,Internet,Information Technology,"See Tickets, TicketWeb, Vendini",102.5000,300.000000,THE CHALLENGE Eventbrite world-class data repo...
952,"Project Scientist - Auton Lab, Robotics Institute",2.6,Software Engineering Institute,"Pittsburgh, PA","Pittsburgh, PA",750.5,1984,College / University,Colleges & Universities,Education,-1,73.5000,24319.000761,The Auton Lab Carnegie Mellon University large...
953,Data Science Manager,3.2,"Numeric, LLC","Allentown, PA","Chadds Ford, PA",25.5,-1,Company - Private,Staffing & Outsourcing,Business Services,-1,127.5000,7.500000,Data Science ManagerResponsibilities: Oversee ...
954,Data Engineer,4.8,IGNW,"Austin, TX","Portland, OR",350.5,2015,Company - Private,IT Services,Information Technology,Slalom,103.1539,37.500000,Loading... Title: Data Engineer Location: Aust...


In [316]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Competitors          0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


## Data Cleaning

In [317]:
# 1. Handle Missing Values
# Replace '-1' with NaN (common placeholder for missing data)
#df.replace('-1', pd.NA, inplace=True)

In [318]:
df = df.drop('Competitors', axis=1)


In [319]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


In [320]:
# 2. Data Type Corrections
# Convert 'Size', 'Average Salary', and 'Average Revenue' to numeric (if applicable)
df['Size'] = pd.to_numeric(df['Size'], errors='coerce')
df['Average Salary'] = pd.to_numeric(df['Average Salary'], errors='coerce')
df['Average Revenue'] = pd.to_numeric(df['Average Revenue'], errors='coerce')

In [321]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


In [322]:
# 3. Outlier Detection
# You might want to inspect specific columns for outliers manually or use statistical methods
# Example: Cap or remove outliers in 'Average Salary' or 'Average Revenue'
q1 = df['Average Salary'].quantile(0.25)
q3 = df['Average Salary'].quantile(0.75)
iqr = q3 - q1

In [323]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


In [324]:
# Define lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Filter out outliers
df = df[(df['Average Salary'] >= lower_bound) & (df['Average Salary'] <= upper_bound)]

# Repeat similar steps for 'Average Revenue' if necessary

In [325]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


In [326]:
# 4. Text Data Cleaning
# Remove leading/trailing spaces and convert text columns to a uniform case
text_columns = ['Job Title', 'Company Name', 'Location', 'Headquarters', 'Type of ownership', 'Industry', 'Sector', 'Processed_JD']
for col in text_columns:
    df[col] = df[col].str.strip().str.lower()


In [327]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


In [328]:
duplicates = df[df.duplicated()]
print(duplicates)

                                             Job Title  Rating  \
30                                      data scientist     4.8   
31                                      data scientist     3.8   
62                                      data scientist     4.1   
63                                      data scientist     3.4   
94                   staff data scientist - technology     3.2   
..                                                 ...     ...   
951                               senior data engineer     4.4   
952  project scientist - auton lab, robotics institute     2.6   
953                               data science manager     3.2   
954                                      data engineer     4.8   
955          research scientist – security and privacy     3.6   

                       Company Name         Location       Headquarters  \
30                          knowbe4   clearwater, fl     clearwater, fl   
31                             pnnl     richland, wa     

In [329]:
df

Unnamed: 0,Job Title,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Average Salary,Average Revenue,Processed_JD
0,data scientist,3.8,tecolote research,"albuquerque, nm","goleta, ca",750.5,1973,company - private,aerospace & defense,aerospace & defense,72.0000,75.000000,"data scientist location: albuquerque, educatio..."
1,healthcare data scientist,3.4,university of maryland medical system,"linthicum, md","baltimore, md",10000.0,1984,other organization,health care services & hospitals,health care,87.5000,3500.000000,what you will do: general summary the healthca...
2,data scientist,4.8,knowbe4,"clearwater, fl","clearwater, fl",750.5,2010,company - private,security services,business services,85.0000,300.000000,"knowbe4, inc. high growth information security..."
3,data scientist,3.8,pnnl,"richland, wa","richland, wa",3000.5,1965,government,energy,"oil, gas, energy & utilities",76.5000,250500.000000,*organization job id** job id: 310709 director...
4,data scientist,2.9,affinity solutions,"new york, ny","new york, ny",125.5,1998,company - private,advertising & marketing,business services,114.5000,24319.000761,data scientist affinity solutions marketing cl...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
951,senior data engineer,4.4,eventbrite,"nashville, tn","san francisco, ca",3000.5,2006,company - public,internet,information technology,102.5000,300.000000,the challenge eventbrite world-class data repo...
952,"project scientist - auton lab, robotics institute",2.6,software engineering institute,"pittsburgh, pa","pittsburgh, pa",750.5,1984,college / university,colleges & universities,education,73.5000,24319.000761,the auton lab carnegie mellon university large...
953,data science manager,3.2,"numeric, llc","allentown, pa","chadds ford, pa",25.5,-1,company - private,staffing & outsourcing,business services,127.5000,7.500000,data science managerresponsibilities: oversee ...
954,data engineer,4.8,ignw,"austin, tx","portland, or",350.5,2015,company - private,it services,information technology,103.1539,37.500000,loading... title: data engineer location: aust...


In [330]:

# 6. Data Standardization
# Standardize location names (e.g., convert 'NY' to 'New York' if such inconsistencies exist)
# This step would require knowing the specific inconsistencies to correct

# Example: Standardizing common location variations (simplified example)
df['Location'] = df['Location'].str.replace('new york, ny', 'new york, new york')

df


Unnamed: 0,Job Title,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Average Salary,Average Revenue,Processed_JD
0,data scientist,3.8,tecolote research,"albuquerque, nm","goleta, ca",750.5,1973,company - private,aerospace & defense,aerospace & defense,72.0000,75.000000,"data scientist location: albuquerque, educatio..."
1,healthcare data scientist,3.4,university of maryland medical system,"linthicum, md","baltimore, md",10000.0,1984,other organization,health care services & hospitals,health care,87.5000,3500.000000,what you will do: general summary the healthca...
2,data scientist,4.8,knowbe4,"clearwater, fl","clearwater, fl",750.5,2010,company - private,security services,business services,85.0000,300.000000,"knowbe4, inc. high growth information security..."
3,data scientist,3.8,pnnl,"richland, wa","richland, wa",3000.5,1965,government,energy,"oil, gas, energy & utilities",76.5000,250500.000000,*organization job id** job id: 310709 director...
4,data scientist,2.9,affinity solutions,"new york, new york","new york, ny",125.5,1998,company - private,advertising & marketing,business services,114.5000,24319.000761,data scientist affinity solutions marketing cl...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
951,senior data engineer,4.4,eventbrite,"nashville, tn","san francisco, ca",3000.5,2006,company - public,internet,information technology,102.5000,300.000000,the challenge eventbrite world-class data repo...
952,"project scientist - auton lab, robotics institute",2.6,software engineering institute,"pittsburgh, pa","pittsburgh, pa",750.5,1984,college / university,colleges & universities,education,73.5000,24319.000761,the auton lab carnegie mellon university large...
953,data science manager,3.2,"numeric, llc","allentown, pa","chadds ford, pa",25.5,-1,company - private,staffing & outsourcing,business services,127.5000,7.500000,data science managerresponsibilities: oversee ...
954,data engineer,4.8,ignw,"austin, tx","portland, or",350.5,2015,company - private,it services,information technology,103.1539,37.500000,loading... title: data engineer location: aust...


In [331]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Job Title            0.0
Rating               0.0
Company Name         0.0
Location             0.0
Headquarters         0.0
Size                 0.0
Founded              0.0
Type of ownership    0.0
Industry             0.0
Sector               0.0
Average Salary       0.0
Average Revenue      0.0
Processed_JD         0.0
dtype: float64


### Data Engineering(NLP)

In [332]:
# Removing Stopwords
# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    if isinstance(text, str):  # Check if the input is a string
        return ' '.join([word for word in text.split() if word.lower() not in stop_words])
    else:
        return text  # If it's not a string, return it as is (could be None or some other type)

# Apply the function to the cleaned_text column
df['Processed_JD'] = df['Processed_JD'].apply(remove_stopwords)
df['Location'] = df['Location'].apply(remove_stopwords)
df['Sector'] = df['Sector'].apply(remove_stopwords)
df['Industry'] = df['Industry'].apply(remove_stopwords)
df['Location'] = df['Location'].apply(remove_stopwords)

# Display the first few rows to verify the changes
df[['Processed_JD','Location','Industry', 'Sector']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Processed_JD,Location,Industry,Sector
0,"data scientist location: albuquerque, educatio...","albuquerque, nm",aerospace & defense,aerospace & defense
1,do: general summary healthcare data scientist ...,"linthicum, md",health care services & hospitals,health care
2,"knowbe4, inc. high growth information security...","clearwater, fl",security services,business services
3,*organization job id** job id: 310709 director...,"richland, wa",energy,"oil, gas, energy & utilities"
4,data scientist affinity solutions marketing cl...,"new york, new york",advertising & marketing,business services


In [333]:
nltk.download('punkt')

def tokenize_text(text):
    if isinstance(text, str):  # Check if the input is a string
        return word_tokenize(text)
    else:
        return []  # Return an empty list if the text is None or not a string

# Apply the function to the cleaned_text column
df['JD_tokens'] = df['Processed_JD'].apply(tokenize_text)
df['Location_tokens'] = df['Location'].apply(tokenize_text)
df['Industry_tokens'] = df['Industry'].apply(tokenize_text)
df['Sector_tokens'] = df['Sector'].apply(tokenize_text)

# Display the first few rows to verify the changes
df[['JD_tokens','Industry_tokens','Sector_tokens','Location_tokens']].head()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,JD_tokens,Industry_tokens,Sector_tokens,Location_tokens
0,"[data, scientist, location, :, albuquerque, ,,...","[aerospace, &, defense]","[aerospace, &, defense]","[albuquerque, ,, nm]"
1,"[do, :, general, summary, healthcare, data, sc...","[health, care, services, &, hospitals]","[health, care]","[linthicum, ,, md]"
2,"[knowbe4, ,, inc., high, growth, information, ...","[security, services]","[business, services]","[clearwater, ,, fl]"
3,"[*, organization, job, id, *, *, job, id, :, 3...",[energy],"[oil, ,, gas, ,, energy, &, utilities]","[richland, ,, wa]"
4,"[data, scientist, affinity, solutions, marketi...","[advertising, &, marketing]","[business, services]","[new, york, ,, new, york]"
