# DATA SCIENTIST JOB RECOMMENDER SYSTEM
- Names: 
    - Ingavi Kilavuka
    - Calvin Omwega
    - Alvin Kimathi
    - Ronny Kabiru
- Instructor: Maryann Mwikali
- Modeling Focus: Hybrid NLP & Recommender system 

## Table of contents 
1. [Business Understanding](#1-business-understanding)
    1. [Objectives](#objectives)
    1. [Problem Definition](#Problem-definition)
3. [Imports](#imports)
2. [Data Cleaning](#Data-Cleaning)
    1. [Data engineering](#data-engineeringnlp)

## 1. Business Understanding

## Objective

In the pursuit for jobs in our desired field, This project aims to:
1. *Develop a Personalized Recommendation Algorithm*:
   - Create a machine learning-based recommender system that analyzes a data scientist's profile (e.g., skills, experience, career goals) and matches them with the most relevant job opportunities.

2. *Enhance Job Matching Precision*:
   - Improve the accuracy of job recommendations by incorporating advanced NLP techniques to analyze job descriptions and user profiles, ensuring that recommendations are based on a deep understanding of both the candidate's qualifications and the job requirements.

3. *Improve User Experience*:
   - Design an intuitive user interface that allows data scientists to easily input their preferences, receive job recommendations, and provide feedback to further refine the recommendation algorithm.

4. *Integrate Continuous Learning*:
   - Implement a feedback loop where the recommender system continuously learns from user interactions (e.g., job applications, rejections, and preferences) to improve the relevance of future recommendations.

5. *Address Diversity and Inclusion*:
   - Ensure the recommender system promotes diversity by identifying and mitigating biases in job recommendations, allowing for fair and equitable matching of candidates from various backgrounds.

6. *Performance Evaluation*:
   - Develop metrics to evaluate the success of the recommender system, such as user satisfaction, recommendation accuracy, and the rate of successful job placements.

## Problem definition 
In today's rapidly evolving job market, data scientists face a plethora of job opportunities across various industries and roles. However, the diversity in job descriptions, required skills, and company expectations can make it challenging for data scientists to identify roles that align with their specific skills, experience, and career aspirations. Traditional job search platforms often provide overwhelming results, making it difficult for data scientists to filter through the noise and find the most suitable opportunities. This lack of tailored recommendations can lead to inefficient job searches, missed opportunities, and potentially unsatisfactory job placements. 

### Imports

In [42]:
import pandas as pd
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [43]:
df = pd.read_csv('data.csv', on_bad_lines='skip', sep=",", encoding='cp1252')
df

Unnamed: 0.1,Unnamed: 0,URL,Description,Job_Title,Company_Rating,Company_Name,Company_Location,Estimated_Salary,Location_Code
0,1,https://www.glassdoor.com/job-listing/data-sci...,Indeed Prime is a free service that connects q...,,,,,,1152990
1,2,https://www.glassdoor.com/job-listing/machine-...,Position Summary: Machine Learning researchers...,Machine Learning Research Scientist (Entry-Level,3.3,Software Engineering Institute,"– Pittsburgh, PA","$92,000/ year",1152990
2,3,https://www.glassdoor.com/job-listing/data-sci...,,Data Scientist I,2.7,United States Steel,"– Pittsburgh, PA","$126,000/ year",1152990
3,4,https://www.glassdoor.com/job-listing/data-sci...,Support data-driven product decisions that im...,Data Science Intern,4.9,Duolingo,"– Pittsburgh, PA",,1152990
4,5,https://www.glassdoor.com/job-listing/data-ana...,"Bechtel Plant Machinery, Inc. (BPMI) is seekin...",Data Analyst,3.5,BPMI,"– Monroeville, PA","$48,000/ year",1152990
...,...,...,...,...,...,...,...,...,...
1080,1081,https://www.glassdoor.com/job-listing/research...,Research Scientist:Aeroacoustics The National ...,RESEARCH SCIENTIST: Aeroacoustics,3.1,National Institute of Aerospace,"– Hampton, VA","$73,000/ year",1130324
1081,1082,https://www.glassdoor.com/job-listing/call-cen...,Title Call Center Data Analyst 16-Nov-2018 De...,Call Center Data Analyst,2.9,Dollar Tree,"– Chesapeake, VA","$59,000/ year",1130324
1082,1083,https://www.glassdoor.com/job-listing/usmtf-da...,OverviewOasis Systems has an exciting opportu...,USMTF Data Analyst,3.2,"MAR, Incorporated","– Hampton, VA","$63,000/ year",1130324
1083,1084,https://www.glassdoor.com/job-listing/engineer...,Engineer/Scientist 2 (Mechanical) - Navigation...,Engineer/Scientist 2 (Mechanical) - Navigation,3.2,L3 Technologies,"– Norfolk, VA",,1130324


In [44]:
df.columns

Index(['Unnamed: 0', 'URL', 'Description', 'Job_Title', 'Company_Rating',
       'Company_Name', 'Company_Location', 'Estimated_Salary',
       'Location_Code'],
      dtype='object')

In [45]:
df.Company_Location.unique()

array([nan, '\xa0–\xa0Pittsburgh, PA', '\xa0–\xa0Monroeville, PA',
       '\xa0–\xa0Coraopolis, PA', '\xa0–\xa0Blue Bell, PA',
       '\xa0–\xa0Dresher, PA', '\xa0–\xa0Philadelphia, PA',
       '\xa0–\xa0King of Prussia, PA', '\xa0–\xa0Newtown Square, PA',
       '\xa0–\xa0Exton, PA', '\xa0–\xa0Malvern, PA',
       '\xa0–\xa0Kulpsville, PA', '\xa0–\xa0Radnor, PA',
       '\xa0–\xa0Ambler, PA', '\xa0–\xa0Berwyn, PA',
       '\xa0–\xa0Brooklyn, NY', '\xa0–\xa0New York, NY',
       '\xa0–\xa0Boston, MA', '\xa0–\xa0Cambridge, MA',
       '\xa0–\xa0Andover, MA', '\xa0–\xa0Bedford, MA',
       '\xa0–\xa0Framingham, MA', '\xa0–\xa0Baltimore, MD',
       '\xa0–\xa0Laurel, MD', '\xa0–\xa0Timonium, MD',
       '\xa0–\xa0Annapolis Junction, MD', '\xa0–\xa0Towson, MD',
       '\xa0–\xa0Columbia, MD', '\xa0–\xa0Hunt Valley, MD',
       '\xa0–\xa0Sparks, MD', '\xa0–\xa0Hanover, MD',
       '\xa0–\xa0Linthicum Heights, MD', '\xa0–\xa0Fort Meade, MD',
       '\xa0–\xa0Elkridge, MD', '\xa0–\xa0Washingt

In [46]:
df.shape

(1085, 9)

In [47]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Company_Rating,Location_Code
count,1085.0,1020.0,1085.0
mean,543.0,3.662255,1143141.0
std,313.356825,0.567207,8073.385
min,1.0,1.0,1128808.0
25%,272.0,3.3,1136950.0
50%,543.0,3.6,1145013.0
75%,814.0,4.0,1149603.0
max,1085.0,5.0,1155583.0


In [48]:
df.dtypes

Unnamed: 0            int64
URL                  object
Description          object
Job_Title            object
Company_Rating      float64
Company_Name         object
Company_Location     object
Estimated_Salary     object
Location_Code         int64
dtype: object

## Data Cleaning

In [49]:
missing_percentage = df.isnull().mean() * 100

print(missing_percentage)

Unnamed: 0           0.000000
URL                  0.000000
Description          0.552995
Job_Title            3.594470
Company_Rating       5.990783
Company_Name         5.345622
Company_Location     6.175115
Estimated_Salary    28.387097
Location_Code        0.000000
dtype: float64


In [50]:
df_cleaned = df.drop(columns=['Unnamed: 0'])
df_cleaned = df.drop(columns=['URL'])

df_cleaned = df_cleaned.dropna(subset=['Job_Title'])

df_cleaned['Company_Rating'] = df_cleaned['Company_Rating'].fillna('Not Rated')

df_cleaned['Company_Name'] = df_cleaned['Company_Name'].str.strip().str.title()
df_cleaned['Company_Location'] = df_cleaned['Company_Location'].str.strip().str.title()


df_cleaned['Estimated_Salary'] = (
    df_cleaned['Estimated_Salary']
    .str.replace(r'[^\d]', '', regex=True)  
    .astype(float)
)

df_cleaned_info = df_cleaned.info()
df_cleaned_head = df_cleaned.head()

df_cleaned_info, df_cleaned_head

<class 'pandas.core.frame.DataFrame'>
Index: 1046 entries, 1 to 1083
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1046 non-null   int64  
 1   Description       1040 non-null   object 
 2   Job_Title         1046 non-null   object 
 3   Company_Rating    1046 non-null   object 
 4   Company_Name      1026 non-null   object 
 5   Company_Location  1018 non-null   object 
 6   Estimated_Salary  777 non-null    float64
 7   Location_Code     1046 non-null   int64  
dtypes: float64(1), int64(2), object(5)
memory usage: 73.5+ KB


(None,
    Unnamed: 0                                        Description  \
 1           2  Position Summary: Machine Learning researchers...   
 2           3                                                NaN   
 3           4   Support data-driven product decisions that im...   
 4           5  Bechtel Plant Machinery, Inc. (BPMI) is seekin...   
 5           6   The Fort is the new hub for innovation and in...   
 
                                           Job_Title Company_Rating  \
 1  Machine Learning Research Scientist (Entry-Level            3.3   
 2                                  Data Scientist I            2.7   
 3                               Data Science Intern            4.9   
 4                                      Data Analyst            3.5   
 5                                    Data Scientist            3.2   
 
                      Company_Name   Company_Location  Estimated_Salary  \
 1  Software Engineering Institute   – Pittsburgh, Pa           92000.0   

### Data Engineering(NLP)

In [51]:
# Removing Stopwords
# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    if isinstance(text, str):  # Check if the input is a string
        return ' '.join([word for word in text.split() if word.lower() not in stop_words])
    else:
        return text  # If it's not a string, return it as is (could be None or some other type)

# Apply the function to the cleaned_text column
df_cleaned['Description'] = df_cleaned['Description'].apply(remove_stopwords)
df_cleaned['Job_Title'] = df_cleaned['Job_Title'].apply(remove_stopwords)
df_cleaned['Company_Name'] = df_cleaned['Company_Name'].apply(remove_stopwords)
df_cleaned['Company_Location'] = df_cleaned['Company_Location'].apply(remove_stopwords)

# Display the first few rows to verify the changes
df_cleaned[['Description','Job_Title','Company_Name', 'Company_Location']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Description,Job_Title,Company_Name,Company_Location
1,Position Summary: Machine Learning researchers...,Machine Learning Research Scientist (Entry-Level,Software Engineering Institute,"– Pittsburgh, Pa"
2,,Data Scientist,United States Steel,"– Pittsburgh, Pa"
3,Support data-driven product decisions impact l...,Data Science Intern,Duolingo,"– Pittsburgh, Pa"
4,"Bechtel Plant Machinery, Inc. (BPMI) seeking g...",Data Analyst,Bpmi,"– Monroeville, Pa"
5,Fort new hub innovation incubation Fortive. ba...,Data Scientist,Fortive,"– Pittsburgh, Pa"


In [52]:
nltk.download('punkt')

def tokenize_text(text):
    if isinstance(text, str):  # Check if the input is a string
        return word_tokenize(text)
    else:
        return []  # Return an empty list if the text is None or not a string

# Apply the function to the cleaned_text column
df['Title_tokens'] = df['Job_Title'].apply(tokenize_text)
df['Name_tokens'] = df['Company_Name'].apply(tokenize_text)
df['Location_tokens'] = df['Company_Location'].apply(tokenize_text)
df['Description_tokens'] = df['Description'].apply(tokenize_text)

# Display the first few rows to verify the changes
df[['Description_tokens','Name_tokens','Title_tokens','Location_tokens']].head()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Description_tokens,Name_tokens,Title_tokens,Location_tokens
0,"[Indeed, Prime, is, a, free, service, that, co...",[],[],[]
1,"[Position, Summary, :, Machine, Learning, rese...","[Software, Engineering, Institute]","[Machine, Learning, Research, Scientist, (, En...","[–, Pittsburgh, ,, PA]"
2,[],"[United, States, Steel]","[Data, Scientist, I]","[–, Pittsburgh, ,, PA]"
3,"[Support, data-driven, product, decisions, tha...",[Duolingo],"[Data, Science, Intern]","[–, Pittsburgh, ,, PA]"
4,"[Bechtel, Plant, Machinery, ,, Inc., (, BPMI, ...",[BPMI],"[Data, Analyst]","[–, Monroeville, ,, PA]"
