
<div class="alert alert-info" style="background-color:#008492; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> Problem Statement: Job Type Prediction </h2>
</div>

<div class="alert alert-info" style="background-color:#0000FF; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'>Lifecycle in A NLP  Projects </h2>
</div>

1. Data Analysis(EDA)/ Data Cleaning /Feature Engineering 
    - Tokenization, Lower case convertion, Digits Removal, unicodedata removal, lemmatization, Stop word removal, Single character word removal,  Rare word removal etc
2. Convert Text to Numerical field Using  - TFIDF 
3. Model Building Using Random Forest
4. Create the pipeline for TFIDF and RF
5. Model Evaluation
6. Genearate/save the joblib file to deploy in Heroku
7. Create API Related FASTAPI Code
7. Upload files to Github
8. Deploy all codes and API in Heroku environment

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 1. Import the libraries </h2>
</div>

In [1]:
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
import unicodedata

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline


from sklearn.metrics import (
confusion_matrix, 
classification_report, 
accuracy_score,  
precision_score, 
recall_score, 
f1_score
)


from joblib import dump


In [2]:
# Read the data, Original data was in .xlsx form and it was showing UTF8 encoding issue, manually saved to .csv file
#So taht issue got resolved

train_df = pd.read_csv(r'x0pa_ds_interview_round_2_train.csv', encoding = 'utf-8') 
test_df = pd.read_csv(r'x0pa_ds_interview_round_2_test.csv', encoding = 'utf-8')


In [3]:
train_df.head()

Unnamed: 0,id,Job Title,Type
0,439491,E-Project Manager,Project Management
1,53426,Oracle PL/SQL Developer,Database Administration
2,532645,Senior Software Design Engineer (Smart & Conne...,Design
3,542591,Customer Service Representative of Medical Dev...,Customer Support
4,514151,Clicksoftware Project Manager,Project Management


In [4]:
test_df.head()

Unnamed: 0,id,Job Title
0,123636,Interim IT Project Manager - Virtualization (6...
1,13474,Product Operations Software Engineer (DevOps /...
2,305454,IT User Experience Designer
3,360875,Digitador/a Facturas Masivas- SAP - Huechuraba...
4,274401,PhD Intern - Northeastern University Co-op Stu...


In [5]:
test_x = test_df[["Job Title"]]
test_x.head()

Unnamed: 0,Job Title
0,Interim IT Project Manager - Virtualization (6...
1,Product Operations Software Engineer (DevOps /...
2,IT User Experience Designer
3,Digitador/a Facturas Masivas- SAP - Huechuraba...
4,PhD Intern - Northeastern University Co-op Stu...


In [6]:
train_df['Type'].value_counts()

Non-IT                          11130
Backend Engineer                 5564
Project Management               5209
Product Management               4418
Customer Support                 3945
Data Science                     3928
Design                           3903
Full Stack Engineer              3491
Technical Support                2302
Front End Engineer               1471
Data Analyst                     1300
Mobile Application Developer     1234
Database Administration           621
Cloud architect                   597
Information Security              527
Network Administration            360
Name: Type, dtype: int64

> Here we can see 16 different types of jobs are there. We finalized Random Forest model, so we are not using label encoder for target

> Normaly **Tree Based models** not require lable encoding

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 2. Split the data into X and Y </h2>
</div>

In [7]:
X = train_df[['Job Title']]
y = train_df[['Type']]

X.head()

Unnamed: 0,Job Title
0,E-Project Manager
1,Oracle PL/SQL Developer
2,Senior Software Design Engineer (Smart & Conne...
3,Customer Service Representative of Medical Dev...
4,Clicksoftware Project Manager


In [8]:
y.head()

Unnamed: 0,Type
0,Project Management
1,Database Administration
2,Design
3,Customer Support
4,Project Management


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 3. NLP Data Preprocessing Technique </h2>
</div>

 - Tokenization
 - Lower case convertion 
 - Digits Removal
 - Unicodedata removal 
 - Lemmatization
 - Stop word removal
 - Single character word removal
 - Rare word removal etc

In [9]:
nlp = spacy.load('en_core_web_md')

In [10]:
# This part done in app.py itself, so not included in pipeline

# Lemmatization

def make_to_base(x):
    x_list = []
    # TOKENIZATION
    doc = nlp(x)
    
    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':   
            lemma = token.text
        x_list.append(lemma)
    #print(" ".join(x_list))
    return(" ".join(x_list))


In [11]:
# This part done in app.py itself so not included in pipeline

def pre_process(X):
    # Lower case convertion
    X['Job Title'] = X['Job Title'].apply(lambda x: str(x).lower()) 
    
    # Digits Removal
    X['Job Title'] = X['Job Title'].apply(lambda x: re.sub('[^A-Z a-z # . ]+', '', x))
    
    # Stop word Removal
    X['Job Title'] = X['Job Title'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))
   
    # Unicodedata removal
    X['Job Title'] = X['Job Title'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore'))
    
    # Lemmatization
    X['Job Title'] = X['Job Title'].apply(lambda x: make_to_base(x))
    
    #Single Character  removal
    X['Job Title']  = X['Job Title'] .apply(lambda x: " ".join([t for t in x.split() if len(t) != 1]))
   
    # Rare word Removal
    text = ' '.join(X['Job Title'])
    text = text.split()
    freq_comm = pd.Series(text).value_counts()
    # rare_remov_list is the word occured only once in trainset
    rare_remov_list = freq_comm[freq_comm==1]
    X['Job Title'] = X['Job Title'].apply(lambda x: " ".join([t for t in x.split() if t not in rare_remov_list]))
    return X

X_pre_proc = pre_process(X)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in t

In [12]:
#Data after  Preprocessing
X.head()

Unnamed: 0,Job Title
0,eproject manager
1,oracle plsql developer
2,senior software design engineer smart connected
3,customer service representative medical device
4,project manager


In [13]:
#Data Before  Preprocessing
train_df[['Job Title']].head()

Unnamed: 0,Job Title
0,E-Project Manager
1,Oracle PL/SQL Developer
2,Senior Software Design Engineer (Smart & Conne...
3,Customer Service Representative of Medical Dev...
4,Clicksoftware Project Manager


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 4. Apply TFIFD </h2>
</div>

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(X_pre_proc['Job Title'])  

In [15]:
#x_tfidf.head()
x_tfidf.shape

(50000, 7419)

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 5. Build Model </h2>
</div>

In [16]:
# train test split
x_train_tfidf , x_test_tfidf, y_train_tfidf , y_test_tfidf = train_test_split(x_tfidf, y, test_size = 0.2, random_state=21)


# Apply Random Forest 
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)

rfc.fit(x_train_tfidf, y_train_tfidf)


  


RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)

In [17]:
y_pred = rfc.predict(x_test_tfidf)

y_pred

array(['Customer Support', 'Data Science', 'Data Analyst', ...,
       'Data Science', 'Product Management', 'Data Science'], dtype=object)

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 6. Model Evaluation </h2>
</div>

### This is Classification problem so used different Classification related evaluation matrix

- accuracy
- miss_class_rate
- precision
- recall
- f1

In [18]:
# Different evaluation matrix

accuracy =  round(accuracy_score(y_test_tfidf, y_pred),4)
miss_class_rate =  round(1 - accuracy_score(y_test_tfidf, y_pred),4)
precision = round(precision_score(y_test_tfidf, y_pred, average='weighted'),4)
recall = round(recall_score(y_test_tfidf, y_pred, average='weighted'),4)
f1 = round(f1_score(y_test_tfidf, y_pred, average='weighted'),4)

print("accuracy: ", accuracy )
print("miss_class_rate: ", miss_class_rate )
print("precision: ", precision )
print("recall: ", recall )
print("f1: ", f1 )

accuracy:  0.9425
miss_class_rate:  0.0575
precision:  0.9427
recall:  0.9425
f1:  0.9424


> Model having very good matrics

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 7. Pipeline </h2>
</div>

- Here we are creating pipeline for preprocessed data , which does TF-IDF and Random forest modelling sequentially

In [19]:
# Create new pipeline
pipeline = Pipeline(steps= [('tfidf', TfidfVectorizer()),
                            ('rfc', RandomForestClassifier())])

In [20]:
#Fit the pipeline with preprocessed data
pipeline.fit(X_pre_proc['Job Title'], y)


  self._final_estimator.fit(Xt, y, **fit_params_last_step)


Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('rfc', RandomForestClassifier())])

In [21]:
pred =pipeline.predict(X_pre_proc['Job Title'])
pred

array(['Project Management', 'Database Administration', 'Design', ...,
       'Technical Support', 'Non-IT', 'Backend Engineer'], dtype=object)

In [22]:
pred[0]

'Project Management'

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 8. Deployment - Joblib file creation </h2>
</div>

In [31]:
# dump the pipeline model here to create .joblib

# This for not zipped file
#dump(pipeline, filename="text_classification.joblib")

# This for gz zipped file
filename="text_classification.joblib"
#dump(pipeline, filename + '.gz', compress='gzip') # gzip

dump(pipeline, filename + '.bz2', compress=('bz2', 3)) # bz2

['text_classification.joblib.bz2']

> this "text_classification.joblib" is save in same folder

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 9. Extra Validation </h2>
</div>

In [24]:
data = {'job_title': 'Senior Software Design Engineer (Smart & Connected)'}
data1 = dict(data)
data1

{'job_title': 'Senior Software Design Engineer (Smart & Connected)'}

In [25]:
dt = pd.DataFrame(list(pd.Series(data1['job_title'])), columns = ['job_title'])
dt

Unnamed: 0,job_title
0,Senior Software Design Engineer (Smart & Conne...


In [26]:
pipeline.predict(dt['job_title'])[0]

'Design'

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> END </h2>
</div>