
<div class="alert alert-info" style="background-color:#008492; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> Problem Statement: Job Type Prediction </h2>
</div>

- The core of X0PA products often involve APIs around machine learning algorithms and NLP. 
- In real world data especially in human resources, they come in different schemas and it is important to 
standardize them into one. 
- For this problem, we will be looking into classifying any job titles into job functions. Initially, we tried to train a model that classify job titles into all job functions. However, we 
found out that the information technology job functions are too general and it is important for us to 
break them further down into various subclasses. 
- By breaking them down, we are then able to match candidates to jobs to a higher accuracy.
- This problem will test your ability to build a basic NLP model based on a given dataset. 
- Your end task will be to: - develop a model to predict one of the 16 classes (see Variables Schema); - provide justification for model evaluation and report your results; and 
- deploy your model in the form of an API endpoint (any API framework will do, but FastAPI is 
preferred).

- You will be assessed on:
    -  your ability to build a model pipeline and deploy it as an API (50%);
    -  model accuracy (10%); and
    -  writing clean, readable code (40%)



###### Variables Schema: 
###### Column Name       -      Description
- **id**                       
        - A unique identifier for every job title. This is purely for our reference. Do not use it at all.
- **Job Title**                
        - Job Title scraped from the job description. Do take note that this data is unclean and may consist of unnecessary field. In various job titles, you will be able to see the duration as well. 
        - This column is your X label.
- **Type**                     
        - Job Function of a job. This column is your Y label. It consists of the following 16 classes. 
        - Non - IT, Backend Engineer, Project Management, Product Management, Customer Support,Design, Data Science, Full Stack Engineer, Technical Support, Front End Engineer, Data  Analyst, Mobile Application Developer, Database Administration, Cloud architect, Information Security,


#### Results
- Submit an updated test file with a new third column called "Type". The type will contain one of 
the 16 classes

## Lifecycle In a NLP  Projects
1. Data Analysis(EDA)/ Data Cleaning /Feature Engineering 
    - Tokenization, Lower case convertion, Digits Removal, unicodedata removal, lemmatization, Stop word removal, Single character word removal, Rare word removal etc
3. Convert Text to Numerical field
    - BOW, TFIDF etc
4. Model Building
5. Model Evaluation
6. Finalize Best model
7. Create the pipeline (Done in another notepad)
8. Apply same model to Test data and Create Final Submission file

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 1. Import the libraries </h2>
</div>

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
import unicodedata

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer



from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


from sklearn.metrics import (
confusion_matrix, 
classification_report, 
accuracy_score,  
precision_score, 
recall_score, 
f1_score
)


In [2]:
# Read the data, Original data was in .xlsx form and it was showing UTF8 encoding issue, manually saved to .csv file
#So taht issue got resolved

train_df = pd.read_csv(r'x0pa_ds_interview_round_2_train.csv', encoding = 'utf-8') 
test_df = pd.read_csv(r'x0pa_ds_interview_round_2_test.csv', encoding = 'utf-8')


In [3]:
train_df.head()

Unnamed: 0,id,Job Title,Type
0,439491,E-Project Manager,Project Management
1,53426,Oracle PL/SQL Developer,Database Administration
2,532645,Senior Software Design Engineer (Smart & Conne...,Design
3,542591,Customer Service Representative of Medical Dev...,Customer Support
4,514151,Clicksoftware Project Manager,Project Management


In [4]:
test_df.head()

Unnamed: 0,id,Job Title
0,123636,Interim IT Project Manager - Virtualization (6...
1,13474,Product Operations Software Engineer (DevOps /...
2,305454,IT User Experience Designer
3,360875,Digitador/a Facturas Masivas- SAP - Huechuraba...
4,274401,PhD Intern - Northeastern University Co-op Stu...


In [5]:
test_x = test_df[["Job Title"]]
test_x.head()

Unnamed: 0,Job Title
0,Interim IT Project Manager - Virtualization (6...
1,Product Operations Software Engineer (DevOps /...
2,IT User Experience Designer
3,Digitador/a Facturas Masivas- SAP - Huechuraba...
4,PhD Intern - Northeastern University Co-op Stu...


In [6]:
train_df['Type'].value_counts()

Non-IT                          11130
Backend Engineer                 5564
Project Management               5209
Product Management               4418
Customer Support                 3945
Data Science                     3928
Design                           3903
Full Stack Engineer              3491
Technical Support                2302
Front End Engineer               1471
Data Analyst                     1300
Mobile Application Developer     1234
Database Administration           621
Cloud architect                   597
Information Security              527
Network Administration            360
Name: Type, dtype: int64

> Here we can see 16 different types of jobs are there. By using label encoder later we will convert this to numerical value

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 2. Split the data into X and Y </h2>
</div>

In [7]:
X = train_df[['Job Title']]
y = train_df[['Type']]

X.head()

Unnamed: 0,Job Title
0,E-Project Manager
1,Oracle PL/SQL Developer
2,Senior Software Design Engineer (Smart & Conne...
3,Customer Service Representative of Medical Dev...
4,Clicksoftware Project Manager


In [8]:
y.head()

Unnamed: 0,Type
0,Project Management
1,Database Administration
2,Design
3,Customer Support
4,Project Management


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 3. Label Encoding - Target Col </h2>
</div>

In [9]:
# Initialize Label encoder

le = preprocessing.LabelEncoder()
#Here we do label encoder fit
le.fit(y)
le.classes_

  return f(*args, **kwargs)


array(['Backend Engineer', 'Cloud architect', 'Customer Support',
       'Data Analyst', 'Data Science', 'Database Administration',
       'Design', 'Front End Engineer', 'Full Stack Engineer',
       'Information Security', 'Mobile Application Developer',
       'Network Administration', 'Non-IT', 'Product Management',
       'Project Management', 'Technical Support'], dtype=object)

In [10]:
y_labeled = le.transform(y)
y_labeled[:5]

array([14,  5,  6,  2, 14])

In [11]:
y.head()

Unnamed: 0,Type
0,Project Management
1,Database Administration
2,Design
3,Customer Support
4,Project Management


In [12]:
le.inverse_transform(y_labeled[:5])

array(['Project Management', 'Database Administration', 'Design',
       'Customer Support', 'Project Management'], dtype=object)

<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 4. NLP Data Preprocessing Technique </h2>
</div>

 - Tokenization
 - Lower case convertion 
 - Digits Removal
 - Unicodedata removal 
 - Lemmatization
 - Stop word removal
 - Single character word removal
 - Rare word removal etc

In [13]:
nlp = spacy.load('en_core_web_md')

In [14]:
# Lemmatization

def make_to_base(x):
    x_list = []
    # TOKENIZATION
    doc = nlp(x)
    
    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':   
            lemma = token.text
        x_list.append(lemma)
    #print(" ".join(x_list))
    return(" ".join(x_list))


In [15]:

def pre_process(X):
    # Lower case convertion
    X['Job Title'] = X['Job Title'].apply(lambda x: str(x).lower()) 
    
    # Digits Removal
    X['Job Title'] = X['Job Title'].apply(lambda x: re.sub('[^A-Z a-z # . ]+', '', x))
    
    # Stop word Removal
    X['Job Title'] = X['Job Title'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))
   
    # Unicodedata removal
    X['Job Title'] = X['Job Title'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore'))
    
    # Lemmatization
    X['Job Title'] = X['Job Title'].apply(lambda x: make_to_base(x))
    
    #Single Character removal
    X['Job Title']  = X['Job Title'] .apply(lambda x: " ".join([t for t in x.split() if len(t) != 1]))
   
    # Rare word Removal
    text = ' '.join(X['Job Title'])
    text = text.split()
    freq_comm = pd.Series(text).value_counts()
    # rare_remov_list is the word occured only once in trainset
    rare_remov_list = freq_comm[freq_comm==1]
    X['Job Title'] = X['Job Title'].apply(lambda x: " ".join([t for t in x.split() if t not in rare_remov_list]))
    return X

X = pre_process(X)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col

In [16]:
#Data after  Preprocessing
X.head()

Unnamed: 0,Job Title
0,eproject manager
1,oracle plsql developer
2,senior software design engineer smart connected
3,customer service representative medical device
4,project manager


In [17]:
#Data Before  Preprocessing
train_df[['Job Title']].head()

Unnamed: 0,Job Title
0,E-Project Manager
1,Oracle PL/SQL Developer
2,Senior Software Design Engineer (Smart & Conne...
3,Customer Service Representative of Medical Dev...
4,Clicksoftware Project Manager


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 5. Apply BOW </h2>
</div>

In [18]:
# Initialize CountVectorizer

cv = CountVectorizer()

In [19]:
text_counts = cv.fit_transform(X['Job Title'].values.astype('U'))  

In [20]:
text_counts.toarray().shape

(50000, 7419)

In [21]:
x_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())

In [22]:
x_bow.head(5)

Unnamed: 0,aa,aaa,aakash,aalst,aan,aas,ab,aba,abap,abapfiori,...,zona,zone,zoom,zoomdata,zrjob,zu,zuidoost,zunik,zupee,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
x_bow[['eproject','oracle','software','developer','engineer']].head()

Unnamed: 0,eproject,oracle,software,developer,engineer
0,1,0,0,0,0
1,0,1,0,1,0
2,0,0,1,0,1
3,0,0,0,0,0
4,0,0,0,0,0


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 6. Apply TF-IFD </h2>
</div>

In [24]:
# Initialize tfidf

tfidf = TfidfVectorizer()
text_counts_tfidf = tfidf.fit_transform(X['Job Title'].values.astype('U'))  

In [25]:
#x_tfidf.head()
text_counts_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [26]:
x_tfidf = pd.DataFrame(text_counts_tfidf.toarray(), columns=tfidf.get_feature_names())

In [27]:
x_tfidf.head(5)

Unnamed: 0,aa,aaa,aakash,aalst,aan,aas,ab,aba,abap,abapfiori,...,zona,zone,zoom,zoomdata,zrjob,zu,zuidoost,zunik,zupee,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
x_tfidf[['eproject','oracle','software','developer','engineer']].head()

Unnamed: 0,eproject,oracle,software,developer,engineer
0,0.957274,0.0,0.0,0.0,0.0
1,0.0,0.617832,0.0,0.260992,0.0
2,0.0,0.0,0.2535,0.0,0.202471
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0



<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 7. Build Model </h2>
</div>

##### Here we are training the pre-processed BOW data and TFIDF Data by below models

- SGDClassifier
- LogisticRegression
- LinearSVC
- RandomForestClassifier
- GaussianNBS

In [29]:
sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)

lgr = LogisticRegression(random_state=42, max_iter=200)

svm = LinearSVC(random_state=42, max_iter=200)

rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)

gnb = GaussianNB()

In [30]:
# Classifier dictionary

clf = {'SGD': sgd, 'LGR': lgr,  'SVM': svm, 'RFC': rfc, 'GNB': gnb}
clf.keys()

dict_keys(['SGD', 'LGR', 'SVM', 'RFC', 'GNB'])

In [31]:
#Evaluation matrix 
results_df = pd.DataFrame(columns=['Model', 'Accuracy', 'Miss_class_Rate', 'Precision_Score', 
                       'Recall_Score','f1_Score' 
                                     ])
results_df

Unnamed: 0,Model,Accuracy,Miss_class_Rate,Precision_Score,Recall_Score,f1_Score


### Evaluation Function with different models

In [32]:

def classification_metrics_udf(key, y_test, y_predict):
    
    global  results_df

    accuracy =  round(accuracy_score(y_test, y_predict),4)

    miss_class_rate =  round(1 - accuracy_score(y_test, y_predict),4)

    precision = round(precision_score(y_test, y_predict, average='weighted'),4)
        
    recall = round(recall_score(y_test, y_predict, average='weighted'),4)

    f1 = round(f1_score(y_test, y_predict, average='weighted'),4)
    
    
    results_df_2 = pd.DataFrame(data=[[key, accuracy, miss_class_rate, precision, recall , f1]], 
                            columns=['Model', 'Accuracy', 'Miss_class_Rate', 'Precision_Score', 'Recall_Score','f1_Score' 
                                     ])
    results_df = results_df.append(results_df_2, ignore_index=True)
       
    print("Confusion Matrix :\n\n", confusion_matrix(y_test,y_predict))
    
    print("\n\n Classification Report: \n\n", classification_report(y_test, y_predict))

    

### Function to train the data with different models

- Apply MinMax scalarization(Normalization) to input data
- Splitting the data as train and test data 
- Fit the specific model
- Call the Evaluation function to see the evaluation matrix

In [33]:
#here, we are training our model by defining the function classify.
y_pred_final = {} # This hold predicted value of each model

def classify(X, y, typ1):
    
    #MinMax scalarization(Normalization)
    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X) 
    
    # Split the data as train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
    
    for key in clf.keys():
        clf[key].fit(X_train, y_train)
        y_pred = clf[key].predict(X_test)
        ac = accuracy_score(y_test, y_pred)
        
        # Revert the label encoding
        y_pred_rev = le.inverse_transform(y_pred) 
        y_pred_final[key] = y_pred_rev
        
        # Call evaluation matrics
        key = key +'_'+typ1
        print(key, " ---> ", ac)
        classification_metrics_udf(key, y_test, y_pred)

        


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 7.1  Call All Model with BOW  data</h2>
</div>


In [34]:
# Call All Model with BOW data    

classify(x_bow, y_labeled , 'bow')


SGD_bow  --->  0.9297
Confusion Matrix :

 [[ 989    6    1    0    8    2    2   14   47    0    1    0   30    5
     4    4]
 [   0  105    0    0    0    1    0    0    1    1    0    0    4    1
     3    3]
 [   0    0  769    4    1    0    0    0    0    0    0    0    5    1
     5    4]
 [   0    0    2  235   12    0    0    0    0    0    0    0    5    3
     2    1]
 [   3    0    3   38  713    1    1    0    2    1    0    0   12    3
     4    5]
 [   4    0    0    0    4  109    0    0    2    0    0    0    3    0
     0    2]
 [   4    0    0    0    0    1  751    2    0    1    0    1   15    3
     2    1]
 [  11    0    0    0    2    1    2  263   11    0    0    0    2    1
     0    1]
 [  17    1    2    0    3    4    0    5  650    0    0    0    4    7
     0    5]
 [   0    1    0    0    0    0    0    0    0  102    0    0    1    0
     0    1]
 [   3    0    0    0    0    0    2    1    1    1  229    0    8    1
     1    0]
 [   1    1    0    0 

GNB_bow  --->  0.2652
Confusion Matrix :

 [[150  10   4  46  38  12  14 142 159  26 333  84  23  10   9  53]
 [  4  32   0   2   2   3   0   0   4  43   7   7   0   6   1   8]
 [  3   0 241 131  14   1   4   0   9   8  88   6  36  15  22 211]
 [  1  17  10  54  17  19   3   5   7  48   9  34   7  19   6   4]
 [ 15  84   7 152 144  56  10  22  25  53  28  91  29  30  16  24]
 [  5   7   1   4   6  58   2   6  12   9   0   1   2   2   0   9]
 [ 14  15   3  52  13   2 296  77  44   2 127  61  49   7  17   2]
 [ 13   0   0   0   5   2   8 171  34   1  48   2   4   4   2   0]
 [ 32  14   4  27  25  99   9 130  92  17 161  32  11  17   6  22]
 [  0   1   2   5   4   2   1   0   2  71   1   4   1   4   0   7]
 [ 12   0   2   1   2   2   6   5  20   1 178   7   3   3   3   2]
 [  2   5   1   4   1   0   3   0   2  10   0  33   2   1   1   7]
 [ 21   4 109 196 208  64 252   2  32  57 136  60 733  80  96 176]
 [ 14  16  19 179  26  86  21   8  46 110  34  41  25 185  18  56]
 [  5  11  29 145  

In [35]:
results_df

Unnamed: 0,Model,Accuracy,Miss_class_Rate,Precision_Score,Recall_Score,f1_Score
0,SGD_bow,0.9297,0.0703,0.9303,0.9297,0.9295
1,LGR_bow,0.9338,0.0662,0.9343,0.9338,0.9337
2,SVM_bow,0.937,0.063,0.9372,0.937,0.9369
3,RFC_bow,0.9381,0.0619,0.9386,0.9381,0.9382
4,GNB_bow,0.2652,0.7348,0.425,0.2652,0.2892


In [38]:
y_pred_df_bow = pd.DataFrame(y_pred_final)
y_pred_df_bow

Unnamed: 0,SGD,LGR,SVM,RFC,GNB
0,Customer Support,Customer Support,Customer Support,Customer Support,Non-IT
1,Backend Engineer,Backend Engineer,Backend Engineer,Backend Engineer,Backend Engineer
2,Project Management,Project Management,Project Management,Project Management,Project Management
3,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Database Administration
4,Non-IT,Non-IT,Non-IT,Non-IT,Data Analyst
...,...,...,...,...,...
9995,Data Science,Data Science,Data Science,Data Science,Customer Support
9996,Product Management,Product Management,Product Management,Product Management,Data Analyst
9997,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Front End Engineer
9998,Non-IT,Non-IT,Non-IT,Non-IT,Non-IT



<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 7.2  Call All Model with TFIDF data</h2>
</div>


In [39]:
#Call All Model with TFIDF data

classify(x_tfidf, y_labeled ,'tfidf')

SGD_tfidf  --->  0.9279
Confusion Matrix :

 [[ 971    5    1    0    9    3    3   11   59    1    2    0   33    7
     4    4]
 [   1  104    0    0    0    0    0    0    1    1    0    0    1    2
     4    5]
 [   0    0  773    4    1    0    0    0    0    0    0    0    2    1
     4    4]
 [   0    0    1  216   28    0    0    0    0    0    0    0    9    3
     2    1]
 [   3    0    1   33  717    1    1    0    2    1    0    0   15    2
     5    5]
 [   5    0    0    0    4  110    0    0    0    0    0    0    4    0
     0    1]
 [   4    0    0    0    0    1  749    2    0    0    0    1   18    3
     2    1]
 [  12    0    0    0    2    1    1  261   13    0    0    0    2    1
     0    1]
 [  14    1    2    0    3    0    0    5  657    0    0    0    7    2
     0    7]
 [   0    0    0    0    0    0    0    0    0  104    0    0    0    0
     0    1]
 [   3    0    0    0    1    0    2    1    1    0  229    0    8    1
     0    1]
 [   1    1    0    

GNB_tfidf  --->  0.2637
Confusion Matrix :

 [[154  11   4  45  35  15  19 152 159  39 319  67  28  12  12  42]
 [  6  37   0   2   2   2   0   1   4  34   7   7   1   6   1   9]
 [  4   0 237 156  15   1   7   0  10   8  62   6  37  14  23 209]
 [  3  17   9  53  17  19   4   5   7  57   8  25   7  17   7   5]
 [ 21  88   7 150 131  54  14  22  26  73  30  68  29  32  17  24]
 [  6   8   1   4   5  58   3   6  12   6   1   1   3   2   0   8]
 [ 15  15   4  72  14   2 289  86  42   2  95  61  54   8  19   3]
 [ 16   0   0   0   6   2   6 175  34   2  40   1   6   3   3   0]
 [ 30  25   5  25  23 103   9 212  96  26  76   8  12  20   9  19]
 [  0   1   2   5   4   2   1   0   2  69   1   5   2   5   0   6]
 [ 13   0   3   1   1   1   7   6  21   2 175   6   3   3   3   2]
 [  2   5   1   4   2   0   2   0   2  12   0  30   2   2   1   7]
 [ 21   4 103 221 188  65 250   2  25  58 110  58 743  87 118 173]
 [ 17  19  20 180  27  84  19  10  46 110  31  39  30 178  19  55]
 [  9  13  29 147

In [41]:
results_df

Unnamed: 0,Model,Accuracy,Miss_class_Rate,Precision_Score,Recall_Score,f1_Score
0,SGD_bow,0.9297,0.0703,0.9303,0.9297,0.9295
1,LGR_bow,0.9338,0.0662,0.9343,0.9338,0.9337
2,SVM_bow,0.937,0.063,0.9372,0.937,0.9369
3,RFC_bow,0.9381,0.0619,0.9386,0.9381,0.9382
4,GNB_bow,0.2652,0.7348,0.425,0.2652,0.2892
5,SGD_tfidf,0.9279,0.0721,0.9284,0.9279,0.9276
6,LGR_tfidf,0.9322,0.0678,0.9325,0.9322,0.9319
7,SVM_tfidf,0.9356,0.0644,0.9356,0.9356,0.9354
8,RFC_tfidf,0.9394,0.0606,0.9394,0.9394,0.9392
9,GNB_tfidf,0.2637,0.7363,0.4111,0.2637,0.2863


In [40]:
y_pred_df_tfidf = pd.DataFrame(y_pred_final)
y_pred_df_tfidf

Unnamed: 0,SGD,LGR,SVM,RFC,GNB
0,Customer Support,Customer Support,Customer Support,Customer Support,Non-IT
1,Backend Engineer,Backend Engineer,Backend Engineer,Backend Engineer,Backend Engineer
2,Project Management,Project Management,Project Management,Project Management,Project Management
3,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Database Administration
4,Non-IT,Non-IT,Non-IT,Non-IT,Data Analyst
...,...,...,...,...,...
9995,Data Science,Data Science,Data Science,Data Science,Customer Support
9996,Product Management,Product Management,Product Management,Product Management,Data Analyst
9997,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Full Stack Engineer,Front End Engineer
9998,Non-IT,Non-IT,Non-IT,Non-IT,Non-IT



<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 8. Evaluation </h2>
</div>

### This is Classification problem so used different Classification related evaluation matrix

- accuracy
- miss_class_rate
- precision
- recall
- f1

In [42]:
results_df

Unnamed: 0,Model,Accuracy,Miss_class_Rate,Precision_Score,Recall_Score,f1_Score
0,SGD_bow,0.9297,0.0703,0.9303,0.9297,0.9295
1,LGR_bow,0.9338,0.0662,0.9343,0.9338,0.9337
2,SVM_bow,0.937,0.063,0.9372,0.937,0.9369
3,RFC_bow,0.9381,0.0619,0.9386,0.9381,0.9382
4,GNB_bow,0.2652,0.7348,0.425,0.2652,0.2892
5,SGD_tfidf,0.9279,0.0721,0.9284,0.9279,0.9276
6,LGR_tfidf,0.9322,0.0678,0.9325,0.9322,0.9319
7,SVM_tfidf,0.9356,0.0644,0.9356,0.9356,0.9354
8,RFC_tfidf,0.9394,0.0606,0.9394,0.9394,0.9392
9,GNB_tfidf,0.2637,0.7363,0.4111,0.2637,0.2863


> Here TFIDF applied Random forest having good Accuracy (0.994), Precision (0.9394), Recall (0.9394), f1_score(0.9394) 

> So we will aplly same model as a final model


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 9. Apply RF for test data </h2>
</div>

In [43]:
#Preprocess the data
test_x_preprocess = pre_process(test_x)

#Apply Tf-Idf
test_text_counts_tfidf = tfidf.transform(test_x['Job Title'].values.astype('U'))  
xtest_tfidf = pd.DataFrame(test_text_counts_tfidf.toarray(), columns=tfidf.get_feature_names())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col

In [44]:
xtest_tfidf.head(5)

Unnamed: 0,aa,aaa,aakash,aalst,aan,aas,ab,aba,abap,abapfiori,...,zona,zone,zoom,zoomdata,zrjob,zu,zuidoost,zunik,zupee,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
#Predict by using Random Forest
y_pred_test = rfc.predict(xtest_tfidf)

# Revert the label encoding
y_pred_test_rev = le.inverse_transform(y_pred_test) 
y_pred_test_rev

array(['Project Management', 'Cloud architect', 'Design', ..., 'Non-IT',
       'Customer Support', 'Product Management'], dtype=object)


<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> 10. Create Test Data with Predicted Type </h2>
</div>

In [102]:
y_pred_test_pd = pd.DataFrame(pd.Series(y_pred_test_rev), columns =['Type'])

#Index should be same order
y_pred_test_pd.reset_index(drop=True)
test_x.reset_index(drop=True)

test_x_new = pd.concat([test_x, y_pred_test_pd], axis=1,sort=False)
#test_x_new = test_x_new.loc[:,['Job Title','Type']]
test_x_new

Unnamed: 0,Job Title,Type
0,interim project manager virtualization month,Project Management
1,product operation software engineer devop sre,Cloud architect
2,user experience designer,Design
3,sap santiago,Product Management
4,phd intern university coop student,Non-IT
...,...,...
19995,medical surveillance datum system specialist s...,Data Science
19996,senior lead engineer automation engineering pr...,Product Management
19997,finance mis specialist,Non-IT
19998,italian customer care,Customer Support


In [103]:
#Save test dataset with type in "X0PA_DS_TEST_RESULT.csv"
test_x_new.to_csv('X0PA_DS_TEST_RESULT.csv')

In [104]:
# Read the csv
test_final = pd.read_csv('X0PA_DS_TEST_RESULT.csv')
test_final.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Type
0,0,interim project manager virtualization month,Project Management
1,1,product operation software engineer devop sre,Cloud architect
2,2,user experience designer,Design
3,3,sap santiago,Product Management
4,4,phd intern university coop student,Non-IT



<div class="alert alert-info" style="background-color:#7FFF00; color:white; padding:0px 10px; border-radius:2px;"><h2 style='margin:10px 5px'> END </h2>
</div>