<a href="https://colab.research.google.com/github/AVJdataminer/HireOne/blob/master/Flask_App_on_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create Flask app for NLP project

Steps in this NLP project notebook

1.   Create model using Gensim doc2vec to get job desc vectors
2.   Save serialized model as pickle
3.   Build flask app - in Google Colab
4.   Connect to Heroku



# Build the model

In [None]:
# Imports
import re
import string
from collections import Counter

import pandas as pd
import numpy as np
from gensim.summarization import keywords
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import pickle
import json

In [None]:
#install pdf reader module
! pip install pdfminer.six

Collecting pdfminer.six
[?25l  Downloading https://files.pythonhosted.org/packages/98/12/ab5ebafc4cb2b49847de7bfc26f2d152f42a4af136263152d070c61dfd7d/pdfminer.six-20200726-py3-none-any.whl (5.6MB)
[K     |████████████████████████████████| 5.6MB 2.9MB/s 
[?25hCollecting cryptography
[?25l  Downloading https://files.pythonhosted.org/packages/33/62/30f6936941d87a5ed72efb24249437824f6b2c953901245b58c91fde2f27/cryptography-3.1.1-cp35-abi3-manylinux2010_x86_64.whl (2.6MB)
[K     |████████████████████████████████| 2.6MB 32.0MB/s 
Installing collected packages: cryptography, pdfminer.six
Successfully installed cryptography-3.1.1 pdfminer.six-20200726


## load job description data

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/AVJdataminer/HireOne/master/data/job_descriptions.csv', encoding = 'unicode_escape')
df.head()

Unnamed: 0,jobOrResumeDescription,role
0,: Artificial Intelligence / Machine Learning D...,Developer
1,: Data Scientist/Architect\n: 6+ months + Hig...,Data Scientist
2,": Data Analyst\n: Davidson, NC\n: 04+ Months\...",Data Analyst
3,: Big Data Architect or Data Scientist\n: New...,Data Scientist
4,": Data Engineer\n: Woonsocket, RI\n: 6+ Months...",Data Engineer


Clean up job description column.

In [None]:
def clean_text(text):
    text = text.replace('\n', ' ')                # remove newline
    text = text.replace(':', ' ')
    return text
df['description'] = df.apply(lambda x: clean_text(x['jobOrResumeDescription']), axis=1)

Print first job desc

In [None]:
df['description'].iloc[0]

"  Artificial Intelligence / Machine Learning Developer     Irving TX  Terms  Contract   Details             Bachelor's degree or 7-10 or more years of relevant  experience.     7+ years of server app development (design/develop/deploy).     3+ years of Python 3.x, experience in ML algorithms/data analytics.     5+ years of advanced SQL development (ER modeling, SQL scripts, stored procedures, functions, s) with RDBMS such as PostgreSQL/MS SQL Server.     3+ years on AWS S3, EC2, Serverless computing (Lambda).     3+ years of experience/familiarity with DevOps using Stash/Jenkins/Chef and Puppet.     Excellent communication  in interfacing with different cross-functional teams.         5+ years of experience in designing, building applications using .NET platform using C#, .NET Core, ORM, SQL, MS SQL Server, Visual Studio.     1+ years' experience in developing containerized Docker .net core apps."

Create a list from the cleaned job description column

In [None]:
jd = df['description'].tolist()

Build model to tag each job description as a seperate document.


In [None]:
import gensim
import gensim.downloader as api
from gensim import models
# Create the tagged document needed for Doc2Vec
def create_tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])

train_data = list(create_tagged_document(jd))

print(train_data[:1])

[TaggedDocument(words="  Artificial Intelligence / Machine Learning Developer     Irving TX  Terms  Contract   Details             Bachelor's degree or 7-10 or more years of relevant  experience.     7+ years of server app development (design/develop/deploy).     3+ years of Python 3.x, experience in ML algorithms/data analytics.     5+ years of advanced SQL development (ER modeling, SQL scripts, stored procedures, functions, s) with RDBMS such as PostgreSQL/MS SQL Server.     3+ years on AWS S3, EC2, Serverless computing (Lambda).     3+ years of experience/familiarity with DevOps using Stash/Jenkins/Chef and Puppet.     Excellent communication  in interfacing with different cross-functional teams.         5+ years of experience in designing, building applications using .NET platform using C#, .NET Core, ORM, SQL, MS SQL Server, Visual Studio.     1+ years' experience in developing containerized Docker .net core apps.", tags=[0])]


Train the model on the job descriptions for matching later.

In [None]:
# Init the Doc2Vec model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

# Build the Volabulary
model.build_vocab(train_data)

# Train the Doc2Vec model
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)

Let's look at an example of how it converts a list of words to a vector.

## Save model as pickel

In [None]:
pickle.dump(model, open('model.pkl','wb'))
# Loading model from pickle file
model = pickle.load( open('model.pkl','rb'))
#test infering vectors
print(model.infer_vector(['data', 'science','python']))

[-0.00429757 -0.00297611  0.00904408 -0.00976137  0.00639761  0.00729887
 -0.0060175   0.00303777 -0.00242801 -0.00995632 -0.00129785  0.00982981
  0.00315597 -0.00721821 -0.00867892  0.00788841 -0.00339966 -0.00162887
  0.00441544  0.00746401 -0.00349555 -0.00038155 -0.0077288   0.00483619
  0.00707149  0.00414337  0.00113955  0.00541657  0.00119371  0.00305656
 -0.00128384  0.00254856 -0.00201426 -0.00587954  0.00709848 -0.00166082
 -0.00705521 -0.00044411  0.00060066 -0.00624194 -0.00823938 -0.00922911
 -0.00754664 -0.00991162  0.00071209  0.00360184  0.00311903 -0.005285
 -0.00755596 -0.00868551]


# Create flask app

In [None]:
#install ngrok
#!pip install flask-ngrok

Collecting flask-ngrok
  Downloading https://files.pythonhosted.org/packages/af/6c/f54cb686ad1129e27d125d182f90f52b32f284e6c8df58c1bae54fa1adbc/flask_ngrok-0.0.25-py3-none-any.whl
Installing collected packages: flask-ngrok
Successfully installed flask-ngrok-0.0.25


Run a test app in ngrok - click on the 'Running on http://05fdd29d9a82.ngrok.io' like link below to load sample app in browser.

In [None]:
from flask import Flask
from flask import request
from flask_ngrok import run_with_ngrok

app = Flask(__name__)
run_with_ngrok(app)  # Start ngrok when app is run

# for / root, return Hello Word
@app.route("/")
def root():
    url = request.method
    return f"Hello World! {url}"

app.run()

Collecting flask-ngrok
  Downloading https://files.pythonhosted.org/packages/af/6c/f54cb686ad1129e27d125d182f90f52b32f284e6c8df58c1bae54fa1adbc/flask_ngrok-0.0.25-py3-none-any.whl
Installing collected packages: flask-ngrok
Successfully installed flask-ngrok-0.0.25
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


 * Running on http://05fdd29d9a82.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


127.0.0.1 - - [01/Oct/2020 02:02:28] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [01/Oct/2020 02:02:28] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -


The actual NLP app code is here.
- 1. Run the code below and open the nrgrok link in a browser window
- 2. Paste the text from your resume in the box and click submit
- 3. The response should appear as the json table of cosine distances, job descriptions and roles of the top five matches to the resume text.



In [None]:
from flask import Flask, request, render_template
import gensim
import gensim.downloader as api
from gensim import models
import pandas as pd
import numpy as np
import sklearn
import pickle
import json
from sklearn.metrics.pairwise import cosine_distances

app = Flask(__name__)

@app.route('/')
def my_form():
    return render_template('https://raw.githubusercontent.com/AVJdataminer/HireOne/master/templates/my-form.html')

@app.route('/', methods=['POST'])
def my_form_post():
    text = request.form['text']
    #processed_text = text.upper()
    proccessed_text = text.split()
    #load model
    loaded_model = pickle.load(open('https://raw.githubusercontent.com/AVJdataminer/HireOne/master/model.pkl', 'rb'))
    #calc vector for resume input
    resume_vect = loaded_model.infer_vector(proccessed_text)
    #load job desc vectors
    jd = pd.read_csv('https://raw.githubusercontent.com/AVJdataminer/HireOne/master/data/vectors_data.csv')
    jn = jd.to_numpy()
    #calculate cosine distances
    cos_dist =[]
    for i in range(jd.shape[0]):
        cos_dist.append(float(cosine_distances(resume_vect[0:].reshape(1,-1),jn[i].reshape(1,-1))))
    #load job desc data to return
    df = pd.read_csv('https://raw.githubusercontent.com/AVJdataminer/HireOne/master/data/updated_job_description.csv', encoding = 'unicode_escape')
    role = df['role'].tolist()
    desc = df['description'].tolist()
    summary = pd.DataFrame({
        'Role Title': role,
        'Cosine Distances': cos_dist,
        'Job Description': desc
    })
    z = summary.sort_values(by ='Cosine Distances', ascending=True)
    z = z.head()
    text_vector = z.to_dict()
    return text_vector

if __name__=='__main__':
    app.run(debug=True)

# Next steps:
1. Figure out the Heroku
2. Improved formatted response template html code