# Module #3 -- Assignment

## Instructions

Using all the knowledge acquired mainly in this module to create a cli entry retrieval of the job description data used up to the current module.

- The entry retrieval retrieved query via command line and the entry can be of any range.
- Use n-gram where n=1 and 2.
- Only top 5 ranks are needed to be displayed to the users.
- For the others, you can design by yourself.

## Your task 

explore the three variants of ranking methods:

- Rank using only tf
- Rank using both tf and idf
- Rank using BM25

`Please also discuss the different in the returned ranks.`

---

# Module Preparing

## The essential python library

In [2]:
import pandas as pd # do some data
import string
import timeit # just import for timer
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf built in function
from scipy import sparse

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## The Code or Class that going to use in this assignment

### Get Data from .csv file

In [3]:
def get_and_clean_data():
    data = pd.read_csv('./data/software_developer_united_states_1971_20191023_1.csv')
    description = data['job_description']
    cleaned_description = description.apply(lambda s: s.translate(str.maketrans('', '', string.punctuation + u'\xa0')))
    cleaned_description = cleaned_description.apply(lambda s: s.lower())
    cleaned_description = cleaned_description.apply(lambda s: s.translate(str.maketrans(string.whitespace, ' '*len(string.whitespace), '')))
    cleaned_description = cleaned_description.drop_duplicates()
    return cleaned_description

### Setup pre process for faster analysis

In [4]:
def preProcess(s):
    ps = PorterStemmer()
    s = word_tokenize(s)
    stopwords_set = set(stopwords.words())
    stop_dict = {s: 1 for s in stopwords_set}
    s = [w for w in s if w not in stop_dict]
    s = [ps.stem(w) for w in s]
    s = ' '.join(s)
    return s

### BM25

In [5]:
class BM25(object):
    def __init__(self, b=0.75, k1=1.6):
        self.vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)
        self.b = b
        self.k1 = k1

    def fit(self, X):
        self.vectorizer.fit(X)
        y = super(TfidfVectorizer, self.vectorizer).transform(X)
        self.avdl = y.sum(1).mean()

    def transform(self, q, X):
        b, k1, avdl = self.b, self.k1, self.avdl

        X = super(TfidfVectorizer, self.vectorizer).transform(X)
        len_X = X.sum(1).A1
        q, = super(TfidfVectorizer, self.vectorizer).transform([q])
        assert sparse.isspmatrix_csr(q)

        X = X.tocsc()[:, q.indices]
        denom = X + (k1 * (1 - b + b * len_X / avdl))[:, None]
        idf = self.vectorizer._tfidf.idf_[None, q.indices] - 1.
        numer = X.multiply(np.broadcast_to(idf, X.shape)) * (k1 + 1)                                                          
        return (numer / denom).sum(1).A1

---

# Assignment Part

explore the three variants of ranking methods:

## Rank using only tf

In [6]:
# some cool python code goes here

## Rank using both tf and idf

In [7]:
# Another cool python code goes here

## Rank using BM25

In [8]:
# very super duper coolest python code goes here

## Some conclusion

- Bra Bra Bra
- Bra Bra Bra
- Bra Bra?
- Bra Bra
- Bruh..