# Module #3 -- Assignment

## Instructions

Using all the knowledge acquired mainly in this module to create a cli entry retrieval of the job description data used up to the current module.

- The entry retrieval retrieved query via command line and the entry can be of any range.
- Use n-gram where n=1 and 2.
- Only top 5 ranks are needed to be displayed to the users.
- For the others, you can design by yourself.

## Your task 

explore the three variants of ranking methods:

- Rank using only tf
- Rank using both tf and idf
- Rank using BM25

`Please also discuss the different in the returned ranks.`

---

# Module Preparing

## The essential python library

In [1]:
import pandas as pd # do some data
import string
import timeit # just import for timer
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf built in function
from scipy import sparse

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## The Code or Class that going to use in this assignment

### Get Data from .csv file

In [2]:
def get_and_clean_data():
    data = pd.read_csv('./data/software_developer_united_states_1971_20191023_1.csv')
    description = data['job_description']
    cleaned_description = description.apply(lambda s: s.translate(str.maketrans('', '', string.punctuation + u'\xa0')))
    cleaned_description = cleaned_description.apply(lambda s: s.lower())
    cleaned_description = cleaned_description.apply(lambda s: s.translate(str.maketrans(string.whitespace, ' '*len(string.whitespace), '')))
    cleaned_description = cleaned_description.drop_duplicates()
    return cleaned_description

### Setup pre process for faster analysis

In [3]:
def preProcess(s):
    ps = PorterStemmer()
    s = word_tokenize(s)
    stopwords_set = set(stopwords.words())
    stop_dict = {s: 1 for s in stopwords_set}
    s = [w for w in s if w not in stop_dict]
    s = [ps.stem(w) for w in s]
    s = ' '.join(s)
    return s

### BM25

In [4]:
class BM25(object):
    def __init__(self, b=0.75, k1=1.6):
        self.vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)
        self.b = b
        self.k1 = k1

    def fit(self, X):
        self.vectorizer.fit(X)
        y = super(TfidfVectorizer, self.vectorizer).transform(X)
        self.avdl = y.sum(1).mean()

    def transform(self, q, X):
        b, k1, avdl = self.b, self.k1, self.avdl

        X = super(TfidfVectorizer, self.vectorizer).transform(X)
        len_X = X.sum(1).A1
        q, = super(TfidfVectorizer, self.vectorizer).transform([q])
        assert sparse.isspmatrix_csr(q)

        X = X.tocsc()[:, q.indices]
        denom = X + (k1 * (1 - b + b * len_X / avdl))[:, None]
        idf = self.vectorizer._tfidf.idf_[None, q.indices] - 1.
        numer = X.multiply(np.broadcast_to(idf, X.shape)) * (k1 + 1)
        
        return (numer / denom).sum(1).A1

---

# CLI Entry Part

## Client CLI

In [5]:
job_entry = input("What are you looking for : ").split(' ')

What are you looking for : java aws azure docker python


## Simple Dictaionary Ranking

In [6]:
def show_top_five(key_word):
    result = dict()
    key_word_dict = key_word.to_dict()
    
    for item in key_word_dict:
        result[item] = key_word_dict[item][0]
        for index in key_word_dict[item] :
            result[item] += key_word_dict[item][index]
            
    return sorted(result.items(), key=lambda x: x[1], reverse=True)[:5]

---

# Assignment Part

explore the three variants of ranking methods:

## Information Retrival

In [7]:
cleaned_description = get_and_clean_data()
cleaned_description = cleaned_description.iloc[:100]

vectorizer = CountVectorizer(preprocessor=preProcess, ngram_range=(1, 2))
vectorizer.fit_transform(cleaned_description)

<100x19135 sparse matrix of type '<class 'numpy.int64'>'
	with 38929 stored elements in Compressed Sparse Row format>

## Rank using only tf

In [8]:
def tf_ranking():
    X = vectorizer.transform(job_entry)
    
    # tf formula
    X.data = np.log10(X.data + 1)
    tf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    print(show_top_five(tf))

In [9]:
tf_ranking()

[('java', 0.6020599913279624), ('aw', 0.3010299956639812), ('azur', 0.3010299956639812), ('docker', 0.3010299956639812), ('python', 0.3010299956639812)]


## Rank using both tf and idf

In [10]:
def tf_idf_ranking(n):
    X = vectorizer.transform(job_entry)
    
    # tf-idf formula
    idf = n / (X.tocoo() > 0).sum(0)
    X.data = np.log10(X.data + 1)
    X.data = X.multiply(np.log10(idf))
    
    tfidf = pd.DataFrame(X.data.toarray(), columns=vectorizer.get_feature_names())
    
    print(show_top_five(tfidf))

In [11]:
tf_idf_ranking(5)

  idf = n / (X.tocoo() > 0).sum(0)


[('java', 0.42082187474904936), ('aw', 0.21041093737452468), ('azur', 0.21041093737452468), ('docker', 0.21041093737452468), ('python', 0.21041093737452468)]


## Rank using BM25

In [12]:
from pprint import pprint
def BM25_ranking():
    bm25 = BM25()
    bm_ranking = bm25.fit(cleaned_description)
    bm_ranking = bm25.transform(' '.join(w for w in job_entry), cleaned_description)
    
    pprint(sorted(bm_ranking, reverse=True)[:5])

In [13]:
BM25_ranking()

[8.53906968182736,
 8.01630235443384,
 6.114675079756328,
 5.395575450169901,
 5.342448261012892]


## My Though about This assignment

### The result from TF

term frequency is a technique that retrive information by number of times that the word is occurs in that information 

,because of Java is occurs the most in the document that why java have the high place in the result

### The result from TF-IDF

term frequency Inversion document frequency is use the weight and the weight is calculate from the word counting

java seem happen to appear in the individual word without any term as prefix or postfix that why it also have the highest place while using the tf-idf technique

### The result from BM25

BM25 is a Best Match 25 it is a standard way using the probabilitstic IR model it using a bag of word to calculate and using the binomial distribution to do as a result

because I cannot turn the np.ndarray to show in the pandas data frame I will assume that java will place to top place ( 8.xxx) because this word seem to seperate all over the .csv data set

ended assignment