# Salary prediction by vacancy description

## Dataset description

The dataset represents the data about vacancies which were published in the world net for different countries. The vacancy info has a full description of this vacancy, title, location, company, working category, salary etc.
In this assignment you have to predict the possibility of raising the salary threshold, using the vacancy description. The data is presented in the dataframe. The columns of interest are:
* FullDescription - description of vacancy
* SalaryNormalized - predicted salary threshold.

Dataset description

There are steps which are required to successfully complete the assignment:
1. Read dataset
2. Perform text transformation by removing punctuation terms and stop words.
3. Generate n-grams.	
4. Count TF * IDF features
5. Fit model for generated features.


## Reading dataset

Init pyspark session

In [None]:
from __future__ import division, print_function, unicode_literals # For the compatibility with Python 2

In [None]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder\
                            .enableHiveSupport()\
                            .appName("spark sql")\
                            .master("local[4]")\
                            .getOrCreate()

Load train dataset placed at `/data/vacancie` with at least 10 partitions (use function `repartition` for this case)

## Transforming dataset

Remove redundant punctuation signs using RegexTokenizer with pattern <code>"\\\\s+|,|\\\\*|/|\\\\."</code>. This pattern removes whitespaces, commas, dots and other characters.

In [None]:
from pyspark.ml.feature import RegexTokenizer, Tokenizer

Remove English stop words using StopWordsRemover

In [None]:
from pyspark.ml.feature import StopWordsRemover

Generate n-grams with $n = 2$, $n = 3$ (module pyspark.ml.feature). After that you can perform some experiments with concatenating of column datasets (e.g. words and 3-grams or words with 2-grams and 3-grams). You can use the function in the cell below to concatenate lists.

In [None]:
from itertools import chain
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *

def concat(type):
    def concat_(*args):
        return list(chain(*args))
    return udf(concat_, ArrayType(type))

concat_string_arrays = concat(StringType())

## Counting TF-IDF features

Use hashing trick and IDF to count features of train dataset. The appropriate features number for the dataset is about 2000. You can experiment with varying the number of features.


<b>Note.</b> Remember to save IDF model in order to apply it to the test dataset.

In [None]:
from pyspark.ml.feature import HashingTF, IDF, CountVectorizer

In [None]:
# Apply Hashing TF

In [None]:
# Train IDF model

In [None]:
# Transform data with the IDF model

# Fitting model

Split the dataset to train and validation part (it is better to use 90% for the train part and 10% for the validation part)

Fit the Logistic Regression to the model on the splitted train part. Use about 15 iterations for the training process.

<b>Hint.</b> Use regularization parameter in order to prevent overfitting.

In [None]:
from pyspark.ml.classification import LogisticRegression

Print the loss function for each iteration. What can you notice from the behaviour of loss function?

<b>Hint.</b> Use summary.objectiveHistory for this case.

Apply the model to the validation set

In [None]:
# Show results in the table

Calculate AUC-ROC for the predicted data. For this purpose, you can use BinaryClassificationEvaluator from ml.evaluation model.

<b>Self-check question:</b>

1. Try to fit and predict model using pure words. Has the result changed?



# Performing test submission

Apply the learned models to the test dataset.

<b>Note!</b> The test dataset will be changed during the test phase. Your last cell output must be the output of the AUC-ROC score.

In [None]:
# Load dataset

In [None]:
# Transform dataset and calculate auc-roc

In [None]:
# Output for the AUC-ROC