# Dataframe assembler

This notebook will:

* Input original dataset
* Conduct feature engineering for the following columns:
    * Lat/long: Add clusters, potentially; also neighborhood/other vars
    * Features: Exploded? and k-means clustering into 20 clusters -- kapow!
    * Manager: Add a manager score
    * Description: replace with text analysis thing, add columns for exclamations and punctuation
* This will generate a dataframe with several 'features' columns (eg. 'features_description', 'features_manager' etc.)
* We will then combine these columns into a single column of features vectors:
https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html looks very helpful for doing this

* We then split the data using 20% testing, 80% cv with 5 folds of 16% to parameterize the model

    * First model= logistic regression using no engineered features
    * Second model= random forest with no engineered features

    * Third model= logistic regression with engineered features
    * Fourth model= random forest with engineered features


Cross-validation and model comparison is based on log-loss.
    

In [1]:
# Initiate spark

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
import seaborn as sns
import matplotlib.pyplot as plt

import pandas as pd

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("Assemble_Data") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext
sqlCtx = SQLContext(sc)

In [2]:
# Import data
train_data_pd = pd.read_json("data/train.json")
train_data_df = sqlCtx.createDataFrame(train_data_pd)

# Feature Engineering

## Lat/long work:

## 'Features' work:

## Manager work:

## Description work:

In [3]:
### I am assuming that I will take a df that has already coded 'interest_level' as 0, 1, or 2. I also assume that no records have been removed.
### Note: we should go through dataframe naming conventions!

# select only the description
train_data_df2 = train_data_df.select("interest_level","description")
# check it works
train_data_df2.groupBy("interest_level").count().show()

+--------------+-----+
|interest_level|count|
+--------------+-----+
|           low|34284|
|          high| 3839|
|        medium|11229|
+--------------+-----+



In [4]:
# deal with missing values
from pyspark.sql.functions import isnan
from pyspark.sql.functions import when, lit, col

def replace(column, value):
    return when(column != value, column).otherwise(lit("none"))

train4 = train_data_df2.withColumn("description", replace(col("description"), '        '))
train4 = train4.withColumn("description", replace(col("description"), ""))
train4 = train4.withColumn("description", replace(col("description"), " "))
train4 = train4.withColumn("description", replace(col("description"), "           "))


train4.show(2)

+--------------+--------------------+
|interest_level|         description|
+--------------+--------------------+
|        medium|A Brand New 3 Bed...|
|           low|                none|
+--------------+--------------------+
only showing top 2 rows



In [5]:
# I will use a count vectorizer for this part. For the presentation, I'll simply show that we compared several methods
# (count vectorizer, hashing, word2vec) and used only the training set to decide which one would be better.

from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="description", outputCol="words", pattern="\\W") # I don't know what W is...

# stop words
add_stopwords = ["a","the","it","of","the","is","and", # standard stop words
     "A","this","in","for"]
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords)

# bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=10000, minDF=5)


In [6]:
from pyspark.ml import Pipeline
from pyspark.ml.linalg import DenseVector
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType


pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(train4)
dataset = pipelineFit.transform(train4)
dataset = dataset.withColumn("label", dataset["interest_level"].cast(IntegerType()))
dataset.show(4)

+--------------+--------------------+--------------------+--------------------+--------------------+-----+
|interest_level|         description|               words|            filtered|            features|label|
+--------------+--------------------+--------------------+--------------------+--------------------+-----+
|        medium|A Brand New 3 Bed...|[a, brand, new, 3...|[brand, new, 3, b...|(10000,[0,1,3,5,6...| null|
|           low|                none|              [none]|              [none]| (10000,[246],[1.0])| null|
|          high|Top Top West Vill...|[top, top, west, ...|[top, top, west, ...|(10000,[0,1,2,3,4...| null|
|           low|Building Amenitie...|[building, amenit...|[building, amenit...|(10000,[0,1,2,3,4...| null|
+--------------+--------------------+--------------------+--------------------+--------------------+-----+
only showing top 4 rows



In [None]:
# then we select the features column to combine with the ones created above

# Feature Union