<h1>Supervised Learning in a Nutshell</h1>

![](https://github.com/ahmetbulut/CS340withDSX/blob/master/static/Week9/SupervisedLearning.png?raw=true)

<h1>Supervised Classification</h1> 
<h2>
<ul>
<li>
During training, a feature extraction scheme is used to convert each input value to a feature vectors. Labeled points, which consist of pairs of labels and feature vectors are fed into the machine learning algorithm to generate a model.
</li>
<p>
<li>
During prediction, the same feature extraction scheme is used to convert unseen inputs to feature vectors. These feature sets are then fed into the model, which generates predicted labels.
</li>
</ul>
</h2>

![](https://github.com/ahmetbulut/CS340withDSX/blob/master/static/Week9/Supervised-classification.png?raw=true)

<h1>Organization of the dataset for training supervised classifiers</h1> 
<h2>
<ul>
<li>The dataset is divided into two sets: (1) the development set, and (2) the "test" set.</li>
<p>
<li>The development set can further be subdivided into a "training" set and a "dev"-test set.</li>
</ul>
</h2>

![](https://github.com/ahmetbulut/CS340withDSX/blob/master/static/Week9/Corpus-org.png?raw=true)

<h1>ML Pipeline</h1>
<h2>In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:</h2>
<h2><ol>
<li>Split each document’s text into words.</li>
<li>Convert each document’s words into a numerical feature vector.</li>
<li>Learn a prediction model using the feature vectors and labels.</li>
</ol></h2>
<h2>Spark ML represents such a workflow as a <u>Pipeline</u>, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.</h2> 

<h1>Pipeline components</h1>

<h2>Transformers</h2>
<h3>
<ol>
<li>A Transformer is an abstraction that includes feature transformers and learned models.
</li>
<p>
<li>
Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns.
</li>
<p>For example: 
<ul>
<li>A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and <u>output a new DataFrame with the mapped column appended</u>.
</li>
<p>
<li>A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and <u>output a new DataFrame with predicted labels appended as a column.</u>
</li>
</ul>
</ol>
</h3>

<h2>Estimators</h2>
<h3>
<ol>
<li>An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. 
</li>
<p>
<li>
Technically, <u>an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer</u>.
<p>For example:
<p>
A learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.
</li>
</ol>
</h3>

![](https://github.com/ahmetbulut/CS340withDSX/blob/master/static/Week9/ml-Pipeline.png?raw=true)
<hr size="5">
![](https://github.com/ahmetbulut/CS340withDSX/blob/master/static/Week9/ml-PipelineModel.png?raw=true)

In [4]:
# The code was removed by DSX for sharing.

In [6]:
rdd = sc.textFile(path_1)

In [7]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row

In [8]:
interim = rdd.map(lambda line: (1.0 if line.split("\t")[0] == "spam" else 0.0, line.split("\t")[1]))
interim = interim.map(lambda t: Row(label=t[0], text=t[1]))
training = sqlContext.createDataFrame(interim)

In [9]:
# Configure an ML pipeline, which consists of 3 stages: 
# (1) Tokenizer, 
# (2) HashingTF, and 
# (3) LR.

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = sqlContext.createDataFrame([
    Row(text="You will get a prize. To claim call 09061701461. Claim code KL341. Valid 12 hours only."),
    Row(text="Even my brother is not like to speak with me. They treat me like aids patent.")])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)



In [11]:
prediction.limit(2).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|                text|               words|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|You will get a pr...|[you, will, get, ...|(262144,[97,1569,...|[-0.4008797488979...|[0.40110098912445...|       1.0|
|Even my brother i...|[even, my, brothe...|(262144,[26,3370,...|[6.28599076167609...|[0.99814125033600...|       0.0|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------+



In [12]:
selected = prediction.select("text", "prediction")
for row in selected.collect():
    print(row)

Row(text=u'You will get a prize. To claim call 09061701461. Claim code KL341. Valid 12 hours only.', prediction=1.0)
Row(text=u'Even my brother is not like to speak with me. They treat me like aids patent.', prediction=0.0)
