# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>
- TAs: Tong Zeng <tozeng@syr.edu>, Priya Matnani <psmatnan@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [1]:
# Load the packages needed for this part
# create spark and sparkcontext objects
from pyspark.sql import SparkSession
import numpy as np

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

import pyspark
from pyspark.ml import feature, regression, Pipeline, classification, pipeline, evaluation
from pyspark.sql import functions as fn, Row
from pyspark import sql

import matplotlib.pyplot as plt
import pandas as pd

# Part 2

Image that you are an avid fan of the Chicago Cubs. Somehow, you managed to be part of two email lists, one about baseball (which you are interested in) and hockey (which you are not that interested in). You will use the power of data science to create a classifier that takes an email as an input and predicts whether the email is about baseball or hockey

The dataset will be in `email_df`

In [2]:
email_df = spark.read.json('emails.json')

In [3]:
# explore the data a bit
print("Baseball email\n============================================")
print(email_df.where('target == "rec.sport.baseball"').first().email)
print("Hockey email\n============================================")
print(email_df.where('target == "rec.sport.hockey"').first().email)

Baseball email
From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb

# Question 2.1

Add a `topic` column to `email_df` to be 1 if `target` is `rec.sport.baseball` and 0 if it is `rec.sport.hockey` and store the result in `email2_df`

In [4]:
email2_df = email_df.withColumn('topic',(fn.col('target')=='rec.sport.baseball').cast('int'))

In [5]:
# check your results
email2_df.show()

+--------------------+--------+------------------+-----+
|               email|email_id|            target|topic|
+--------------------+--------+------------------+-----+
|From: dougb@comm....|       1|rec.sport.baseball|    1|
|From: gld@cunixb....|       2|  rec.sport.hockey|    0|
|From: rudy@netcom...|       3|rec.sport.baseball|    1|
|From: monack@heli...|       4|  rec.sport.hockey|    0|
|Subject: Let it b...|       5|rec.sport.baseball|    1|
|From: mmb@lamar.C...|       6|  rec.sport.hockey|    0|
|From: fierkelab@b...|       7|rec.sport.baseball|    1|
|From: rvpst2+@pit...|       8|  rec.sport.hockey|    0|
|From: smorris@ven...|       9|  rec.sport.hockey|    0|
|From: richard@amc...|      10|  rec.sport.hockey|    0|
|From: brifre1@ac....|      11|  rec.sport.hockey|    0|
|From: dwk@cci632....|      12|  rec.sport.hockey|    0|
|From: cub@csi.jpl...|      13|rec.sport.baseball|    1|
|From: golchowy@al...|      14|  rec.sport.hockey|    0|
|From: krattige@hp...|      15|

In [6]:
email2_df.count()

1197

In [7]:
# (5 pts)
np.testing.assert_array_equal(
    email2_df.groupBy('topic').count().orderBy('topic').rdd.map(lambda x: x['count']).collect(),
    [600, 597]
)

# Question 2.2: tfidf feature engineering
Create a pipeline that combines a `Tokenizer`, `CounterVectorizer`, and a `IDF` estimator to compute the tfidf vectors of each email. Fit this pipeline and assign the pipeline transformer to a variable `tfidf_pipeline`. The `Tokenizer` step should create a column `words`, the `CounterVectorizer` step should create a column `tf`, and the `IDF` step should create a column `tfidf`. **Use the default parameers of all the estimators**

In [8]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, CountVectorizer, IDF
Tokenizer = Tokenizer().setInputCol('email').setOutputCol('words')
#Tokenizer.transform(email2_df).show()
CounterVectorizer = CountVectorizer().setInputCol('words').setOutputCol('tf')
IDF = IDF().setInputCol('tf').setOutputCol('tfidf')
tfidf_pipeline = Pipeline(stages=[Tokenizer, CounterVectorizer,IDF]).fit(email2_df)
#tfidf_pipeline.transform(email2_df).show()

In [9]:
# (5 pts)
np.testing.assert_array_equal([type(s) for s in tfidf_pipeline.stages],
                              [feature.Tokenizer, feature.CountVectorizerModel, feature.IDFModel])
np.testing.assert_array_equal(len(tfidf_pipeline.transform(email2_df).collect()), 1197)

**(5 pts)** Investigate the fitted pieline above and create a variable `lowest_idf` that contain the set of words with the 5 lowest IDF. **Hint: you must extract the vocabulary from the fitted `CountVectorizer` and the IDF values from the fitted `IDF`, both in the stages of `tfidf_pipeline`. You can put both lists into a Pandas dataframe columns and sort by idf, picking 5 after sorting**

In [10]:
#count_vectorizer_transformer = CounterVectorizer.fit(Tokenizer.transform(email2_df))
#count_vectorizer_transformer.vocabulary
vocabulary = tfidf_pipeline.stages[1].vocabulary
idf = tfidf_pipeline.stages[2].idf
ans = pd.DataFrame({'word':vocabulary,'idf':idf})
x= ans.sort_values(by=['idf'])
y = x[0:5]['word'].tolist()
lowest_idf= set(y)
#lowest_idf

In [11]:
# (5 pts)
# it is a set
np.testing.assert_equal(type(lowest_idf), set)
# it has 5 elements
np.testing.assert_equal(len(lowest_idf), 5)
# each element is a string
np.testing.assert_equal({type(w) for w in lowest_idf}, {str})

# Question 2.3: Compare models

Using the following splits:

In [12]:
training_df = email2_df.where('email_id < 1197*0.6')
validation_df = email2_df.where('email_id >= 1197*0.6 and email_id < 1197*0.9')
testing_df = email2_df.where('email_id >= 1197*0.9')

In [13]:
[training_df.count(), validation_df.count(), testing_df.count()]

[718, 359, 120]

**(5 pts)** Create pipelines where the first stage is the `tfidf_pipeline` created above and the second stage is a `LogisticRegression` model to predict `target` using different regularization parameters ($\lambda$) and elastic net mixture ($\alpha$). *Only change the regularization parameters and leave all parameters as default for `LogisticRegression`*. Fit those pipelines to the appropriate data split.

1. Logistic regression with $\lambda=0$ and $\alpha=0$ (assign the fitted pipeline to `lr_pipeline1`)
2. Logistic regression with $\lambda=0.02$ and $\alpha=0.2$ (assign the fitted pipeline to `lr_pipeline2`)
3. Logistic regression with $\lambda=0.1$ and $\alpha=0.4$ (assign the fitted pipeline to `lr_pipeline3`)

In [14]:
# create lr_pipeline1, lr_pipeline2, and lr_pipeline3
from pyspark.ml.classification import LogisticRegression
lambda_par = 0
alpha_par = 0
en_lr1= LogisticRegression().setLabelCol('topic').setFeaturesCol('tfidf').\
        setRegParam(lambda_par).setMaxIter(100).setElasticNetParam(alpha_par)
en_lr2= LogisticRegression().setLabelCol('topic').setFeaturesCol('tfidf').\
         setRegParam(0.02).setMaxIter(100).setElasticNetParam(0.2)
en_lr3= LogisticRegression().setLabelCol('topic').setFeaturesCol('tfidf').\
        setRegParam(0.1).setMaxIter(100).setElasticNetParam(0.4)

lr_pipeline1 = Pipeline(stages=[tfidf_pipeline,en_lr1]).fit(training_df)
lr_pipeline2 = Pipeline(stages=[tfidf_pipeline,en_lr2]).fit(training_df)
lr_pipeline3 = Pipeline(stages=[tfidf_pipeline,en_lr3]).fit(training_df)

In [15]:
# (10 pts)
np.testing.assert_equal(type(lr_pipeline1), pipeline.PipelineModel)
np.testing.assert_equal(type(lr_pipeline2), pipeline.PipelineModel)
np.testing.assert_equal(type(lr_pipeline3), pipeline.PipelineModel)
np.testing.assert_array_equal([type(s) for s in lr_pipeline1.stages],
                              [pipeline.PipelineModel, classification.LogisticRegressionModel])
np.testing.assert_array_equal([type(s) for s in lr_pipeline2.stages],
                              [pipeline.PipelineModel, classification.LogisticRegressionModel])
np.testing.assert_array_equal([type(s) for s in lr_pipeline3.stages],
                              [pipeline.PipelineModel, classification.LogisticRegressionModel])

**(5 pts)** Use the evaluator object defined below to compute the area under the curve of your predictors. For example, to compute the area under the curve of pipeline 1 for a dataframe `df`, you would run

```python
evaluator.evaluate(lr_pipeline1.transform(df))
```

**You must choose the right data split (dataframe `df`) with the goal of comparing models.**

Assign the AUC of the three models to the variables `AUC1`, `AUC2`, and `AUC3`, and and assign the pipeline with the best model to a variable `best_model`

In [16]:
evaluator = evaluation.BinaryClassificationEvaluator(labelCol='topic')

For example, the AUC on training of the first model should perfect:

```
evaluator.evaluate(lr_pipeline1.transform(training_df))
```

```console
1.0
```

In [17]:
# print the AUC for the three models as follows
# print("Model 1 AUC: ", evaluator.evaluate(....))
# etc
# finally, based on these, assign the best validated 
# model to a variable best_model
# YOUR CODE HERE
AUC1 = evaluator.evaluate(lr_pipeline1.transform(validation_df))
print("Model 1 AUC: ",AUC1)
AUC2 = evaluator.evaluate(lr_pipeline2.transform(validation_df))
print("Model 2 AUC: ",AUC2)
AUC3 = evaluator.evaluate(lr_pipeline3.transform(validation_df))
print("Model 3 AUC: ",AUC3)
best_model = lr_pipeline1

Model 1 AUC:  0.9937903626428214
Model 2 AUC:  0.988170640834575
Model 3 AUC:  0.972180824639841


In [18]:
# (5 pts)
np.testing.assert_array_equal([type(AUC1), type(AUC2), type(AUC3)],
                             [float, float, float])
# AUC less than 1
np.testing.assert_array_less([AUC1, AUC2, AUC3], [1, 1, 1])
# AUC more than 0.5
np.testing.assert_array_less([.5, .5, .5],
                            [AUC1, AUC2, AUC3])

# Question 2.4: Choose best model

Using the right split and the best model selected before, compute the generalization performance and assign it to a variable `AUC_best`

In [19]:
# assign to AUC_best the AUC of the best model selected before
AUC_best = evaluator.evaluate(best_model.transform(testing_df))
AUC_best

0.9943181818181818

In [20]:
# (5 pts)
np.testing.assert_approx_equal(AUC_best, 
                               0.9943181818181818, significant=1)

# Question 2.5: Inference

Use the pipeline 3 fitted above (`lr_pipeline3`) to create a Pandas dataframes that contain the most negative words and the most positive words. In particular, create a dataframe `positive_words` with the columns `word` and `weight` with the top 20 positive words, sorted by descending coefficient. Similarly create a `negative_words` Pandas dataframe with the top 20 negative words where the coefficient are sorted in ascending order. **Hint: follow the `sentiment_analysis.ipynb` notebook in the repo**

In [21]:
# create positive_words and negative_words pandas dataframe below
lr_weights = lr_pipeline3.stages[-1].coefficients.toArray()
lr_coeffs_df = pd.DataFrame({'word': lr_pipeline3.stages[0].stages[1].vocabulary, 'weight': lr_weights})
negative_words = lr_coeffs_df.sort_values('weight').head(20)
positive_words = lr_coeffs_df.sort_values('weight', ascending=False).head(20)
#len(lr_coeffs_df)
positive_words.head()
negative_words.head()

Unnamed: 0,word,weight
816,golchowy@alchemy.chem.utoronto.ca,-0.160207
882,nhl.,-0.14871
570,contact,-0.141691
236,playoff,-0.131673
810,olchowy),-0.129368


The results should be as follows:

`positive_words.head()`
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>word</th>
      <th>weight</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>860</th>
      <td>(edward</td>
      <td>0.123131</td>
    </tr>
    <tr>
      <th>129</th>
      <td>baseball</td>
      <td>0.107927</td>
    </tr>
    <tr>
      <th>991</th>
      <td>players?</td>
      <td>0.092217</td>
    </tr>
    <tr>
      <th>285</th>
      <td>pitching</td>
      <td>0.088141</td>
    </tr>
    <tr>
      <th>969</th>
      <td>fischer)</td>
      <td>0.061178</td>
    </tr>
  </tbody>
</table>

`negative_words.head()`

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>word</th>
      <th>weight</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>816</th>
      <td>golchowy@alchemy.chem.utoronto.ca</td>
      <td>-0.160207</td>
    </tr>
    <tr>
      <th>882</th>
      <td>nhl.</td>
      <td>-0.148710</td>
    </tr>
    <tr>
      <th>570</th>
      <td>contact</td>
      <td>-0.141691</td>
    </tr>
    <tr>
      <th>236</th>
      <td>playoff</td>
      <td>-0.131673</td>
    </tr>
    <tr>
      <th>810</th>
      <td>olchowy)</td>
      <td>-0.129368</td>
    </tr>
  </tbody>
</table>

In [22]:
# (5 pts)
np.testing.assert_equal(set(positive_words.columns), {'weight', 'word'})
np.testing.assert_equal(set(negative_words.columns), {'weight', 'word'})
np.testing.assert_approx_equal(positive_words.weight.sum(), 1.1287686331251567, significant=1)
np.testing.assert_approx_equal(negative_words.weight.sum(), -1.9525975400776723, significant=1)
np.testing.assert_array_less(positive_words.weight.iloc[-1], positive_words.weight.iloc[0])
np.testing.assert_array_less(negative_words.weight.iloc[0], negative_words.weight.iloc[-1])

**(5 pts)** Explain in simple terms what the top three positive words and the top three negative words might indicate about the prediction:

When we look at the top 3 positive words, it consist of symbols such as brackets and ?. Simillary the top 3 negative words contain bracket symbol and punctuation. After a closer look, some words do not have meaning and some cannot be interpreted as positive or negative words. There there is a high possibility of over fitting. This might be possible as number of emails is less as compared to number of words.  