# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>
- TAs: Tong Zeng <tozeng@syr.edu>, Priya Matnani <psmatnan@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [None]:
# load these packages
from pyspark.ml import feature
from pyspark.ml import clustering
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd

# Part 1. Unsupervised learning

I would recommend to follow the notebook `unsupervised_learning.ipynb` first, shared through the IST 718 repository.

The following dataset contains information about dozens of "data science" programs across the US.

In [None]:
ds_programs_df = spark.read.csv('/datasets/colleges_data_science_programs.csv',
                               inferSchema=True, header=True).\
                 fillna('').orderBy('id')

## Question 1: (10 pts)

This dataset contains many columns that we can use to understand how these data science programs differ from one another. In this question, you will create a dataframe `ds_programs_text_df` which simply adds a column `text` to the dataframe `ds_programs_df`. This column will have the concatenation of the following columns separated by a space: `program`, `degree` and `department` (find the appropriate function in the `fn` package)

In [None]:
# (10 pts) Create ds_programs_text_df here
# YOUR CODE HERE
raise NotImplementedError()

An example of the `ds_programs_text_df` should give you:

```python
ds_programs_text_df.orderBy('id').first().text
```

```console
'Data Science Masters Mathematics and Statistics'
```

In [None]:
# (10 pts)
np.testing.assert_equal(ds_programs_text_df.count(), 222)
np.testing.assert_equal(set(ds_programs_text_df.columns), {'admit_reqs',
 'business',
 'capstone',
 'cost',
 'country',
 'courses',
 'created_at',
 'databases',
 'degree',
 'department',
 'ethics',
 'id',
 'machine learning',
 'mapreduce',
 'name',
 'notes',
 'oncampus',
 'online',
 'part-time',
 'program',
 'program_size',
 'programminglanguages',
 'state',
 'text',
 'university_count',
 'updated_at',
 'url',
 'visualization', 
 'year_founded'})
np.testing.assert_array_equal(ds_programs_text_df.orderBy('id').rdd.map(lambda x: x.text).take(5),
                              ['Data Science Masters Mathematics and Statistics',
 'Analytics Masters Business and Information Systems',
 'Data Science Masters Computer Science',
 'Business Intelligence & Analytics Masters Business',
 'Advanced Computer Science(Data Analytics) Masters Computer Science'])

# Question 2: (10 pts) 

The following code creates a dataframe `ds_features_df` which adds a column `features` to `ds_programs_text_df` that contains the `tfidf` of the column `text`:

In [None]:
# read-only
pipe_features = \
    Pipeline(stages=[
        feature.Tokenizer(inputCol='text', outputCol='words'),
        feature.CountVectorizer(inputCol='words', outputCol='tf'),
        feature.IDF(inputCol='tf', outputCol='tfidf'),
        feature.StandardScaler(withStd=False, withMean=True, inputCol='tfidf', outputCol='features')]).\
    fit(ds_programs_text_df)

Create a pipeline model `pipe_pca` that computes the two first principal components of `features` as computed by `pipe_features` and outputs a column `pc`. Use that pipeline to create a dataframe `ds_features_df` with the columns `id`, `name`, `url`, and `pc`.

In [None]:
# create the pipe_pca PipelineModel below (10 pts)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Tests for (10 pts)
np.testing.assert_equal(pipe_pca.stages[0],  pipe_features)
np.testing.assert_equal(type(pipe_pca.stages[1]),  feature.PCAModel)
np.testing.assert_equal(set(ds_features_df.columns), {'id', 'name', 'pc', 'url'})
np.testing.assert_equal(ds_features_df.first().pc.shape, (2, ))

# Question 3: (10 pts)

Create a scatter plot with the x axis containing the first principal component and the y axis containing the second principal component of `ds_features_df`

In [None]:
# below perform the appropriate 
# YOUR CODE HERE
raise NotImplementedError()

# Question 4 (10 pts)

Create two Pandas dataframes `pc1_pd` and `pc2_pd` with the columns `word` and `abs_loading` that contain the top 5 words in absolute loading for the principal components 1 and 2, respetively. You can extract the vocabulary from the stage that contains the count vectorizer in `pipe_features`:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
pc1_pd

In [None]:
pc2_pd

In [None]:
# (10 pts)
assert type(pc1_pd) == pd.core.frame.DataFrame
assert type(pc2_pd) == pd.core.frame.DataFrame
np.testing.assert_array_equal(pc1_pd.shape, (5, 2))
np.testing.assert_array_equal(pc2_pd.shape, (5, 2))
np.testing.assert_equal(set(pc1_pd.columns), {'abs_loading', 'word'})
np.testing.assert_equal(set(pc2_pd.columns), {'abs_loading', 'word'})

# Question 5: (10 pts)

Create a new pipeline for PCA called `pipe_pca2` where you fit 50 principal components. Extract the the `PCAModel` from the stages of this pipeline, and assign to a variable `explainedVariance` the variance explained by components of such model. Finally, assign to a variable `best_k` the value $k$ such that ($k+1$)-th component is not able to explain more than 0.01 variance. You can use a for-loop to find such best k.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Tests for (10 pts)
np.testing.assert_equal(pipe_pca2.stages[0],  pipe_features)
np.testing.assert_equal(type(pipe_pca2.stages[1]),  feature.PCAModel)
np.testing.assert_equal(len(explainedVariance), 50)
np.testing.assert_array_less(5, best_k)