# Assignment 08: 
# Text Mining and Regression using Dimensionality Reduction Methods [_/100 Marks]

### Follow These Instructions

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.



#### In this assignment, we will study apply dimensionality reduction methods to improve our understanding of text data and to predict the sentiment of a set of texts. The dataset for this assignment comes from the Amazon website and represents 1,000 reviews which were labeled (by humans) as positive or negative. This application of data science is called [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) and it is widely used across many fields to get automated feedback when text opinions are expressed. You will also work with dimensionality reduction for classification and regression.
---

In [24]:
import numpy as np
import pandas as pd
import umap
from sklearn.decomposition import PCA, TruncatedSVD
import sklearn.feature_extraction.text as sktext
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from itertools import product

import seaborn as sns
import matplotlib.pyplot as plt
seed = 221113

---
## Task 1: Decomposition of the texts [ / 50 marks]

### Question 1.1 [ / 9 marks]

The dataset comes with the text and a binary variable which represents the sentiment, either positive or negative. Import the data and use sklearn's `TfidfVectorizer` to eliminate accents, special characters, and stopwords. In addition, make sure to eliminate words that appear in less than 5% of documents and those that appear in over 95%. You can also set `sublinear_tf` to `True`. After that, split the data into train and test with `test_size = 0.2` and `seed = seed`. Calculate the [Tf-Idf transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for both train and test. Note that you need to fit and transform the inputs for the train set but you only need to transform the inputs for the test set. Don't forget to turn the sparse matrices to dense ones after you apply the `Tf-Idf` transform.  

In [25]:
# Load the data [ /1 marks]


# Display the first 5 rows [ /1 marks]


In [26]:
# Defining the TfIDFTransformer [ /2 marks]


# Train/test split [ /2 marks]

# Calculate the Tf-Idf transform [ /2 marks]


From here on, you will use the variables `TfIDF_train` and `TfIDF_test` as the input for the different tasks, and the `y_train` and `y_test` labels for each dataset (if required). Print the number of indices in the ouput using [`TfIDFTransformer.get_feature_names()` method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [27]:
# Print the number of indices [ /1 marks]


### Question 1.2 [ / 8 marks]
Now we have the TfIDF matrix so we can start working on the data. We hope to explore what some commonly occuring concepts are in the text reviews. We can do this using PCA. A PCA transform of the TF-IDF matrix will give us a basis of the text data, each component representing a *concept* or set of words that are correlated. Correlation in text can be interpreted as a relation to a similar topic. Calculate a PCA transform of the training data using the **maximum** number of concepts possible. Make a plot of the explained variance that shows the cumulative explained variance per number of concepts.

In [28]:
# Apply PCA on training data and get the explained variance [ / 3 marks]


# Plotting explained variance with number of concepts [ / 3 marks]


**Written Question:** Exactly how many concepts do we need to correctly explain at least 80% of the data? [ /2 marks]


In [29]:
# To get the exact index where the variance is above 80%


**Your Answer:**

Here

### Question 1.3 [ / 12 marks]

Let's examine the first three concepts by looking how many variance they explained and showing the 10 words that are the most important in each of these three concepts (as revealed by the absolute value of the PCA weight in each concept).


In [30]:
# Explained variance [ / 2 marks]


In [31]:
# Get 10 most important words for each component [ / 4 marks]


In [32]:
# Words for concept 1 [ / 2 marks]


In [33]:
# Words for concept 2 [ / 1 marks]


In [34]:
# Words for concept 3 [ / 1 marks]



**Written Question:** What is the cumulative variance explained by these three concepts? What would you name each of these concepts? [ / 2 marks]

*Hint: If in a concept you would get the words 'dog', 'cat', 'fish' as the most important ones, you could name the concept 'animals' or 'pets'.*

**Your answer:**

Here

### Question 1.4 [ / 8 marks]

 Apply the PCA transformation to the test dataset. Use only the first two components and make a scatter plot of the cases. Identify positive and negative cases by colouring points with different sentiments with different colours.


In [35]:
# Apply PCA to the test dataset [ / 2 marks]


# Plot the two different set of points with different markers and labels [ /4 marks]


**Written Question:** What can we say about where the positive and negative cases lie in our plot? Could we use these concepts to discriminate positive and negative cases? If yes, why? If no, why not? Discuss your findings. [ /2 marks]

**Your answer:**

Here

### Question  1.5 [ / 13 marks]

Repeat the process above, only now using a UMAP projection with two components. Test all combinations of ```n_neighbors=[2, 10, 25]``` and ```min_dist=[0.1, 0.25, 0.5]``` over the train data and choose the projection that you think is best, and apply it over the test data. Use 1000 epochs, a cosine metric and random initialization. If you have more than 8GB of RAM (as in Colab), you may want to set ```low_memory=False``` to speed up computations.

*Hint: [This link](https://stackoverflow.com/questions/16384109/iterate-over-all-combinations-of-values-in-multiple-lists-in-python) may be helpful.*



In [36]:
# Set parameters [ / 4 marks]


# Create plot [ [ /2 marks]]


**Written Question:** Which paramter would you choose? [ / 2 makrs]

**Your Answer:**

Here

In [37]:
# Choose the paramters that you think are best and apply to test set [ / 2 marks]


# Create plot [ /1 marks]


**Written Question:** How does the plot compare to the PCA one? [ /2 marks]

**Your answer:**

Here

---
## Task 2: Benchmarking predictive capabilities of the compressed data [ /24 marks]

For this task, we will benchmark the predictive capabilities of the compressed data against the original one. 



### Question 2.1 [ /6 marks]
Train a regularized logistic regression over the original TfIDF train set (with no compression) using l2 regularization. Calculate the AUC score and plot the ROC curve for the original test set. Use the training/test split created in Q1.1.

In [38]:
# Train and test using model LogisticRegressionCV [ /4 marks]

# Define the model


# Fit on the training dataset


# Apply to the test dataset




# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first


# Create and show the plot


### Question 2.2 [ /8 marks]
Train a regularized logistic regression over an SVD-reduced dataset (with 10 components) using l2 regularization. Calculate the AUC score and plot the ROC curve for the SVD-transformed test set.

In [39]:
# Apply SVD first [ / 3 marks]


#Train and test using model LogisticRegressionCV [ /3 marks]



# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first


# Create and show the plot


### Question 2.3 [ /8 marks]
Train a regularized logistic regression over the UMAP-reduced dataset (with 10 components using the same parameters as Task 1.5) using l2 regularization. Calculate the AUC score and plot the ROC curve for the UMAP-transformed test set.

In [40]:
# Apply UMAP first [ / 3 marks]


#Train and test using model LogisticRegressionCV [ /3 marks]


# Plot ROC curve and compute AUC score [ /2 marks]
# Calculate the ROC curve points


# Save the AUC in a variable to display it. Round it first


# Create and show the plot


### Question 2.4 [ /2 marks]
**Written Question:** Compare the performance of the three models. Which one is the best. [ / 2 marks] 

**Your Answer:**

Here

---
## Task 3: PCA + Hockey [ / 26 marks]
Connor Andrew McDavid is a Canadian professional ice hockey player and captain of the Edmonton Oilers of the National Hockey League (NHL). The data file `Hockey_sample.csv` provides a reduced version of Connor's game by game career data. Each row represents the stats of one game. The dataset has the following attributes:

|#| Attribute | Description |
| --- | --- | --- |
|0|`opposingTeam`|The team the player played against.|
|1|`home_or_away`|Whether a game was played home or away.|
|2|`icetime`|Log10 of total time the player played in seconds.|
|3|`gameScore`|Game score rating.|
|4|`I_F_primaryAssists`|Primary Assists the player has received on teammates' goals.|
|5|`I_F_secondaryAssists`|Secondary Assists the player has received on teammates' goals.|
|6|`log10_I_F_shotAttempts`|Log10 of shot attempts. Includes player's shots on goal, missed shots, and blocked shot attempts.|
|7|`I_F_goals`|Number of goals the player scored.|
|8|`I_F_rebounds`|Rebound shot attempts. These must occur within 3 seconds of a previous shot.|
|9|`I_F_reboundGoals`|Goals from rebound shot attempts.|
|10|`I_F_freeze`|Puck freezes after a player's shot. The  number of puck freezes by goalies after the player's unblocked shot attempts.|
|11|`I_F_playContinuedInZone`|Number of times the play continues in the offensive zone after the player's shot besides an immediate rebound shot.|
|12|`I_F_playContinuedOutsideZone`|Number of times the play goes outside the offensive zone after the player's shot.|
|13|`I_F_savedShotsOnGoal`|Number of the player's unblocked shots that were saved by the goalie.|
|14|`I_F_savedUnblockedShotAttempts`|Number of the player's unblocked shots that were saved by the goalie or missed the net.|
|15|`I_F_penalityMinutes`|Number of penalty minutes the player has received.|
|16|`log10_I_F_faceOffsWon`|Log10 of number of faceoffs the player has won.|
|17|`I_F_hits`|Number of hits the player has given.|
|18|`I_F_takeaways`|Number of takeaways the player has taken from opponents.|
|19|`I_F_giveaways`|Number of giveaways the player has given to other team.|
|20|`I_F_lowDangerGoals`|Goals from low danger shots.|
|21|`I_F_mediumDangerGoals`|Goals from medium danger shots.|
|22|`I_F_highDangerGoals`|Goals from high danger shots.|
|23|`I_F_unblockedShotAttempts`|All shot attempts that weren't blocked.|
|24|`I_F_dZoneGiveaways`|Giveaways in the team's defensive zone.|
|25|`penalityMinutesDrawn`|Number of penalty minutes the player has drawn.|
|26|`penaltiesDrawn`|Number of penalties the player has drawn.|

### Question 3.1 [ / 6 marks]

Drop categorical attributes, standardize numerical ones, and finally, with "icetime" as your target create the matrix of predictors and target vector, calling them `X1` and `y`, respectively. What is the `shape` of `X1` and `y`?

Hint they should be as the following:

Shape of X1: (2725, 24)

Shape of y: (2725,)

In [41]:
#

### Question 3.2 [ / 10 marks]

Use a 15-component regular PCA to transform `X1` and create the scree plot. Let $p$ be the **minimum** number of PCs required in order to capture at least 80% of total variance. What would be the value of $p$? Reduce the dimension of `X1` to $p$ and call this new array `X2` (retain `X1` intact though, we need it for later).

In [42]:
#

### Question 3.3 [ / 10 marks]

Now that you have 2 different design matrices (*i.e.*, `X1` and `X2`) let's try two different scenarios: Train a simple linear regression (with default arguments) once using `X1`, and another time using the combination of `X1` and `X2` (*i.e.* concatenate them). Use cross-validation with RMSE as the error measure to identify the best model among the two. Report the cross-validation RMSE along with thier CIs for both models.

(For the cross-validation, do five-fold shuffled. For train/test split, use sklearn's default value for test set size.)

In [43]:
#

---
$$The End$$