# Homework 2: Text mining

This assignment will teach you how to use text data to build predictive models. Before starting this assignment, make sure you read the Trans-American Airlines case study. The data for the case study is in the file "tweets.csv".

## Grade Breakdown

The grade breakdown for this assignment is as follows:
1. **Questions & Code (80\%):** Most of this assignment consists of completing short snippets of code and answering questions within the Jupyter notebook. Questions do not have partial credits. You either get all the points or you don't get any, so be careful in your responses.
2. **Peer evaluation (20\%):** You and your group members must evaluate each other by completing the peer evaluation (the link is in the assignment description in Canvas). **IMPORTANT**: You will receive no credits if you do not complete your peer evaluation as part of your submission, so please be careful. One of the questions in the evaluation is this one: `Did this group member helped you submit a better assignment or in less time than what you could have done on your own?` Your grade will depend on the answer of other group members, and their grade will depend on your answer. These are the possible answers:
   * Great: "Definitely. My assignment is much better or it took me much less time than if I had done it without them." (+10% to grade, or +20% if you are in a group of 2)
   * Acceptable: "To some extent. My assignment is slightly better or it took me slightly less time than if I had done it without them." (+5% to grade, or +10% if you are in a group of 2)
   * Worrisome: "Not really. They did not save me time or help me submit a better assignment, but they gave it an honest try." (+2% to grade, or +4% if you are in a group of 2, and the person who answered this should reach out to the corresponding group member)
   * Unacceptable: "No. And they offered me very little help or no help at all." (+0%, the person who answered this should reach out to the corresponding group member, and the professor will look into it)

## Loading the data

Before you answer the questions below, let's first load the data that was labeled by your assistant. 

In [None]:
import pandas as pd

df = pd.read_csv("tweets.csv")
df.head()

## Question 1: Target Variable (1 point)

Code a function that takes the whole data set as an input and returns a binary target variable that we should predict to address the problem discussed in the case study.

In [None]:
def get_target_variable(data):
    # YOUR CODE HERE
    raise NotImplementedError()

The output of your function should be a pandas `Series` where values are either `True` or `False`. A `True` value should represent the outcome of interest (e.g., the customer is angry). Check that your function is giving the correct type of output:   

In [None]:
y = get_target_variable(df)
assert type(y) == pd.core.series.Series
assert len(y) == len(df)
assert y.dtype == bool

## Question 2: Building a Predictive Model (1 point)

The following code splits the data into a training set (13,640 tweets) and a holdout set (1,000 tweets). It then transforms the text of each tweet using the bag-of-words technique discussed in Chapter 10. Each possible word that could appear in a tweet is represented as a binary feature that takes a value of 1 if the word is present in the tweet and a value of 0 otherwise. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

text_data = df['text']
text_train, text_holdout, y_train, y_holdout = train_test_split(text_data, y, test_size=1000, random_state=0)

binary_vectorizer = CountVectorizer(binary=True)
binary_vectorizer.fit(text_train)
X_binary_train = binary_vectorizer.transform(text_train)
X_binary_holdout = binary_vectorizer.transform(text_holdout)
X_binary_train

As you can see, the matrix that results from transforming the text in the training data consists of 13,640 rows (tweets) and 14,436 features (words)! The output above also shows that the data is being stored in a sparse matrix (as opposed to the typical dense matrix). Given the shape of the matrix, this means there are \~197 million cells that should have values. However, from the above, we can see that only \~218k cells (\~0.1% of the cells) have values! Why is this?

To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored, which saves a ton of memory and makes the computation of models much more efficient!

**Code a function that returns a logistic regression model that is trained and tuned using the training data**. Use `GridSearchCV` with 10 folds to tune the model according to AUC. Use the parameters `solver="liblinear"` and `random_state=42` for the logistic regression. Also, try the following values for the `C` parameter with the cross-validation: 0.01, 0.1, 1, 10. For more information on these parameters, [check out the documentation on logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

*Side note (and practical tip):* When you just want to do a "quick" tune of a regularization parameter (like `C`), it's a common practice to try powers of 10 (e.g., 0.01, 0.1, 1, 10).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

def get_model(X, y):
    # YOUR CODE HERE
    raise NotImplementedError()

Check that your function is giving the correct type of output:

In [None]:
my_model = get_model(X_binary_train, y_train)
assert type(my_model) == LogisticRegression 
assert len(my_model.predict(X_binary_train)) == len(y_train)


## Question 3: Most Predictive Words (0.5 points)

The code below shows the features with the largest coefficients in your model. Use it to show the most predictive words. Pick a few words that catch your attention (at least 2 or 3). Why do you think these words are predictive? 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot_coefficients(classifier, feature_names, top_features=15):
    coef = classifier.coef_.ravel()
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    # create plot
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" for c in coef[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
    plt.ylabel("Coefficient size")
    plt.show()
    
feature_names = binary_vectorizer.get_feature_names()
plot_coefficients(my_model, feature_names)

YOUR ANSWER HERE

## Question 4: Text Mining Limitations (0.5 points)

For the purposes of this question, suppose we use a threshold of 0.5 to predict the target variable. The following code prints the text of five false positives and five false negatives in the validation set. Why do you think the model made these mistakes?

In [None]:
predictions = my_model.predict(X_binary_holdout)
false_positives = text_holdout[(predictions == True) & (y_holdout == False)]
false_negatives = text_holdout[(predictions == False) & (y_holdout == True)]
print("===== FALSE POSITIVES")
print(false_positives.head(5).values)
print("===== FALSE NEGATIVES")
print(false_negatives.head(5).values)

YOUR ANSWER HERE

## Question 5: Model Evaluation (2 points)

Based on the case study information, choose one of the following measures to evaluate your model:
* Accuracy
* Expected Value
* AUC
* Precision
* Recall

Then, code a function called `get_evaluation` that evaluates the performance of your model according to this measure. This function should receive two parameters: the labeled data in the holdout set and the probability predictions made by your model in the holdout set. To obtain full marks, you must choose the most appropriate evaluation measure and code it correctly. You are only allowed to use standard Python, scikit-learn, pandas, or numpy for this task.

Print the model's performance. Do you think this performance is acceptable from a business perspective? Justify your answer. 

In [None]:
def get_evaluation(y, probs):
    # YOUR CODE HERE
    raise NotImplementedError()

Check that your function is giving the right output:

In [None]:
probabilities = my_model.predict_proba(X_binary_holdout)[:, 1]
result = get_evaluation(y_holdout, probabilities)
print(result)

YOUR ANSWER HERE

## Question 6: Using the model (0.5 point)

Print the top 20 tweets in the holdout set with the highest probability of having a positive value for the target variable. What seems to be the main problem that Trans-American Airlines is facing?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Question 7: Conduct Benchmark (2 points)

There are several options for transforming text into features besides using 1s and 0s to represent the presence or absence of a word. For example you can use integers to indicate how many times words appear; the term frequency - inverse document frequency (tf-idf) measure is another popular alternative (see Chapter 10 of the book). The code below shows how to transform text into features using these two other approaches. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# transform text to word counts
count_vectorizer = CountVectorizer()
count_vectorizer.fit(text_train)
X_count_train = count_vectorizer.transform(text_train)

# transform text to tf-idf
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(text_train)
X_tfidf_train = tfidf_vectorizer.transform(text_train)

Code a function that benchmarks the following approaches to transform text data into features:
* `CountVectorizer(binary=True)`
* `CountVectorizer()`
* `TfidfVectorizer()`

The function must:
* Learn and tune models using the function `get_model` you coded for Question 2. As before, use the training data to train and tune models.
* Evaluate the models using the function `get_evaluation` you coded for Question 5. As before, use the test data to evaluate the models.
* Return a list with the evaluation performance of the three methods. The first element should correspond to `CountVectorizer(binary=True)`, the second element to `CountVectorizer()`, and the third element to `TfidfVectorizer()`.

In [None]:
from sklearn import metrics

def conduct_benchmark(text_train, text_val, y_train, y_val):
    # YOUR CODE HERE
    raise NotImplementedError()

Check that your function is returning a list with 3 numbers. 

In [None]:
benchmark = conduct_benchmark(text_train, text_holdout, y_train, y_holdout)
assert len(benchmark) == 3
print(benchmark)


## Question 8: Interpret Benchmark (0.5 points)

Take a look at the benchmark results. Do you think the difference in performance is large or small? Why do you think the difference in performance is large/small? Justify your answer.

In [None]:
pd.DataFrame({"Approach":['binary', 'count', 'tfidf'], "Performance":benchmark})

YOUR ANSWER HERE