# Python Text Analysis: Part 2 Solutions

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from string import punctuation

%matplotlib inline

In [2]:
# Use pandas to import tweets
tweets_path = '../data/airline_tweets.csv'
tweets = pd.read_csv(tweets_path, sep=',')

## 🥊 Challenge 1: Apply a Text Cleaning Pipeline

Write a function called `preprocess()` that performs the following steps on a text input:
* Step 1: Lowercase text.
* Step 2: Replace the following patterns with placeholders:
    * URLs &rarr; ` URL `
    * Digits &rarr; ` DIGIT `
    * Hashtags &rarr; ` HASHTAG `
    * Tweet handles &rarr; ` USER `
* Step 3: Remove extra blankspaces.

Here are some hints to guide you through this challenge! 

* For Step 1, recall from Part 1 that a string method called [`.lower()`](https://docs.python.org/3.11/library/stdtypes.html#str.lower) will do the job of lowercasing text input.
* We have integrated Step 2 in a function called `placeholder`. Run the next cell to import it into your notebook, and you can use it just like any other functions.
* For Step 3, we have provided the regex pattern for identifying whitespace characters as well as the correct replacement for extraneous whitespaces. 

Run your `preprocess()` function on `example_tweet` (three cells below), and when you think you have it working, apply it to the entire `text` column in the tweets DataFrame.

In [5]:
from utils import placeholder

ModuleNotFoundError: No module named 'utils'

In [None]:
blankspace_pattern = r'\s+'
blankspace_repl = ' '

def preprocess(text):
    '''Create a preprocess pipeline that cleans the tweet data.'''

    # Lowercase
    text = text.lower()

    # Replace patterns with placeholders
    text = placeholder(text)

    # Remove extra whitespaces
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text

## 🥊 Challenge 2: Lemmatize the Text Input

Recall from Part 1 that we introduced using `spaCy` to perform lemmatization, i.e., removing morphological affixes on words. With lemmatization, we keep only word stems in texts, which presumbaly should capture the core meaning of the text. 

Now let's implement lemmatization on our tweet data, and pass the lemmatized text to create a third DTM. 

Complete the function `lemmatize_text`. It requires a text input, and the returned output is the same text except this time lemmas of all tokens. There are several steps we need to consider to complete this function:
- Initialize a list to hold lemmas
- Apply the `nlp()` pipeline to input text
- Iterate over tokens in the processed text, and retrieve lemma of the token
    - HINT: lemma is one of the linguistic annotations that the `nlp` pipeline returns. We can use `token.lemma_` to access the annotation.

In [None]:
# Import spaCy
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Create a function to lemmatize text
def lemmatize_text(text):
    '''Lemmatize the text input with spaCy annotations.'''
    
    # Apply the nlp pipeline to input text
    doc = nlp(text)

    # Append the token lemma to list and join them into a single string
    text_lemma = ' '.join([token.lemma_ for token in doc])
    
    return text_lemma

In [None]:
# Apply the function to an example tweet
print(tweets.iloc[101]["text_processed"])
print(f"{'='*50}")
print(lemmatize_text(tweets.iloc[101]['text_processed']))

In [None]:
# This may take a while!
tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))

In [None]:
# Print the preprocessed tweet
print(tweets['text_processed'].iloc[101])
print(f"{'='*50}")
# Print the lemmatized tweet
print(tweets['text_lemmatized'].iloc[101])

## 🥊 Challenge 3: Words with Highest Mean TF-IDF scores

So we have got tf-idf values for each term in each document, does that inform us anything about our data? Instead of focusing on tf-idf value of any particular word, let's take a step back. Is there any word to be particularly informative for positive/negative tweets? Let's gather the indices to all positive/negative tweets, and calculate the mean tf-idf scores of words appear in positive/negative tweets. 

We've provided the following starter codes to scaffold:
- Use boolean masks to select tweets that have positive/negative sentiments, retrieve the indices, and assign them to `positive_index`/`negative_index`
- Select positive/negative tweets in the tfidf dataframe, and take the mean tf-idf values across the documents, sort the mean values in descedning order, and get the top 10 terms. 

After you've completed the following two cells, plot the words having the highest mean tf-idf scores for each subset. 

In [None]:
# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])

# Create a tf-idf dataframe
tfidf = pd.DataFrame(tf_dtm.todense(),
                     columns=vectorizer.get_feature_names_out(),
                     index=tweets.index)

In [None]:
# Complete the boolean masks 
positive_index = tweets[tweets['airline_sentiment'] == 'positive'].index
negative_index = tweets[tweets['airline_sentiment'] == 'negative'].index

In [None]:
# Complete the following two lines
pos = tfidf.loc[positive_index].mean().sort_values(ascending=False).head(10)
neg = tfidf.loc[negative_index].mean().sort_values(ascending=False).head(10)

In [None]:
pos.plot(kind='barh', 
         xlim=(0, 0.18),
         color='cornflowerblue',
         title='Top 10 terms with the highest mean tf-idf values for positive tweets');

In [None]:
neg.plot(kind='barh', 
         xlim=(0, 0.18),
         color='darksalmon',
         title='Top 10 terms with the highest mean tf-idf values for negative tweets');