# lab5d: PCA with text 2/21/19






#### Imagining Communities using texts

1. Using words as features: Words as Vectors in unique dimensions
2. Imagining political communties / political discourse
3. Try this at home: PCA on your own texts



In [1]:
# Some libraries we'll need...

import pandas as pd
from sklearn.decomposition import PCA as sklearnPCA 
import matplotlib.pyplot as plt
import matplotlib 
##import text_analysis as ta 
import numpy as np
import re

## Imagining communities via their textual traces

#### 1. Using words as features
We can use PCA to examine collections of texts. To do this, we treat each word as a feature.

Thus for a corpus with just two texts of two words each, namely, $T_1 = (the, the)$ and $T_2 = (the, moose)$, we would draw:

![Words](fig/Words.jpeg)

For each additional text $T_i$, we'd add a new data point in the plot above. For each additional *word* in the corpus, we'd add a new dimension to the plot above. If our corpus has a million words, our plot above would have a million dimensions. As above, we'll use PCA to reduce a high-dimension data set into a 2-dimension data set that we can easily visualize. 

#### Let's take a small sample of tweets: US politicians (and their offices) for 1 day. 

In [3]:
# Here's a bunch of tweets from senators and their aids on 29 Jan 2017

tweets = pd.read_csv('https://raw.githubusercontent.com/data-ppf/data-ppf.github.io/master/dat/Jan_2017_tweets.csv')
tweets

Unnamed: 0,id,user_id,created_at,tweet_text,hashtag_entities
0,825817400465571841,76649729,2017-01-29 21:26:25,"RT @GOPHELP: Betsy DeVos says in letter - ""I a...",{}
1,825725985458880514,76649729,2017-01-29 15:23:10,"RT @GOPHELP: Op-ed in @memphisnews:""DeVos has ...",{}
2,825839671208972290,817076257770835968,2017-01-29 22:54:55,RT @JCColtin: @RepEspaillat inauguration going...,{}
3,825820208736718850,817076257770835968,2017-01-29 21:37:35,RT @NYCMayor: We're the safest big city in Ame...,{}
4,825820131196620800,817076257770835968,2017-01-29 21:37:16,RT @NYCMayor: I'll be joining New Yorkers at B...,{}
5,825820092684496903,817076257770835968,2017-01-29 21:37:07,RT @NYCMayor: There is something more importan...,{BatteryPark}
6,825819737217183745,817076257770835968,2017-01-29 21:35:42,RT @LizRittr: .@BilldeBlasio to @jeffsessions:...,{}
7,825819634402213888,817076257770835968,2017-01-29 21:35:18,Thank you @SenSchumer for your remarks today @...,{NY13}
8,825808721288822784,817076257770835968,2017-01-29 20:51:56,RT @jonorcutt: Great rallying cries from @RepE...,{}
9,825808666012090368,817076257770835968,2017-01-29 20:51:43,"RT @altochulo: No justice, no peace! cries @Re...",{NoBanNoWall}


In [None]:
# How many unique twitter accounts exist in this data set?
len(tweets.user_id.unique())

In [None]:
# How many tweets are contained in the data set?
len(tweets)

In [None]:
# How many years does this data set cover? When does it start and when does it end?

earliest_tweet_date = tweets['created_at'].min()
print("earliest tweet date", earliest_tweet_date)

latest_tweet_date = tweets['created_at'].max()
print("latest tweet date", latest_tweet_date)

### Here we propose to do something very bad with data. We're going to take these tweets and, using PCA, imagine communtities of affective affinity or political belief in two ways: 
1. by looking at individual hashtag use to make inferences about existing communities; 
2. by looking at individual language usage to postulate shared affect. 

It's important to stress that we don't think this is a particularly creditable or justifiable approach, but it does serve to illustrate a fairly predominant trend in which some data is used to posit the existence of different communities and to make claims about individuals.    

#### We're going to need some handy functions for stuff we're going to do again and again below. 
You don't need to pay to much attention to the details below, but you should note that functions are very handy for repetitive tasks. You must run the box below in order to use these functions. 

In [None]:
def count_words(list_to_search): #uses single_type_count() to count all tokens
    unique_words = set(list_to_search)
    word_counts = {}
    for word in unique_words:
        word_counts[word] = single_type_count(word, list_to_search)
    return word_counts # dict w/ word counts

def single_type_count(token_to_count, list_to_search): #counts up all tokens of a single type
    number_of_tokens = 0                            
    for token in list_to_search:                   
        if token == token_to_count:                 
            number_of_tokens += 1                   
    return number_of_tokens #returns int

def total_number_of_words(dict_of_word_counts): #for use with token_counts
    number_of_words = 0
    for word in dict_of_word_counts:
        number_of_words = number_of_words + dict_of_word_counts[word]
    return number_of_words #returns int

def total_number_of_words_in_corpus(list_of_total_word_counts):
    total_number_of_words = 0
    for total in range(0,len(list_of_total_word_counts)):
        total_number_of_words = total_number_of_words + list_of_total_word_counts[total]
    return total_number_of_words #returns int

def get_word_frequencies(dict_of_words_with_counts, total_number_of_words_in_text):
    word_freq = {}
    for word in dict_of_words_with_counts:
        word_freq[word] = dict_of_words_with_counts[word]/total_number_of_words_in_text
    return word_freq # dict with word w/ normalized frequencies

def word_freq(word_counts, corpus_word_count):
## calculate word frequencies for any set of distinct items (e.g., texts, chunks, topic models, etc.)
    word_frequencies = []
    for text in range(0, len(word_counts)):
        word_frequencies.append(get_word_frequencies(word_counts[text],corpus_word_count)) 
    return word_frequencies

def obtain_MFW(word_frequencies, compared_to_which_text, textnames):
#USEFUL FOR EXAMINING MOST FREQUENT WORDS IN WORD LIST
#Uses Pandas to sort word frequencies and fill empty cells
#Returns dataframe "MFW"
    readable_word_frequencies = pd.DataFrame(word_frequencies).T
    #compared_to_which_text = 0 # column identifer for a particular text; full list in "novelnames"
    MFW = readable_word_frequencies.sort_values([compared_to_which_text], ascending = False)
    MFW = MFW.fillna(0) # fill all empty cells with zeros   
    print("Columns are users/texts")
    print("Rows are words/hashtags")
    return MFW

def PCAnalysis(word_frequencies, number_of_MFWs_used, corpus_word_count, number_of_components, text_index_to_compare_MFWs):
## APPLY PCA TO WORD FREQUENCIES LIST
## IF number_of_MFWs_used == 0, USE ALL WORDS IN LIST.
    ## prepare word_frequencies for PCA processing
    print("Corpus Word Count:" + str(corpus_word_count))
    if number_of_MFWs_used == 0:
        print("""Using corpus word count (""" + str(corpus_word_count) + """ words) for PCA in """ + str(number_of_components) + """-dimensions...""")
        number_of_MFWs_used = corpus_word_count
        # Note: There's probably a faster way than performing to T operations in the next 3 lines...
        dataframe_word_frequencies = pd.DataFrame(word_frequencies).T
        MFW = dataframe_word_frequencies.sort_values([text_index_to_compare_MFWs], ascending = False)
        MFW_for_PCA = MFW.fillna(0).as_matrix().T #using all MFWs
    else:
        print("Using " + str(number_of_MFWs_used) + " words for PCA in "+ str(number_of_components) + "-dimensions...")  #np.nan_to_num(word_frequencies[number_of_MFWs_used:])
        #Note: There's probably a faster way than performing to T operaions in the next 3 lines
        dataframe_word_frequencies = pd.DataFrame(word_frequencies).T
        MFW = dataframe_word_frequencies.sort_values([text_index_to_compare_MFWs], ascending = False)
        MFW_for_PCA = MFW.head(number_of_MFWs_used).fillna(0).as_matrix().T #using X of MFWs
    ## generate data points of PCA from MFW_for_PCA (this is where PCA is performed)
    pca_coordinates, pca_results = perform_PCA(MFW_for_PCA, number_of_components)   
    return pca_coordinates, pca_results

def perform_PCA(MFWlist_array_for_PCA, num_of_PCs):
#perform PCA on using the MFW lists for a set of texts
    pca_results = sklearnPCA(n_components = num_of_PCs)
    pca_coordinates = pca_results.fit_transform(MFWlist_array_for_PCA) #array of x- & y-coordinates 
    return pca_coordinates, pca_results

def plot_PCA(pca_coordinates, pca_results, colors_for_texts, textnames, plot_size, output_flag, plot_name):
## plot PCA graph without any description/labeling of texts; 
## set output_flag = 1 to produce PDF, 2 to produce PNG, else output = 0   
    ## (1) prepare color scheme, data point style & labels ---
    for text in range(0, len(pca_coordinates)):        
        plt.plot(pca_coordinates[text,0],pca_coordinates[text,1], 'o', markersize=7, color=colors_for_texts[text], alpha=0.5, label=textnames[text])
    ## (2) graph display parameters & labels ---
    plt.xlabel('PC 1 ('+str(pca_results.explained_variance_ratio_[0]*100)+'%)') #x-axis title
    plt.ylabel('PC 2('+str(pca_results.explained_variance_ratio_[1]*100)+'%)') #y-axis title
    matplotlib.rcParams['figure.figsize'] = (plot_size, plot_size) #size of graph generated in notebook
    ##plt.axis('tight') #OR just fit plot around data automatically; but this usually fits *so* closely that it misses data
    plt.title('PCA for ' + str(len(textnames)) + 'users') #title of plot
    plt.grid(b=True, which='major', color='gray', linestyle='dotted') # Add gridlines
    ## (3) "print-ready" plots ---
    ## need to produce these files before 'plt.show' since that command erases graph parameters
    if output_flag == 1:
        plt.savefig(plot_name + '.pdf', dpi=600) #to produce a PDF of plot_PCA
    if output_flag == 2:
        plt.savefig(plot_name + '.png', dpi=600) #to produce a png of plot
    if output_flag != 1 & output_flag != 2 & output_flag != 0:
        print("Warning: output_flag for PCA plot not set to 0, 1, or 2")
        print("see plot_PCA_with_labels function")
    plt.show() ## (4) plot PCA graph to screen ---
    return

def assign_text_colors_via_word_counts(total_word_counts_nparray, total_word_counts):
## assigns colors for texts based on their relative word counts
## uses the word_count_binning function
    n, bins = word_count_binning(total_word_counts_nparray) #produce binning for all texts or text chunks in corpus
    text_binning = [] #identifies the bin in which a text belongs, where the text index is the same used in total_word_counts
    for text in range(0, len(total_word_counts_nparray)):
        for bin in range (0, len(bins)):
            if (total_word_counts[text] >= bins[bin]) & (total_word_counts[text] <= bins[bin+1]):
                text_binning.append(bin)
                break
    colors = generate_spectrum_of_colors(len(n))
    colors_for_texts = [] #identifies the color of a text, where the text index is the same used in total_word_counts
    for text in range(0, len(text_binning)):
        colors_for_texts.append(colors[text_binning[text]])
    return colors_for_texts

def word_count_binning(total_word_counts_numpy_array):
## generates historgram data for all texts or text chunks, based on word count,
## but does not output histogram plot; number of bins set to 37
    ## (1) generate histogram of data ---------------------------------
    n, bin_location_list, patches = plt.hist(total_word_counts_numpy_array, 37, facecolor='red', alpha=0.75) #plt.hist(total_word_counts_array, 2, normed=1, facecolor='red', alpha=0.75)
    #TEST: print("contents of bins: " + str(n))
    #TEST: print("bin location intervals: " + str(bin_location_list))
    plt.close()
    return n, bin_location_list 

def generate_spectrum_of_colors(num_colors): 
# generates a spectrum of colors--from red to black--for use in matplotlib;
    import colorsys
    spectrum_of_colors = []
    hue = 355/360
    for i in np.arange(0., 100., 100. / num_colors):
        lightness = i/100
        saturation = i/100
        spectrum_of_colors.append(colorsys.hls_to_rgb(hue, lightness, saturation))
    return spectrum_of_colors

def plot_PCA_with_labels(pca_coordinates, pca_results, textnames, colors_for_texts, plot_size, output_flag, plot_name):
## plot PCA graph using "arrow" labels; set output_flag = 1 to produce PDF, 2 to produce PNG, else output = 0
## WARNING: Using some styles for plt.style.use() will *break* arrow labels. For instance,
## plt.style.use('ggplot') breaks arrows. Accordingly, this function sets plt.style.use('classic')
    ## (0) invoke plt.style that works with arrow labels
    plt.style.use('seaborn-whitegrid') #works w/ 'seaborn-ticks', 'seaborn-white', classic'
     ## (1) prepare color scheme, data point style & labels ---
    for text in range(0, len(pca_coordinates)):
        plt.plot(pca_coordinates[text,0],pca_coordinates[text,1], 'o', markersize=7, color=colors_for_texts[text], alpha=0.8, label=textnames[text])
    ## (2) graph display parameters & labels ---
    plt.xlabel('PC 1 ('+str(pca_results.explained_variance_ratio_[0]*100)+'%)') #x-axis title
    plt.ylabel('PC 2 ('+str(pca_results.explained_variance_ratio_[1]*100)+'%)') #y-axis title
    matplotlib.rcParams['figure.figsize'] = (plot_size, plot_size) #size of graph generated in notebook
    plt.title('PCA for ' + str(len(textnames)) + ' novels') #title of plot
    ax = plt.subplot(111) #used in making legend 
    plt.grid(b=True, which='major', color='gray', linestyle='dotted') #grid lines
    ## (3)generate arrow labels ---
    for text in range(0, len(textnames)):
        ax.annotate(textnames[text], xy=(pca_coordinates[text,0], pca_coordinates[text,1]), xycoords='data', xytext=(-30, -30), textcoords='offset points', arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
    ## (4) "print-ready" plots ---
    ## need to produce these files before 'plt.show' since that command erases graph parameters
    if output_flag == 1:
        plt.savefig(plot_name + '.pdf', dpi=600) #to produce a PDF of plot_PCA
    if output_flag == 2:
        plt.savefig(plot_name + '.png', dpi=600) #to produce a png of plot
    if output_flag != 1 & output_flag != 2 & output_flag != 0:
        print("Warning: output_flag for PCA plot not set to 0, 1, or 2")
        print("see plot_PCA_with_labels function")
    ## (5) plot PCA graph to screen ---
    plt.show() 
    return

### Part 2, Section 1: Using hashtags to invent communities

In [None]:
# get a list of all unique twitter users in corpus
users = list(set(tweets.user_id.tolist()))

In [None]:
# get all hashtags for each user; counts of all hashtags for all users; 
# total number of hashtags for all users; get all total number of all hashtags;
# and hashtag frequencies in corpus

user_hashtags = []                 # list of lists of user tag tokens
user_hashtag_counts = []           # list of dicts of user tag types
user_total_number_of_hashtags = [] # list of number of tag tokens used by each user

for ID in range(0, len(users)):  
    temp_hashtag = []
    hashtag = tweets[tweets["user_id"]==users[ID]]["hashtag_entities"].tolist() #list of all hashtags for one user ID
    for tag in range(0,len(hashtag)):  # for all hashtags for one user ID
        temp_hashtag.extend(re.compile('\w+').findall(hashtag[tag].lower())) # pull hashtag from one tweet and add to list
    user_hashtags.append(temp_hashtag)
    user_hashtag_counts.append(count_words(user_hashtags[ID])) 
    user_total_number_of_hashtags.append(total_number_of_words(user_hashtag_counts[ID])) 

# number of all tag tokens in tweet corpus
corpus_hashtag_count = total_number_of_words_in_corpus(user_total_number_of_hashtags)

# calculate frequencies for each hashtag in corpus
hashtag_frequencies = word_freq(user_hashtag_counts, corpus_hashtag_count)

# TESTS
#print(user_hashtags)
#print(user_hashtag_counts)    
#print(user_total_number_of_hashtags)
#print(corpus_hashtag_count)
#print(hashtag_frequencies)

#### While the following code isn't necessary to use PCA, it shows you the top 10 most used hashtags in the twitter corpus.

In [None]:
## Inspect most frequent words (MFWs) ------------------
text_index_to_compare_MFWs = 0 # column identifer for a particular text; full list in "textnames"
most_frequent_hashtags = obtain_MFW(hashtag_frequencies, text_index_to_compare_MFWs, str(users))


# Columns are individual texts (i.e., novels in this case), rows are words
most_frequent_hashtags.head(10) #list first 10 words for all users 

#### Now lets actually do PCA on the hashtags for each user:

In [None]:
## Perform PCA for users using hashtags 

text_index_to_compare_MFTs = 0 # column identifer for a particular text; full list in "textnames"
number_of_components = 2 # how many dimensions for PCA
number_of_MFTs_used = 0 #number of MFWs used to plot PCA graphs # if set to 0, will use all words
pca_coordinates, pca_results = PCAnalysis(hashtag_frequencies, number_of_MFTs_used, corpus_hashtag_count, number_of_components, text_index_to_compare_MFTs)

#### And now lets plot our PCA:

In [None]:
## generate color spectrum based on tag count of texts ---------
total_tag_counts_nparray = np.array(user_total_number_of_hashtags) # convert word count list into a numpy array
colors_for_tags = assign_text_colors_via_word_counts(total_tag_counts_nparray, user_total_number_of_hashtags)

## plot PCA data with labels and arrows
size_of_plot = 10
name_of_file = "pca-2017-01-29_hashtags_with_labels"
output_file = 1 # for PDF, output_flag = 1; for PNG, output_flag = 2; else output_flag = 0
plot_PCA_with_labels(pca_coordinates, pca_results, users, colors_for_tags, size_of_plot, output_file, name_of_file) # last two inputs are plot_size and output flag 


If you want to generate PCA plot without labels, uncomment the single hashes below and run code.

In [None]:
## May need to run this code block twice to get it to display properly!

## generate color spectrum based on tag count of texts ----------
#total_tag_counts_nparray = np.array(user_total_number_of_hashtags) # convert word count list into a numpy array
#colors_for_tags = assign_text_colors_via_word_counts(total_tag_counts_nparray, user_total_number_of_hashtags)

##plot PCA without legend or tables -------
#size_of_plot = 12
#output_file = 0 # for PDF, output_flag = 1; for PNG, output_flag = 2; else output_flag = 0
#name_of_file = "pca-2017-01-29_hashtags"
#plot_PCA(pca_coordinates, pca_results, colors_for_tags, users, size_of_plot, output_file, name_of_file)  # last three inputs are plot_size, output flag, graph name

### Part 2, Section 2: Using words to invent communities

In [None]:
# get all words for each user; counts of all words for all users; 
# total number of words for all users; get all total number of all words;
# and words frequencies in corpus

user_words = []                 # list of lists of user word tokens
user_word_counts = []           # list of dicts of user word types
user_total_number_of_words = [] # list of number of words used by each user

for ID in range(0, len(users)):  
    temp_words = []
    words = tweets[tweets["user_id"]==users[ID]]["tweet_text"].tolist() #list of all hashtags for one user ID
    for tweet_wordlist in range(0,len(words)):  # for all hashtags for one user ID
        temp_words.extend(re.compile('\w+').findall(words[tweet_wordlist].lower())) # pull hashtag from one tweet and add to list
    user_words.append(temp_words)
    user_word_counts.append(count_words(user_words[ID])) 
    user_total_number_of_words.append(total_number_of_words(user_word_counts[ID])) 

# number of all tag tokens in tweet corpus
corpus_word_count = total_number_of_words_in_corpus(user_total_number_of_words)

# calculate frequencies for each hashtag in corpus
word_frequencies = word_freq(user_word_counts, corpus_word_count)

# TESTS
#print(user_words)
#print(user_word_counts)    
#print(user_total_number_of_words)
#print(corpus_word_count)
#print(word_frequencies)

In [None]:
## Inspect most frequent words (MFWs) ------------------
text_index_to_compare_MFWs = 0 # column identifer for a particular text; full list in "textnames"
MFWs= obtain_MFW(word_frequencies, text_index_to_compare_MFWs, str(users))


# Columns are individual texts (i.e., novels in this case), rows are words
MFWs.head(10) #list first 10 words for all users 

In [None]:
## Perform PCA for users using hashtags 

text_index_to_compare_MFWs = 0 # column identifer for a particular text; full list in "textnames"
number_of_components = 2 # how many dimensions for PCA
number_of_MFWs_used = 0 #number of MFWs used to plot PCA graphs # if set to 0, will use all words
pca_coordinates, pca_results = PCAnalysis(word_frequencies, number_of_MFWs_used, corpus_word_count, number_of_components, text_index_to_compare_MFWs)

In [None]:
## generate color spectrum based on tag count of texts ---------
total_word_counts_nparray = np.array(user_total_number_of_words) # convert word count list into a numpy array
colors_for_words = assign_text_colors_via_word_counts(total_word_counts_nparray, user_total_number_of_words)

## plot PCA data with labels and arrows
size_of_plot = 10
name_of_file = "pca-2017-01-29_words_with_labels"
output_file = 1 # for PDF, output_flag = 1; for PNG, output_flag = 2; else output_flag = 0
plot_PCA_with_labels(pca_coordinates, pca_results, users, colors_for_tags, size_of_plot, output_file, name_of_file) # last two inputs are plot_size and output flag 


In [None]:
## May need to run this code block twice to get it to display properly!

## generate color spectrum based on tag count of texts ----------
total_word_counts_nparray = np.array(user_total_number_of_words) # convert word count list into a numpy array
colors_for_words = assign_text_colors_via_word_counts(total_word_counts_nparray, user_total_number_of_words)

##plot PCA without legend or tables -------
size_of_plot = 12
output_file = 0 # for PDF, output_flag = 1; for PNG, output_flag = 2; else output_flag = 0
name_of_file = "pca-2017-01-29_words"
plot_PCA(pca_coordinates, pca_results, colors_for_tags, users, size_of_plot, output_file, name_of_file)  # last three inputs are plot_size, output flag, graph name