# Downloading and Partitioning the Google Books Corpus Unigram Time-series

## Alex John Quijano

*Department of Applied Mathematics, University of California, Merced*

**Outline**
1. Import Python Modules
2. Load Google Books Corpus Unigram Frequency Dataset (English)
3. Load Sentiment Lexicon from National Research Council Canada (NRC)
4. Subset the Time-series into Categories Positive, Negative, Both, and Neither
5. Save the Partitioned Time-series into one DataFrame

In [1]:
import  time
print( 'Last updated: %s' %time.strftime('%d/%m/%Y') )

Last updated: 22/04/2019


### 1. Import Python Modules.

In [2]:
import numpy as np
import pandas as pd

### 2. Load Google Books Corpus Unigram Frequency Dataset (English).

#### 2.1. Download Google Ngram Dataset from https://github.com/stressosaurus/raw-data-google-ngram.

```bash
cd ..
mkdir raw-data
cd raw-data
git clone https://github.com/stressosaurus/raw-data-google-ngram google-ngram
cd ..
cp raw-data/google-ngram/googleNgram.py dynamic-mode-decomposition/googleNgram.py
```

#### 2.2. Load Google Ngram Unigram Frequency Dataset.

In [96]:
import googleNgram as gn

# English Unigram
n = '1'
l = 'eng'
R, V, POS = gn.read(n,'rscore',l) # unigram raw frequency time-series of English
P, V, POS = gn.read(n,'pscore',l) # unigram raw frequency time-series of English
Z, V, POS = gn.read(n,'zscore',l) # unigram raw frequency time-series of English
R.shape # 18737 words, 109 years

(18737, 109)

#### 2.3. Subset the Probability Time-series into two time regimes (1900-1949) and (1950-1999).

In [29]:
time_O = range(1900,1999+1) # O means 1900-1999
time_A = range(1900,1949+1) # A means 1900-1949
time_B = range(1950,1999+1) # B means 1950-1999

# time regime O
R_O = R[:,range(0,len(time_O))]
P_O = P[:,range(0,len(time_O))]
Z_O = Z[:,range(0,len(time_O))]

# time regimes A and B
R_A = R[:,range(0,len(time_A))]
R_B = R[:,range(len(time_A),len(time_A)+len(time_B))]

print('R_O: sise='+str(R_O.shape))
print('R_A: size='+str(R_A.shape))
print('R_B: size='+str(R_B.shape))

R_O: sise=(18737, 100)
R_A: size=(18737, 50)
R_B: size=(18737, 50)


#### 2.4. Convert Probability Time-series into Normalized Frequency (zscores).

In [5]:
# time-regime A
P_A = np.zeros(R_A.shape,dtype=float)
for i in range(R_A.shape[1]):
    P_A[:,i] = np.divide(R_A[:,i],np.sum(R_A[:,i])) # probabilities
Z_A = np.zeros(P_A.shape,dtype=float)
for i in range(P_A.shape[0]):
    Z_A[i,:] = np.divide(P_A[i,:] - np.mean(P_A[i,:]),np.std(P_A[i,:])) # zscores
    
# time-regime B
P_B = np.zeros(R_B.shape,dtype=float)
for i in range(R_B.shape[1]):
    P_B[:,i] = np.divide(R_B[:,i],np.sum(R_B[:,i])) # probabilities
Z_B = np.zeros(P_B.shape,dtype=float)
for i in range(P_B.shape[0]):
    Z_B[i,:] = np.divide(P_B[i,:] - np.mean(P_B[i,:]),np.std(P_B[i,:])) # zscores

### 3. Load Sentiment Lexicon from National Research Council Canada (NRC).

#### 3.1. Download from http://sentiment.nrc.ca/lexicons-for-research/NRC-Emotion-Lexicon.zip.

In [8]:
%%bash
curl -O http://sentiment.nrc.ca/lexicons-for-research/NRC-Emotion-Lexicon.zip
unzip NRC-Emotion-Lexicon.zip
rm NRC-Emotion-Lexicon.zip

Archive:  NRC-Emotion-Lexicon.zip
  inflating: NRC - Sentiment Lexicon - Research EULA Sept 2017 .pdf  
   creating: NRC-Emotion-Lexicon-v0.92/
  inflating: NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Senselevel-v0.92.txt  
  inflating: NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.xlsx  
  inflating: NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt  
   creating: NRC-Emotion-Lexicon-v0.92/Older Versions/
  inflating: NRC-Emotion-Lexicon-v0.92/Older Versions/NRC-Emotion-Lexicon-v0.92-InManyLanguages.xlsx  
  inflating: NRC-Emotion-Lexicon-v0.92/Older Versions/readme.txt  
  inflating: NRC-Emotion-Lexicon-v0.92/Paper1_NRC_Emotion_Lexicon.pdf  
  inflating: NRC-Emotion-Lexicon-v0.92/Paper2_NRC_Emotion_Lexicon.pdf  
  inflating: NRC-Emotion-Lexicon-v0.92/readme.txt  


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  2 22.7M    2  694k    0     0  1543k      0  0:00:15 --:--:--  0:00:15 1540k 28 22.7M   28 6551k    0     0  4376k      0  0:00:05  0:00:01  0:00:04 4373k 51 22.7M   51 11.8M    0     0  4943k      0  0:00:04  0:00:02  0:00:02 4941k100 22.7M  100 22.7M    0     0  6795k      0  0:00:03  0:00:03 --:--:-- 6795k


#### 3.2. Load Sentiment Lexicon from NRC.

In [23]:
file_path = 'NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'
sentiment_file = open(file_path,'r')
N_words = [] # list of negative words
P_words = [] # list of positive words
for i in sentiment_file:
    i_vect = i.replace('\n','').split('\t')
    if len(i_vect) != 1:
        if i_vect[1] == 'negative' and i_vect[2] == '1':
            N_words.append(i_vect[0])
        if i_vect[1] == 'positive' and i_vect[2] == '1':
            P_words.append(i_vect[0])
sentiment_file.close()
print('NRC negative words: '+str(len(N_words)))
print('NRC positive words: '+str(len(P_words)))
print('NRC both: '+str(len(set(N_words) & set(P_words))))

NRC negative words: 3324
NRC positive words: 2312
NRC both: 81


### 4. Subset the Time-series into Categories Positive, Negative, Both, and Neither.

In [25]:
S = {}
for v in V['forward'].keys():
    if v in N_words and v in P_words:
        S[v] = 'both'
    elif v in N_words:
        S[v] = 'negative'
    elif v in P_words:
        S[v] = 'positive'
    else:
        S[v] = 'neither'
S_list = np.array(list(S.values()))
print('V negative: '+str(len(np.where(S_list == 'negative')[0])))
print('V positive: '+str(len(np.where(S_list == 'positive')[0])))
print('V both: '+str(len(np.where(S_list == 'both')[0])))
print('V neither: '+str(len(np.where(S_list == 'neither')[0])))

V negative: 2093
V positive: 1789
V both: 60
V neither: 14795


### 5. Save the Partitioned Time-series into one DataFrame.

In [97]:
# organize data into Dictionary/DataFrame structures
X = {}
time_regimes_labels = ['1900-1999','1900-1949','1950-1999']
time_regimes = [time_O,time_A,time_B]
data_types = ['R','P','Z','S']
for i in data_types:
    X[i] = {}
    if i != 'S':
        for j in time_regimes_labels:
            X[i][j] = {}
for v in list(V['forward'].keys()):
    for i in data_types:
        if i == 'S':
            X[i][v] = S[v]
        for j in time_regimes_labels:
            if i == 'R':
                if j == '1900-1999':
                    X[i][j][v] = list(R_O[V['forward'][v],:])
                elif j == '1900-1949':
                    X[i][j][v] = list(R_A[V['forward'][v],:])
                elif j == '1950-1999':
                    X[i][j][v] = list(R_B[V['forward'][v],:])
            elif i == 'P':
                if j == '1900-1999':
                    X[i][j][v] = list(P_O[V['forward'][v],:])
                elif j == '1900-1949':
                    X[i][j][v] = list(P_A[V['forward'][v],:])
                elif j == '1950-1999':
                    X[i][j][v] = list(P_B[V['forward'][v],:])
            elif i == 'Z':
                if j == '1900-1999':
                    X[i][j][v] = list(Z_O[V['forward'][v],:])
                elif j == '1900-1949':
                    X[i][j][v] = list(Z_A[V['forward'][v],:])
                elif j == '1950-1999':
                    X[i][j][v] = list(Z_B[V['forward'][v],:])
for i in data_types:
    if i != 'S':
        for j in time_regimes_labels:
            X[i][j] = pd.DataFrame(X[i][j])
            if j == '1900-1999':
                X[i][j].index = list(time_O)
            elif j == '1900-1949':
                X[i][j].index = list(time_A)
            elif j == '1950-1999':
                X[i][j].index = list(time_B)
                
# save data X
np.save(n+'gram-'+l+'-partitioned.npy',X)

In [105]:
# raw frequency time regime 1950-1999 example
X['R']['1950-1999'].head(5)

Unnamed: 0,actor,anthony,acknowledgment,agile,alive,anterior,ap,art,atoms,attractive,...,zero,zones,zeal,z,zurich,zealous,zu,zone,zur,zealously
1950,47102.0,32238.0,16642.0,2862.0,95102.0,70720.0,9929.0,696993.0,135765.0,75754.0,...,154007.0,46883.0,32757.0,95764.0,15478.0,11101.0,35242.0,127895.0,28878.0,2951.0
1951,44382.0,27920.0,15564.0,3004.0,95957.0,67412.0,11458.0,564154.0,82642.0,75208.0,...,136316.0,41820.0,31560.0,80813.0,9184.0,10625.0,39340.0,133166.0,30005.0,2932.0
1952,45862.0,28400.0,14605.0,3288.0,95483.0,83769.0,11451.0,563922.0,82748.0,72734.0,...,137425.0,41134.0,31962.0,89490.0,12480.0,11039.0,41459.0,119313.0,31230.0,2967.0
1953,51736.0,28776.0,13596.0,3392.0,96328.0,87710.0,11383.0,562040.0,86950.0,73460.0,...,136612.0,46553.0,29224.0,89698.0,13947.0,9738.0,45305.0,130500.0,32979.0,2718.0
1954,43894.0,34036.0,13777.0,3752.0,99307.0,79178.0,12561.0,579195.0,105668.0,81964.0,...,141396.0,42884.0,32269.0,86773.0,13326.0,10492.0,44104.0,127946.0,32619.0,2838.0


In [106]:
# probabilities time regime 1950-1999 example
X['P']['1950-1999'].head(5)

Unnamed: 0,actor,anthony,acknowledgment,agile,alive,anterior,ap,art,atoms,attractive,...,zero,zones,zeal,z,zurich,zealous,zu,zone,zur,zealously
1950,1.9e-05,1.3e-05,7e-06,1e-06,3.8e-05,2.8e-05,4e-06,0.000277,5.4e-05,3e-05,...,6.1e-05,1.9e-05,1.3e-05,3.8e-05,6e-06,4e-06,1.4e-05,5.1e-05,1.1e-05,1e-06
1951,1.8e-05,1.1e-05,6e-06,1e-06,3.9e-05,2.8e-05,5e-06,0.000231,3.4e-05,3.1e-05,...,5.6e-05,1.7e-05,1.3e-05,3.3e-05,4e-06,4e-06,1.6e-05,5.5e-05,1.2e-05,1e-06
1952,1.9e-05,1.2e-05,6e-06,1e-06,3.9e-05,3.4e-05,5e-06,0.000229,3.4e-05,2.9e-05,...,5.6e-05,1.7e-05,1.3e-05,3.6e-05,5e-06,4e-06,1.7e-05,4.8e-05,1.3e-05,1e-06
1953,2.1e-05,1.2e-05,6e-06,1e-06,4e-05,3.6e-05,5e-06,0.000233,3.6e-05,3e-05,...,5.7e-05,1.9e-05,1.2e-05,3.7e-05,6e-06,4e-06,1.9e-05,5.4e-05,1.4e-05,1e-06
1954,1.7e-05,1.4e-05,5e-06,1e-06,3.9e-05,3.1e-05,5e-06,0.00023,4.2e-05,3.3e-05,...,5.6e-05,1.7e-05,1.3e-05,3.4e-05,5e-06,4e-06,1.8e-05,5.1e-05,1.3e-05,1e-06


In [107]:
# zscores time regime 1950-1999 example
X['Z']['1950-1999'].head(5)

Unnamed: 0,actor,anthony,acknowledgment,agile,alive,anterior,ap,art,atoms,attractive,...,zero,zones,zeal,z,zurich,zealous,zu,zone,zur,zealously
1950,-0.919548,-1.129065,1.760802,-0.984957,-0.617386,-1.249375,-1.921933,2.621943,2.329822,-1.102706,...,-0.533276,-1.444498,1.12837,-1.162915,0.25031,1.129679,-1.708887,-2.061953,-2.13399,0.911124
1951,-1.132697,-1.573831,1.386691,-0.071976,0.168894,-1.40708,-1.657662,-0.902377,-0.222935,-0.88306,...,-1.639875,-1.895615,1.090207,-2.075501,-3.104734,1.055517,-1.091696,-1.472717,-1.664108,1.019055
1952,-0.972545,-1.545377,0.721172,0.973062,-0.134585,0.429039,-1.674101,-1.070721,-0.256822,-1.324606,...,-1.65266,-2.023392,1.104684,-1.492041,-1.284309,1.194382,-0.880546,-2.4701,-1.435925,1.032233
1953,0.080661,-1.414537,0.294861,1.705002,0.5294,1.116536,-1.647276,-0.744137,0.057624,-0.996299,...,-1.470799,-1.255715,0.827447,-1.330004,-0.276816,0.716828,-0.296299,-1.538938,-0.837579,0.712983
1954,-1.407997,-0.912741,0.044603,2.539868,0.244877,-0.303879,-1.551969,-0.980766,0.802547,-0.283104,...,-1.574466,-1.924003,1.054094,-1.828223,-0.965763,0.85502,-0.675335,-2.079804,-1.268753,0.711421
