This is an initial experiment conducted to see if we could predict the sector of the tweeting company from the topic probability vector of a tweet. This is primarily done to see if any topics in particular are particularly revealing of the sector of the account that publiished a specific tweet.

In [1]:
import numpy as np
import pandas as pd
import pickle

from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 

#### Setup

**Goal**: Given a topic vector for a specific tweet, we attempt to predict the sector of the company that shared the tweet.


In [2]:
MODEL_FOLDER = "50_topics_model"

In [3]:
# In this matrix, there is a row for each topic, and a column for each document.
with open(f"{MODEL_FOLDER}/matrix_topics_docs.npy", 'rb') as f:
        matrix_topics_docs = np.load(f)

In [4]:
matrix_topics_docs.shape

(50, 1004184)

In [5]:
# doc_num_to_comp_tweet_id_tuple_map dictionary maps document number to (company CSV filename, tweet ID) tuple.
with open(f"{MODEL_FOLDER}/doc_num_to_comp_tweet_id_tuple_map.pkl", 'rb') as f:
    doc_num_to_comp_tweet_id_tuple_map = pickle.load(f)

In [6]:
# load the company to sector map
with open("../company_sector_map.pkl", "rb") as f:
    company_sector_map = pickle.load(f)

In [7]:
SECTOR_NUMBER_TO_NAME_MAP = {}

sector_num = 0
for sector_name in set(company_sector_map.values()):
    SECTOR_NUMBER_TO_NAME_MAP[sector_num] = sector_name
    sector_num += 1
    
SECTOR_NUMBER_TO_NAME_MAP

{0: 'Communication Services',
 1: 'Utilities',
 2: 'Real Estate',
 3: 'Financials',
 4: 'Consumer Discretionary',
 5: 'Materials',
 6: 'Consumer Staples',
 7: 'Information Technology',
 8: 'Industrials',
 9: 'Health Care',
 10: 'Energy'}

In [8]:
SECTOR_NAME_TO_NUMBER_MAP = {}

for sector_number in SECTOR_NUMBER_TO_NAME_MAP:
    sector_name = SECTOR_NUMBER_TO_NAME_MAP[sector_number]
    SECTOR_NAME_TO_NUMBER_MAP[sector_name] = sector_number
    
SECTOR_NAME_TO_NUMBER_MAP

{'Communication Services': 0,
 'Utilities': 1,
 'Real Estate': 2,
 'Financials': 3,
 'Consumer Discretionary': 4,
 'Materials': 5,
 'Consumer Staples': 6,
 'Information Technology': 7,
 'Industrials': 8,
 'Health Care': 9,
 'Energy': 10}

In [9]:
# Create an array of the sector for each document

# For each doc num, use the doc_num_to_comp_tweet_id_tuple_map to get the company
# then use the company_sector_map to get the sector for that company

# We need to use the transposed version of matrix_topics_docs so that we have a row for every doc and a column for each topic
# We could also load the model here and use matrix_docs_topics_, so maybe it's worth checking that they are equal to each other transposed
matrix_topics_docs_T = matrix_topics_docs.T
NUM_DOCS = len(matrix_topics_docs_T)

# Since we currently don't have a GICS sector for all the subsidiaries,
# we will keep track of the doc numbers that we cannot consider in the model because they do not have a sector associated with them.
# When we're creating the final training data, we will only keep a row if the index of the row (corresponding to the doc number)
# is not in this set of docs_with_no_sector 
docs_with_no_sector = set()

# This will be our array of values to predict. 
# So matrix_topics_docs_T[i] will be the topic vector for document i, and doc_num_sectors[i] will be the sector of the company that tweeted document i
doc_num_sectors = np.empty(NUM_DOCS)
for doc_num in range(NUM_DOCS):
    company = doc_num_to_comp_tweet_id_tuple_map[doc_num][0] # the 0th element of the tuple is the company
    company_name = "_".join(company.split("_")[:-1])
    try:
        sector_name = company_sector_map[company_name] # drop the _tweets.csv off the end of the company name
    except:
        docs_with_no_sector.add(doc_num)
        continue
    doc_num_sectors[doc_num] = SECTOR_NAME_TO_NUMBER_MAP[sector_name]

In [10]:
print(len(doc_num_sectors))
len(doc_num_sectors) == len(matrix_topics_docs_T)

1004184


True

In [11]:
final_training_vectors = []
final_training_sectors = []
for doc_num in range(NUM_DOCS):
    if doc_num not in docs_with_no_sector:
        final_training_vectors.append(matrix_topics_docs_T[doc_num])
        final_training_sectors.append(doc_num_sectors[doc_num])
        
final_training_vectors = np.matrix(final_training_vectors)
final_training_sectors = np.array(final_training_sectors)    

In [12]:
print(len(final_training_sectors))
len(final_training_sectors) == len(final_training_vectors)


952265


True

Now we have the training data $X$ as `final_training_vectors` and the target values $Y$ as the `final_training_sectors` (it's a bad name, I know).

Now we can train different models on this data. Remeber that each sector number is mapped to a sector using `SECTOR_NUMBER_TO_NAME_MAP`.

In [13]:
np.random.seed(2727)

In [14]:
# randomly sample 100k tweet vectors and their sector

num_samples = 100_000
indices_100k = np.random.choice(final_training_vectors.shape[0], num_samples, replace=False)

training_100k_tweet_vectors = final_training_vectors[indices_100k]
training_100k_sectors = final_training_sectors[indices_100k]

# Generate a train-test split for this data
X_100k_train, X_100k_test, y_100k_train, y_100k_test = train_test_split(training_100k_tweet_vectors, training_100k_sectors, test_size=0.2)

X_100k_train = np.asarray(X_100k_train)
X_100k_test = np.asarray(X_100k_test)
y_100k_train = np.asarray(y_100k_train)
y_100k_test = np.asarray(y_100k_test)

#### Model 1: Linear Regression

This is just a test to make sure everything is set up right. This result has no semantic meaning becuase this is not a classification problem.

In [15]:
regression_model = LinearRegression().fit(np.asarray(X_100k_train), y_100k_train)
regression_model.score(np.asarray(y_100k_train), np.asarray(X_100k_train))

0.2118493896666218

#### Model 2: MLP Classification Network

This is one of the standard scikit learn neural nets for classification.

In [21]:
def classifier_test_table(X_test, y_test, classifier):
    """ a table including predictions using predict """
    predictions = classifier.predict(X_test)            # all predictions
    prediction_probs = classifier.predict_proba(X_test) # all prediction probabilities
    
    # count correct predictions
    num_correct = 0
    print(f"{'input ':>28s} -> {'pred':^6s} {'des.':^6s}") 
    for i in range(len(y_test)):
        pred = predictions[i]
        pred_probs = prediction_probs[i,:]
        desired = y_test[i]
        if pred == desired:
            result = "  correct"
            num_correct += 1
        else:
            result = "  incorrect"
        pred_sector_name = SECTOR_NUMBER_TO_NAME_MAP[pred]
        desired_sector_name = SECTOR_NUMBER_TO_NAME_MAP[desired]
        print(f"{X_test[i,:2]!s:>28s} -> {pred:^6.0f} {desired:^6.0f} {result:^10s}") 
        # print(f"{X_test[i,:2]!s:>28s} -> {pred_sector_name:^10s} {desired_sector_name:^10s} {result:^10s}") 
    print(f"\ncorrect predictions: {num_correct} out of {len(y_test)}")

In [18]:
nn_classifier = MLPClassifier(hidden_layer_sizes=(6,7),  # 50 input -> 6 -> 7 -> 1 output
                    max_iter=1000,
                    activation="tanh",
                    solver='sgd',   
                    verbose=True,
                    shuffle=True,
                    random_state=2727,
                    learning_rate_init=.1, # can go significantly smaller on this
                    learning_rate = 'adaptive')  # soften feedback as it converges

In [19]:

print("\n\n++++++++++  Begin training  +++++++++++++++\n\n")
nn_classifier.fit(X_100k_train, y_100k_train)
# TODO(nile): do we want to scale the data?? Probably not y's but maybe X's
# nn_classifier.fit(X_train_scaled, y_train_scaled)
print("\n++++++++++  End training  +++++++++++++++")
print(f"The analog prediction error (the loss) is {nn_classifier.loss_}")



++++++++++  Begin training  +++++++++++++++






Iteration 1, loss = 2.01305731
Iteration 2, loss = 1.86457970
Iteration 3, loss = 1.85049062
Iteration 4, loss = 1.84108585
Iteration 5, loss = 1.83213482
Iteration 6, loss = 1.82496619
Iteration 7, loss = 1.82224306
Iteration 8, loss = 1.81974613
Iteration 9, loss = 1.81791520
Iteration 10, loss = 1.81672672
Iteration 11, loss = 1.81512394
Iteration 12, loss = 1.81327831
Iteration 13, loss = 1.81339111
Iteration 14, loss = 1.81247383
Iteration 15, loss = 1.81258702
Iteration 16, loss = 1.81103919
Iteration 17, loss = 1.81158502
Iteration 18, loss = 1.81021786
Iteration 19, loss = 1.81003189
Iteration 20, loss = 1.80954447
Iteration 21, loss = 1.80995184
Iteration 22, loss = 1.80870302
Iteration 23, loss = 1.80901535
Iteration 24, loss = 1.80877508
Iteration 25, loss = 1.80839244
Iteration 26, loss = 1.80879347
Iteration 27, loss = 1.80807701
Iteration 28, loss = 1.80882473
Iteration 29, loss = 1.80832228
Iteration 30, loss = 1.80743631
Iteration 31, loss = 1.80810070
Iteration 32, los

In [23]:
classifier_test_table(np.asarray(X_100k_test), np.asarray(y_100k_test), nn_classifier)

                      input  ->  pred   des. 
     [0.05667216 0.00583627] ->   0      1      incorrect
     [0.01560134 0.00914912] ->   8      5      incorrect
     [0.01189345 0.00021998] ->   0      0      correct 
     [0.01548515 0.00281572] ->   9      1      incorrect
     [0.00664839 0.01163831] ->   9      6      incorrect
     [0.01290803 0.00447556] ->   0      0      correct 
     [0.03775981 0.01690134] ->   0      0      correct 
     [0.12465658 0.01316496] ->   0      5      incorrect
[5.14903745e-05 1.20800431e-02] ->   10     1      incorrect
     [0.00224834 0.0726816 ] ->   1      9      incorrect
                     [0. 0.] ->   6      9      incorrect
     [0.11684106 0.01051999] ->   0      0      correct 
     [0.00193275 0.0992078 ] ->   9      6      incorrect
                     [0. 0.] ->   6      0      incorrect
[1.23405230e-06 4.92269698e-03] ->   7      8      incorrect
     [0.0001704  0.00038346] ->   1      1      correct 
     [0.0169488  0.003885

     [0.0453063  0.02265332] ->   6      5      incorrect
     [0.0345064  0.01508247] ->   6      6      correct 
     [0.64510565 0.00138213] ->   5      4      incorrect
     [0.00238327 0.03318108] ->   8      0      incorrect
     [0.00524513 0.00464113] ->   9      9      correct 
     [0.09430389 0.0055334 ] ->   7      2      incorrect
     [0.00154596 0.00459336] ->   9      6      incorrect
     [0.12436958 0.01469583] ->   6      3      incorrect
     [0.0979774  0.00449802] ->   5      5      correct 
     [0.00600713 0.02312124] ->   1      4      incorrect
     [0.20414346 0.00772068] ->   3      1      incorrect
[1.02493777e-04 1.21535662e-05] ->   7      8      incorrect
     [0.01123668 0.01633524] ->   1      10     incorrect
     [0.08211667 0.00614799] ->   1      0      incorrect
     [0.02392292 0.01690351] ->   6      1      incorrect
                     [0. 0.] ->   6      7      incorrect
     [0.01336097 0.01762036] ->   1      2      incorrect
[5.08676531e-0

#### Model 3: Gaussian Naive Bayes Classifier

Focusing on interpretability here

In [17]:
naive_bayes = GaussianNB()
naive_bayes.fit(X_100k_train, y_100k_train)

One observation -- this fits almost instantly

In [18]:
naive_bayes.score(X_100k_test, y_100k_test)

0.3296

In [19]:
naive_bayes.get_params()

{'priors': None, 'var_smoothing': 1e-09}

Doesn't perform as well as other models though

#### Model 4: Random Forest

##### 100 estimators

In [18]:
random_forest_classifier = RandomForestClassifier(n_estimators=100) # the default number of trees in the forest

random_forest_classifier.fit(X_100k_train, y_100k_train)

In [19]:
avg_score_100k_rf = random_forest_classifier.score(X_100k_test, y_100k_test)
avg_score_100k_rf

0.41685

This isn't *that* good...maybe we shuold increase the number of estimators?

For now, figure out which features (i.e. which topics) are most important

In [43]:
topic_importance_map = {}
feature_importances = random_forest_classifier.feature_importances_

for topic in range(len(feature_importances)):
    topic_importance_map[topic] = feature_importances[topic]
    
sorted_topic_importances = sorted(topic_importance_map.items(), key = lambda pair: pair[1], reverse=True)

print(f"{'topic':>7s} -> {'importance':^10s}") 
for item in sorted_topic_importances:
    print(f"{str(item[0]):^7s} -> {item[1]:>10f}")

  topic -> importance
  47    ->   0.046593
  46    ->   0.028845
  38    ->   0.028255
  49    ->   0.024058
   5    ->   0.023719
  17    ->   0.023392
  34    ->   0.023349
  15    ->   0.023206
  11    ->   0.023054
  39    ->   0.022316
  43    ->   0.021412
  13    ->   0.020662
   9    ->   0.020574
  25    ->   0.020482
  26    ->   0.020230
  19    ->   0.020125
  29    ->   0.019980
  14    ->   0.019936
  40    ->   0.019841
  31    ->   0.019277
  44    ->   0.019210
  23    ->   0.019167
  35    ->   0.019015
  20    ->   0.018768
  30    ->   0.018720
  24    ->   0.018650
  12    ->   0.018622
   0    ->   0.018477
  33    ->   0.018326
  28    ->   0.018198
  16    ->   0.018079
  32    ->   0.017977
   2    ->   0.017867
  45    ->   0.017763
   7    ->   0.017570
   4    ->   0.017566
  22    ->   0.017293
  41    ->   0.017277
  48    ->   0.017268
   6    ->   0.017165
  27    ->   0.017160
  10    ->   0.017159
  37    ->   0.016998
  18    ->   0.016923
  21    ->

Trying the same model with double the number of trees -- 200.

##### 200 estimators

In [44]:
random_forest_classifier_200 = RandomForestClassifier(n_estimators=200)

random_forest_classifier_200.fit(X_100k_train, y_100k_train)

In [48]:
avg_score_100k_rf2 = random_forest_classifier_200.score(X_100k_test, y_100k_test)
avg_score_100k_rf2

0.4243

This isn't really much better. `.4243` here vs `.41685` before.

In [49]:
topic_importance_map_200 = {}
feature_importances_200 = random_forest_classifier_200.feature_importances_

for topic in range(len(feature_importances_200)):
    topic_importance_map_200[topic] = feature_importances_200[topic]
    
sorted_topic_importances_200 = sorted(topic_importance_map_200.items(), key = lambda pair: pair[1], reverse=True)

print(f"{'topic':>7s} -> {'importance':^10s}") 
for item in sorted_topic_importances_200:
    print(f"{str(item[0]):^7s} -> {item[1]:>10f}")

  topic -> importance
  47    ->   0.045705
  46    ->   0.028494
  38    ->   0.027904
  11    ->   0.024156
  17    ->   0.023903
   5    ->   0.023524
  34    ->   0.023444
  15    ->   0.023205
  49    ->   0.023121
  39    ->   0.022562
  43    ->   0.021470
  13    ->   0.020741
  26    ->   0.020536
  25    ->   0.020455
   9    ->   0.020293
  19    ->   0.020045
  29    ->   0.019889
  14    ->   0.019749
  40    ->   0.019734
  23    ->   0.019473
  31    ->   0.019405
  35    ->   0.019348
  44    ->   0.019249
  30    ->   0.018641
  24    ->   0.018629
  20    ->   0.018625
  33    ->   0.018439
   0    ->   0.018401
  12    ->   0.018377
  28    ->   0.018286
  16    ->   0.018011
  32    ->   0.017969
   2    ->   0.017884
  45    ->   0.017794
   4    ->   0.017544
  22    ->   0.017489
   6    ->   0.017461
  48    ->   0.017452
   7    ->   0.017423
  10    ->   0.017282
  41    ->   0.017181
  27    ->   0.017131
  37    ->   0.017128
   1    ->   0.016884
  18    ->

Feature importance rankings have largely stayed the same.