## Engineering a MultiLayer Perceptron using Numpy

In [1]:
# i found the data set floating around the internet and placed it all in a txt file. One with reviews and the other with the labels
g = open('reviews.txt','r') 
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') 
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
#we can see that the data set is the same exact size. 
len(reviews)

25000

In [3]:
# lets read
reviews[4]
# im going to hypothesize that this is a positive sentiment

'brilliant over  acting by lesley ann warren . best dramatic hobo lady i have ever seen  and love scenes in clothes warehouse are second to none . the corn on face is a classic  as good as anything in blazing saddles . the take on lawyers is also superb . after being accused of being a turncoat  selling out his boss  and being dishonest the lawyer of pepto bolt shrugs indifferently  i  m a lawyer  he says . three funny words . jeffrey tambor  a favorite from the later larry sanders show  is fantastic here too as a mad millionaire who wants to crush the ghetto . his character is more malevolent than usual . the hospital scene  and the scene where the homeless invade a demolition site  are all  time classics . look for the legs scene and the two big diggers fighting  one bleeds  . this movie gets better each time i see it  which is quite often  .  '

In [4]:
# lets check
labels[4]

'POSITIVE'

Even just reading one review will further connect and engage you. With one review you can begin to develop a sense of the data landscape in your set

In [5]:
from collections import Counter
import numpy as np

In [6]:
# counter objects to store positive, negative and total counts
# instantiate empty
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [7]:
# populate counters with respective words
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(' '):
            positive_counts[word] += 1
            total_counts[word] += 1
    else: 
        for word in reviews[i].split(' '):
            negative_counts[word] +=1
            total_counts[word] +=1

In [8]:
# check counts of the most common words in positive reviews
positive_counts.most_common()[:10]

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235)]

In [9]:
# check counts of the most common words in negative reviews
negative_counts.most_common()[:10]

[('', 561462),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327)]

Right away we can see that there is a lot of noise in this data. If we aren't careful, the high amount of spaces, periods, common words and articles could over set weights, obscure pattern from being detected and cause the model to under perform

In [10]:
# object to store positive/negative ratios
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

In [11]:
print("Pos-to-neg ratio for 'and' = {}".format(pos_neg_ratios["and"]))
print("Pos-to-neg ratio for 'fantastic' = {}".format(pos_neg_ratios["fantastic"]))
print("Pos-to-neg ratio for 'disgusting' = {}".format(pos_neg_ratios["disgusting"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'and' = 1.2061678272793268
Pos-to-neg ratio for 'fantastic' = 4.503448275862069
Pos-to-neg ratio for 'disgusting' = 0.32142857142857145
Pos-to-neg ratio for 'terrible' = 0.17744252873563218


In [12]:
# convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
    pos_neg_ratios[word] = np.log(ratio)

In [13]:
# now rather than high and low numbers, the two categories are polarized
# 0 is neutral | positive is +integers | negative is -integers
print("Pos-to-neg ratio for 'and' = {}".format(pos_neg_ratios["and"]))
print("Pos-to-neg ratio for 'fantastic' = {}".format(pos_neg_ratios["fantastic"]))
print("Pos-to-neg ratio for 'disgusting' = {}".format(pos_neg_ratios["disgusting"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'and' = 0.18744824888788403
Pos-to-neg ratio for 'fantastic' = 1.5048433868558566
Pos-to-neg ratio for 'disgusting' = -1.1349799328389845
Pos-to-neg ratio for 'terrible' = -1.7291085042663878


In [14]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()[:30]

[('edie', 4.6913478822291435),
 ('paulie', 4.07753744390572),
 ('felix', 3.152736022363656),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.80672172860924),
 ('victoria', 2.681021528714291),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.538973871058276),
 ('flawless', 2.451005098112319),
 ('superbly', 2.26002547857525),
 ('perfection', 2.159484249353372),
 ('astaire', 2.1400661634962708),
 ('captures', 2.038619547159581),
 ('voight', 2.030170492673053),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.978345424808467),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796),
 ('refreshing', 1.8551812956655511),
 ('breathtaking', 1.8481124057791867),
 ('bourne', 1.8478489358790986),
 ('lemmon', 1.8458266904983307),
 ('delightful', 1.8002701588959635),
 ('flynn', 1.7996646487351682),
 ('andrews', 1.7764919970972666),
 ('homer', 1.7692866133759964),
 ('beautifully', 1.7626953362841438),
 ('soccer', 1.757857

some of these words are names. take 'polanski' for example.  mentioning Roman Polanski's name is not an inherent result of a positive sentiment. it is more so a result of the context in which he is being talked about. i can't assume everytime someone has mentioned 'polanski' in text has been in order to express positivity. however - there are some clear gems cutting through the noise. like: 'beautifully, breathtaking, delightful, perfection' -- when building the model, it will be important to address this  and reduce the noise.

In [15]:
# words most frequently seen in a review with a "NEGATIVE" label
pos_neg_ratios.most_common()[:-30:-1]

[('boll', -4.969813299576001),
 ('uwe', -4.624972813284271),
 ('seagal', -3.644143560272545),
 ('unwatchable', -3.258096538021482),
 ('stinker', -3.2088254890146994),
 ('mst', -2.9502698994772336),
 ('incoherent', -2.9368917735310576),
 ('unfunny', -2.6922395950755678),
 ('waste', -2.6193845640165536),
 ('blah', -2.5704288232261625),
 ('horrid', -2.4849066497880004),
 ('pointless', -2.4553061800117097),
 ('atrocious', -2.4259083090260445),
 ('redeeming', -2.3682390632154826),
 ('prom', -2.3608540011180215),
 ('drivel', -2.3470368555648795),
 ('lousy', -2.307572634505085),
 ('worst', -2.286987896180378),
 ('laughable', -2.264363880173848),
 ('awful', -2.227194247027435),
 ('poorly', -2.2207550747464135),
 ('wasting', -2.204604684633842),
 ('remotely', -2.1972245773362196),
 ('existent', -2.0794415416798357),
 ('boredom', -1.995100393246085),
 ('miserably', -1.9924301646902063),
 ('sucks', -1.987068221548821),
 ('uninspired', -1.9832976811269336),
 ('lame', -1.981767458946166)]

these words seem more related directly to negative sentiment. except for one name that is peaking as the third highest word related to negative sentiment. *shrug

In [16]:
# object containing all words from all of the reviews
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)

74074


In [17]:
# this is how the first input will to pass through the model
layer_0 = np.zeros((1,vocab_size))
layer_0.shape

(1, 74074)

In [26]:
# local library housing the custom model with two hidden layers
import mfinchmods as mf

In [27]:
# meet Senti_Net
nn_clf = mf.Senti_Net(reviews[:-1000], #splitting the data manually, leaving the last 1000 for testing
                      labels[:-1000],
                      min_count=20, # here is how i addressed the noise. words or characters that appeared either too much or not enough to actually by part of a pattern
                      polarity_cutoff=0.05,
                      learning_rate=0.01)
nn_clf.train(reviews[:-1000],labels[:-1000])

How much I've read:0.0% How fast I can read:(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
How much I've read:10.4% How fast I can read:(reviews/sec):1159. #Correct:1994 #Trained:2501 Training Accuracy:79.7%
How much I've read:20.8% How fast I can read:(reviews/sec):1135. #Correct:4063 #Trained:5001 Training Accuracy:81.2%
How much I've read:31.2% How fast I can read:(reviews/sec):1143. #Correct:6176 #Trained:7501 Training Accuracy:82.3%
How much I've read:41.6% How fast I can read:(reviews/sec):1147. #Correct:8336 #Trained:10001 Training Accuracy:83.3%
How much I've read:52.0% How fast I can read:(reviews/sec):1141. #Correct:10501 #Trained:12501 Training Accuracy:84.0%
How much I've read:62.5% How fast I can read:(reviews/sec):1146. #Correct:12641 #Trained:15001 Training Accuracy:84.2%
How much I've read:72.9% How fast I can read:(reviews/sec):1147. #Correct:14782 #Trained:17501 Training Accuracy:84.4%
How much I've read:83.3% How fast I can read:(reviews/sec):1142. #

The traing went pretty well. I'll review the model's architecture in another part of the article. 

In [28]:
nn_clf.predict(reviews[-1000:], labels[-1000:])

How much I've read:99.9% How fast I can read:(reviews/sec):1568. #Correct:859 #Tested:1000 Testing Accuracy:85.9%

Senti_Net did better than the Kera's models; by roughly 2% than the over complex model and roughly 3% than the simpler model. Not a lot, but it says more about knowing your data and building a model to directly address the patterns in the data it is working with. To do this, you need to be engaged with your data as well as have intentional control over the mechanisms of a model

### Example Classification

In [80]:
nn_clf.classify(reviews[1289])

'NEGATIVE'

In [81]:
print(reviews[1289])

i saw this last week after picking up the dvd cheap . i had wanted to see it for ages  finding the plot outline very intriguing . so my disappointment was great  to say the least . i thought the lead actor was very flat . this kind of part required a performance like johny depp  s in the ninth gate  of which this is almost a complete rip  off   but i guess tv budgets don  t always stretch to this kind of acting ability .  br    br   i also the thought the direction was confused and dull  serving only to remind me that carpenter hasn  t done a decent movie since in the mouth of madness . as for the story  well  i was disappointed there as well  there was no way it could meet my expectation i guess  but i thought the payoff and explanation was poor  and the way he finally got the film anti  climactic to say the least .  br    br   this was written by one of the main contributors to aicn  and you can tell he does love his cinema  but i would have liked a better result from such a good ini

### Building out Further Functionality Using Senti_Net

In [30]:
# train a newtwork that doesn't us polarity or min_cut offf
nn_clf_f = mf.Senti_Net(reviews[:-1000],
                        labels[:-1000],min_count=0,
                        polarity_cutoff=0,
                        learning_rate=0.01)

In [31]:
nn_clf_f.train(reviews[:-1000],
              labels[:-1000])

How much I've read:0.0% How fast I can read:(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
How much I've read:10.4% How fast I can read:(reviews/sec):914.5 #Correct:1962 #Trained:2501 Training Accuracy:78.4%
How much I've read:20.8% How fast I can read:(reviews/sec):897.0 #Correct:4002 #Trained:5001 Training Accuracy:80.0%
How much I've read:31.2% How fast I can read:(reviews/sec):907.9 #Correct:6120 #Trained:7501 Training Accuracy:81.5%
How much I've read:41.6% How fast I can read:(reviews/sec):913.4 #Correct:8271 #Trained:10001 Training Accuracy:82.7%
How much I've read:52.0% How fast I can read:(reviews/sec):912.6 #Correct:10431 #Trained:12501 Training Accuracy:83.4%
How much I've read:62.5% How fast I can read:(reviews/sec):909.2 #Correct:12565 #Trained:15001 Training Accuracy:83.7%
How much I've read:72.9% How fast I can read:(reviews/sec):908.2 #Correct:14670 #Trained:17501 Training Accuracy:83.8%
How much I've read:83.3% How fast I can read:(reviews/sec):908.1 #

In [32]:
# use weights from the trained the model to express how the model has clustered like words in it's learning
def fetch_similar_words_to(focus = str):
    most_similar = Counter()

    for word in nn_clf_f.word2dict.keys():
        most_similar[word] = np.dot(nn_clf_f.weights_0_1[nn_clf_f.word2dict[word]],
                                    nn_clf_f.weights_0_1[nn_clf_f.word2dict[focus]])
    
    return most_similar.most_common()

In [33]:
# example of using similarity function
fetch_similar_words_to('great')[:10]

[('excellent', 0.08714175888229203),
 ('perfect', 0.07997393832571632),
 ('amazing', 0.058524466856635544),
 ('today', 0.05750220855411224),
 ('wonderful', 0.05694920677561696),
 ('fun', 0.05576918451159403),
 ('great', 0.055538020108908126),
 ('best', 0.05468981521759843),
 ('liked', 0.049518996908511824),
 ('definitely', 0.048837788647917935)]

We can see that embedded in the weights, the connections between neurons, is distilled realtions that can be used in customizable ways to further explore either dataset or the sentimentality of clustered data. 

### New Data
#### Bringing in Completey foreign Dataset to see how the Model perceives it

In [83]:
# loading in a twitter data that I've been mining for a seperate project
import pandas as pd
tw = pd.read_csv('twitter_data.csv')
tweets = list(tw['text'])

In [86]:
# classify random tweet from set
nn_clf.classify(tweets[321])

'NEGATIVE'

In [87]:
# check to see what the tweet is and review how the model performed
print(tweets[321])

You can sue tobacco companies. 

You can sue pharmaceutical companies.

The only companies you can’t sue are gun manufacturers because of a law Bernie Sanders voted for. #DemDebate


In [90]:
# repeat for validation
nn_clf.classify(tweets[187])

'POSITIVE'

In [91]:
print(tweets[187])

Young people have led every movement for justice in our nation's history and they are leading the movement for climate justice now.

Politicians must listen. https://t.co/FF46Bhguls


I would say the model is accurate