# NLP for Stock Market Prediction
Author: Antonio Stark, Sasha Pukhova, Frank Looi, Precious Enharo

Kaggle challenge: https://www.kaggle.com/aaron7sun/stocknews
Kaggle article: https://www.kaggle.com/rahulvarma9595/nlp-for-stock-market-predictions

## Import packages and libraries

In [1]:
# import packages
import pandas as pd
import numpy as np
import string
import time
import gensim

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec, KeyedVectors
from sklearn.metrics import accuracy_score, confusion_matrix

## Import data

In [2]:
# import data
data = pd.read_csv('Combined_News_DJIA.csv')

print('data is %d data points with %d features'%(data.shape[0],data.shape[1]))
data.head()

data is 1989 data points with 27 features


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


## Data preprocessing
### Data cleaning & tokenization & normalization

In [3]:
# data cleaning & tokenization & normalization
## create copy of original dataframe
dataClean = data.copy()

## create set for stopwords as dictionaries are faster
stops = set(stopwords.words('english'))

# news header you want to test
## (2,5) is interesting to see how "Al-Qa'eda" gets transferred
## (4,3) is interesting to see how numbers are encoded
## (2,15) is interesting to see how both numbers ('55') and hyphens ('mega-city') are encoded
## (1988,24) gives a bug for replacing numbers and removing stop words
tester = (2,15)

for i in range(2,data.shape[1]):
    if i==tester[1]:
        print('Original text:')
        print(dataClean.iloc[tester[0],i])
    
    # data cleaning
    ## remove 'b'' at the start and ''' at the end
    dataClean.iloc[:,i]=dataClean.iloc[:,i].str.strip("b'")
    if i==tester[1]:
        print('\nCleaned text:')
        print(dataClean.iloc[tester[0],i])
    
    ## remove punctuation
    dataClean.iloc[:,i]=dataClean.iloc[:,i].str.translate(str.maketrans('', '', string.punctuation))
    if i==tester[1]:
        print('\nPunctuations removed:')
        print(dataClean.iloc[tester[0],i])
        
    # normalization
    ## make lowercase
    dataClean.iloc[:,i]=dataClean.iloc[:,i].str.lower()
    if i==tester[1]:
        print('\nLowercase:')
        print(dataClean.iloc[tester[0],i])
    
    # tokenization
    ## word_tokenize version is below
    dataClean.iloc[:,i]=dataClean.iloc[:,i].str.split()
    if i==tester[1]:
        print('\nTokenized:')
        print(dataClean.iloc[tester[0],i])
        
    # replace numbers
    dataClean.iloc[:,i]=dataClean.iloc[:,i].apply(lambda sent: 'num' if isinstance(sent,float) else ['num' if token.isdigit() else token for token in sent])
    if i==tester[1]:
        print('\nReplaced numbers:')
        print(dataClean.iloc[tester[0],i])
        print()
        
    # remove stop words
    dataClean.iloc[:,i]=dataClean.iloc[:,i].apply(lambda sent: [token for token in sent if token not in stops])
    if i==tester[1]:
        print('\nRemoved stop words:')
        print(dataClean.iloc[tester[0],i])
        

Original text:
b'55 pyramids as large as the Luxor stacked into a mega-city pyramid in Tokyo Bay'

Cleaned text:
55 pyramids as large as the Luxor stacked into a mega-city pyramid in Tokyo Bay

Punctuations removed:
55 pyramids as large as the Luxor stacked into a megacity pyramid in Tokyo Bay

Lowercase:
55 pyramids as large as the luxor stacked into a megacity pyramid in tokyo bay

Tokenized:
['55', 'pyramids', 'as', 'large', 'as', 'the', 'luxor', 'stacked', 'into', 'a', 'megacity', 'pyramid', 'in', 'tokyo', 'bay']

Replaced numbers:
['num', 'pyramids', 'as', 'large', 'as', 'the', 'luxor', 'stacked', 'into', 'a', 'megacity', 'pyramid', 'in', 'tokyo', 'bay']


Removed stop words:
['num', 'pyramids', 'large', 'luxor', 'stacked', 'megacity', 'pyramid', 'tokyo', 'bay']


In [4]:
dataClean.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"[georgia, downs, two, russian, warplanes, coun...","[breaking, musharraf, impeached]","[russia, today, columns, troops, roll, south, ...","[russian, tanks, moving, towards, capital, sou...","[afghan, children, raped, impunity, un, offici...","[num, russian, tanks, entered, south, ossetia,...","[breaking, georgia, invades, south, ossetia, r...","[enemy, combatent, trials, nothing, sham, sali...",...,"[georgia, invades, south, ossetia, russia, get...","[alqaeda, faces, islamist, backlash]","[condoleezza, rice, us, would, act, prevent, i...","[busy, day, european, union, approved, new, sa...","[georgia, withdraw, num, soldiers, iraq, help,...","[pentagon, thinks, attacking, iran, bad, idea,...","[caucasus, crisis, georgia, invades, south, os...","[indian, shoe, manufactory, series, like, work]","[visitors, suffering, mental, illnesses, banne...","[help, mexicos, kidnapping, surge]"
1,2008-08-11,1,"[wont, america, nato, help, us, wont, help, us...","[bush, puts, foot, georgian, conflict]","[jewish, georgian, minister, thanks, israeli, ...","[georgian, army, flees, disarray, russians, ad...","[olympic, opening, ceremony, fireworks, faked]","[mossad, fraudulent, new, zealand, passports, ...","[russia, angered, israeli, military, sale, geo...","[american, citizen, living, sossetia, blames, ...",...,"[israel, us, behind, georgian, aggression]","[believe, tv, neither, russian, georgian, much...","[riots, still, going, montreal, canada, police...","[china, overtake, us, largest, manufacturer]","[war, south, ossetia, pics]","[israeli, physicians, group, condemns, state, ...","[russia, beaten, united, states, head, peak, oil]","[perhaps, question, georgia, russia, conflict]","[russia, much, better, war]","[come, trading, sex, food]"
2,2008-08-12,0,"[remember, adorable, 9yearold, sang, opening, ...","[russia, ends, georgia, operation]","[sexual, harassment, would, children]","[alqaeda, losing, support, iraq, brutal, crack...","[ceasefire, georgia, putin, outmaneuvers, west]","[microsoft, intel, tried, kill, xo, num, laptop]","[stratfor, russogeorgian, war, balance, power]","[im, trying, get, sense, whole, georgiarussia,...",...,"[us, troops, still, georgia, know, georgia, fi...","[russias, response, georgia, right]","[gorbachev, accuses, us, making, serious, blun...","[russia, georgia, nato, cold, war, two]","[remember, adorable, 62yearold, led, country, ...","[war, georgia, israeli, connection]","[signs, point, us, encouraging, georgia, invad...","[christopher, king, argues, us, nato, behind, ...","[america, new, mexico]","[bbc, news, asiapacific, extinction, man, clim..."
3,2008-08-13,0,"[us, refuses, israel, weapons, attack, iran, r...","[president, ordered, attack, tskhinvali, capit...","[israel, clears, troops, killed, reuters, came...","[britains, policy, tough, drugs, pointless, sa...","[body, num, year, old, found, trunk, latest, r...","[china, moved, num, million, quake, survivors,...","[bush, announces, operation, get, russias, gri...","[russian, forces, sink, georgian, ships]",...,"[elephants, extinct, num]","[us, humanitarian, missions, soon, georgia, ru...","[georgias, ddos, came, us, sources]","[russian, convoy, heads, georgia, violating, t...","[israeli, defence, minister, us, strike, iran]","[gorbachev, choice]","[witness, russian, forces, head, towards, tbil...","[quarter, russians, blame, us, conflict, poll]","[georgian, president, says, us, military, take...","[num, nobel, laureate, aleksander, solzhenitsy..."
4,2008-08-14,1,"[experts, admit, legalise, drugs]","[war, south, osetia, num, pictures, made, russ...","[swedish, wrestler, ara, abrahamian, throws, a...","[russia, exaggerated, death, toll, south, osse...","[missile, killed, num, inside, pakistan, may, ...","[rushdie, condemns, random, houses, refusal, p...","[poland, us, agree, missle, defense, deal, int...","[russians, conquer, tblisi, bet, seriously, bet]",...,"[bank, analyst, forecast, georgian, crisis, nu...","[georgia, confict, could, set, back, russias, ...","[war, caucasus, much, product, american, imper...","[nonmedia, photos, south, ossetiageorgia, conf...","[georgian, tv, reporter, shot, russian, sniper...","[saudi, arabia, mother, moves, block, child, m...","[taliban, wages, war, humanitarian, aid, workers]","[russia, world, forget, georgias, territorial,...","[darfur, rebels, accuse, sudan, mounting, majo...","[philippines, peace, advocate, say, muslims, n..."


### Stemming, lemmatization, canonicalization

In [5]:
# do either stemming or lemmatization
stemming = False # converted philippines to philippin in (4,25)
lemmatization = True # converts to proper words but philippine? (4,25)

dataNormal = dataClean.copy()
tester = (0,5) 
print('Original text:')
print(data.iloc[tester[0],tester[1]])
print('\nCleaned text:')
print(dataClean.iloc[tester[0],tester[1]])

if stemming:
    porter = PorterStemmer()   
    for i in range(2,dataNormal.shape[1]):
        dataNormal.iloc[:,i]=dataNormal.iloc[:,i].apply(lambda x: [porter.stem(y) for y in x])
    print('\nAfter stemming:')
    print(dataNormal.iloc[tester[0],tester[1]])
elif lemmatization:
    lemmatizer = WordNetLemmatizer()
    for i in range(2,data.shape[1]):
        dataNormal.iloc[:,i]=dataNormal.iloc[:,i].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
    print('\nAfter lemmatization:')
    print(dataNormal.iloc[tester[0],tester[1]])
        
dataNormal.head()

Original text:
b'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'

Cleaned text:
['russian', 'tanks', 'moving', 'towards', 'capital', 'south', 'ossetia', 'reportedly', 'completely', 'destroyed', 'georgian', 'artillery', 'fire']

After lemmatization:
['russian', 'tank', 'moving', 'towards', 'capital', 'south', 'ossetia', 'reportedly', 'completely', 'destroyed', 'georgian', 'artillery', 'fire']


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"[georgia, down, two, russian, warplane, countr...","[breaking, musharraf, impeached]","[russia, today, column, troop, roll, south, os...","[russian, tank, moving, towards, capital, sout...","[afghan, child, raped, impunity, un, official,...","[num, russian, tank, entered, south, ossetia, ...","[breaking, georgia, invades, south, ossetia, r...","[enemy, combatent, trial, nothing, sham, salim...",...,"[georgia, invades, south, ossetia, russia, get...","[alqaeda, face, islamist, backlash]","[condoleezza, rice, u, would, act, prevent, is...","[busy, day, european, union, approved, new, sa...","[georgia, withdraw, num, soldier, iraq, help, ...","[pentagon, think, attacking, iran, bad, idea, ...","[caucasus, crisis, georgia, invades, south, os...","[indian, shoe, manufactory, series, like, work]","[visitor, suffering, mental, illness, banned, ...","[help, mexico, kidnapping, surge]"
1,2008-08-11,1,"[wont, america, nato, help, u, wont, help, u, ...","[bush, put, foot, georgian, conflict]","[jewish, georgian, minister, thanks, israeli, ...","[georgian, army, flees, disarray, russian, adv...","[olympic, opening, ceremony, firework, faked]","[mossad, fraudulent, new, zealand, passport, i...","[russia, angered, israeli, military, sale, geo...","[american, citizen, living, sossetia, blame, u...",...,"[israel, u, behind, georgian, aggression]","[believe, tv, neither, russian, georgian, much...","[riot, still, going, montreal, canada, police,...","[china, overtake, u, largest, manufacturer]","[war, south, ossetia, pic]","[israeli, physician, group, condemns, state, t...","[russia, beaten, united, state, head, peak, oil]","[perhaps, question, georgia, russia, conflict]","[russia, much, better, war]","[come, trading, sex, food]"
2,2008-08-12,0,"[remember, adorable, 9yearold, sang, opening, ...","[russia, end, georgia, operation]","[sexual, harassment, would, child]","[alqaeda, losing, support, iraq, brutal, crack...","[ceasefire, georgia, putin, outmaneuvers, west]","[microsoft, intel, tried, kill, xo, num, laptop]","[stratfor, russogeorgian, war, balance, power]","[im, trying, get, sense, whole, georgiarussia,...",...,"[u, troop, still, georgia, know, georgia, firs...","[russia, response, georgia, right]","[gorbachev, accuses, u, making, serious, blund...","[russia, georgia, nato, cold, war, two]","[remember, adorable, 62yearold, led, country, ...","[war, georgia, israeli, connection]","[sign, point, u, encouraging, georgia, invade,...","[christopher, king, argues, u, nato, behind, g...","[america, new, mexico]","[bbc, news, asiapacific, extinction, man, clim..."
3,2008-08-13,0,"[u, refuse, israel, weapon, attack, iran, report]","[president, ordered, attack, tskhinvali, capit...","[israel, clear, troop, killed, reuters, camera...","[britain, policy, tough, drug, pointless, say,...","[body, num, year, old, found, trunk, latest, r...","[china, moved, num, million, quake, survivor, ...","[bush, announces, operation, get, russia, gril...","[russian, force, sink, georgian, ship]",...,"[elephant, extinct, num]","[u, humanitarian, mission, soon, georgia, russ...","[georgia, ddos, came, u, source]","[russian, convoy, head, georgia, violating, tr...","[israeli, defence, minister, u, strike, iran]","[gorbachev, choice]","[witness, russian, force, head, towards, tbili...","[quarter, russian, blame, u, conflict, poll]","[georgian, president, say, u, military, take, ...","[num, nobel, laureate, aleksander, solzhenitsy..."
4,2008-08-14,1,"[expert, admit, legalise, drug]","[war, south, osetia, num, picture, made, russi...","[swedish, wrestler, ara, abrahamian, throw, aw...","[russia, exaggerated, death, toll, south, osse...","[missile, killed, num, inside, pakistan, may, ...","[rushdie, condemns, random, house, refusal, pu...","[poland, u, agree, missle, defense, deal, inte...","[russian, conquer, tblisi, bet, seriously, bet]",...,"[bank, analyst, forecast, georgian, crisis, nu...","[georgia, confict, could, set, back, russia, u...","[war, caucasus, much, product, american, imper...","[nonmedia, photo, south, ossetiageorgia, confl...","[georgian, tv, reporter, shot, russian, sniper...","[saudi, arabia, mother, move, block, child, ma...","[taliban, wage, war, humanitarian, aid, worker]","[russia, world, forget, georgia, territorial, ...","[darfur, rebel, accuse, sudan, mounting, major...","[philippine, peace, advocate, say, muslim, nee..."


In [6]:
dataNormal.tail()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
1984,2016-06-27,0,"[barclays, rb, share, suspended, trading, tank...","[pope, say, church, ask, forgiveness, gay, pas...","[poland, shocked, xenophobic, abuse, pole, uk]","[second, referendum, cabinet, agrees]","[scotland, welcome, join, eu, merkel, ally, say]","[sterling, dip, friday, 31year, low, amid, bre...","[negative, news, south, african, president, al...","[surge, hate, crime, uk, following, uk, brexit...",...,"[german, lawyer, probe, erdogan, alleged, war,...","[boris, johnson, say, uk, continue, intensify,...","[richard, branson, calling, uk, government, ho...","[turkey, sorry, downing, russian, jet]","[edward, snowden, lawyer, vow, new, push, pard...","[brexit, opinion, poll, reveals, majority, don...","[conservative, mp, leave, campaigner, leave, c...","[economist, predict, uk, recession, weakening,...","[new, eu, superstate, plan, france, germany, c...","[pakistani, cleric, declare, transgender, marr..."
1985,2016-06-28,1,"[num, scientist, australia, want, save, great,...","[personal, detail, num, french, police, office...","[sampp, cut, united, kingdom, sovereign, credi...","[huge, helium, deposit, found, africa]","[ceo, south, african, state, broadcaster, quit...","[brexit, cost, investor, num, trillion, worst,...","[hong, kong, democracy, activist, call, return...","[brexit, iceland, president, say, uk, join, tr...",...,"[u, canada, mexico, pledge, num, power, clean,...","[increasing, evidence, australia, torturing, r...","[richard, branson, founder, virgin, group, sai...","[37000yrold, skull, borneo, reveals, surprise,...","[palestinian, stone, western, wall, worshiper,...","[jeanclaude, juncker, asks, farage]","[romanian, remainians, offering, new, home, nu...","[brexit, gibraltar, talk, scotland, stay, eu]","[num, suicide, bomber, strike, lebanon]","[mexico, security, force, routinely, use, sexu..."
1986,2016-06-29,1,"[explosion, airport, istanbul]","[yemeni, former, president, terrorism, offspri...","[uk, must, accept, freedom, movement, access, ...","[devastated, scientist, late, captive, breed, ...","[british, labor, party, leader, jeremy, corbyn...","[muslim, shop, uk, firebombed, people, inside]","[mexican, authority, sexually, torture, woman,...","[uk, share, pound, continue, recover]",...,"[escape, tunnel, dug, hand, found, holocaust, ...","[land, beijing, sinking, much, four, inch, per...","[car, bomb, antiislamic, attack, mosque, perth...","[emaciated, lion, taiz, zoo, trapped, bloodsoa...","[rupert, murdoch, describes, brexit, wonderful...","[num, killed, yemen, suicide, attack]","[google, found, disastrous, symantec, norton, ...","[extremist, violence, rise, germany, domestic,...","[bbc, news, labour, mp, pas, corbyn, noconfide...","[tiny, new, zealand, town, many, job, launch, ..."
1987,2016-06-30,1,"[jamaica, proposes, marijuana, dispenser, tour...","[stephen, hawking, say, pollution, stupidity, ...","[boris, johnson, say, run, tory, party, leader...","[six, gay, men, ivory, coast, abused, forced, ...","[switzerland, denies, citizenship, muslim, imm...","[palestinian, terrorist, stab, israeli, teen, ...","[puerto, rico, default, num, billion, debt, fr...","[republic, ireland, fan, awarded, medal, sport...",...,"[google, free, wifi, indian, railway, station,...","[mounting, evidence, suggests, hobbit, wiped, ...","[men, carried, tuesday, terror, attack, istanb...","[call, suspend, saudi, arabia, un, human, righ...","[num, nobel, laureate, call, greenpeace, antig...","[british, pedophile, sentenced, num, year, u, ...","[u, permitted, num, offshore, fracks, gulf, me...","[swimming, ridicule, french, beach, police, ca...","[uefa, say, minute, silence, istanbul, victim,...","[law, enforcement, source, gun, used, paris, t..."
1988,2016-07-01,1,"[117yearold, woman, mexico, city, finally, rec...","[imf, chief, back, athens, permanent, olympic,...","[president, france, say, brexit, donald, trump]","[british, man, must, give, police, num, hour, ...","[num, nobel, laureate, urge, greenpeace, stop,...","[brazil, huge, spike, number, police, killing,...","[austria, highest, court, annuls, presidential...","[facebook, win, privacy, case, track, belgian,...",...,"[united, state, placed, myanmar, uzbekistan, s...","[sampp, revise, european, union, credit, ratin...","[india, get, num, billion, loan, world, bank, ...","[u, sailor, detained, iran, spoke, much, inter...","[mass, fish, kill, vietnam, solved, taiwan, st...","[philippine, president, rodrigo, duterte, urge...","[spain, arrest, three, pakistani, accused, pro...","[venezuela, anger, food, shortage, still, moun...","[hindu, temple, worker, killed, three, men, mo...","[ozone, layer, hole, seems, healing, u, amp, u..."


## Convert words to vectors

### Combine all header words in each date into a single datapoint (flattening)

In [7]:
dataVector = dataNormal.iloc[:,0:2].copy()

# combine the words in each date into a single datapoint
combined = dataNormal.iloc[:,2].values.tolist()

for i in range(0,dataNormal.shape[0]):
    for j in range (2,dataNormal.shape[1]):
        combined[i] += dataNormal.iloc[i,j]


dataVector['Combined'] = combined
print('dataVector is %d data points with %d features'%(dataVector.shape[0],dataVector.shape[1]))
dataVector.head()

dataVector is 1989 data points with 3 features


Unnamed: 0,Date,Label,Combined
0,2008-08-08,0,"[georgia, down, two, russian, warplane, countr..."
1,2008-08-11,1,"[wont, america, nato, help, u, wont, help, u, ..."
2,2008-08-12,0,"[remember, adorable, 9yearold, sang, opening, ..."
3,2008-08-13,0,"[u, refuse, israel, weapon, attack, iran, repo..."
4,2008-08-14,1,"[expert, admit, legalise, drug, expert, admit,..."


In [8]:
# Swap the order
df = pd.DataFrame.from_dict({'Words':dataVector.Combined,'Date':dataVector.Date, 'Label':dataVector.Label})
df.head()

Unnamed: 0,Words,Date,Label
0,"[georgia, down, two, russian, warplane, countr...",2008-08-08,0
1,"[wont, america, nato, help, u, wont, help, u, ...",2008-08-11,1
2,"[remember, adorable, 9yearold, sang, opening, ...",2008-08-12,0
3,"[u, refuse, israel, weapon, attack, iran, repo...",2008-08-13,0
4,"[expert, admit, legalise, drug, expert, admit,...",2008-08-14,1


In [9]:
# Breaking down all headlines by word
# Reference: (Alexander, 2017)

df.reset_index(inplace=True)
rows = []

_ = df.apply(lambda row: [rows.append([nn, row['Date'], row['Label']]) 
                         for nn in row.Words], axis=1)

df_new = pd.DataFrame(rows, columns=['Words','Date', 'Label'])
df_new.head()

Unnamed: 0,Words,Date,Label
0,georgia,2008-08-08,0
1,down,2008-08-08,0
2,two,2008-08-08,0
3,russian,2008-08-08,0
4,warplane,2008-08-08,0


In [10]:
print('There are %d (nonunique) words in the set'%(df_new.shape[0]))

There are 616245 (nonunique) words in the set


### Word2vec vs GloVe model

In [11]:
# GloVe

model = gensim.models.KeyedVectors.load_word2vec_format("~/Downloads/glove.6B/glove.6B.50d.txt", binary=False)
print('GloVe model (vocabulary) is %d words with %d dimensions'%(model.wv.vectors.shape[0],model.wv.vectors.shape[1]))

GloVe model (vocabulary) is 400000 words with 50 dimensions


  after removing the cwd from sys.path.


In [12]:
# Word2Vec

num_size = 100        # Word vector dimensionality                    
num_min_count = 5     # Minimum frequency of a word to be included in dictionary                       
num_workers = 4       # Number of threads to run in parallel

start_time = time.time()
w2v_model = Word2Vec(df['Words'], size=num_size, min_count=num_min_count, workers=num_workers)
end_time = time.time()
# model.save('w2v-combined-vector-size100-count5-workers4')

# print resulting matrix size
print('W2V model (vocabulary) is %d words with %d dimensions'%(w2v_model.wv.vectors.shape[0],w2v_model.wv.vectors.shape[1]))
print('training time: %.3f seconds'%(end_time-start_time))

W2V model (vocabulary) is 10404 words with 100 dimensions
training time: 5.153 seconds


Glove produces much better word similarity results and makes the positioning of words across a single dimention more sparse than Word2Vec which should make it easier for a model to differentiate between items. Before, most items laid within 0.05 of each other across each dimension.

In [13]:
w1 = 'war'
topN = 5

w1 = 'war'
w2 = 'peace'
w3 = 'terror'
w4 = 'school'

In [14]:
# Word-2-Vec

print('Top %d words most similar to \'%s\':'%(topN,w1))
topSimWords_1 = w2v_model.wv.most_similar(positive=[w1],topn=topN)
for i in range(0,len(topSimWords_1)):
    print('   \'%s\': %.4f'%(topSimWords_1[i][0],topSimWords_1[i][1]))
    
print('\nSimilarity between \'%s\' and \'%s\': %.3f'%(w1,w2,w2v_model.wv.similarity(w1=w1,w2=w2)))
print('Similarity between \'%s\' and \'%s\': %.3f'%(w1,w3,w2v_model.wv.similarity(w1=w1,w2=w3)))
print('Similarity between \'%s\' and \'%s\': %.3f'%(w2,w4,w2v_model.wv.similarity(w1=w2,w2=w4)))

Top 5 words most similar to 'war':
   'humanity': 0.7551
   'cartel': 0.7549
   'cold': 0.7165
   'ii': 0.7153
   'end': 0.7051

Similarity between 'war' and 'peace': 0.520
Similarity between 'war' and 'terror': 0.397
Similarity between 'peace' and 'school': 0.132


In [15]:
# GloVe

print('Top %d words most similar to \'%s\':'%(topN,w1))
topSimWords = model.wv.most_similar(positive=[w1],topn=topN)
for i in range(0,len(topSimWords)):
    print('   \'%s\': %.4f'%(topSimWords[i][0],topSimWords[i][1]))

# test similarities between words
print('\nSimilarity between \'%s\' and \'%s\': %.3f'%(w1,w2,model.wv.similarity(w1=w1,w2=w2)))
print('Similarity between \'%s\' and \'%s\': %.3f'%(w1,w3,model.wv.similarity(w1=w1,w2=w3)))
print('Similarity between \'%s\' and \'%s\': %.3f'%(w2,w4,model.wv.similarity(w1=w2,w2=w4)))

Top 5 words most similar to 'war':
   'occupation': 0.8530
   'invasion': 0.8488
   'wars': 0.8247
   'conflict': 0.8188
   'fighting': 0.8163

Similarity between 'war' and 'peace': 0.644
Similarity between 'war' and 'terror': 0.660
Similarity between 'peace' and 'school': 0.328


  after removing the cwd from sys.path.
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


### Create average vector per each date

In [16]:
import copy
df = copy.deepcopy(df_new)
words = list(model.wv.vocab.keys())

  This is separate from the ipykernel package so we can avoid doing imports until


In [17]:
# Getting vectors
not_in_vocab = [] 
df['Vector'] = df.apply(lambda row: np.array(model[row.Words]) if row.Words in words else None, axis=1) 

In [18]:
df.head()

Unnamed: 0,Words,Date,Label,Vector
0,georgia,2008-08-08,0,"[-1.3427, 0.4592, 0.19281, 0.71305, -0.5934, 0..."
1,down,2008-08-08,0,"[-0.1981, -0.70847, 0.85857, -0.48108, 0.51562..."
2,two,2008-08-08,0,"[0.58289, 0.36258, 0.34065, 0.36416, 0.34337, ..."
3,russian,2008-08-08,0,"[0.19318, 0.88272, 0.40764, -0.15212, 0.030107..."
4,warplane,2008-08-08,0,"[0.31237, -1.7611, 1.0351, -0.14, 0.34241, 0.1..."


In [19]:
total = len(df)
nulls = sum(df.Vector.isnull())
print('Total num of words in text: ', total)
print('Num of words found in the dictionary: ', total-nulls)
print('Num of NOT found in the dictionary: ', nulls)
print('Percentage (%) of NOT found in the dictionary: ', round(nulls/total, 4)*100)

Total num of words in text:  616245
Num of words found in the dictionary:  603423
Num of NOT found in the dictionary:  12822
Percentage (%) of NOT found in the dictionary:  2.08


In [20]:
data = copy.deepcopy(df)
data = data.dropna() # dropping columns with words that were not found

In [21]:
# Breaking down the vector and making each dimention a separate feature
data = pd.merge(data,data.Vector.apply(pd.Series),right_index=True,left_index=True)
data = data.drop(columns=['Vector'])

In [22]:
data.head()

Unnamed: 0,Words,Date,Label,0,1,2,3,4,5,6,...,40,41,42,43,44,45,46,47,48,49
0,georgia,2008-08-08,0,-1.3427,0.4592,0.19281,0.71305,-0.5934,0.063595,-0.87187,...,-0.33128,0.24235,0.42535,-1.1329,-0.37384,1.0152,-0.24836,0.47535,-0.95568,0.11488
1,down,2008-08-08,0,-0.1981,-0.70847,0.85857,-0.48108,0.51562,-0.28924,-0.64311,...,-0.47409,0.25916,0.41522,0.15245,0.093191,-0.091906,0.40082,-0.90268,0.30191,-0.89862
2,two,2008-08-08,0,0.58289,0.36258,0.34065,0.36416,0.34337,0.79387,-0.9362,...,-0.54738,0.15244,0.41,0.15702,0.007794,-0.015106,-0.28653,-0.16158,-0.35169,-0.82555
3,russian,2008-08-08,0,0.19318,0.88272,0.40764,-0.15212,0.030107,0.061858,-0.022592,...,0.57779,0.044094,1.8773,-1.4023,0.52847,-0.33256,-0.74839,1.2133,-0.6449,-0.71372
4,warplane,2008-08-08,0,0.31237,-1.7611,1.0351,-0.14,0.34241,0.13254,-1.2271,...,0.26021,0.14546,1.3043,-0.44311,0.40676,0.42714,-0.25484,1.2122,-0.66438,-0.005737


In [23]:
data.columns = data.columns.astype(str)

# Modeling

### Train/test dataset splitting

In [24]:
# Train add validation set

import datetime
from datetime import date

data.Date = pd.to_datetime(data["Date"])

'''Defining train set'''

train = data.loc[data.Date < datetime.datetime(2015, 1, 2)]
X_train = train.loc[:, '0':'49'] # leaving out Label, Words, and Date
y_train = train.Label

'''Defining test set'''

test = data.loc[data.Date >= datetime.datetime(2015, 1, 2)]
X_test = test.loc[:, '0':'49'] # leaving out Label, Words, and Date
y_test = test.Label

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(484277, 50) (484277,)
(119146, 50) (119146,)


### Train and test models

In [25]:
import xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.neural_network import MLPClassifier

In [50]:
models = ['RF','XGB', 'MLP']

def predModel(X_train, y_train, X_test, y_test, modelType=models, proba=False):
    
    if modelType == models[0]:
        rf = RandomForestClassifier(n_jobs=2, random_state=0)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
        return(y_pred)
        
    elif modelType == models[1]:
        xg_boost = XGBClassifier(objective = 'binary:logistic')
        xg_boost.fit(X_train, y_train)
        y_pred = xg_boost.predict(X_test)
        if proba:
            y_pred = xg_boost.predict_proba(X_test)
        else:
            y_pred = xg_boost.predict(X_test)
        return(y_pred)
    
    elif modelType == models[2]:
        mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
        mlp.fit(X_train, y_train)
        y_pred = mlp.predict(X_test)
        if proba:
            y_pred = mlp.predict_proba(X_test)
        else:
            y_pred = mlp.predict(X_test)
        return(y_pred)

In [27]:
# Getting XGBoost label predictions
xgb_y_pred = predModel(X_train, y_train, X_test, y_test, modelType='XGB', proba=False)

In [28]:
# Getting MLP label predictions
mlp_y_pred = predModel(X_train, y_train, X_test, y_test, modelType='MLP', proba=False)

### XGBoost

In [29]:
xgb_y_pred = pd.DataFrame([xgb_y_pred]).transpose()
xgb_y_pred.columns = xgb_y_pred.columns.astype(str)
xgb_y_pred = xgb_y_pred.rename(columns={'0': 'Preds_for_1'})
xgb_y_pred.head()

Unnamed: 0,Preds_for_1
0,1
1,1
2,1
3,0
4,1


In [30]:
xgb_results = test[['Date', 'Words', 'Label']]
xgb_results.head()

Unnamed: 0,Date,Words,Label
494541,2015-01-02,case,1
494542,2015-01-02,cancer,1
494543,2015-01-02,result,1
494544,2015-01-02,sheer,1
494545,2015-01-02,bad,1


In [31]:
xgb_results['Preds'] = xgb_y_pred.values
xgb_results.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Words,Label,Preds
494541,2015-01-02,case,1,1
494542,2015-01-02,cancer,1,1
494543,2015-01-02,result,1,1
494544,2015-01-02,sheer,1,0
494545,2015-01-02,bad,1,1


In [32]:
# Grouping predictions back by date
# Computing the mean prediction by date ('majority vote')
xgb_check = xgb_results.groupby('Date').mean()
xgb_check.head()

Unnamed: 0_level_0,Label,Preds
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,1.0,0.994505
2015-01-05,0.0,1.0
2015-01-06,0.0,0.996466
2015-01-07,1.0,0.990937
2015-01-08,1.0,0.96587


In [33]:
# Preds now stands for mean predictions
# Determine the final label based on the mean predictions
xgb_check['Pred_label'] = xgb_check.apply(lambda row: 1 if row.Preds > 0.53 else 0, axis=1)
xgb_check.head()

Unnamed: 0_level_0,Label,Preds,Pred_label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,1.0,0.994505,1
2015-01-05,0.0,1.0,1
2015-01-06,0.0,0.996466,1
2015-01-07,1.0,0.990937,1
2015-01-08,1.0,0.96587,1


In [34]:
xgb_check['Accuracy_per_pred'] = xgb_check.apply(lambda row: 1 if row.Label == row.Pred_label else 0, axis=1)
xgb_check.head()

Unnamed: 0_level_0,Label,Preds,Pred_label,Accuracy_per_pred
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-01-02,1.0,0.994505,1,1
2015-01-05,0.0,1.0,1,0
2015-01-06,0.0,0.996466,1,0
2015-01-07,1.0,0.990937,1,1
2015-01-08,1.0,0.96587,1,1


In [35]:
accuracy_results = xgb_check['Accuracy_per_pred'].value_counts()
print('Correctly predicted (%): ', round(accuracy_results[1]/(accuracy_results[0]+accuracy_results[1]), 3)*100)

Correctly predicted (%):  50.8


### MLP Classifier

In [36]:
mlp_y_pred = pd.Series(mlp_y_pred) 
mlp_y_pred = pd.DataFrame([mlp_y_pred]).transpose()
mlp_y_pred.columns = mlp_y_pred.columns.astype(str)
mlp_y_pred = mlp_y_pred.rename(columns={'0': 'Preds_for_1'})
mlp_y_pred.head()

Unnamed: 0,Preds_for_1
0,1
1,1
2,1
3,1
4,1


In [37]:
# Joining all columns
mlp_results = test[['Date', 'Words', 'Label']]
mlp_results.head()

Unnamed: 0,Date,Words,Label
494541,2015-01-02,case,1
494542,2015-01-02,cancer,1
494543,2015-01-02,result,1
494544,2015-01-02,sheer,1
494545,2015-01-02,bad,1


In [39]:
mlp_results['Preds'] = mlp_y_pred.values
mlp_results.head() # Preds = preds for 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Words,Label,Preds
494541,2015-01-02,case,1,1
494542,2015-01-02,cancer,1,1
494543,2015-01-02,result,1,1
494544,2015-01-02,sheer,1,1
494545,2015-01-02,bad,1,1


In [40]:
mlp_check = mlp_results.groupby('Date').mean()
mlp_check.head()

Unnamed: 0_level_0,Label,Preds
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,1,1
2015-01-05,0,1
2015-01-06,0,1
2015-01-07,1,1
2015-01-08,1,1


In [41]:
mlp_check['Pred_label'] = mlp_check.apply(lambda row: 1 if row.Preds > 0.53 else 0, axis=1)
mlp_check.head()

Unnamed: 0_level_0,Label,Preds,Pred_label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,1,1,1
2015-01-05,0,1,1
2015-01-06,0,1,1
2015-01-07,1,1,1
2015-01-08,1,1,1


In [42]:
mlp_check['Accuracy_per_pred'] = mlp_check.apply(lambda row: 1 if row.Label == row.Pred_label else 0, axis=1)
mlp_check.head()

Unnamed: 0_level_0,Label,Preds,Pred_label,Accuracy_per_pred
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-01-02,1,1,1,1
2015-01-05,0,1,1,0
2015-01-06,0,1,1,0
2015-01-07,1,1,1,1
2015-01-08,1,1,1,1


In [43]:
accuracy_results_mlp = mlp_check['Accuracy_per_pred'].value_counts()
print('Correctly predicted (%): ',
      round(accuracy_results_mlp[1]/(accuracy_results_mlp[0]+accuracy_results_mlp[1]), 3)*100)

Correctly predicted (%):  50.8


In [54]:
mlp_y_pred_probs = predModel(X_train, y_train, X_test, y_test, modelType='MLP', proba=True)
mlp_preds_for1 = [row[1] for row in mlp_y_pred_probs]
mlp_preds_for1_series = pd.Series(mlp_preds_for1) 
mlp_preds_for1_series.head()

0    0.542140
1    0.541941
2    0.542140
3    0.541982
4    0.542140
dtype: float64

In [55]:
mlp_preds_for1_series = pd.DataFrame([mlp_preds_for1_series]).transpose()
mlp_preds_for1_series.columns = mlp_preds_for1_series.columns.astype(str)
mlp_preds_for1_series = mlp_preds_for1_series.rename(columns={'0': 'Preds_for_1'})
mlp_preds_for1_series.head()

Unnamed: 0,Preds_for_1
0,0.54214
1,0.541941
2,0.54214
3,0.541982
4,0.54214


In [56]:
mlp_results['Precise_pred_probs'] = mlp_preds_for1_series.values
mlp_results.head() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Words,Label,Preds,Precise_pred_probs
494541,2015-01-02,case,1,1,0.54214
494542,2015-01-02,cancer,1,1,0.541941
494543,2015-01-02,result,1,1,0.54214
494544,2015-01-02,sheer,1,1,0.541982
494545,2015-01-02,bad,1,1,0.54214


In [57]:
mlp_check_probs = mlp_results.groupby('Date').mean()
mlp_check_probs.head()

Unnamed: 0_level_0,Label,Preds,Precise_pred_probs
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,1,1,0.5421
2015-01-05,0,1,0.542102
2015-01-06,0,1,0.542083
2015-01-07,1,1,0.542091
2015-01-08,1,1,0.542092


In [49]:
# checking the % of 1s in the original set
y_train.value_counts()[1]/(y_train.value_counts()[1]+y_train.value_counts()[0])

0.5420823206553275

### References

Alexander, 2017. In Stack Overflow. Retrieved from: https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows?fbclid=IwAR3cUJb9ZuaprvbZ1ShQcHbPZNElmJZVZBktbVdM3I19Kw5KmXbtKKT48U8