### Problem Statement

- Dataset is a combination of the world news and stock price shifts available in kaggle
- Data contains 25 top news headlines for each day.
- Data ranges from 2008-2016 and the data from 2000-1008 was scrapped from Yahoo! Finance.
- Labels are based on **Dow Jones Industrial Average** stock index.
    - Class 1: Stock price stayed the same or increased
    - Class 0: Stock price decreased
    
    
***Notebook Referred***
1. https://www.kaggle.com/code/ndrewgele/omg-nlp-with-the-djia-and-reddit

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [3]:
## Understanding the date division
df.Date.min() , df.Date.max()

('2008-08-08', '2016-07-01')

In [4]:
### Divide the dataset into train and test
train = df[df.Date < '2015-01-01']
test = df[df.Date >= '2015-01-01']

In [5]:
len(train), len(test)

(1611, 378)

#### Build a Bag of Words model

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [8]:
## Combine all lower-cased News headlines as one paragraph
trainheadlines = []
for row in range(len(train)):
    news_head = ' '.join(str(headline).lower() for headline in train.iloc[row,2::])
    trainheadlines.append(' '.join(str(word) for word in CountVectorizer().build_tokenizer()(news_head)))
    
trainheadlines[2]

'remember that adorable year old who sang at the opening ceremonies that was fake too russia ends georgia operation if we had no sexual harassment we would have no children al qa eda is losing support in iraq because of brutal crackdown on activities it regards as un islamic including women buying cucumbers ceasefire in georgia putin outmaneuvers the west why microsoft and intel tried to kill the xo 100 laptop stratfor the russo georgian war and the balance of power trying to get sense of this whole georgia russia war vote up if you think georgia started it or down if you think russia did the us military was surprised by the timing and swiftness of the russian military move into south ossetia and is still trying to sort out what happened us defense official said monday beats war drum as iran dumps the dollar gorbachev georgian military attacked the south ossetian capital of tskhinvali with multiple rocket launchers designed to devastate large areas cnn use footage of tskhinvali ruins t

In [9]:
vectorizer = CountVectorizer(ngram_range=(2, 2))

train_vectors = vectorizer.fit_transform(trainheadlines)
print(train_vectors.shape)

(1611, 366721)


In [11]:
### Building a Naive-Bayes Model
model = MultinomialNB()

model.fit(train_vectors, train['Label'])

MultinomialNB()

In [12]:
testheadlines = []
for row in range(len(test)):
    news_head = ' '.join(str(headline).lower() for headline in test.iloc[row,2::])
    testheadlines.append(' '.join(str(word) for word in CountVectorizer().build_tokenizer()(news_head)))
    
test_vectors = vectorizer.transform(testheadlines)
y_pred = model.predict(test_vectors)

In [13]:
### Use Cross-tab to view the results
pd.crosstab(test["Label"], y_pred, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,17,169
1,21,171


In [14]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print(accuracy_score(test["Label"], y_pred))
print(classification_report(test["Label"], y_pred))

0.4973544973544973
              precision    recall  f1-score   support

           0       0.45      0.09      0.15       186
           1       0.50      0.89      0.64       192

    accuracy                           0.50       378
   macro avg       0.48      0.49      0.40       378
weighted avg       0.48      0.50      0.40       378

