## PROJECT : STOCK SENTIMENT ANALYSIS

OBJECTIVE: 

> Develop a machine learning model that can accurately predict the direction of stock price movements for a given company, using advanced sentiment analysis techniques by analyzing news articles.

__________________________________________

In [None]:
# importing the libraries 
import pandas as pd

import re

In [None]:
## Downloading the Dataset :

!wget https://raw.githubusercontent.com/krishnaik06/Stock-Sentiment-Analysis/master/Data.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
df  = pd.read_csv('Data.csv',encoding='latin1')
df.head(3)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links


## <b> METHOD 1


In [None]:
## Declaring the pradictors and labels :

X = df.drop(['Date','Label'],axis = 1)
y = df['Label']

print (f'''
Shape of the dataframe:
* X_Shape: {X.shape}
* y_Shape: {y.shape}

''')


Shape of the dataframe:
* X_Shape: (4101, 25)
* y_Shape: (4101,)




## PREPROCESS THE ENTIRE DATASET :

In [None]:
# Here, we will be adding all the headlines for a particular stock (i.e. joining all the sentences row wise along with cleaning the data)

X_updated = []
for i in range(len(X)):

  X_updated.append (' '.join(re.sub("[^a-zA-Z]",' ',(' '.join(str(x) for x in X.iloc[i])).lower()).split()))



In [None]:
X_updated[0]

'a hindrance to operations extracts from the leaked reports scorecard hughes instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes leeds pay the penalty hammers hand robson a youthful lesson saints party like it s wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

## <b> PERFORMING TF-IDF

In [None]:
# Performing TFIDF on X_updated :

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf =  TfidfVectorizer()
x = tfidf.fit_transform(X_updated)

X_tfidf = pd.DataFrame(x.toarray())  # This will be our final training and testing dataset 

In [None]:
X_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,46668,46669,46670,46671,46672,46673,46674,46675,46676,46677
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Splitting the training and testing data:

from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X_tfidf, y, train_size = 0.75, random_state = 41)

## <b><u> MODELING ON TF-IDF DATA:

### <b> NAIVE_BAIS MODEL:

In [None]:
# Creating a Training model Using Naive_baise classification Technique

from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB()

# Fitting out training data :
model_nb.fit(x_train, y_train)

In [None]:
# Getting the predictions on the training and test dataset :

y_train_pred= model_nb.predict(x_train)
y_test_pred = model_nb.predict(x_test)

In [None]:
# Getting the accuracy of the model:

from sklearn.metrics import classification_report, accuracy_score, f1_score

print (f'''
MODEL PERFORMANCE :

1) Train Set Accuracy : {accuracy_score(y_train, y_train_pred)}
2) Test Set Accuracy : {accuracy_score(y_test, y_test_pred)}
3) Classification REPORT ON TEST SET: 
{classification_report(y_test, y_test_pred)}
''')


MODEL PERFORMANCE :

1) Train Set Accuracy : 0.6832520325203252
2) Test Set Accuracy : 0.5165692007797271
3) Classification REPORT ON TEST SET: 
              precision    recall  f1-score   support

           0       0.40      0.02      0.04       491
           1       0.52      0.97      0.68       535

    accuracy                           0.52      1026
   macro avg       0.46      0.50      0.36      1026
weighted avg       0.46      0.52      0.37      1026




### <b> RANDOM FOREST MODEL

In [None]:
# Creating a Training model Using Random forest Model

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators = 200, criterion= 'entropy')

# Fitting out training data :
model_rf.fit(x_train, y_train)

In [None]:
# Getting the predictions on the training and test dataset :

y_train_pred= model_nb.predict(x_train)
y_test_pred = model_nb.predict(x_test)

In [None]:
# Getting the accuracy of the Random Forest Model

print (f'''
MODEL PERFORMANCE :

1) Train Set Accuracy : {accuracy_score(y_train, y_train_pred)}
2) Test Set Accuracy : {accuracy_score(y_test, y_test_pred)}
3) Classification REPORT ON TEST SET: 
{classification_report(y_test, y_test_pred)}
''')


MODEL PERFORMANCE :

1) Train Set Accuracy : 0.6832520325203252
2) Test Set Accuracy : 0.5165692007797271
3) Classification REPORT ON TEST SET: 
              precision    recall  f1-score   support

           0       0.40      0.02      0.04       491
           1       0.52      0.97      0.68       535

    accuracy                           0.52      1026
   macro avg       0.46      0.50      0.36      1026
weighted avg       0.46      0.52      0.37      1026




In [None]:
%reset

## <b> METHOD 2

In [None]:
# importing the libraries 
import pandas as pd
import re

In [None]:
df  = pd.read_csv('Data.csv',encoding='latin1')
df.head(3)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links


In [None]:
## Splitting the train and test data based on year:

train_df = df[df['Date'] < '20150101']
test_df = df[df['Date'] > '20141231']


In [None]:
# REmoving the Punctuations :

# FROM TRAIN DATA
train_data = train_df.iloc[:, 2:27]
train_data.replace('[^a-zA-Z]',' ', regex = True, inplace = True)

# FROM TEST DATA
test_data = test_df.iloc[:, 2:27]
test_data.replace('[^a-zA-Z]',' ', regex = True, inplace = True)

# Converting all headlines to a smaller character:

for column in train_data.columns.to_list():
  train_data[column]   = train_data[column].str.lower()
  test_data[column]    = test_data[column].str.lower()

train_data.head()

Unnamed: 0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,a hindrance to operations extracts from the...,scorecard,hughes instant hit buoys blues,jack gets his skates on at ice cold alex,chaos as maracana builds up for united,depleted leicester prevail as elliott spoils e...,hungry spurs sense rich pickings,gunners so wide of an easy target,derby raise a glass to strupar s debut double,southgate strikes leeds pay the penalty,...,flintoff injury piles on woe for england,hunters threaten jospin with new battle of the...,kohl s successor drawn into scandal,the difference between men and women,sara denver nurse turned solicitor,diana s landmine crusade put tories in a panic,yeltsin s resignation caught opposition flat f...,russian roulette,sold out,recovering a title
1,scorecard,the best lake scene,leader german sleaze inquiry,cheerio boyo,the main recommendations,has cubie killed fees,has cubie killed fees,has cubie killed fees,hopkins furious at foster s lack of hannibal...,has cubie killed fees,...,on the critical list,the timing of their lives,dear doctor,irish court halts ira man s extradition to nor...,burundi peace initiative fades after rebels re...,pe points the way forward to the ecb,campaigners keep up pressure on nazi war crime...,jane ratcliffe,yet more things you wouldn t know without the ...,millennium bug fails to bite
2,coventry caught on counter by flo,united s rivals on the road to rio,thatcher issues defence before trial by video,police help smith lay down the law at everton,tale of trautmann bears two more retellings,england on the rack,pakistan retaliate with call for video of walsh,cullinan continues his cape monopoly,mcgrath puts india out of their misery,blair witch bandwagon rolls on,...,south melbourne australia,necaxa mexico,real madrid spain,raja casablanca morocco,corinthians brazil,tony s pet project,al nassr saudi arabia,ideal holmes show,pinochet leaves hospital after tests,useful links
3,pilgrim knows how to progress,thatcher facing ban,mcilroy calls for irish fighting spirit,leicester bin stadium blueprint,united braced for mexican wave,auntie back in fashion even if the dress look...,shoaib appeal goes to the top,hussain hurt by shambles but lays blame on e...,england s decade of disasters,revenge is sweet for jubilant cronje,...,putin admits yeltsin quit to give him a head s...,bbc worst hit as digital tv begins to bite,how much can you pay for,christmas glitches,upending a table chopping a line and scoring ...,scientific evidence unreliable defence claims,fusco wins judicial review in extradition case,rebels thwart russian advance,blair orders shake up of failing nhs,lessons of law s hard heart
4,hitches and horlocks,beckham off but united survive,breast cancer screening,alan parker,guardian readers are you all whingers,hollywood beyond,ashes and diamonds,whingers a formidable minority,alan parker part two,thuggery toxins and ties,...,most everywhere udis,most wanted chloe lunettes,return of the cane completely off the agenda,from sleepy hollow to greeneland,blunkett outlines vision for over s,embattled dobson attacks play now pay later ...,doom and the dome,what is the north south divide,aitken released from jail,gone aloft


In [None]:
# Merging all headlines in one:

# FOR TRAIN DATA:
headlines_train  = []
for row in range(0, len(train_data)):
  headlines_train.append(' '.join(str(i) for i in train_data.iloc[row, :]))

# FOR TEST DATA:
headlines_test  = []
for row in range(0, len(test_data)):
  headlines_test.append(' '.join(str(i) for i in test_data.iloc[row, :]))
  

In [None]:
headlines_train[0]

'a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

In [None]:
headlines_test[0]

'most cases of cancer are the result of sheer bad luck rather than unhealthy lifestyles  diet or even inherited genes  new research suggests  random mutations that occur in dna when cells divide are responsible for two thirds of adult cancers across a wide range of tissues  iran dismissed united states efforts to fight islamic state as a ploy to advance u s  policies in the region   the reality is that the united states is not acting to eliminate daesh  they are not even interested in weakening daesh  they are only interested in managing it  poll  one in   germans would join anti muslim marches uk royal family s prince andrew named in us lawsuit over underage sex allegations some    asylum seekers refused to leave the bus when they arrived at their destination in rural northern sweden  demanding that they be taken back to malm or  some big city   pakistani boat blows self up after india navy chase  all four people on board the vessel from near the pakistani port city of karachi are bel

### <b> PERFORMING TF_IDF ON TRAIN_DATA

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


# fit tfidf vectorizer on the combined train and test data
vectorizer = TfidfVectorizer()
vectorizer.fit(headlines_train + headlines_test)

# transform the train and test data using the updated vectorizer
train_matrix = vectorizer.transform(headlines_train)
test_matrix = vectorizer.transform(headlines_test)

<3975x46678 sparse matrix of type '<class 'numpy.float64'>'
	with 817766 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# fit tfidf vectorizer on the combined train and test data
cv = CountVectorizer(ngram_range= (2,2))
cv.fit(headlines_train + headlines_test)

# transform the train and test data using the updated vectorizer
train_matrix = cv.transform(headlines_train)
test_matrix = cv.transform(headlines_test)



### <b> RANDOM FOREST MODEL

In [None]:
# Creating a Training model Using Random forest Model

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators = 200, criterion= 'entropy')

# Fitting out training data :
model_rf.fit(train_matrix, train_df['Label'])

In [None]:
# Getting the predictions on the training and test dataset :

from sklearn.metrics import classification_report, accuracy_score, f1_score


y_train_pred= model_rf.predict(train_matrix)
y_test_pred = model_rf.predict(test_matrix)

# Getting the accuracy of the Random Forest Model

print (f'''
MODEL PERFORMANCE :

1) Train Set Accuracy : {accuracy_score(train_df['Label'], y_train_pred)}
2) Test Set Accuracy : {accuracy_score(test_df['Label'], y_test_pred)}

3) Classification REPORT ON TEST SET: 

{classification_report(test_df['Label'], y_test_pred)}

''')


MODEL PERFORMANCE :

1) Train Set Accuracy : 1.0
2) Test Set Accuracy : 0.828042328042328

3) Classification REPORT ON TEST SET: 

              precision    recall  f1-score   support

           0       0.90      0.73      0.81       186
           1       0.78      0.92      0.84       192

    accuracy                           0.83       378
   macro avg       0.84      0.83      0.83       378
weighted avg       0.84      0.83      0.83       378



