# Forecasting the Stock Market using News Sentiment Analysis

### This project is a reproduction of the findings from the paper "Forecasting the Stock Market using News Sentiment Analysis":

See here: http://arno.uvt.nl/show.cgi?fid=157031

In [1]:
import pandas as pd
import numpy as np
from sklearn import svm
from textblob import TextBlob
from scipy.stats import zscore
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import classification_report, accuracy_score, make_scorer, confusion_matrix

### 1.A. News articles preprocessing (all-the-news-2-1.csv):
* keep date, title, section, and publication. 
* all the rows without a title are filtered to assure that each entry has a filled title column
* Some publishers do not have a filled section column; these entries are replaced with “unknown”. However, this study is mainly focused on financial and business news articles and with an “unknown” section column it is impossible to decide whether it belongs to a financial and business article or not. 
* Therefore, this study uses two news article datasets, (1) with all the news articles with a financial or business-related section, and (2) all the news articles with unknown sections combined with dataset 1.

**keep 'date', 'title', 'section', and 'publication' columns**

In [2]:
news_fields = ['date', 'title', 'section', 'publication']
news_df_tmp = pd.read_csv('../data/all-the-news-2-1.csv', usecols=news_fields)
news_df_tmp['date'] = news_df_tmp['date'].astype('datetime64[ns]')

In [3]:
news_df_tmp

Unnamed: 0,date,title,section,publication
0,2016-12-09 18:31:00,We should take concerns about the health of li...,,Vox
1,2016-10-07 21:26:46,Colts GM Ryan Grigson says Andrew Luck's contr...,,Business Insider
2,2018-01-26 00:00:00,Trump denies report he ordered Mueller fired,Davos,Reuters
3,2019-06-27 00:00:00,France's Sarkozy reveals his 'Passions' but in...,World News,Reuters
4,2016-01-27 00:00:00,Paris Hilton: Woman In Black For Uncle Monty's...,,TMZ
...,...,...,...,...
2688873,2020-04-01 00:00:00,Florida Ammo Selling Out On Heels of Stay-At-H...,,TMZ
2688874,2020-04-01 00:00:00,Disney Forcing Annual Pass Holders to Continue...,,TMZ
2688875,2020-04-01 00:00:00,Nick Cannon Pimps Out His Impala with Custom N...,,TMZ
2688876,2020-04-01 00:00:00,Pete Buttigieg Says Governors Showing More Lea...,,TMZ


**all the rows without a title are filtered to assure that each entry has a filled title column**

In [4]:
news_df = news_df_tmp[news_df_tmp['title'].notna()]

**Some publishers do not have a filled section column; these entries are replaced with “unknown”.**

* here, it is unclear, if NaN schould be renamed to 'unknown' or if only the existing 7 'unknown' entries should be taken into account

In [5]:
#news_df['section'] = news_df.iloc[news_df['section'].replace(np.nan, 'unknown')]
#news_df['section'] = news_df['section'].replace(np.nan, 'unknown')

In [6]:
news_df

Unnamed: 0,date,title,section,publication
0,2016-12-09 18:31:00,We should take concerns about the health of li...,,Vox
1,2016-10-07 21:26:46,Colts GM Ryan Grigson says Andrew Luck's contr...,,Business Insider
2,2018-01-26 00:00:00,Trump denies report he ordered Mueller fired,Davos,Reuters
3,2019-06-27 00:00:00,France's Sarkozy reveals his 'Passions' but in...,World News,Reuters
4,2016-01-27 00:00:00,Paris Hilton: Woman In Black For Uncle Monty's...,,TMZ
...,...,...,...,...
2688873,2020-04-01 00:00:00,Florida Ammo Selling Out On Heels of Stay-At-H...,,TMZ
2688874,2020-04-01 00:00:00,Disney Forcing Annual Pass Holders to Continue...,,TMZ
2688875,2020-04-01 00:00:00,Nick Cannon Pimps Out His Impala with Custom N...,,TMZ
2688876,2020-04-01 00:00:00,Pete Buttigieg Says Governors Showing More Lea...,,TMZ


**this study uses two news article datasets, (1) with all the news articles with a financial or business-related section...**


**...and (2) all the news articles with unknown sections combined with dataset 1**


**BUT!**

When observing Table 3, the bold depicted scores are the highest of its row. With these results **it can be concluded that dataset 1**, containing only the news articles that have a business or financial related section, **performed better than dataset 2**. Moreover, dataset 1 outperformed dataset 2 on four of the seven different feature sets, and on three occasions, both datasets achieved the same score. Therefore, the following sections will only focus on the performance of the sentiment analysis method and models on dataset

In [7]:
valid_section_values_1 = ['Financials', 'Business News']
news_df_1 = news_df[news_df['section'].isin(valid_section_values_1)]
news_df_1 = news_df_1.reset_index(drop=True)

In [8]:
news_df_1

Unnamed: 0,date,title,section,publication
0,2019-06-17,"ECB's Coeure: If we decide to cut rates, we'd ...",Financials,Reuters
1,2019-06-23,Hudson's Bay's chairman's buyout bid pits reta...,Business News,Reuters
2,2018-12-28,Wells Fargo to pay $575 million in settlement ...,Business News,Reuters
3,2019-05-21,Factbox: Investments by automakers in the U.S....,Business News,Reuters
4,2019-02-05,Exclusive: Britain's financial heartland unbow...,Business News,Reuters
...,...,...,...,...
154235,2020-03-26,BlackRock says 'rebalancing into risky assets'...,Business News,Reuters
154236,2020-03-23,"Croatia cenbank rolls over liquidity boost, in...",Financials,Reuters
154237,2020-03-26,"Putin, at G20 summit, proposes lifting sanctio...",Business News,Reuters
154238,2020-03-23,"Most North Sea oil output ""in the money"", but ...",Financials,Reuters


In [9]:
#valid_section_values_2 = ['unknown', 'Financials', 'Business News']
#news_df_2 = news_df[news_df['section'].isin(valid_section_values_2)]
#news_df_2 = news_df_2.reset_index(drop=True)

### 1.B. News Articles Sentiment Scores
* After preprocessing the news article dataset, some code should be executed to gather the different sentiment score features. As mentioned in Section 3.2 the following sentiment scores are calculated and added to the two news articles datasets: VADER, **TextBlob**, and LM. Hence, each row (news article) has several sentiment score columns. Since the news article and price datasets will be merged on the date columns, a group by date function has been executed. While grouping the rows by the date column, the sentiment scores will be averaged.

In [10]:
def text_blob_sentiment(some_df):
    scores = []
    sentences = list(some_df['title'])
    
    for sent in sentences:
        testimonial = TextBlob(sent)
        scores.append(testimonial.sentiment.polarity)
        
    some_df['TextBlob_Sentiment'] = scores
    return some_df

In [11]:
news_df_1 = text_blob_sentiment(news_df_1)

In [12]:
news_df_1

Unnamed: 0,date,title,section,publication,TextBlob_Sentiment
0,2019-06-17,"ECB's Coeure: If we decide to cut rates, we'd ...",Financials,Reuters,0.000000
1,2019-06-23,Hudson's Bay's chairman's buyout bid pits reta...,Business News,Reuters,0.200000
2,2018-12-28,Wells Fargo to pay $575 million in settlement ...,Business News,Reuters,0.000000
3,2019-05-21,Factbox: Investments by automakers in the U.S....,Business News,Reuters,0.000000
4,2019-02-05,Exclusive: Britain's financial heartland unbow...,Business News,Reuters,0.000000
...,...,...,...,...,...
154235,2020-03-26,BlackRock says 'rebalancing into risky assets'...,Business News,Reuters,0.000000
154236,2020-03-23,"Croatia cenbank rolls over liquidity boost, in...",Financials,Reuters,0.000000
154237,2020-03-26,"Putin, at G20 summit, proposes lifting sanctio...",Business News,Reuters,0.000000
154238,2020-03-23,"Most North Sea oil output ""in the money"", but ...",Financials,Reuters,0.500000


In [13]:
news_df_1.sort_values(by='date')

Unnamed: 0,date,title,section,publication,TextBlob_Sentiment
46778,2016-01-01,Honda confirms ninth death linked to Takata ai...,Business News,Reuters,0.000000
42247,2016-01-01,Investors look to January effect at start of 2...,Business News,Reuters,0.000000
83639,2016-01-01,New year brings minimum wage hikes for America...,Business News,Reuters,0.136364
47636,2016-01-01,"China December factory activity shrinks, more ...",Business News,Reuters,0.500000
35699,2016-01-01,India's NSE index ends near 2-month high,Financials,Reuters,0.130000
...,...,...,...,...,...
153209,2020-03-28,Ackman says Pershing Square no longer has hedg...,Business News,Reuters,0.000000
152872,2020-03-28,Italy PM adopts new measures to help coronavir...,Business News,Reuters,0.136364
153370,2020-03-28,"S&P cuts DAMAC's rating, puts Emaar Properties...",Financials,Reuters,0.000000
153034,2020-03-28,Canada's top lenders cut prime rates after cen...,Business News,Reuters,0.250000


**drop columns 'title', 'section', 'publication'**

In [14]:
news_df_1.drop(['title', 'section', 'publication'], axis=1, inplace=True)

In [15]:
len(news_df_1)

154240

**a group by date function has been executed. While grouping the rows by the date column, the sentiment scores will be averaged.**

In [16]:
news_df_1 = news_df_1.groupby('date', as_index=False, sort=True)['TextBlob_Sentiment'].mean()

In [17]:
news_df_1 = news_df_1[(news_df_1['date'] >= '2016-01-01') & (news_df_1['date'] <= '2020-01-31')]

In [18]:
news_df_1

Unnamed: 0,date,TextBlob_Sentiment
0,2016-01-01,0.114318
1,2016-01-02,0.041667
2,2016-01-03,-0.066667
3,2016-01-04,-0.004649
4,2016-01-05,-0.004303
...,...,...
1487,2020-01-27,0.034606
1488,2020-01-28,0.009579
1489,2020-01-29,0.017997
1490,2020-01-30,0.008438


### 2. S&P 500 preprocessing (YahooFinancials package.xls) (1785 entries before preprocessing, includig keys row)
* keep columns: close price, volume, and date
* Another column has been added with the movement, this is the binary dependent variable in this study. A 1 is assigned if the stock market value went up or remained the same compared to the day before and 0 if the stock market values went down.
* Since it has been decided to use fundamental as well as technical analysis, a column has been added with a technical indicator. Although there are many technical indicators which can be calculated with the features in the S&P 500 dataset, the most common one has been chosen, which is the simple moving average (SMA). To calculate the SMA, all the closing prices are summed up over a given period and divided by the number of periods
SMA = (A1 + A2 + ... + An) / n
* Since the prediction of a stock for the following day is considered as a short-term prediction, the period should is set on 10 days

**keep columns: close price, volume, and date**

In [19]:
fields = ['Date', 'Close*', 'Volume']

sandp_df = pd.read_excel('../data/YahooFinancials package.xls', usecols=fields)
sandp_df.rename(columns = {'Date':'date'}, inplace = True)
sandp_df['date'] = sandp_df['date'].astype('datetime64[ns]')

In [20]:
sandp_df

Unnamed: 0,date,Close*,Volume
0,2020-01-31,3225.52,4527830000
1,2020-01-30,3283.66,3787250000
2,2020-01-29,3273.40,3584500000
3,2020-01-28,3276.24,3526720000
4,2020-01-27,3243.63,3823100000
...,...,...,...
1779,2013-01-07,1461.89,3304970000
1780,2013-01-04,1466.47,3424290000
1781,2013-01-03,1459.37,3829730000
1782,2013-01-02,1462.42,4202600000


In [21]:
sandp_df.dtypes

date      datetime64[ns]
Close*           float64
Volume             int64
dtype: object

**Another column has been added with the movement, this is the binary dependent variable in this study. A 1 is assigned if the stock market value went up or remained the same compared to the day before and 0 if the stock market values went down.**

In [22]:
def calculate_movement(some_df):
    
    close_list = list(some_df['Close*'])
    
    movement = []  
    for i in range(len(close_list)-1):
        if close_list[i] < close_list[i+1]:
            movement.append(0)
        else:
            movement.append(1)
    movement.append(1)   # there is nothing to compare for the first day, set to 1 (remained the same)
    some_df['movement'] = movement
    return some_df

In [23]:
sandp_df = calculate_movement(sandp_df)

In [24]:
sandp_df.tail(3)

Unnamed: 0,date,Close*,Volume,movement
1781,2013-01-03,1459.37,3829730000,0
1782,2013-01-02,1462.42,4202600000,1
1783,2012-12-31,1426.19,3204330000,1


In [25]:
sandp_df.head(15)

Unnamed: 0,date,Close*,Volume,movement
0,2020-01-31,3225.52,4527830000,0
1,2020-01-30,3283.66,3787250000,1
2,2020-01-29,3273.4,3584500000,0
3,2020-01-28,3276.24,3526720000,1
4,2020-01-27,3243.63,3823100000,0
5,2020-01-24,3295.47,3707130000,0
6,2020-01-23,3325.54,3764860000,1
7,2020-01-22,3321.75,3619850000,1
8,2020-01-21,3320.79,4105340000,0
9,2020-01-17,3329.62,3698170000,1


**To calculate the SMA, all the closing prices are summed up over a given period and divided by the number of periods SMA = (A1 + A2 + ... + An) / n.**

**Since the prediction of a stock for the following day is considered as a short-term prediction, the period should is set on 10 days**

In [26]:
def calculate_SMA(some_df):
    
    close_list = list(some_df['Close*'])
   
    sma = []
    for i in range(len(close_list)-1):
        previous = close_list[i+1:i+11]
        sma.append( (sum(previous))/len(previous) )

    sma.append(close_list[-1])  # there is nothing to compare for the first day, set to current day value  
    some_df['SMA'] = sma
    return some_df

In [27]:
sandp_df = calculate_SMA(sandp_df)

In [28]:
sandp_df.head(3)

Unnamed: 0,date,Close*,Volume,movement,SMA
0,2020-01-31,3225.52,4527830000,0,3298.691
1,2020-01-30,3283.66,3787250000,1,3299.254
2,2020-01-29,3273.4,3584500000,0,3300.229


In [29]:
sandp_df.tail(3)

Unnamed: 0,date,Close*,Volume,movement,SMA
1781,2013-01-03,1459.37,3829730000,0,1444.305
1782,2013-01-02,1462.42,4202600000,1,1426.19
1783,2012-12-31,1426.19,3204330000,1,1426.19


In [30]:
print(len(sandp_df))

1784


In [31]:
sandp_df.sort_values(by='date')

Unnamed: 0,date,Close*,Volume,movement,SMA
1783,2012-12-31,1426.19,3204330000,1,1426.190000
1782,2013-01-02,1462.42,4202600000,1,1426.190000
1781,2013-01-03,1459.37,3829730000,0,1444.305000
1780,2013-01-04,1466.47,3424290000,1,1449.326667
1779,2013-01-07,1461.89,3304970000,0,1453.612500
...,...,...,...,...,...
4,2020-01-27,3243.63,3823100000,0,3303.590000
3,2020-01-28,3276.24,3526720000,1,3301.418000
2,2020-01-29,3273.40,3584500000,0,3300.229000
1,2020-01-30,3283.66,3787250000,1,3299.254000


In [32]:
sandp_df

Unnamed: 0,date,Close*,Volume,movement,SMA
0,2020-01-31,3225.52,4527830000,0,3298.691000
1,2020-01-30,3283.66,3787250000,1,3299.254000
2,2020-01-29,3273.40,3584500000,0,3300.229000
3,2020-01-28,3276.24,3526720000,1,3301.418000
4,2020-01-27,3243.63,3823100000,0,3303.590000
...,...,...,...,...,...
1779,2013-01-07,1461.89,3304970000,0,1453.612500
1780,2013-01-04,1466.47,3424290000,1,1449.326667
1781,2013-01-03,1459.37,3829730000,0,1444.305000
1782,2013-01-02,1462.42,4202600000,1,1426.190000


In [33]:
sandp_df = sandp_df[(sandp_df['date'] >= '2016-01-01') & (sandp_df['date'] <= '2020-01-31')]

In [34]:
sandp_df

Unnamed: 0,date,Close*,Volume,movement,SMA
0,2020-01-31,3225.52,4527830000,0,3298.691
1,2020-01-30,3283.66,3787250000,1,3299.254
2,2020-01-29,3273.40,3584500000,0,3300.229
3,2020-01-28,3276.24,3526720000,1,3301.418
4,2020-01-27,3243.63,3823100000,0,3303.590
...,...,...,...,...,...
1022,2016-01-08,1922.03,4664940000,0,2033.016
1023,2016-01-07,1943.09,5076590000,0,2042.604
1024,2016-01-06,1990.26,4336660000,0,2045.693
1025,2016-01-05,2016.71,3706620000,1,2044.577


### 3. Merge Datasets:
* After the datasets are preprocessed, the price dataset has been left joined to the news article datasets to create two datasets that will be used as input and output for the machine and deep learning models. 
* The datasets are merged on the column date. Thus, the two dataset which will be used for this research are fully preprocessed and both contains 1,096 rows. Because both datasets consist of the same features only one table has been created to give an overview of all the features in the datasets.
* The dependent variables have 589 entries that the stock market movement directionality went down, and 480 entries where the stock market price went up or remained the same.
* One small adjustment has been made to the final datasets, which is known as standardization. This method may help to minimize dataset dissimilarities. Rescaling the features to give them the characteristics of a regular normal distribution is known as standardization (or Z-score normalization) (formula on page 20)

In [35]:
merge_1 = news_df_1.merge(sandp_df, how='left', on='date')
merge_1 = merge_1[merge_1['Close*'].notna()]
merge_1['movement'] = merge_1['movement'].astype('int')

In [36]:
merge_1

Unnamed: 0,date,TextBlob_Sentiment,Close*,Volume,movement,SMA
3,2016-01-04,-0.004649,2012.66,4.304880e+09,0,2047.500
4,2016-01-05,-0.004303,2016.71,3.706620e+09,1,2044.577
5,2016-01-06,0.037939,1990.26,4.336660e+09,0,2045.693
6,2016-01-07,-0.005610,1943.09,5.076590e+09,0,2042.604
7,2016-01-08,-0.092329,1922.03,4.664940e+09,0,2033.016
...,...,...,...,...,...,...
1487,2020-01-27,0.034606,3243.63,3.823100e+09,0,3303.590
1488,2020-01-28,0.009579,3276.24,3.526720e+09,1,3301.418
1489,2020-01-29,0.017997,3273.40,3.584500e+09,0,3300.229
1490,2020-01-30,0.008438,3283.66,3.787250e+09,1,3299.254


In [37]:
merge_1['movement'].value_counts()

1    569
0    458
Name: movement, dtype: int64

**check datatypes and compare to Table 1 on page 20**

In [38]:
merge_1.dtypes

date                  datetime64[ns]
TextBlob_Sentiment           float64
Close*                       float64
Volume                       float64
movement                       int32
SMA                          float64
dtype: object

## My understanding of the task:
* movement is the target value for the binary classification
* Technical Analysis (TA) represents the volume, close price and the SMA (from paper).
* SVM + TextBlob will use columns: 'date', 'TextBlob_Sentiment', 'movement'
* SVM + TextBlob + TA will use columns: 'date', 'TextBlob_Sentiment', 'Close*', 'Volume', SMA, 'movement'

In [39]:
svm_textblob = merge_1[['TextBlob_Sentiment']]
svm_textblob_ta = merge_1[['TextBlob_Sentiment', 'Close*', 'Volume', 'SMA']]
y = list(merge_1['movement'])

**One small adjustment has been made to the final datasets, which is known as standardization. This method may help to minimize dataset dissimilarities. Rescaling the features to give them the characteristics of a regular normal distribution is known as standardization (or Z-score normalization)**

In [40]:
svm_textblob = svm_textblob.apply(zscore)
svm_textblob_ta = svm_textblob_ta.apply(zscore)

In [41]:
svm_textblob.head(3)

Unnamed: 0,TextBlob_Sentiment
3,-1.481421
4,-1.463648
5,0.706539


In [42]:
svm_textblob_ta.head(3)

Unnamed: 0,TextBlob_Sentiment,Close*,Volume,SMA
3,-1.481421,-1.601418,1.020934,-1.494152
4,-1.463648,-1.589688,0.130415,-1.502689
5,0.706539,-1.666292,1.068239,-1.499429


### 4/5. Training, Validation, and Test Set
* dataset for this study has been divided into two parts, 80% for training and 20% for testing the models. 
* Hence, the training data contains 855 observations, and the test data consists of 214 observations.
* Real-world datasets should use a stratified 10-fold cross-validation (Kohavi, 1995). The cross-validation method ensures that the training and test sets include the same proportions of both target groups, which are in this study the upwards and downwards movement of the S&P 500 index price. Cross-validation guarantees the model's validity and reliability; see Section 4.3 for a more comprehensive explanation. The validation set was used as part of a grid search, covered in more detail in Subsection 4.2.3.
* Each of these algorithms was used to create models, and a grid search was applied to find the best model settings, as described in the following subsection. Subsequently, the models were evaluated, and the best performing model for predicting daily S&P 500 movements was selected.
* automated hyperparameter optimization (HPO) has been used in this study. Grid search, also known as full factorial design, is a commonly used HPO tool which automatically finds the best hyperparameters for an algorithm

In [43]:
#svm_textblob_train = svm_textblob[:int(len(svm_textblob)*0.8)]
#svm_textblob_test = svm_textblob[int(len(svm_textblob)*0.8):]

#svm_textblob_ta_train = svm_textblob_ta[:int(len(svm_textblob_ta)*0.8)]
#svm_textblob_ta_test = svm_textblob_ta[int(len(svm_textblob_ta)*0.8):]

#y_train = y[:int(len(svm_textblob)*0.8)]
#y_test  = y[int(len(svm_textblob_ta)*0.8):]

In [44]:
#print(len(svm_textblob_train))
#print(len(svm_textblob_test))
#print(len(svm_textblob_ta_train))
#print(len(svm_textblob_ta_test))
#print(len(y_train))
#print(len(y_test))

**The best performing model that made use of the SVM algorithm was optimized by a grid search to discover the best model settings. These best settings/hyperparameters that are revealed through a grid search are presented in Table 5 on page 25**

* C = 100
* Gamma = 0.001
* Kernel = rbf

In [45]:
# check dimensions
print(len(svm_textblob))
print(len(svm_textblob_ta))
print(len(y))

1027
1027
1027


In [48]:
def classification_report_with_accuracy_score(y_true, y_pred):

    print(classification_report(y_true, y_pred)) # print classification report
    print()
    print('Accuracy:')
    print(accuracy_score(y_true, y_pred))
    print()
    print('Confusion Matrix:')
    print(confusion_matrix(y_true, y_pred))
    print()
    return accuracy_score(y_true, y_pred) # return accuracy score

**SVM + TextBlob**

In [49]:
X = svm_textblob
y = y

# with cross-validation we don't need to call to split the dataset in train/test and we don't need to call 
# the fit function, since both will be executed inside the cross-validationith cross-validation we don't need to call the fit() function, since it will be executed inside the cross-validation

clf = svm.SVC(kernel='rbf', C=100, gamma=0.001, random_state=42)
nested_score = cross_val_score(clf, X, y, cv=10, scoring=make_scorer(classification_report_with_accuracy_score))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**SVM + TextBlob + TA**

In [50]:
X = svm_textblob_ta
y = y

# with cross-validation we don't need to call to split the dataset in train/test and we don't need to call 
# the fit function, since both will be executed inside the cross-validationith cross-validation we don't need to call the fit() function, since it will be executed inside the cross-validation

clf = svm.SVC(kernel='rbf', C=100, gamma=0.001, random_state=42)
nested_score = cross_val_score(clf, X, y, cv=10, scoring=make_scorer(classification_report_with_accuracy_score))

              precision    recall  f1-score   support

           0       0.75      0.20      0.31        46
           1       0.59      0.95      0.73        57

    accuracy                           0.61       103
   macro avg       0.67      0.57      0.52       103
weighted avg       0.66      0.61      0.54       103


Accuracy:
0.6116504854368932

Confusion Matrix:
[[ 9 37]
 [ 3 54]]

              precision    recall  f1-score   support

           0       1.00      0.04      0.08        46
           1       0.56      1.00      0.72        57

    accuracy                           0.57       103
   macro avg       0.78      0.52      0.40       103
weighted avg       0.76      0.57      0.44       103


Accuracy:
0.5728155339805825

Confusion Matrix:
[[ 2 44]
 [ 0 57]]



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55       103
   macro avg       0.28      0.50      0.36       103
weighted avg       0.31      0.55      0.39       103


Accuracy:
0.5533980582524272

Confusion Matrix:
[[ 0 46]
 [ 0 57]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.55      1.00      0.71        57

    accuracy                           0.55 