# Ablation Study  

On the following preprocessing techniques:
        - 
Special Characters Removal    - 
Stopword Remova    - l
Lemmatizati    on


Model used will be VADER, as it gives the best results currently of about .




## Special Chars Removal

In [29]:
import re
def remove_special_char(df):
    # Remove URLS
    df.data = df.data.apply(lambda x:re.sub(r"http\S+", "", x))
    
    # Remove all the special characters
    df.data   = df.data.apply(lambda x:' '.join(re.findall(r'\w+', x)))
    
    print('Finished Special Char Removal')
    return df

## Stopword Removal

In [30]:
# load stopwords.csv to list
import csv
stopwords = []

with open('../data/stopwords.csv', newline='') as f:
    for row in f:
        stopwords.append(row.split(',')[0])
stopwords = stopwords[1:]

In [31]:
stopwords[:5]

['the', 'to', 'and', 'of', '']

In [36]:
def stopword_removal(df):
    for word in stopwords:
        # removing stopwords from dataset
        # print('Replacing ' + word)
        df['data'] = df['data'].apply(lambda x: x.replace(' '+word+' ', ' '))
        
    print('Finished Stopword Removal')
    return df

## Lemmatization

In [33]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [106]:
def lemmatization(df):
    lemma_df = df['data'].apply(lambda x: x.replace(x,lemmatizer.lemmatize(x)))

    lemma_df = pd.concat([lemma_df,df['sentiment']], axis = 1)
    
    print('Finished Lemmatisation')
    return lemma_df

## Load Data

In [20]:
eval_set = pd.read_csv('../data/eval_set_labelled.csv')
eval = eval_set[['data','sentiment']].copy()

In [21]:
eval[:10]

Unnamed: 0,data,sentiment
0,"BlackRock, ProShares Bitcoin ETFs Surpass GBTC...",neutral
1,Stock Up 400%! Pampa Metals $PMMCF Options New...,positive
2,Ethereum (ETH) Approaches $4000 In Multi-Week ...,neutral
3,"After a 141% Surge, Golden Inu's Appears To Be...",neutral
4,(10/18) Tuesday's Pre-Market Stock Movers & Ne...,positive
5,Miner hosting. We accept any miner above 28th ...,neutral
6,Plug Power (NASDAQ: PLUG) Shares Soar on $1.6 ...,positive
7,Sony cuts PlayStation 5 sales forecast to 21 m...,neutral
8,01/19/24 [Join XRPLounge Discord] - discord.co...,positive
9,Ethereum Film Featuring Vitalik Buterin Raises...,neutral


## VADER Setup

In [22]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [69]:
import numpy as np
def vader_eval(df, df_name):
    # adds 2 columns to dataframe with category and probability
    # calculates accuracy
    
    vader_eval_prob = []
    vader_eval_cat = []
    
    for sentence in df.data:
        vs = analyzer.polarity_scores(sentence)
        del vs['compound'] # delete composite score before doing argmax()
        pred_cat = max(vs, key=vs.get) # argmax over keys
        prob = vs[pred_cat]
        
        vader_eval_cat.append(pred_cat)
        vader_eval_prob.append(prob)
    
    df["vader_category"] = vader_eval_cat
    df["vader_prob"] = vader_eval_prob

    # calculate accuracy
    df['result'] = np.where(df["vader_category"] == df['sentiment'].str[:3], 1, 0)

    print('----------' + 'RESULTS FOR ' + df_name + '--------------------')
    print(df[:5])
    print((df.result.values == 1).mean())
    
    eval.to_csv('ablation_results/' + df_name + '.csv', encoding='utf-8')
    
    return df

## Inference

In [107]:
# naming convention of ablation: XXX, chars are Y/N representing:
# Special Characters Removal 
# Stopword Removal
# Lemmatization

YNN = remove_special_char(eval)
NYN = stopword_removal(eval)
NNY = lemmatization(eval)

YYN = stopword_removal(YNN)
NYY = lemmatization(NYN)
YNY = lemmatization(YNN)

YYY = lemmatization(YYN)

Finished Special Char Removal
Finished Stopword Removal
Finished Lemmatisation
Finished Stopword Removal
Finished Lemmatisation
Finished Lemmatisation
Finished Lemmatisation


In [57]:
dataframes = ['YNN','NYN','NNY','YYN','NYY','YNY','YYY']

In [51]:
YNN['sentiment'].str[:3]

0       neu
1       pos
2       neu
3       neu
4       pos
       ... 
998     neu
999     neu
1000    pos
1001    neu
1002    pos
Name: sentiment, Length: 1003, dtype: object

In [54]:
YNN["vader_category"]

0       neu
1       neu
2       neu
3       neu
4       neu
       ... 
998     neu
999     neu
1000    neu
1001    neu
1002    neu
Name: vader_category, Length: 1003, dtype: object

In [70]:
NYN_results = vader_eval(NYN, 'NYN')

----------RESULTS FOR NYN--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [71]:
def is_unique(s):
    a = s.to_numpy() # s.values (pandas<0.24)
    return (a[0] == a).all()

is_unique(NYN_results['vader_category'])

False

In [72]:
YNN_results = vader_eval(YNN, 'YNN')

----------RESULTS FOR YNN--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [73]:
NYN_results = vader_eval(NYN, 'NYN')

----------RESULTS FOR NYN--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [109]:
NNY_results = vader_eval(NNY, 'NNY')

----------RESULTS FOR NNY--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [110]:
YYN_results = vader_eval(YYN, 'YYN')

----------RESULTS FOR YYN--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [111]:
NYY_results = vader_eval(NYY, 'NYY')

----------RESULTS FOR NYY--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [112]:
YNY_results = vader_eval(YNY, 'YNY')

----------RESULTS FOR YNY--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474


In [113]:
YYY_results = vader_eval(YYY, 'YYY')

----------RESULTS FOR YYY--------------------
                                                data sentiment vader_category  \
0  BlackRock ProShares Bitcoin ETFs Surpass GBTC ...   neutral            neu   
1  Stock Up 400 Pampa Metals PMMCF Options New Go...  positive            neu   
2  Ethereum ETH Approaches 4000 In Multi Week Hig...   neutral            neu   
3  After 141 Surge Golden Inu s Appears To Be Hea...   neutral            neu   
4  10 18 Tuesday s Pre Market Stock Movers News G...  positive            neu   

   vader_prob  result  
0       1.000       1  
1       1.000       0  
2       0.784       1  
3       1.000       1  
4       0.867       0  
0.5623130608175474
