# Vorbereitung der Daten für eigenes Modell

Dieses Notebook bereitet die im durch API Abrufe erstellten Datensätze weiter auf.

Es werden vor allem die Sentiment-Label durch den VADER Sentiment Analyzer vergeben. Dies geschieht für alle Datensätze nicht nur für den in der Analyse genutzten Microsoft Datensatz. Die Lesenden des Artikels können die anderen Datensätze gern für weitere Analysen und Vergleiche nutzen.


In [1]:
import pandas as pd
import re
import json
import numpy as np


df_microsoft = pd.read_csv("Reddit Data/microsoft_comments.csv")
df_amazon = pd.read_csv("Reddit Data/amazon_comments.csv")
df_apple = pd.read_csv("Reddit Data/apple_comments.csv")
df_bitcoin = pd.read_csv("Reddit Data/bitcoin_comments.csv")

In [2]:
df_dict = {"microsoft":df_microsoft,
          "amazon": df_amazon,
          "apple": df_apple,
          "bitcoin": df_bitcoin}

for key, df in df_dict.items():
    print(key)
    display(df.head(3))
    print(df.info(), "\n")
    print(df.isna().sum(), "\n")

microsoft


Unnamed: 0,body,created_utc
0,I'm waiting for the hacks in the marketing dep...,1420070000.0
1,&gt;And the great thing about the web is that ...,1420073000.0
2,Many years ago I gave up giving Microsoft hint...,1420077000.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69352 entries, 0 to 69351
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   body         69329 non-null  object 
 1   created_utc  69329 non-null  float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB
None 

body           23
created_utc    23
dtype: int64 

amazon


Unnamed: 0,body,created_utc
0,be ready for them to fight you on activating y...,1420077000.0
1,"Guys, do yourself a favor and buy your own mod...",1420090000.0
2,"If you have a brain, you can make a server han...",1420095000.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97013 entries, 0 to 97012
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   body         97012 non-null  object 
 1   created_utc  97012 non-null  float64
dtypes: float64(1), object(1)
memory usage: 1.5+ MB
None 

body           1
created_utc    1
dtype: int64 

apple


Unnamed: 0,body,created_utc
0,If you study leadership you'll know that a vis...,1420068000.0
1,Apple had it coming. They are a bad company.,1420093000.0
2,Radar nowadays can tell the difference between...,1420102000.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203453 entries, 0 to 203452
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   body         203448 non-null  object 
 1   created_utc  203448 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.1+ MB
None 

body           5
created_utc    5
dtype: int64 

bitcoin


Unnamed: 0,body,created_utc
0,I wonder how much bitcoin the $2375 worth of e...,1420079129
1,"If you want super secure payments, look into [...",1420087732
2,Bitcoin. Problem solved.,1420101402


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21402 entries, 0 to 21401
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   body         21402 non-null  object
 1   created_utc  21402 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 334.5+ KB
None 

body           0
created_utc    0
dtype: int64 



In [3]:
for key, df in df_dict.items():
    #drop any NAs
    df.dropna(inplace=True)
    #change created_utc to datetime
    df["created_utc"] = pd.to_datetime(df["created_utc"],
                                       unit='s',
                                       errors='coerce')
    #check changes
    display(df.head(2))
    print(df.isna().sum())

Unnamed: 0,body,created_utc
0,I'm waiting for the hacks in the marketing dep...,2014-12-31 23:56:33
1,&gt;And the great thing about the web is that ...,2015-01-01 00:35:56


body           0
created_utc    0
dtype: int64


Unnamed: 0,body,created_utc
0,be ready for them to fight you on activating y...,2015-01-01 01:47:47
1,"Guys, do yourself a favor and buy your own mod...",2015-01-01 05:23:39


body           0
created_utc    0
dtype: int64


Unnamed: 0,body,created_utc
0,If you study leadership you'll know that a vis...,2014-12-31 23:27:18
1,Apple had it coming. They are a bad company.,2015-01-01 06:18:45


body           0
created_utc    0
dtype: int64


Unnamed: 0,body,created_utc
0,I wonder how much bitcoin the $2375 worth of e...,2015-01-01 02:25:29
1,"If you want super secure payments, look into [...",2015-01-01 04:48:52


body           0
created_utc    0
dtype: int64


## Sentiment Analyse mit VADER

"VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains." (https://github.com/cjhutto/vaderSentiment#python-demo-and-code-examples)
VADER ist ein Regel-basiertes Model, welches von Wissenschaftler der Georgia Tech Universität für Sentiment Analyse von Social Media texten entwickelt wurde. Mehr Infos finden sich in ihrem Paper (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

Hier wird der VADER (Valence Aware Dictionary for sEntiment Reasoning) Sentiment Analyzer genutzt um die einzelnen Kommentare in jedem DataFrame zu Labeln.


### Über das Scoring Scoring

Der zusammengesetzte Score (compound score) wird berechnet, indem die Valenz-Scores der einzelnen Wörter im Lexikon summiert, entsprechend den Regeln angepasst und dann so normalisiert werden, dass sie zwischen -1 (extrem negativ) und +1 (extrem positiv) liegen. Dies ist die nützlichste Metrik, wenn ein einziges eindimensionales Maß der Stimmung für einen bestimmten Satz gewünscht wird. Die Bezeichnung 'normalized, weighted composite score' ist präzise.



Für die geplante Klassifikation sind diskrete Kategorien anstelle der kontinuierlichen Scores notwendig. Hierfür verwenden wir die standardisierte Schwellenwerte für die Klassifizierung von Sätzen als entweder positiv, neutral oder negativ. Typische Schwellenwerte (die in der oben erwähnten Literatur verwendet werden) sind:

positive Stimmung: zusammengesetzter Wert >= 0,05
neutrale Stimmung: (zusammengesetzte Punktzahl > -0,05) und (zusammengesetzte Punktzahl < 0,05)
negative Stimmung: zusammengesetzte Punktzahl <= -0,05

Der Compound Score und die gezeigten Cut-Offs werden von den meisten Forschern, auch von den Autoren von VADER, am häufigsten für die Sentiment-Analyse verwendet.


In [4]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #VADER sentiment model

In [5]:
sent_analyzer = SentimentIntensityAnalyzer() #instantiate

In [6]:
df_bitcoin.loc[:,"scores"] = df_bitcoin.loc[:,"body"].apply(sent_analyzer.polarity_scores) #use analyzer on every comment
df_bitcoin.loc[:,"scores"] = df_bitcoin.loc[:,"scores"].apply(lambda score : score["compound"]) #only keep compound score
df_bitcoin.head()

Unnamed: 0,body,created_utc,scores
0,I wonder how much bitcoin the $2375 worth of e...,2015-01-01 02:25:29,0.296
1,"If you want super secure payments, look into [...",2015-01-01 04:48:52,0.9246
2,Bitcoin. Problem solved.,2015-01-01 08:36:42,-0.1531
3,and the Sacramento Kings! http://www.nba.com/k...,2015-01-01 14:10:44,0.0
4,Noice this'll be good for Bitcoin,2015-01-01 21:04:28,0.4404


In [7]:
def analyse_sentiment(df,sentiment_analyzer=sent_analyzer):
    """
    Small utility function that applies the sentiment_analyzer to every row of df in order to
    receive the compound sentiment score.
    """
    df.loc[:,"scores"] = df.loc[:,"body"].apply(sent_analyzer.polarity_scores)
    df.loc[:,"scores"] = df.loc[:,"scores"].apply(lambda score : score["compound"])

#analyse sentiment for every comment in every df
for key,df in df_dict.items():
    print(key)
    analyse_sentiment(df)
    print("Done")
    display(df.head(3))

microsoft
Done


Unnamed: 0,body,created_utc,scores
0,I'm waiting for the hacks in the marketing dep...,2014-12-31 23:56:33,0.0
1,&gt;And the great thing about the web is that ...,2015-01-01 00:35:56,0.6811
2,Many years ago I gave up giving Microsoft hint...,2015-01-01 01:55:38,0.8658


amazon
Done


Unnamed: 0,body,created_utc,scores
0,be ready for them to fight you on activating y...,2015-01-01 01:47:47,0.8201
1,"Guys, do yourself a favor and buy your own mod...",2015-01-01 05:23:39,0.5393
2,"If you have a brain, you can make a server han...",2015-01-01 06:43:57,0.9162


apple
Done


Unnamed: 0,body,created_utc,scores
0,If you study leadership you'll know that a vis...,2014-12-31 23:27:18,0.7189
1,Apple had it coming. They are a bad company.,2015-01-01 06:18:45,-0.5423
2,Radar nowadays can tell the difference between...,2015-01-01 08:43:26,0.0534


bitcoin
Done


Unnamed: 0,body,created_utc,scores
0,I wonder how much bitcoin the $2375 worth of e...,2015-01-01 02:25:29,0.296
1,"If you want super secure payments, look into [...",2015-01-01 04:48:52,0.9246
2,Bitcoin. Problem solved.,2015-01-01 08:36:42,-0.1531


## Kategorien erstellen anstelle des Scores

Hier werden Sentiment-Kategorien anhand der Offizielle VADER Dokumentations Cut-offs (s.o.) erstellt.

In [8]:
def categorize_sentiment_score(df):
    # create a list of conditions for sentiment
    conditions = [
        (df.loc[:,'scores'] >= 0.05),
        (df.loc[:,'scores'] > -0.05) & (df.loc[:,'scores'] < 0.05),
        (df.loc[:,'scores'] <= -0.05)
        ]

    # create a list of the values we want to assign for each condition
    values = [1, 0, -1]

    # create a new column and use np.select to assign values to it using our lists as arguments
    df['sentiment'] = np.select(conditions, values)

for key,df in df_dict.items():
    print(key)
    categorize_sentiment_score(df)
    print("Done")
    display(df.head(3))

microsoft
Done


Unnamed: 0,body,created_utc,scores,sentiment
0,I'm waiting for the hacks in the marketing dep...,2014-12-31 23:56:33,0.0,0
1,&gt;And the great thing about the web is that ...,2015-01-01 00:35:56,0.6811,1
2,Many years ago I gave up giving Microsoft hint...,2015-01-01 01:55:38,0.8658,1


amazon
Done


Unnamed: 0,body,created_utc,scores,sentiment
0,be ready for them to fight you on activating y...,2015-01-01 01:47:47,0.8201,1
1,"Guys, do yourself a favor and buy your own mod...",2015-01-01 05:23:39,0.5393,1
2,"If you have a brain, you can make a server han...",2015-01-01 06:43:57,0.9162,1


apple
Done


Unnamed: 0,body,created_utc,scores,sentiment
0,If you study leadership you'll know that a vis...,2014-12-31 23:27:18,0.7189,1
1,Apple had it coming. They are a bad company.,2015-01-01 06:18:45,-0.5423,-1
2,Radar nowadays can tell the difference between...,2015-01-01 08:43:26,0.0534,1


bitcoin
Done


Unnamed: 0,body,created_utc,scores,sentiment
0,I wonder how much bitcoin the $2375 worth of e...,2015-01-01 02:25:29,0.296,1
1,"If you want super secure payments, look into [...",2015-01-01 04:48:52,0.9246,1
2,Bitcoin. Problem solved.,2015-01-01 08:36:42,-0.1531,-1


## Speichern der Trainings-DataFrames als .csv Datei.

In [10]:
for key,df in df_dict.items():
    df.drop("scores",axis=1,inplace=True)
    print(key)
    display(df)
    df.to_csv(f"Reddit Data/Train/{key}_comments_train.csv",index=0)

microsoft


Unnamed: 0,body,created_utc,sentiment
0,I'm waiting for the hacks in the marketing dep...,2014-12-31 23:56:33,0
1,&gt;And the great thing about the web is that ...,2015-01-01 00:35:56,1
2,Many years ago I gave up giving Microsoft hint...,2015-01-01 01:55:38,1
3,Microsoft releases patches on Tuesday.,2015-01-01 02:14:01,0
4,"&gt;Consider Microsoft Delve, a brilliant new ...",2015-01-01 02:17:48,1
...,...,...,...
69347,"lol microsoft fucking sucks. win 95, win fista...",2020-12-31 21:27:00,1
69348,Microsoft has 20 different billion dollar busi...,2020-12-31 22:05:18,0
69349,I hope they disclose what source code specific...,2020-12-31 22:15:46,1
69350,Can anyone stop to observe the fact they hacke...,2020-12-31 22:37:47,-1


amazon


Unnamed: 0,body,created_utc,sentiment
0,be ready for them to fight you on activating y...,2015-01-01 01:47:47,1
1,"Guys, do yourself a favor and buy your own mod...",2015-01-01 05:23:39,1
2,"If you have a brain, you can make a server han...",2015-01-01 06:43:57,1
3,They would get bought out and you will continu...,2015-01-01 12:36:38,1
4,"If you're worried about ebooks from Amazon, yo...",2015-01-01 13:29:44,1
...,...,...,...
97008,"&gt;So I went to Amazon.com, searched for ""chi...",2020-12-31 04:19:05,-1
97009,Actually the US government has a hard on for B...,2020-12-31 12:04:21,1
97010,I have found for the past year or so that many...,2020-12-31 12:25:32,1
97011,"Many solutions, most requiring a complete syst...",2020-12-31 14:57:47,1


apple


Unnamed: 0,body,created_utc,sentiment
0,If you study leadership you'll know that a vis...,2014-12-31 23:27:18,1
1,Apple had it coming. They are a bad company.,2015-01-01 06:18:45,-1
2,Radar nowadays can tell the difference between...,2015-01-01 08:43:26,1
3,Or they could simply stop selling 16GB models ...,2015-01-01 08:49:54,1
4,This is the problem in the thread the technolo...,2015-01-01 13:18:48,-1
...,...,...,...
203448,"...and during all that time, did a single one ...",2020-12-31 19:41:42,-1
203449,Remember when everyone freaked out when Apple ...,2020-12-31 20:46:35,-1
203450,Expect Apple to be a low performer for this in...,2020-12-31 20:47:15,-1
203451,Apple killed Flash by banning it on all their ...,2020-12-31 22:06:16,-1


bitcoin


Unnamed: 0,body,created_utc,sentiment
0,I wonder how much bitcoin the $2375 worth of e...,2015-01-01 02:25:29,1
1,"If you want super secure payments, look into [...",2015-01-01 04:48:52,1
2,Bitcoin. Problem solved.,2015-01-01 08:36:42,-1
3,and the Sacramento Kings! http://www.nba.com/k...,2015-01-01 14:10:44,0
4,Noice this'll be good for Bitcoin,2015-01-01 21:04:28,1
...,...,...,...
21397,"Don't think so, go to bitcoin 3 years or 7 yea...",2020-12-31 10:14:53,0
21398,"I mean, while I don't agree with this law chan...",2020-12-31 13:02:35,-1
21399,"Vegas Auto Gallery, which sells brands such a...",2020-12-31 15:18:20,1
21400,Don't even need to go so far as a bitcoin wall...,2020-12-31 16:33:12,1


## Labeln der Testdaten nach dem anhand der Trainingsdaten gezeigten Prozess.

In [11]:
df_test_microsoft = pd.read_csv("Reddit Data/Test/microsoft_comments_test.csv")
df_test_amazon = pd.read_csv("Reddit Data/Test/amazon_comments_test.csv")
df_test_apple = pd.read_csv("Reddit Data/Test/apple_comments_test.csv")
df_test_bitcoin = pd.read_csv("Reddit Data/Test/bitcoin_comments_test.csv")

df_test_dict = {"microsoft":df_test_microsoft,
          "amazon": df_test_amazon,
          "apple": df_test_apple,
          "bitcoin": df_test_bitcoin}


In [12]:
for key,df in df_test_dict.items():
    print(key)
    df.dropna(inplace=True)
    analyse_sentiment(df)
    print("Done")
    display(df.head(3))

microsoft
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,https://en.m.wikipedia.org/wiki/Sun_Microsyste...,1609456623,0,0.0
1,They’re not gimping any files. You can easily ...,1609456695,1,0.8445
2,I think they're going as far as killing the st...,1609456902,-1,-0.2023


amazon
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,* Apple\n* Microsoft\n* Amazon\n* Alphabet\n* ...,1609459000.0,1,0.8885
1,There were two books released on Kindle that h...,1609461000.0,1,0.25
2,"Amazon: “oh, oops” (laughs in billions)",1609465000.0,1,0.5994


apple
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,Fuck apple forever for killing flash,1609457000.0,-1,-0.836
1,I think they're going as far as killing the st...,1609457000.0,-1,-0.2023
2,Also because flash was gonna be a competitor i...,1609458000.0,1,0.8241


bitcoin
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,I know that Bitcoin isn't great for the enviro...,1609512000.0,-1,-0.7349
1,So bitcoin is gonna complete mining in about a...,1609593000.0,0,0.0
2,It is going there with Bitcoin.,1609635000.0,0,0.0


In [13]:
for key,df in df_test_dict.items():
    print(key)
    categorize_sentiment_score(df)
    print("Done")
    display(df.head(3))

microsoft
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,https://en.m.wikipedia.org/wiki/Sun_Microsyste...,1609456623,0,0.0
1,They’re not gimping any files. You can easily ...,1609456695,1,0.8445
2,I think they're going as far as killing the st...,1609456902,-1,-0.2023


amazon
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,* Apple\n* Microsoft\n* Amazon\n* Alphabet\n* ...,1609459000.0,1,0.8885
1,There were two books released on Kindle that h...,1609461000.0,1,0.25
2,"Amazon: “oh, oops” (laughs in billions)",1609465000.0,1,0.5994


apple
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,Fuck apple forever for killing flash,1609457000.0,-1,-0.836
1,I think they're going as far as killing the st...,1609457000.0,-1,-0.2023
2,Also because flash was gonna be a competitor i...,1609458000.0,1,0.8241


bitcoin
Done


Unnamed: 0,body,created_utc,sentiment,scores
0,I know that Bitcoin isn't great for the enviro...,1609512000.0,-1,-0.7349
1,So bitcoin is gonna complete mining in about a...,1609593000.0,0,0.0
2,It is going there with Bitcoin.,1609635000.0,0,0.0


In [14]:
for key,df in df_test_dict.items():
    df.drop("scores",axis=1,inplace=True)
    print(key)
    display(df)
    df.to_csv(f"Reddit Data/Test/{key}_comments_test.csv",index=0)

microsoft


Unnamed: 0,body,created_utc,sentiment
0,https://en.m.wikipedia.org/wiki/Sun_Microsyste...,1609456623,0
1,They’re not gimping any files. You can easily ...,1609456695,1
2,I think they're going as far as killing the st...,1609456902,-1
3,That’s why I don’t want Google or Microsoft to...,1609457786,1
4,* Apple\n* Microsoft\n* Amazon\n* Alphabet\n* ...,1609458930,1
...,...,...,...
2276,"You have everything so, so incorrect. Spotify ...",1619810176,-1
2277,For them I would just keep saving. I debate wh...,1619811045,1
2278,Those that have enough experience to be worth ...,1619812587,1
2279,Get fired? More like call up Microsoft and say...,1619813597,1


amazon


Unnamed: 0,body,created_utc,sentiment
0,* Apple\n* Microsoft\n* Amazon\n* Alphabet\n* ...,1.609459e+09,1
1,There were two books released on Kindle that h...,1.609461e+09,1
2,"Amazon: “oh, oops” (laughs in billions)",1.609465e+09,1
3,As a consumer it makes me really upset knowing...,1.609493e+09,1
4,You paid more than Amazon did in taxes. And h...,1.609509e+09,1
...,...,...,...
8780,It's almost like being a slave to Big tech no ...,1.619814e+09,-1
8781,We've been remote for 14 months. Metro Atlanta...,1.619815e+09,1
8782,&gt; And even if I was able to pay people to r...,1.619816e+09,1
8783,Amazon is doing the same thing.,1.619819e+09,1


apple


Unnamed: 0,body,created_utc,sentiment
0,Fuck apple forever for killing flash,1.609457e+09,-1
1,I think they're going as far as killing the st...,1.609457e+09,-1
2,Also because flash was gonna be a competitor i...,1.609458e+09,1
3,* Apple\n* Microsoft\n* Amazon\n* Alphabet\n* ...,1.609459e+09,1
4,It was one of the things that turned me from a...,1.609463e+09,1
...,...,...,...
8195,They can't let the prices of these massive bui...,1.619806e+09,-1
8196,Yknow that internet was primarily paid for by ...,1.619806e+09,0
8197,&gt; Those employees will get poached soon.\n\...,1.619806e+09,-1
8198,Google might replace some of them but they als...,1.619806e+09,1


bitcoin


Unnamed: 0,body,created_utc,sentiment
0,I know that Bitcoin isn't great for the enviro...,1.609512e+09,-1
1,So bitcoin is gonna complete mining in about a...,1.609593e+09,0
2,It is going there with Bitcoin.,1.609635e+09,0
3,&gt; Did corporations determine who won the U....,1.609637e+09,-1
4,Owning Tesla stock is like owning bitcoin. It'...,1.609638e+09,1
...,...,...,...
4333,No body pays for their coffee with gold. So wh...,1.619793e+09,1
4334,Are you serious?? I built my first PC in late ...,1.619802e+09,-1
4335,Just kill the bitcoin to bank transferability ...,1.619804e+09,-1
4336,Cryptopumpndump would be more accurate. All th...,1.619814e+09,1
