1. Intro: Hvad vil vi?
    - Prædiktere om kommentarer bliver upvoted
    - Behandler signifikante ord/nøgleord som variable
2. Alternative tokenizers
    - effektivitet vs. nøjagtighed
3. Sammenligning af tokenizers
4. Ordoptælling med scikit-learn
5. Alternative vægtninger i optællinger (TF-IDF)
6. Fra tekstdata til variable/features
    - Fra score til dummy
    - Fra tekst til dummies

- Supplerende: Fra rådata til tabel

# Intro

# Alternative tokenizers

- stanza kun med tokenizer
- spacy kun med tokenizer
- sklearn CountVectorizer().build_tokenizer()

- Tidssammenligning
    - stanza med det hele
    - stanza tokenize
    - spacy tokenize
    - sklearn CountVectorizer().build_tokenizer()

## stanza - med det hele

In [1]:
import stanza
import pandas as pd
from nltk.corpus import stopwords

# Download ressourcer
#nltk.download('stopwords')
#stanza.download('da')

In [2]:
redditdata_url = "https://raw.githubusercontent.com/CALDISS-AAU/course_ndms-I/master/datasets/reddit_rdenmark-comments_01032021-08032021_long.csv"
reddit_df = pd.read_csv(redditdata_url)

In [205]:
# Definer tokenizer

nlp = stanza.Pipeline('da')

def tokenizer_stanza(text): # Definerer funktion ud fra koden fra tidligere    
    
    stop_words = list(stopwords.words('danish'))
    pos_tags = ['PROPN', 'ADJ', 'NOUN']

    doc = nlp(text)

    tokens = []

    for sentence in doc.sentences:
        for word in sentence.words:
            if (len(word.lemma) < 2):
                continue
            if (word.pos in pos_tags) and (word.lemma not in stop_words):
                tokens.append(word.lemma)
                
    return(tokens)

2021-03-17 12:24:38 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |
| pos       | ddt     |
| lemma     | ddt     |
| depparse  | ddt     |

2021-03-17 12:24:38 INFO: Use device: cpu
2021-03-17 12:24:38 INFO: Loading: tokenize
2021-03-17 12:24:38 INFO: Loading: pos
2021-03-17 12:24:39 INFO: Loading: lemma
2021-03-17 12:24:39 INFO: Loading: depparse
2021-03-17 12:24:40 INFO: Done loading processors!


In [208]:
reddit_sample = reddit_df.sample(100, random_state = 142)
reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_stanza)

In [None]:
reddit_sample['tokens'].head()

## stanza - kun tokenizer

In [255]:
# Definer tokenizer

nlp = stanza.Pipeline('da', processors = 'tokenize')

def tokenizer_stanza_simple(text): # Definerer funktion ud fra koden fra tidligere
    
    stop_words = list(stopwords.words('danish'))

    doc = nlp(text)

    tokens = []

    for sentence in doc.sentences:
        for word in sentence.words:
            if (len(word.text) < 2):
                continue
            if word.text.lower() not in stop_words:
                tokens.append(word.text.lower())
                
    return(tokens)

2021-03-17 13:29:38 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |

2021-03-17 13:29:38 INFO: Use device: cpu
2021-03-17 13:29:38 INFO: Loading: tokenize
2021-03-17 13:29:38 INFO: Done loading processors!


In [256]:
reddit_sample = reddit_df.sample(100, random_state = 142)
reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_stanza_simple)

In [257]:
reddit_sample['tokens'].head()

480         [kl., morgenen, morgen, kl., 6., 00-06, nat.]
2925    [må, kun, købe, ammunition, våben, tilladelse,...
1100    [særligt, dør, corona, kan, få, god, behandlin...
2777                           [husk, må, offentligheden]
636     [okay, lad, rette, udmelding, damage, bestemt,...
Name: tokens, dtype: object

## spacy

In [213]:
import spacy

!python -m spacy download da_core_news_sm # download sprogmodel

Collecting da-core-news-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/da_core_news_sm-3.0.0/da_core_news_sm-3.0.0-py3-none-any.whl (18.8 MB)
[+] Download and installation successful
You can now load the package via spacy.load('da_core_news_sm')


In [258]:
nlp = spacy.load("da_core_news_sm")

def tokenizer_spacy_simple(text): # Definerer funktion ud fra koden fra tidligere
    
    stop_words = list(stopwords.words('danish'))

    doc = nlp.tokenizer(text)

    tokens = []

    for word in doc:
        if (len(word.text) < 2):
            continue
        if word.text.lower() not in stop_words:
            tokens.append(word.text.lower())

    return(tokens)

In [259]:
reddit_sample = reddit_df.sample(100, random_state = 142)
reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_spacy_simple)

In [260]:
reddit_sample['tokens'].head()

480     [kl., morgenen,  \n, morgen, kl., 6.,  \n, 00-...
2925    [må, kun, købe, ammunition, våben, tilladelse,...
1100    [særligt, dør, corona, kan, få, god, behandlin...
2777                           [husk, må, offentligheden]
636     [okay, lad, rette, udmelding, damage, bestemt,...
Name: tokens, dtype: object

## sklearn

In [261]:
from sklearn.feature_extraction.text import CountVectorizer

tokenizer = CountVectorizer().build_tokenizer()

def tokenizer_sklearn(text):
    stop_words = list(stopwords.words('danish'))
    
    words = tokenizer(text)
    
    tokens = []
    
    for word in words:
        if (len(word) < 2):
            continue
        if word.lower() not in stop_words:
            tokens.append(word.lower())
    
    return(tokens)

In [262]:
reddit_sample = reddit_df.sample(100, random_state = 142)
reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_sklearn)

In [263]:
reddit_sample['tokens'].head()

480               [kl, morgenen, morgen, kl, 00, 06, nat]
2925    [må, kun, købe, ammunition, våben, tilladelse,...
1100    [særligt, dør, corona, kan, få, god, behandlin...
2777                           [husk, må, offentligheden]
636     [okay, lad, rette, udmelding, damage, bestemt,...
Name: tokens, dtype: object

## Tidstælling

In [280]:
import stanza
import spacy
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import time

nlp_stanza = stanza.Pipeline('da')
nlp_stanza_simple = stanza.Pipeline('da', processors = 'tokenize')
nlp_spacy_simple = spacy.load("da_core_news_sm")
sklearn_tokenizer = CountVectorizer().build_tokenizer()

def tokenizer_stanza(text, nlp = nlp_stanza): # Definerer funktion ud fra koden fra tidligere    

    stop_words = list(stopwords.words('danish'))
    pos_tags = ['PROPN', 'ADJ', 'NOUN']

    doc = nlp(text)

    tokens = []

    for sentence in doc.sentences:
        for word in sentence.words:
            if (len(word.lemma) < 2):
                continue
            if (word.pos in pos_tags) and (word.lemma not in stop_words):
                tokens.append(word.lemma)

    return(tokens)


def tokenizer_stanza_simple(text, nlp = nlp_stanza_simple): # Definerer funktion ud fra koden fra tidligere
    
    stop_words = list(stopwords.words('danish'))

    doc = nlp(text)

    tokens = []

    for sentence in doc.sentences:
        for word in sentence.words:
            if (len(word.text) < 2):
                continue
            if word.text.lower() not in stop_words:
                tokens.append(word.text.lower())
                
    return(tokens)

def tokenizer_spacy_simple(text, nlp = nlp_spacy_simple): # Definerer funktion ud fra koden fra tidligere
    
    stop_words = list(stopwords.words('danish'))

    doc = nlp.tokenizer(text)

    tokens = []

    for word in doc:
        if (len(word.text) < 2):
            continue
        if word.text.lower() not in stop_words:
            tokens.append(word.text.lower())

    return(tokens)

def tokenizer_sklearn(text, tokenizer = sklearn_tokenizer):
    stop_words = list(stopwords.words('danish'))
    
    words = tokenizer(text)
    
    tokens = []
    
    for word in words:
        if (len(word) < 2):
            continue
        if word.lower() not in stop_words:
            tokens.append(word.lower())
    
    return(tokens)

def stanza_full_tester():
    start_time = time.time()
    
    reddit_sample = reddit_df.sample(100, random_state = 142)
    reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_stanza)
    
    print("stanza full: {0:.2f} seconds".format(time.time()-start_time))
    
def stanza_simple_tester():
    start_time = time.time()
    
    reddit_sample = reddit_df.sample(100, random_state = 142)
    reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_stanza_simple)
    
    print("stanza simple: {0:.2f} seconds".format(time.time()-start_time))
    
def spacy_simple_tester():
    start_time = time.time()
    
    reddit_sample = reddit_df.sample(100, random_state = 142)
    reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_spacy_simple)
    
    print("spacy simple: {0:.2f} seconds".format(time.time()-start_time))
    
def sklearn_tester():
    start_time = time.time()
          
    reddit_sample = reddit_df.sample(100, random_state = 142)
    reddit_sample['tokens'] = reddit_sample['comment_body'].apply(tokenizer_sklearn)
    
    print("sklearn: {0:.2f} seconds".format(time.time()-start_time))

2021-03-17 13:50:31 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |
| pos       | ddt     |
| lemma     | ddt     |
| depparse  | ddt     |

2021-03-17 13:50:31 INFO: Use device: cpu
2021-03-17 13:50:31 INFO: Loading: tokenize
2021-03-17 13:50:31 INFO: Loading: pos
2021-03-17 13:50:32 INFO: Loading: lemma
2021-03-17 13:50:32 INFO: Loading: depparse
2021-03-17 13:50:33 INFO: Done loading processors!
2021-03-17 13:50:33 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |

2021-03-17 13:50:33 INFO: Use device: cpu
2021-03-17 13:50:33 INFO: Loading: tokenize
2021-03-17 13:50:33 INFO: Done loading processors!


In [281]:
stanza_full_tester()
stanza_simple_tester()
spacy_simple_tester()
sklearn_tester()

stanza full: 18.86 seconds
stanza simple: 6.63 seconds
spacy simple: 0.13 seconds
sklearn: 0.03 seconds


# Ordoptælling (vectorizers)

In [25]:
comments = list(reddit_df['comment_body'])

In [26]:
len(comments)

3428

## CountVectorizer

In [105]:
# Countvectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
transformed_documents = vectorizer.fit_transform(comments)

transformed_documents_as_array = transformed_documents.toarray()

len(transformed_documents_as_array)

3428

In [106]:
df = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())

In [107]:
df.head()

Unnamed: 0,00,000,000kr,01,019d907422_01,02,03,04,05,06,...,øver,øverligt,øverste,øvet,øvetimer,øvre,øvrige,øvrigt,überdurchschnittlich,ﾟヮﾟ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [282]:
word_count = df.sum()
word_count.sort_values(ascending = False)[0:20]

[deleted]    67.000000
god          54.742846
al           49.351776
folk         40.191250
år           36.689772
stor         35.835614
&gt          33.150415
tak          31.653772
Danmark      26.753062
kommentar    26.515093
samme        26.307562
gang         25.675026
lille        25.382263
dansk        24.381739
dag          23.914208
spørgsmål    23.835020
problem      22.978650
tid          22.112148
ting         22.009872
penge        21.293497
dtype: float64

### Med stopord og dokumentgrænser

In [138]:
from nltk.corpus import stopwords

stops = stopwords.words('danish')

vectorizer = CountVectorizer(stop_words = stops, max_df = 0.9)
transformed_documents = vectorizer.fit_transform(comments)

transformed_documents_as_array = transformed_documents.toarray()

df = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())

word_count = df.sum()
word_count.sort_values(ascending = False)[0:20]

så       1320
kan       933
the       462
to        415
https     386
bare      366
ved       357
nok       279
mere      263
godt      260
gt        257
lige      250
lidt      239
and       233
se        223
com       220
www       215
folk      207
år        206
tror      197
dtype: int64

## Alternative vægtning - Tf-idfVectorizer

In [142]:
# Tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import stopwords

stops = stopwords.words('danish')

vectorizer = TfidfVectorizer(stop_words = stops, max_df = 0.9, norm=False)
transformed_documents = vectorizer.fit_transform(comments)

transformed_documents_as_array = transformed_documents.toarray()

df = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())

word_tfidfsum = df.sum()
word_tfidfsum.sort_values(ascending = False)[0:50]

så         3115.334699
kan        2435.694572
the        1808.396103
to         1510.087783
https      1410.917629
bare       1222.780855
ved        1219.668670
and        1030.215344
nok        1009.579087
gt          996.474777
mere        977.064445
godt        956.913994
lige        928.768484
com         918.979451
www         904.193254
lidt        900.883236
0a          889.633384
se          855.588347
of          830.092618
år          828.538209
folk        824.060511
tror        784.250825
fordi       772.838287
andre       769.439345
helt        761.063850
it          750.282866
ret         723.188197
få          711.348774
danmark     702.441635
må          689.234588
you         687.544244
that        679.230748
kommer      675.278304
ja          673.972109
in          629.751896
dk          624.937200
kun         622.438705
får         615.064459
gør         614.010303
nogen       610.965999
samme       604.525528
is          599.507662
mener       576.901141
uden       

## Tf-idf på eksisterende tokens

In [144]:
import ast
reddit_df_tokenized = pd.read_csv("https://raw.githubusercontent.com/CALDISS-AAU/course_ndms-I/master/datasets/reddit_rdenmark-comments_01032021-08032021_long_tokenized.csv")
reddit_df_tokenized['tokens'] = reddit_df_tokenized['tokens'].apply(ast.literal_eval)

In [145]:
comments_tokens = list(reddit_df_tokenized['tokens'])

In [166]:
from sklearn.feature_extraction.text import TfidfVectorizer

def return_tokens(tokens):
    return tokens

vectorizer = TfidfVectorizer(
    tokenizer=return_tokens,
    preprocessor=return_tokens,
    token_pattern=None)

transformed_documents = vectorizer.fit_transform(comments_tokens)

transformed_documents_as_array = transformed_documents.toarray()

df = pd.DataFrame(transformed_documents_as_array, columns = vectorizer.get_feature_names())

word_tfidfsum = df.sum().sort_values(ascending = False)
word_tfidfsum[0:50]

[deleted]    67.000000
god          54.742846
al           49.351776
folk         40.191250
år           36.689772
stor         35.835614
&gt          33.150415
tak          31.653772
Danmark      26.753062
kommentar    26.515093
samme        26.307562
gang         25.675026
lille        25.382263
dansk        24.381739
dag          23.914208
spørgsmål    23.835020
problem      22.978650
tid          22.112148
ting         22.009872
penge        21.293497
land         21.116093
menneske     20.414228
hel          20.390628
sted         20.021477
barn         19.930291
enig         19.403917
ny           18.676445
vaccine      17.294519
måde         16.827851
album        16.783989
That         16.760154
del          16.592103
person       15.291479
svær         14.609716
ulovlig      14.495890
besked       14.481778
rigtig       14.347606
side         14.206807
krone        13.772239
arbejde      13.616934
sikker       13.588716
indhold      13.400862
egen         13.340181
parti      

# Forberedelse af tekst til random forests model

## Fra score til dummy

In [284]:
reddit_df_rf = reddit_df_tokenized.copy()

In [285]:
reddit_df_rf['comment_score'].head()

0    1
1    1
2    1
3    1
4    1
Name: comment_score, dtype: int64

In [286]:
reddit_df_rf['upvoted'] = reddit_df_rf['comment_score'] > 1

In [290]:
reddit_df_rf['upvoted'].head()

0    False
1    False
2    False
3    False
4    False
Name: upvoted, dtype: bool

In [293]:
reddit_df_rf['upvoted'].value_counts()

True     1872
False    1556
Name: upvoted, dtype: int64

## Fra tekst til dummies

In [170]:
top_words = list(word_tfidfsum.index[0:50])

In [181]:
for word in top_words:
    colname = "token_{}".format(word)
    reddit_df_rf[colname] = reddit_df_rf['tokens'].apply(lambda tokens: int(word in tokens))

In [189]:
from itertools import compress

list(compress(reddit_df_textdummies.columns, reddit_df_textdummies.columns.str.startswith('token_')))

['token_[deleted]',
 'token_god',
 'token_al',
 'token_folk',
 'token_år',
 'token_stor',
 'token_&gt',
 'token_tak',
 'token_Danmark',
 'token_kommentar',
 'token_samme',
 'token_gang',
 'token_lille',
 'token_dansk',
 'token_dag',
 'token_spørgsmål',
 'token_problem',
 'token_tid',
 'token_ting',
 'token_penge',
 'token_land',
 'token_menneske',
 'token_hel',
 'token_sted',
 'token_barn',
 'token_enig',
 'token_ny',
 'token_vaccine',
 'token_måde',
 'token_album',
 'token_That',
 'token_del',
 'token_person',
 'token_svær',
 'token_ulovlig',
 'token_besked',
 'token_rigtig',
 'token_side',
 'token_krone',
 'token_arbejde',
 'token_sikker',
 'token_indhold',
 'token_egen',
 'token_parti',
 'token_forælder',
 'token_hvid',
 'token_Thank',
 'token_gammel',
 'token_forhold',
 'token_reklame']

In [185]:
[column for column in reddit_df_textdummies.columns if column.startswith('token_')]

['token_[deleted]',
 'token_god',
 'token_al',
 'token_folk',
 'token_år',
 'token_stor',
 'token_&gt',
 'token_tak',
 'token_Danmark',
 'token_kommentar',
 'token_samme',
 'token_gang',
 'token_lille',
 'token_dansk',
 'token_dag',
 'token_spørgsmål',
 'token_problem',
 'token_tid',
 'token_ting',
 'token_penge',
 'token_land',
 'token_menneske',
 'token_hel',
 'token_sted',
 'token_barn',
 'token_enig',
 'token_ny',
 'token_vaccine',
 'token_måde',
 'token_album',
 'token_That',
 'token_del',
 'token_person',
 'token_svær',
 'token_ulovlig',
 'token_besked',
 'token_rigtig',
 'token_side',
 'token_krone',
 'token_arbejde',
 'token_sikker',
 'token_indhold',
 'token_egen',
 'token_parti',
 'token_forælder',
 'token_hvid',
 'token_Thank',
 'token_gammel',
 'token_forhold',
 'token_reklame']

# Fra rå API-data til tabeldata (supplerende)

## Rester

In [None]:
tf_idf_tuples = list(zip(vectorizer.get_feature_names(), transformed_documents_as_array[0]))
one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).reset_index(drop=True)

one_doc_as_df.loc[one_doc_as_df['score']>0]