# Three investigators - part I.I

A project for scraping and analysing data from a fan site on the audio book called '[The three investigators](https://en.wikipedia.org/wiki/Three_Investigators#Germany)'

Part I.I: Topic modelling

Using the content and title for each episode to detect the overall topic.


**Resources:** 

- Text mining webinar code on [github](https://github.com/DiarmuidM/text-mining/blob/master/code/tm-extraction-2020-06-16.ipynb)

In [1]:
#python version used for this project
from platform import python_version
print(python_version())

3.7.5


> need to add additional modules to requirement text file !!

In [2]:
# import modules [as specified in requirements.txt]
import pandas as pd
import numpy as np
import spacy
import de_core_news_sm #imports German model from spaCy
import nltk
import re

# download German stop words
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('german'))

# To count words in list
from collections import Counter

# for file 
import os

%matplotlib inline

In [3]:
#change directory to root folder
os.chdir("..")

## Load data

In [4]:
#load scraped datafiles
meta = pd.read_csv(".\\data\\scraped\\meta.csv")
content = pd.read_csv(".\\data\\scraped\\content_all.csv")

# make all column names lower case
df_list = [meta, content]

for df in df_list:
    df.columns = df.columns.str.lower()

## Title

### Standardising

In [126]:
meta["titel"]

0            Der Super-Papagei (Hörspiel)
1               Der Phantomsee (Hörspiel)
2             Der Karpatenhund (Hörspiel)
3           Die schwarze Katze (Hörspiel)
4         Der Fluch des Rubins (Hörspiel)
                      ...                
200             Das weiße Grab (Hörspiel)
201    Tauchgang ins Ungewisse (Hörspiel)
202         Der dunkle Wächter (Hörspiel)
203       Das rätselhafte Erbe (Hörspiel)
204      ...und der Mottenmann (Hörspiel)
Name: titel, Length: 205, dtype: object

In [128]:
## make titles lower case
title = meta["titel"].str.lower()

# replace values within titles

# function to loop through the column and replace substrings
def replace_values(text, dic):
    for x, y in dic.items():
        text = text.str.replace(x, y, regex=True)
    return text

# list of values to be replaced, including punctuation
replace_dict = {"hörspiel": "", 
                "[!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-]": "",
                
                #list of noun forms that the German lemmitizer doesn't pick up
                "tigers": "tiger",
                "fouls":"foul",
                "teufels":"teufel",
                "schreckens":"schrecken",
                "meisterdiebs":"meisterdieb",
                "grauens":"grauen",
                "todes":"tod",
                "spielers":"spieler",
                u"goldgräbers" :u"goldgräber"  ,
                "zauberers":"zauberer",
                "henkers":"henker",
                "vergessens" :"vergessen",
                "bauchredners" :"bauchredner",
                "sturms" : "sturm",
                "rubins":"rubin",
               
                #change plural to singular and remove gendered forms
                "musikpiraten" : "musikpirat",
                "seglerin" : "segler",
                "stimmen" : "stimme",
                "karten":"karte",
                u"dämonen": u"dämon",
                "drachen":"drache",
                "piraten":"pirat",
                "botschaften": "botschaft",
                "raben":"rabe",
                "schlangen":"schlange",
                "untoten":"untote",
                "puppen":"puppe",
                "augen" : "auge",
                "vampire":"vampir"               }
                
# apply function
title = replace_values(title, replace_dict)

# strip white space at the end
title = title.str.strip()

title

0             der superpapagei
1               der phantomsee
2             der karpatenhund
3           die schwarze katze
4          der fluch des rubin
                ...           
200             das weiße grab
201    tauchgang ins ungewisse
202         der dunkle wächter
203       das rätselhafte erbe
204         und der mottenmann
Name: titel, Length: 205, dtype: object

### remove stopwords

In [129]:
# split titles into substrings using space as delimiter
title_split = title.str.split(" ")

# create empty list to store titles without stop words
title_no_stop_words = []

# iterate through each word in each title and append those that are no stop words
for words in title_split:
    x = []
    for word in words:
        if word not in stop_words:
            x.append(word)
    title_no_stop_words.append(x)

# join titles back together
title_no_stop_words = [" ".join(items) for items in title_no_stop_words]
title_no_stop_words[:10]

['superpapagei',
 'phantomsee',
 'karpatenhund',
 'schwarze katze',
 'fluch rubin',
 'sprechende totenkopf',
 'unheimliche drache',
 'grüne geist',
 'rätselhaften bilder',
 'flüsternde mumie']

### lemmatise

In [131]:
## merge titles into one string
titles_one_string =  ' '.join(title_no_stop_words)

# load German model from SpaCy
nlp = de_core_news_sm.load()

# create new list to store lemmatised titles
titles_lemmatised = []

# iterate through titles as one string and store the lemmatised words in new list
for title in nlp(titles_one_string):
    x = title.lemma_
    titles_lemmatised.append(x)

titles_lemmatised[:10]

['superpapagei',
 'phantomsee',
 'karpatenhund',
 'schwarze',
 'katze',
 'fluch',
 'rubin',
 'sprechend',
 'totenkopf',
 'unheimlich']

In [134]:
## Check how words have been lemmatised to explore potential issues
words = str(titles_one_string)

for word in nlp.tokenizer(words):
    print("Tokenized: %s | Lemma: %s" %(word, word.lemma_))

Tokenized: superpapagei | Lemma: superpapagei
Tokenized: phantomsee | Lemma: phantomsee
Tokenized: karpatenhund | Lemma: karpatenhund
Tokenized: schwarze | Lemma: schwarze
Tokenized: katze | Lemma: katze
Tokenized: fluch | Lemma: fluch
Tokenized: rubin | Lemma: rubin
Tokenized: sprechende | Lemma: sprechend
Tokenized: totenkopf | Lemma: totenkopf
Tokenized: unheimliche | Lemma: unheimlich
Tokenized: drache | Lemma: drache
Tokenized: grüne | Lemma: grüne
Tokenized: geist | Lemma: geist
Tokenized: rätselhaften | Lemma: rätselhaft
Tokenized: bilder | Lemma: bilder
Tokenized: flüsternde | Lemma: flüsternd
Tokenized: mumie | Lemma: mumie
Tokenized: gespensterschloß | Lemma: gespensterschloß
Tokenized: seltsame | Lemma: seltsam
Tokenized: wecker | Lemma: wecker
Tokenized: lachende | Lemma: lachend
Tokenized: schatten | Lemma: schatten
Tokenized: bergmonster | Lemma: bergmonster
Tokenized: rasende | Lemma: rasend
Tokenized: löwe | Lemma: löwe
Tokenized: zauberspiegel | Lemma: zauberspiegel
To

### Counting words

In [139]:
# count frequencies of words
counts = Counter(titles_lemmatised)
counts = pd.DataFrame.from_dict(counts, orient='index').reset_index() #make into dataframe
counts = counts.sort_values(by=counts.columns[1], ascending=False) #sort by frequency
counts.to_csv("test.csv")
counts

Unnamed: 0,index,0
31,spur,5
94,rache,5
3,schwarze,5
10,drache,5
149,schrecken,5
...,...,...
115,stimme,1
116,pistenteufel,1
117,leer,1
120,voodoo,1


In [174]:
# groups of words
colours = [r'grüne*[nsr]*$' , r'^schw[aä]rze*[nsr]*$']


for i in colours:
    if i in titles_lemmatised:
        print("yes")



In [178]:
test = title_no_stop_words

['superpapagei',
 'phantomsee',
 'karpatenhund',
 'schwarze katze',
 'fluch rubin',
 'sprechende totenkopf',
 'unheimliche drache',
 'grüne geist',
 'rätselhaften bilder',
 'flüsternde mumie',
 'gespensterschloß',
 'seltsame wecker',
 'lachende schatten',
 'bergmonster',
 'rasende löwe',
 'zauberspiegel',
 'gefährliche erbschaft',
 'geisterinsel',
 'teufelberg',
 'flammende spur',
 'tanzende teufel',
 'verschwundene schatz',
 'aztekenschwert',
 'silberne spinne',
 'singende schlange',
 'silbermine',
 'magische kreis',
 'doppelgänger',
 'riff haie',
 'narbengesicht',
 'ameisenmensch',
 'bedrohte ranch',
 'rote pirat',
 'höhlenmensch',
 'superwal',
 'heimliche hehler',
 'unsichtbare gegner',
 'perlenvögel',
 'automarder',
 'volk winde',
 'weinende sarg',
 'höllische werwolf',
 'gestohlene preis',
 'gold wikinger',
 'schrullige millionär',
 'giftige gockel',
 'gefährlichen fässer',
 'comicdiebe',
 'verschwundene filmstar',
 'riskante ritt',
 'musikpirat',
 'automafia',
 'gefahr verzug',
 

In [175]:
titles_lemmatised

['superpapagei',
 'phantomsee',
 'karpatenhund',
 'schwarze',
 'katze',
 'fluch',
 'rubin',
 'sprechend',
 'totenkopf',
 'unheimlich',
 'drache',
 'grüne',
 'geist',
 'rätselhaft',
 'bilder',
 'flüsternd',
 'mumie',
 'gespensterschloß',
 'seltsam',
 'wecker',
 'lachend',
 'schatten',
 'bergmonster',
 'rasend',
 'löwe',
 'zauberspiegel',
 'gefährlich',
 'erbschaft',
 'geisterinsel',
 'teufelberg',
 'flammend',
 'spur',
 'tanzend',
 'teufel',
 'verschwunden',
 'schatz',
 'aztekenschwert',
 'silberne',
 'spinne',
 'singend',
 'schlange',
 'silbermine',
 'magische',
 'kreis',
 'doppelgänger',
 'riff',
 'haie',
 'narbengesicht',
 'ameisenmensch',
 'bedrohen',
 'ranch',
 'rote',
 'pirat',
 'höhlenmensch',
 'superwal',
 'heimlich',
 'hehler',
 'unsichtbar',
 'gegner',
 'perlenvögel',
 'automarder',
 'volk',
 'winde',
 'weinend',
 'sarg',
 'höllische',
 'werwolf',
 'gestohlen',
 'preis',
 'gold',
 'wikinger',
 'schrullig',
 'millionär',
 'giftig',
 'gockel',
 'gefährlich',
 'fässer',
 'comicdi

### Word similarity

In [146]:
titles_one_string[:36]

'superpapagei phantomsee karpatenhund'

In [170]:
word_similarity = nlp(titles_one_string)

similarities = []

for word1 in word_similarity:
    for word2 in word_similarity:
        if (word1.similarity(word2) > 0.85) and (word1.text != word2.text):
            x = word1.text, word2.text, word1.similarity(word2)
            similarities.append(x)
        
similarities

  import sys
  


[('gefährliche', 'gestohlene', 0.85386014),
 ('teufel', 'pistenteufel', 0.8574693),
 ('verschwundene', 'tödliche', 0.8520479),
 ('rote', 'tote', 0.8504193),
 ('gestohlene', 'gefährliche', 0.85386014),
 ('gestohlene', 'gefährliches', 0.8725113),
 ('giftige', 'gestohlene', 0.858411),
 ('pistenteufel', 'teufel', 0.8574693),
 ('tödliche', 'verschwundene', 0.8520479),
 ('tödliche', 'rache', 0.8602149),
 ('tödliche', 'rache', 0.85529023),
 ('gefährliches', 'gestohlene', 0.8725113),
 ('gefährliches', 'gestohlene', 0.862969),
 ('tote', 'rote', 0.8504193),
 ('rache', 'tödliche', 0.8602149),
 ('rache', 'tödliche', 0.85529023),
 ('gestohlene', 'giftige', 0.858411),
 ('gestohlene', 'gefährliches', 0.862969)]

# Content

## Standardising

In [88]:
# list of acronyms to be replaced
replace_dict = {"" : ""}

## make content lower case
title = content["content"].str.lower()

# replace values within titles

# function to loop through the column and replace substrings
def replace_values(text, dic):
    for x, y in dic.items():
        text = text.str.replace(x, y, regex=True)
    return text

# list of values to be replaced, including punctuation
replace_dict = {"[!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-]": ""}


# apply function
title = replace_values(title, replace_dict)

# strip white space at the end
title = title.str.strip()

title

0      Der neueste Auftrag an die drei Detektive hört...
1      Welches Geheimnis verbirgt sich in einem vergi...
2      "Bei mir spukt es!" Mit diesem verzweifelten A...
3      In einem kleinen Wanderzirkus wittern die drei...
4      Alfred Hitchcock und die drei Detektive (Firme...
                             ...                        
201    So hatten sich die drei Detektive ihre Auszeit...
202    Ein Kindermädchen, das nachts in Gestalt eines...
203    Eigentlich sollte Bob in dem einsam gelegenen ...
204    Ein merkwürdiger Anruf erreicht Die Drei Frage...
205    In Rocky Beach taucht eine schaurige Gestalt a...
Name: content, Length: 206, dtype: object