## EDA on topics
We're going to go through the reviews of TOTK and determining detailed aspects for more abstract aspect categories. For example under combat we would have the weapon system , the fuse mechanic, the quality of enemies and variety, the difficulty etc.

These are the aspect Categories:

1. Visual Presentation
2. Audio
3. Combat
4. Bosses and Enemies
5. Main Story/Quests and Characters
6. Abilities
7. Ultra-Hand and Building
8. Puzzles and Dungeons
9. Side Quests and Side Adventures
10. World
11. Exploration and Traversal
12. Misc/Other

In [2]:
import pandas as pd
import re
from nltk.util import bigrams, ngrams
import nltk
import string
from collections import Counter
from nltk.tokenize import sent_tokenize
from ast import literal_eval

To make things quick i'll import the reviews already formatted to lowercase.

In [3]:
lowercase_reviews = pd.read_csv('totk_lowercase_reviews.csv',index_col=0)

To best extract any aspects I'll need to remove punctuation from the reviews:

In [4]:
import string
def remove_punc(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    return re.sub (r'[^\w\s]', ' ', text)

In [5]:
#Define new column being the reviews sans any punctuation

lowercase_reviews['no_punc'] = lowercase_reviews.lowercase_reviews.apply(lambda x: remove_punc(x))

## Attention to Detail:

Aspects can be expressed in many ways, as differing individual words or phrases E.G: "Frame-rate" could appear in a review as "frame rate" or "framerate" and "Art-style" likewise. Here we will focus on individual terms and short phrases of no more than 3 words.
- By including any synonymous terms and phrases an aspect can have we can obtain a more representative understanding of player sentiments.
- Some "aspects" can only be defined as a phrase and such phrases can have overlap. "Weapons" can be referenced in a review regarding their variety, visual design, or some other mechanic associated with them. We should where possible investigate bigram or trigam groups for certain aspects to get finer, more detailed player sentiments.
- By accomodating bigrams and trigrams we could accidentally "double up" on aspects being classified so we need a way to filter out individual words if it is involved in any bigram pair or trigram tripled that we define. 

- This time around I will also be analysing sentiments on a per sentence basis. This will speed up processing time whilst removing the "cross contamination" of 1-word aspects and aspect phrases. Look at the following sentence:
"The world exploration is great, there is plenty to discover in every area." This sentence contains 2 aspects: "world" and "world exploration". If somewhere else in a review containing this sentence contained the aspect "world", when running the model checking the sentiment for "world" the term appearing in front of "world exploration" would "cross-contaminate" that.
- A caveat of this appraoch is that it is possible that a sentence could be brought up in one sentence and then discussed in another. However if we only analyse sentences with aspects in them I would probably expect the count of neutral sentiments to increase.

------

Let's create a function that we can pass a string to and it will find n_grams we want for certain words across all the reviews. We can use these to identify certain "aspect phrases".


In [6]:
def find_aspect_ngram(x,aspect, extra_word = '', n = 3 ):
    tokens = x.split()
    tuple_list = [tup for tup in list(ngrams(tokens, n)) if aspect in tup or (aspect + extra_word) in tup]
    if tuple_list:
        return tuple_list
    else:
        return None
        
def flatten(xss):
    return [x for xs in xss for x in xs]    

In [None]:
lowercase_reviews.no_punc.apply(lambda x: 'texture' in x.split())

- We also need a function that can filter prioritises the detection of bigrams in a text if and not the words that comprise it.

I'll test the function on the following sentence:

In [436]:
sentence = 'The weapon variety is excellent, especially when you consider the new fuse mechanic and how it enables crafting of new types of weapons. \
However I am not a fan of the weapon durability system, returning from the previous games. \
This is especially true if I have combined it with a precious material. \
I find myself simply holding onto a particular weapon instead of using it for fear of it breaking.'

sample_asps =['weapons','weapon','weapon variety', 'fuse', 'fuse mechanic', 'weapon durability']

- re.findall() matches strings greedily so we need a function to organise the list in length descending order such that bigrams take priority over individual words.
- We also need function that creates a pattern for each list of aspects that we can pass into re.findall()

In [7]:
def list_sort(aspect_list):
    x = sorted(aspect_list, key = len)
    x.reverse()
    return x

def create_pattern(aspect_list):
    return r'\b('+'|'.join(word for word in list_sort(aspect_list))+r')\b'


In [432]:
create_pattern(sample_asps) 

'\\b(weapon durability|weapon variety|fuse mechanic|weapons|weapon|fuse)\\b'

Hooray it works

----------
Below:
- we'll create a column with the tokenized sentences for each review.
- create a function that detects aspects from a pre-defined list inside each sentence.

In [6]:
lowercase_reviews['sent_tokenized_reviews'] = lowercase_reviews.lowercase_reviews.apply(lambda x: sent_tokenize(x))

In [7]:

def detect_aspects(sentences, aspect_list):
    
    return [re.findall(create_pattern(list_sort(aspect_list)), sent) for sent in sentences]

In [443]:
detect_aspects(lowercase_reviews.sent_tokenized_reviews.iloc[0], sample_asps)

[[], ['weapons'], [], [], []]

To identify terms and phrases I'll use the following line of code.

In [17]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'star', n=2)).dropna().values.tolist()))

Counter({('same', 'star'): 2,
         ('5', 'star'): 1,
         ('star', 'all'): 1,
         ('honkai', 'star'): 1,
         ('star', 'rail'): 1,
         ('of', 'star'): 1,
         ('star', 'crossed'): 1,
         ('a', 'star'): 1,
         ('star', 'off'): 1,
         ('4', 'star'): 1,
         ('star', 'michelin'): 1,
         ('you', 'star'): 1,
         ('star', 'to'): 1,
         ('star', 'on'): 1,
         ('the', 'star'): 1,
         ('star', 'of'): 1,
         ('1', 'star'): 1,
         ('star', 'challenge'): 1,
         ('10', 'star'): 1,
         ('star', 'reviews'): 1,
         ('star', 'profile'): 1})

By passing any string to the list and setting a value for n i get a selection of n-length tupples containing the string corresponding to sequences that appear across ALL reviews. This allowed me to quickly identify aspect terms and phrases. (Along with an online thesaurus)

---------


Now I'll go through each major are of the game retrieving a list of aspects and synonymous terms and phrases.
Afterwards I'll create dataframes for each aspect group leaving only the review text, the tokenized sentences and a list of the aspects found within the review and for each sentence.
I will run the classifier in a separate py file rather than here in the jupyter notebook.

## Visual Presentation

Here is a selection of aspects we can use when discussing a games visual presentation: 
- 'Visuals'
- 'Graphics' / 'graphical fidelity' / 
- 'Frame-rate' / 'framerate' / 'frame rate' / 'technical performance' / 'performance' 
- 'Art-style' / 'art style' / 'artstyle' / 'art design' / 
- 'character design'
- 'textures' / 'texture' / 'texture quality' / 'texture resolution' /
- 'level of detail' / 'lod' / 'draw distance' / 'draw-distance' 
- 'cutscenes' / 'cut scenes'

Instead `of showig the list of tuples, I'm using the following cell, changing the string and n value to determine what to use as aspects in Visual Presentation.

In [None]:
# Visuals 
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'performance', n=2)).dropna().values.tolist()))

In [9]:
presentation_aspects = ['visuals','visual','graphics','graphic','graphic fidelity', 'graphical fidelity', 'framerate', 'fps', 'performance', 'frame rate',  'frame drops',
                        'performance', 'art style', 'art style', 'art-style', 'art direction','artistic direction', 'artistic style',
                        'texture', 'textures','resolution','picture quality', 'draw distance', 'cut scenes','cinematics', 'cutscenes','cut-scenes',
                        'animation','animations']
presentation_agg_terms = [('Graphics',['graphics','graphic','graphic fidelity', 'graphical fidelity']),
                          ('Framerate',['framerate', 'fps', 'performance', 'frame rate',  'frame drops',
                                       'performance']),
                          ('Art Direction',['art style', 'art style', 'art-style', 'art direction','artistic direction', 'artistic style']),
                          ('Textures',['texture', 'textures']),
                          ('Cut-Scenes',['cut scenes', 'cutscenes','cut-scenes','cinematics']),
                          ('Animation',['animation','animations']),
                          ('Visuals',['visuals','visual']),
                          ('Resolution',['resolution','picture quality']),
                          ]

In [None]:
lowercase_reviews['presentation_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(presentation_aspects),regex=True)

In [15]:
presentation_aspects_df = lowercase_reviews.query('presentation_aspects == True')
#presentation_aspects_df['aspects_within']= presentation_aspects_df.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(presentation_aspects), x))))

In [16]:
presentation_aspects_df = (
    presentation_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(presentation_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,presentation_aspects)))
                    .drop(columns=['no_punc','presentation_aspects','lowercase_reviews'])
)


In [652]:
presentation_aspects_df.drop(columns=['no_punc','presentation_aspects','dup_mentions','lowercase_reviews'],inplace=True)

In [17]:
presentation_aspects_df.to_csv('totk_presentation_aspects.csv')

## Audio
- 'audio design' / 'sound design' / 'sound effects' / 'sfx'
- 'soundtrack' / 'music' / 'ost' / 'score' / 'themes'

In [80]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, '', n=2)).dropna().values.tolist()))

Counter()

In [9]:
audio_aspects = ['audio design','audio','sound design','soundtrack','sound effects', 'music score','musical score', 'ost','original soundtrack'
                 'ambient music', 'ambient soundtrack','music']

audio_aggterms = [('Sound Design', ['audio design','sound design','audio','sound effects'] ),
                   ('Soundtrack',['soundtrack','sound effects', 'music score','musical score', 'ost','original soundtrack'
                 'ambient music', 'ambient soundtrack','music'])
                  ]

In [None]:
lowercase_reviews['audio_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(audio_aspects),regex=True)

In [14]:

audio_aspects_df = lowercase_reviews.query('audio_aspects == True')

audio_aspects_df = (
    audio_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(audio_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,audio_aspects)))
                    .drop(columns=['no_punc','audio_aspects','lowercase_reviews'])
)

In [15]:
audio_aspects_df.to_csv('totk_audio_aspects.csv')

## Combat

In [None]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'variety', n=2)).dropna().values.tolist()))

In [138]:
combat_aspects = ['combat system', 'combat', 'weapon durability', 'weapon breaking' ,'weapon break' ,'weapon breaks',
                  'weapon fusion', 'weapon breakage', 'weapons that break', 'breakable weapons', 'weapons still break', 
                  'weapon fusing', 'weapon crafting', 'weapon combinations','combine weapons','combine weapon' , 'weapon degradation', 
                  'weapon system', 'armour','clothing','clothes','outfit','oufits', 'battle system', 
                  'healing','heal','health', 'flurry rush', 'dodge', 'dodging','new weapons','new weapon','weapon variety','weapons variety' ]

combat_aggterms = [('Combat',['combat system', 'combat','battle system']),
                   ('Weapon Durability',['weapon durability', 'weapon breaking' ,'weapon break' ,'weapon breaks',
                                         'weapon breakage', 'weapons that break', 'breakable weapons', 'weapons still break',
                                         'weapon degradation', 'weapon system']),
                   ('Weapon Fusion',['weapon fusion','weapon fusing','weapon crafting','weapon combinations','combine weapons',
                                     'combine weapon']),
                   ('Health & Healing',['healing','heal','health']),
                   ('Armour and Clothing',['armour','clothing','clothes','outfit','oufits']),
                   ('Dodging and Flurry Rush',['flurry rush', 'dodge', 'dodging']),
                   ('Weapon Variety',['new weapons','new weapon','weapon variety','weapons variety'])]

In [140]:
count=0
for _,y in combat_aggterms:
    count += len(y)

count == len(combat_aspects)

True

In [None]:
lowercase_reviews['combat_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(combat_aspects),regex=True)

In [141]:

combat_aspects_df = lowercase_reviews.query('combat_aspects == True')

combat_aspects_df = (
    combat_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(combat_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,combat_aspects)))
                    .drop(columns=['no_punc','combat_aspects','audio_aspects','lowercase_reviews'])
)

In [142]:
combat_aspects_df.to_csv('totk_combat_aspects.csv')

In [None]:
pd.read_csv('totk_combat_aspects.csv',index_col=0)

## Bosses and Enemies


In [None]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'like', n=2)).dropna().values.tolist()))

In [70]:
boss_enemy_aspects = ['bosses', 'final boss', 'final battle','dungeon bosses','boss fights','boss battles',
                      'temple boss', 'temple bosses', 'boss battle','last boss', 'boss encounter', 'colgera','gohma', 'seized construct','queen gibdo',
                      'phantom ganon','gloom hands', 'lynel','hinox','frox','gleeok', 'mini boss','mini bosses','mini-boss',
                     'bokoblins','moblins','lizalfos', 'enemy variety', 'monster variety','monster types','enemy types', 'variety of enemies',
                      'new enemies', 'enemy diversity', 'enemies']

boss_enemy_aggterms = [('Main Bosses',['bosses','dungeon bosses','boss fights','boss battles',
                                        'temple boss', 'temple bosses', 'boss battle',
                                       'boss encounter', 'colgera','gohma', 'seized construct','queen gibdo']),
                       ('Mini-Bosses',['phantom ganon','gloom hands', 'lynel','hinox','frox','gleeok', 'mini boss','mini bosses','mini-boss']),
                       ('Enemy Variety',['enemy variety', 'monster variety','monster types','enemy types', 'variety of enemies',
                      'new enemies', 'enemy diversity']),
                       ('Final Boss',['final boss', 'final battle','last boss']),
                       ('Enemies',['bokoblins','moblins','lizalfos','enemies'])]

In [None]:
lowercase_reviews['boss_enemy_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(boss_enemy_aspects),regex=True)

In [151]:

boss_enemy_aspects_df = lowercase_reviews.query('boss_enemy_aspects == True')

boss_enemy_aspects_df = (
    boss_enemy_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(boss_enemy_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,boss_enemy_aspects)))
                    .drop(columns=['no_punc','boss_enemy_aspects','audio_aspects','combat_aspects','lowercase_reviews'])
)

In [156]:
boss_enemy_aspects_df.to_csv('totk_bossenemy_aspects.csv')

In [144]:
count=0
for _,y in boss_enemy_aggterms:
    count += len(y)

count == len(boss_enemy_aspects)

True

In [155]:
lowercase_reviews.drop(columns=['audio_aspects','combat_aspects','boss_enemy_aspects'],inplace=True)

## Main Story and Characters

In [122]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, '', n=2)).dropna().values.tolist()))

Counter()

In [159]:
story_char_aspects = ['main story', 'main quest','main quests','main storyline', 'main mission', 'main plot', 'main objective',
                           'plot','story','storyline','characters','side characters','storytelling', 'main character', 'main characters',
                           'voice acting','character design','character designs','narrative','narratives']

story_char_aggterms=[('Main Story',['main story','main storyline','main plot','plot','story','storyline',
                                    'storytelling','narrative','narratives']),
                     ('Main Quests',['main quest','main quests','main mission','main objective']),
                     ('Characters',['characters','side characters''main character', 'main characters']),
                     ('Character Design',['character design','character designs']),
                     ('Voice Acting',['voice acting','voiceover'])]

In [None]:
lowercase_reviews['story_char_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(story_char_aspects),regex=True)

In [161]:

story_char_aspects_df = lowercase_reviews.query('story_char_aspects == True')

story_char_aspects_df = (
    story_char_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(story_char_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,story_char_aspects)))
                    .drop(columns=['no_punc','story_char_aspects','lowercase_reviews'])
)

In [166]:
story_char_aspects_df.to_csv('totk_storychar_aspects.csv')

In [165]:
lowercase_reviews.drop(columns = ['story_char_aspects'],inplace=True)

## Abilities

In [49]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'upwards', n=2)).dropna().values.tolist()))

Counter({('move', 'upwards'): 1,
         ('upwards', 'across'): 1,
         ('and', 'upwards'): 1,
         ('upwards', 'to'): 1,
         ('snaps', 'upwards'): 1,
         ('upwards', 'all'): 1,
         ('travel', 'upwards'): 1,
         ('upwards', 'the'): 1})

In [50]:
abilities_aspects = ['abilities','powers', 
                     'fuse', 'fuse ability', 'fuse mechanic', 'fusion','fusion ability','fusing ability', 'fusing',
                     'ascend' ,'ascend ability', 'ascension',
                     'recall','recall ability','rewind','reverse mechanic', 'time reverse', 'time reversal', 'reversal mechanic'
                     ]
abilities_aggterms = [('Abilities',['abilities','powers']),
                      ('Fuse',['fuse', 'fuse ability', 'fuse mechanic', 'fusion','fusion ability','fusing ability', 'fusing']),
                      ('Ascend',['ascend' ,'ascend ability', 'ascension']),
                      ('Recall',['recall','recall ability','rewind','reverse mechanic', 'time reverse', 'time reversal', 'reversal mechanic'])]

In [51]:
count=0
for _,y in abilities_aggterms:
    count += len(y)

count == len(abilities_aspects)

True

In [None]:
lowercase_reviews['abilities_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(abilities_aspects),regex=True)

In [54]:

abilities_aspects_df = lowercase_reviews.query('abilities_aspects == True')

abilities_aspects_df = (
    abilities_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(abilities_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,abilities_aspects)))
                    .drop(columns=['no_punc','abilities_aspects','lowercase_reviews'])
)

In [56]:
abilities_aspects_df.to_csv('totk_abilities_aspects.csv')

In [57]:
lowercase_reviews.drop(columns = ['abilities_aspects'],inplace=True)

## Ultrahand & Building

In [67]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, '', n=2)).dropna().values.tolist()))

Counter()

In [69]:
ultrahand_aspects = ['ultra hand','ultrahand','ultrahand ability', 'ultra-hand',
                     'autobuild', 'auto build', 'auto building',
                     'build','construction', 'building system','building mechanic','building', 'vehicle building','building vehicles',
                     'zonai devices','zonau devices','zonau device', 'zonai gadget','zonnan gadgets',
                     ]

ultrahand_aggterms = [('Ultrahand',['ultra hand','ultrahand','ultrahand ability', 'ultra-hand']),
                      ('AutoBuild',['autobuild', 'auto build', 'auto building']),
                      ('Building/Construction',['build','construction', 'building system','building mechanic',
                                                'building', 'vehicle building','building vehicles']),
                      ('Zonai Devices',['zonai devices','zonau devices','zonau device', 'zonai gadget','zonnan gadgets'])]

In [70]:
count=0
for _,y in ultrahand_aggterms:
    count += len(y)

count == len(ultrahand_aspects)

True

In [None]:
lowercase_reviews['ultrahand_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(ultrahand_aspects),regex=True)


ultrahand_aspects_df = lowercase_reviews.query('ultrahand_aspects == True')

ultrahand_aspects_df = (
    ultrahand_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(ultrahand_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,ultrahand_aspects)))
                    .drop(columns=['no_punc','ultrahand_aspects','lowercase_reviews'])
)


In [73]:
ultrahand_aspects_df.to_csv('totk_ultrahand_aspects.csv')

lowercase_reviews.drop(columns = ['ultrahand_aspects'],inplace=True)

## Puzzles, Shrines, Dungeons

In [31]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, '', n=3)).dropna().values.tolist()))

Counter({('a', 'more', 'abstract'): 1,
         ('more', 'abstract', 'story'): 1,
         ('abstract', 'story', 'i'): 1})

In [32]:
puzzles_aspects = ['dungeons', 'dungeon','temple','temples',
                   'shrines','shrine','shrine quests','shrine quest',
                   'puzzle','puzzles','solving puzzles','problem solving'
                  ]

puzzles_aggterms = [('Dungeons', ['dungeons', 'dungeon','temple','temples']),
                    ('Shrines',['shrines','shrine','shrine quests','shrine quest']),
                    ('Puzzles',['puzzle','puzzles','solving puzzles','problem solving'])]


In [None]:
lowercase_reviews['puzzles_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(puzzles_aspects),regex=True)


puzzles_aspects_df = lowercase_reviews.query('puzzles_aspects == True')

puzzles_aspects_df = (
    puzzles_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(puzzles_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,puzzles_aspects)))
                    .drop(columns=['no_punc','puzzles_aspects','lowercase_reviews'])
)


In [37]:
puzzles_aspects_df.to_csv('totk_puzzles_aspects.csv')

lowercase_reviews.drop(columns = ['puzzles_aspects'],inplace=True)

## Side Quests and Side Adventures

In [60]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, '', n=3)).dropna().values.tolist()))

Counter()

In [59]:
sidecontent_aspects = ['side quests', 'side quest','sidequest','side-quest',
                       'side adventures', 'side adventure','adventures side', 'side stories',
                       'side content', 'side missions', 'find treasure','mini-games','minigames','mini games']

sidecontent_aggterms = [('Side Quests',['side quests', 'side quest','sidequest','side-quest']),
                        ('Side Adventures',['side adventures', 'side adventure','adventures side', 'side stories']),
                        ('Misc Terms',[ 'side content', 'side missions', 'find treasure','mini-games','minigames','mini games'])]



In [None]:
lowercase_reviews['sidecontent_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(sidecontent_aspects),regex=True)


sidecontent_aspects_df = lowercase_reviews.query('sidecontent_aspects == True')

sidecontent_aspects_df = (
    sidecontent_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(sidecontent_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,sidecontent_aspects)))
                    .drop(columns=['no_punc','sidecontent_aspects','lowercase_reviews'])
)


In [65]:
sidecontent_aspects_df.to_csv('totk_sidecontent_aspects.csv')

lowercase_reviews.drop(columns = ['sidecontent_aspects'],inplace=True)

In [66]:
del sidecontent_aspects_df

## World 

In [102]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'zora', n=2)).dropna().values.tolist()))

Counter({('the', 'zora'): 6,
         ('zora', 's'): 4,
         ('zora', 'and'): 3,
         ('and', 'zora'): 2,
         ('in', 'zora'): 1,
         ('de', 'zora'): 1,
         ('zora', 'cai'): 1,
         ('zora', 'area'): 1,
         ('goron', 'zora'): 1,
         ('zora', 'one'): 1,
         ('zora', 'dungeon'): 1,
         ('zora', 'missions'): 1,
         ('boots', 'zora'): 1,
         ('add', 'zora'): 1,
         ('zora', 'goron'): 1,
         ('fly', 'zora'): 1,
         ('zora', 'more'): 1,
         ('lanayaru', 'zora'): 1,
         ('zora', 'is'): 1})

In [140]:
world_aspects = ['open world','world building','the map','world map','new areas',
                 'the depths', 'underground areas','underground area','under ground','the undergound','underground','underworld','under world',
                 'sky','sky island','sky islands','sky areas','sky archipelago','the skies',
                 'caves','cave',
                 'wells',
                 'hyrule','world of hyrule','hyrule map']

world_aggterms = [('World',['open world','world building','the map','world map','new areas']),
                  ('Depths',['the depths', 'underground areas','underground area','under ground','the undergound','underground','underworld','under world']),
                  ('Sky',['sky','sky island','sky islands','sky areas','sky archipelago','the skies']),
                  ('Caves',['caves','cave']),
                  ('Wells',['wells']),
                  ('Hyrule',['hyrule','world of hyrule','hyrule map'])]

In [None]:
lowercase_reviews['world_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(world_aspects),regex=True)


world_aspects_df = lowercase_reviews.query('world_aspects == True')

world_aspects_df = (
    world_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(world_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,world_aspects)))
                    .drop(columns=['no_punc','world_aspects','lowercase_reviews'])
)


In [145]:
world_aspects_df.to_csv('totk_world_aspects.csv')

lowercase_reviews.drop(columns = ['world_aspects'],inplace=True)

In [146]:
del world_aspects_df

## Exploration and Traversal

In [None]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'mech', n=2)).dropna().values.tolist()))

In [157]:
exploration_aspects = ['world exploration', 'exploration','exploring','explore','exploring areas','explore areas',
                       'climbing','climb',
                       'horse','horses', 'horse riding','horseback riding',
                       'vehicles','vehicle','flying machine','bike','air bike','hover bike','hoverbike','car',
                       'battery', 'batteries','zonai charges', 'zonai charge',
                       'glide','gliding','paraglider','paragliding',
                       'diving','sky dive','sky diving','skydive','skydiving']

exploration_aggterms = [('Exploration',['world exploration', 'exploration','exploring','explore','exploring areas','explore areas']),
                        ('Climbing',['climbing','climb']),
                        ('Horses',['horse','horses', 'horse riding','horseback riding']),
                        ('Vehicles',['vehicles','vehicle','flying machine','bike','air bike','hover bike','hoverbike','car']),
                        ('Gliding',['glide','gliding','paraglider','paragliding']),
                        ('Skydiving',['diving','sky dive','sky diving','skydive','skydiving']),
                        ('Battery',['battery', 'batteries','zonai charges', 'zonai charge'])]

In [None]:
lowercase_reviews['exploration_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(exploration_aspects),regex=True)


exploration_aspects_df = lowercase_reviews.query('exploration_aspects == True')

exploration_aspects_df = (
    exploration_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(exploration_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,exploration_aspects)))
                    .drop(columns=['no_punc','exploration_aspects','lowercase_reviews'])
)


In [159]:
exploration_aspects_df.to_csv('totk_exploration_aspects.csv')

lowercase_reviews.drop(columns = ['exploration_aspects'],inplace=True)

In [160]:
del exploration_aspects_df

## Other

In [None]:
Counter(flatten(lowercase_reviews.no_punc.apply(lambda x: find_aspect_ngram(x, 'inventory', n=2)).dropna().values.tolist()))

In [169]:
other_aspects = ['user interface','ui','menu','menus','menu interface','interface','ability wheel',
         'gameplay', 'game play',
         'physics','physics engine','physic',
         'mechanics',
         'price','cost',
         'collectibles','collectible','koroks','korok seeds',
         'inventory']


other_aggterms = [('UI',['user interface','ui','menu','menus','menu interface','interface','ability wheel']),
                  ('Gameplay',['gameplay', 'game play']),
                  ('Physics',['physics','physics engine','physic']),
                  ('Mechanics',['mechanics']),
                  ('Price',['price','cost']),
                  ('Collectibles',['collectibles','collectible','koroks','korok seeds']),
                  ('Inventory',['inventory'])]

In [170]:
count=0
for _,y in other_aggterms:
    count += len(y)

count == len(other_aspects)

True

In [None]:
lowercase_reviews['other_aspects'] = lowercase_reviews.no_punc.str.contains(create_pattern(other_aspects),regex=True)


other_aspects_df = lowercase_reviews.query('other_aspects == True')

other_aspects_df = (
    other_aspects_df.assign(aspects_within= lambda df_: df_.lowercase_reviews.apply(lambda x: list(set(re.findall(create_pattern(other_aspects), x)))),
                            sentence_aspects = lambda df_: df_.sent_tokenized_reviews.apply(lambda x: detect_aspects(x,other_aspects)))
                    .drop(columns=['no_punc','other_aspects','lowercase_reviews'])
)


In [173]:
other_aspects_df.to_csv('totk_other_aspects.csv')

lowercase_reviews.drop(columns = ['other_aspects'],inplace=True)