In [1]:
import json
import pickle
from pathlib import Path

import yaml

import src.process as p
from src.models import Anime, RatingTag, Review

DATA_PATH = Path() / "data"

Set how many entries to analyse and import them.  
The dataset is the top 1000 ranked anime on MAL as of 27/08/2023.

In [2]:
threshold = 1000
assert 1 <= threshold <= 1000
with open(DATA_PATH / "data.pickle", "rb") as f:
    DATA: list[Anime] = pickle.load(f)[:threshold]
len(DATA)

1000

List the ratio of "Recommended" reviews against the total.  
(together with the rank, id, title)

In [3]:
pct = p.get_pct_rating(DATA, RatingTag.RECOMMENDED)

Store the list in full (as yaml file) or as list of ratios alone (as json file).

In [4]:
with open(DATA_PATH / f"rating_pct_{threshold}.json", "w", encoding="utf-8") as f:
    json.dump([pct[3] for pct in pct], f)

In [5]:
with open(DATA_PATH / f"rating_pct_{threshold}.yaml", "w", encoding="utf-8") as f:
    yaml.dump(pct, f)

Get the top n reviews for each anime in the list, and count the distribution of rating tags.

Keep in mind we only collected the top 3 reviews for each rating tag: Recommended, Mixed Feelings, Not Recommended.

In this context, reviews are ranked based on the number of reactions they received.

Elaborate this data, showing:
* Number of anime whose top 3 reviews include at least one that is not "Recommended".  
* Number of anime whose top review is not "Recommended".  
* Number of anime whose top 3 reviews do not contain a "Recommended" review at all.

In [6]:
num_top = 3
count = p.top_n_ratings(DATA, num_top)

print(p.num_anime_with_non_positive(count))
print(p.num_top_not_positive(count))
print(p.num_no_positive(count))

450
173
26


Save to file the top reviews rating tag distribution data.

In [7]:
with open(DATA_PATH / f"top_{num_top}_reviews_frequency_{threshold}.txt", "w", encoding="utf-8") as f:
    f.write('\n'.join(f"{x[1]: <5} - {x[0]}" for x in count.most_common()))

Collect information about negative reviews that are one of the top 3 reviews of an anime:
* Anime rank.  
* Anime id.  
* Ranking of the review.  
* Number of reactions.  
* Percentage of combined "funny" and "confusing" reactions to the review.

("Funny" and "confusing" are most commonly used to show disagreement or disapproval with a review, as they are the only way to do so beside not reacting to the review at all)

In [8]:
negative = p.negative_review_data(DATA)
print(len(negative))

263


Compute how many negative reviews have a percentage of combined "funny" and "confusing" reviews above a certain threshold, and store it on disk.

In [9]:
mock_ratios = range(5, 101, 5) # threshold % of funny/confusing reactions
with open(DATA_PATH / f"ratio_negative_mock_{threshold}.txt", "w", encoding="utf-8") as f:
    f.write(f"{len(negative)}\n")
    for mock_ratio in mock_ratios:
        mocked = sum(1 for r in negative if r[4] >= mock_ratio)
        f.write(f"[{mock_ratio}] - {mocked} - {mocked / len(negative) * 100:.2f}%\n")


Bonus:  
* The anime with no review at all (a total of 9 in the top 1000).  
* The anime whose top reviews are all "Not Recommended" (a total of 2 in the top 1000).  

In [10]:
no_reviews = p.no_review(DATA)
for entry in no_reviews: print(entry)

(136, 54595, 'Kage no Jitsuryokusha ni Naritakute! 2nd Season')
(749, 52684, 'Shen Yin Wangzuo 2nd Season')
(790, 41462, 'BanG Dream! Film Live 2nd Stage')
(835, 42166, 'Violet Evergarden CM')
(844, 37029, 'Hoozuki no Reitetsu 2nd Season: Sono Ni')
(886, 29830, 'Tamayura: Sotsugyou Shashin Part 3 - Akogare')
(889, 6582, 'Tentai Senshi Sunred 2nd Season')
(903, 40372, 'Haikyuu!! (OVA)')
(929, 36796, 'Owarimonogatari 2nd Season Recaps')


In [11]:
all_negative = p.all_top_negative(DATA)
for entry in all_negative: print(entry)

(331, 48736, 'Sono Bisque Doll wa Koi wo Suru')
(729, 38408, 'Boku no Hero Academia 4th Season')
