## Artist objects with same name or first name/last name inversed

In [54]:
c = Counter()
for name in Artist.objects.values_list('name', flat=True):
    c[' '.join(sorted(name.lower().split()))] += 1
meta = Counter(c.values())
for k, v in meta.most_common():
    print(k, 'occurrence :', v, 'artistes')
print()
for k, v in c.most_common():
    if v > 1:
        print(k, ':', v, 'occurrences')

1 occurrence : 2405 artistes
2 occurrence : 42 artistes
3 occurrence : 7 artistes
4 occurrence : 1 artistes

harold sakuishi : 4 occurrences
araki hirohiko : 3 occurrences
kawahara reki : 3 occurrences
otonaka sawaki : 3 occurrences
akira toriyama : 3 occurrences
miki yoshikawa : 3 occurrences
sukeno yoshiaki : 3 occurrences
ai yazawa : 3 occurrences
miyazaki shinji : 2 occurrences
kunihiko yuyama : 2 occurrences
miyamoto yukihiro : 2 occurrences
togashi yoshihiro : 2 occurrences
hirano yoshihisa : 2 occurrences
isao takahata : 2 occurrences
masaaki yuasa : 2 occurrences
kei wakakusa : 2 occurrences
hayao miyazaki : 2 occurrences
morimi tomihiko : 2 occurrences
michiru ooshima : 2 occurrences
usui yoshito : 2 occurrences
hamaguchi shirou : 2 occurrences
kousaki satoru : 2 occurrences
kou ootani : 2 occurrences
hisaishi joe : 2 occurrences
naoko yamada : 2 occurrences
kensuke ushio : 2 occurrences
aki aoi : 2 occurrences
komi shinya : 2 occurrences
mariko nekono : 2 occurrences
murata s

## How many works share a reference?

In [74]:
from django.db.models import Count

def describe(queryset):
    nb_distinct_fields = queryset.filter(nb__gte=2).count()
    nb_duplicates = sum(queryset.filter(nb__gte=2).values_list('nb', flat=True))
    print('{:d} oeuvres partagent {:d} champs'.format(nb_duplicates, nb_distinct_fields))

In [75]:
queryset = Reference.objects.values('url').annotate(nb=Count('url')).order_by('-nb')
queryset[:5]

<QuerySet [{'url': 'http://myanimelist.net/anime/28907', 'nb': 5}, {'url': 'http://myanimelist.net/anime/29786', 'nb': 4}, {'url': 'http://myanimelist.net/anime/30296', 'nb': 4}, {'url': 'http://myanimelist.net/anime/19489', 'nb': 4}, {'url': 'http://myanimelist.net/anime/30415', 'nb': 4}]>

In [76]:
Reference.objects.filter(url='http://myanimelist.net/anime/28907').values('work__title')

<QuerySet [{'work__title': 'Gate: Jieitai Kanochi nite, Kaku Tatakaeri'}, {'work__title': 'GATE'}, {'work__title': 'GATE'}, {'work__title': 'Gate: Jieitai Kanochi nite, Kaku Tatakaeri'}, {'work__title': 'Gate: Jieitai Kanochi nite, Kaku Tatakaeri'}]>

In [77]:
describe(queryset)

1279 oeuvres partagent 606 champs


Donc 1279 œuvres partagent 606 références, ce qui voudrait dire qu'on aurait au moins 1279 - 606 = 673 doublons dus à ce problème. 😅

## How many works have same poster?

In [83]:
queryset = Work.objects.filter(ext_poster__endswith='.jpg').values('ext_poster').annotate(nb=Count('ext_poster')).order_by('-nb')
queryset[:5]

<WorkQuerySet [{'ext_poster': 'https://myanimelist.cdn-dena.com/images/anime/12/82325.jpg', 'nb': 3}, {'ext_poster': 'http://myanimelist.cdn-dena.com/images/anime/2/75559.jpg', 'nb': 3}, {'ext_poster': 'https://myanimelist.cdn-dena.com/images/anime/8/83284.jpg', 'nb': 3}, {'ext_poster': 'https://myanimelist.cdn-dena.com/images/anime/5/85224.jpg', 'nb': 3}, {'ext_poster': 'https://myanimelist.cdn-dena.com/images/anime/5/72868.jpg', 'nb': 3}]>

In [84]:
describe(queryset)

433 oeuvres partagent 200 champs


## How many works have same AniDB ID?

In [89]:
queryset = Work.objects.exclude(anidb_aid=0).values('anidb_aid').annotate(nb=Count('anidb_aid')).order_by('-nb')
queryset[:5]

<WorkQuerySet [{'anidb_aid': 4932, 'nb': 6}, {'anidb_aid': 8778, 'nb': 3}, {'anidb_aid': 7525, 'nb': 2}, {'anidb_aid': 5841, 'nb': 2}, {'anidb_aid': 6671, 'nb': 2}]>

In [90]:
describe(queryset)

35 oeuvres partagent 15 champs


## How many works have an AniDB ID?

In [93]:
Work.objects.exclude(anidb_aid=0).count() / Work.objects.count()

0.019168918049659076

## How many works have at least one Reference?

In [100]:
Work.objects.annotate(nb=Count('reference')).values('id').filter(nb__gte=1).count() / Work.objects.count()

0.6370127363952142