# COGS 108 - Final Project

### Overview

Some of our group members have an interest in Japanese anime and manga. Sometimes, we wanted to find a new anime or manga to binge but sadly it’s hard to find good ones that are suited to our taste. So thus, we have the follow question we would like to answer.

### Names

- Aaron Lee
- Cassie Zhu
- Tina Chen
- Dillon Handal
- Emery Lin

### Research Question

Can we predict whether or not a user will like a certain anime or manga based on previous ratings?

### Background & Prior Work

This project is quite similar to movie recommendations except that we are examining a different type of dataset and medium. Thus, we can apply similar concepts that are quite useful. For example in this whitepaper, https://beta.vu.nl/nl/Images/werkstuk-fernandez_tcm235-874624.pdf, there are discussions of several approaches to Netflix movie recommender system. You can have a basic model that just recommends the popular titles; however, this approach may not be the most accurate given that users may not like the popular titles. For a more precise prediction, you can examine how close users could be or in fact, examine the similarity between the titles in terms of synopsis or content. 

In the pursuit to find a strong and accurate predictor, the whitepaper reveals comparisons of different models. Given the results of the comparison, they provide somewhat of a roadmap for routes or ideas we can possibly explore. We can try different types of models using all the available data in our dataset. This way we can plot and analyze the data to get meaningful conclusions about our question.


### Hypothesis

Our hypothesis is that given a user’s ratings list, we will be able to predict that a user would like a new anime or manga based on the various factors and attributes in the anime or manga from their current ratings list. We expect to see a strong correlation in some areas such as genre, popularity, and content.

We believe in this hypothesis because the features given are strongly correlated with each other and are relevant features in getting predictions and understanding how they relate to each other to predict high ratings.

# Dataset

Our datasets will consist of public data only. We chose to use public data because it is what’s mostly readily available and easily searched for. Our main source of data will be from the MyAnimeList website because it has the largest consolidation of the anime and manga titles as well as user rating lists. 

https://myanimelist.net/ - According to the MyAnimeList website forum, the statistics list over 6.5 million users. While a small fraction of users may be just dummy or throwaway accounts, the majority of users have actual ratings lists that we can use in our analysis. If we need to scrape this, we can use an API but we will also need to clean up the dataset so it is usable and there are no odd inputs.
	
https://www.kaggle.com/CooperUnion/anime-recommendations-database - This dataset from Kaggle has already parsed and filtered out sensitive information. It has some useful information that we can examine. We also do not have to clean up the data for use.

https://jikan.moe/ - This API can be used to interface with MAL’s website to get useful information without having to scrape and parse the website manually. It will be important in being able to get more features to examine in solving our question.

# Data Analysis

In [130]:
%matplotlib inline

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
from bs4 import BeautifulSoup
from time import sleep

In [184]:
# Configure libraries
# The seaborn library makes plots look nicer
sns.set()
sns.set_context('talk')

# Don't display too many rows/cols of DataFrames
pd.options.display.max_rows = 10
pd.options.display.max_columns = 10

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

We will scrape MAL for its recent users and get information about what they decide to publicly list on their profile and ratings lists.

In [118]:
# Useful functions for getting user data

def get_recent_users():
    """ Returns recent users """
    r = requests.get('https://myanimelist.net/users.php', timeout=10)

    if r.status_code == 200:
        soup = BeautifulSoup(r.text, 'lxml')

        # Finds recent users td
        td = soup.find_all('td', attrs={'align':'center', 'class':'borderClass'})
        users = []

        # Appends users to a list
        for e in td:
            users.append(e.find('div').text)
        return users
    
def getUserInformation(user):
    if user is not "":
        return requests.get('https://api.jikan.moe/v3/user/{}/profile'.format(user)).text
    return None

def insertInfoIntoDataFrame(df, info):
    info = json.loads(info)
    local = pd.DataFrame({'username': [info['username']],
                          'location': [info['location']],
                          'Anime-Watching': [info['anime_stats']['watching']],
                          'Anime-Completed': [info['anime_stats']['completed']],
                          'Anime-Total': [info['anime_stats']['total_entries']],
                          'Anime-Mean-Score': [info['anime_stats']['mean_score']],
                          'Manga-Reading': [info['manga_stats']['reading']],
                          'Manga-Completed': [info['manga_stats']['completed']],
                          'Manga-Total': [info['manga_stats']['total_entries']],
                          'Manga-Mean-Score': [info['manga_stats']['mean_score']]})
    local = local.set_index('username')
    df = df.append(local)
    return df

In [132]:
userInfo = pd.DataFrame()

# Insert column headers to userInfo DataFrame
userInfo['username'] = ""
userInfo['location'] = None
userInfo['Anime-Watching'] = 0
userInfo['Anime-Completed'] = 0
userInfo['Anime-Total'] = 0
userInfo['Anime-Mean-Score'] = 0
userInfo['Manga-Reading'] = 0
userInfo['Manga-Completed'] = 0
userInfo['Manga-Total'] = 0
userInfo['Manga-Mean-Score'] = 0
userInfo = userInfo.set_index('username')

In [239]:
skip = True

# Grabs a bunch of recent users and inserts them into the userInfo DataFrame
for i in range(20):
    
    # variable to skip downloading as this takes a long time
    if skip:
        continue
    
    # iterates 10 times and waits 5 seconds in between fetching new users
    for user in get_recent_users():
        userInfo = insertInfoIntoDataFrame(userInfo, getUserInformation(user))
        sleep(2)
    sleep(4)

In [240]:
if not skip:
    userInfo.to_csv('userInfo.csv')
else:
    userInfo = pd.read_csv('userInfo.csv')

In [241]:
# functions for parsing anime list
def getAnimeList(user):
    if user is not "":
        r = requests.get('https://api.jikan.moe/v3/user/{}/animelist'.format(user))
        if r.status_code == 200:
            return r.text
        return None
    return None

def insertAnimeUser(df, info, name):
    if info is None:
        return df
    
    info = json.loads(info)
    titles = []
    ratings = []
    
    # insert all anime titles to titles list
    # insert all ratings to ratings list
    for anime in info['anime']:
        titles.append(str(anime['title']))
        ratings.append(anime['score'])
    
    local = pd.DataFrame({'anime': titles,
                          name: ratings})
    local = local.set_index('anime')
    df = df.combine_first(local)
    return df

In [242]:
animeUser = pd.DataFrame()
animeInfo = pd.DataFrame()

# pull user animelists
for index, row in userInfo.iterrows():
    username = row.name
    
    if skip:
        continue
    
    # grab anime list
    animeUser = insertAnimeUser(animeUser, getAnimeList(username), username)
    sleep(1)

In [243]:
if not skip:
    animeUser.to_csv('animeUser.csv')
else:
    animeUser = pd.read_csv('animeUser.csv')

# Privacy / Ethics Considerations

In regards to the project, we are using publicly available data. However, the data needs to be cleaned as it contains location data and usernames. We will need to parse to make sure there are no odd revealing pieces of data around. Once we have accomplished this part, the data we use and display would not reveal any confidential information. Aside from this, we are in compliance with the Terms of Service provided by MyAnimeList so our data usage will be ethically safe. Furthermore, users on MyAnimeList are also able to restrict the view of their ratings list so if they did not want their ratings used by any third party application, they could simply restrict who is able to view their lists.

The data results from this project do not contain user sensitive information as everything is anonymized. The purpose of the project is just to see if there is a way to draw meaningful correlations between anime or manga to create predictions. However, the results of the project may have some biased results as we will end up selecting a random population from the total users on MAL. This may create bias since it’s possible that users may have a preference to certain kinds of anime over others which would skew the data analysis and prediction.

# Conclusion & Discussion

