In [8]:
import os 
import sys
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, trim
from pyspark.sql.types import StringType, IntegerType
from numpy.testing import assert_equal, assert_allclose
from IPython.display import HTML
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')


<h1 style="text-align:center;">Anime Movie Acquisition Exploratory Data Analysis</h1>
<hr>

<a name="top"></a>
#### Table of Contents:

[ref0]: #abs
- [Executive Summary][ref0]

[ref1]: #prob_stat
- [Problem Statement][ref1]

[ref2]: #obj
- [Objectives][ref2]

[ref3]: #dat_src
- [Data Source][ref3]

[ref4]: #dat_prep
- [Data Preprocessing][ref4]

[ref5]: #dat_expl
- [Data Exploration][ref5]

[ref6]: #conc
- [Conclusion and Recommendations][ref6]


***

<a name="abs"></a>
## Executive Summary
This section provides a short summary of the whole notebook. Problem statement and significant insights are briefly discussed.
***
A leading movie streaming service aims to enhance its content library by adding high-quality anime movies. The marketing team has identified the need to prioritize the top-rated anime films for licensing agreements to attract a significant anime fanbase and increase user engagement and subscription rates. To achieve this, an Exploratory Data Analysis (EDA) was conducted using PySpark SQL on the Anime_rank dataset sourced from MyAnimeList. The primary objective was to identify the top 10 anime movies with a rating score of 7.5 or higher and over one million members. After data preprocessing, including cleaning and standardizing, the analysis revealed the top 10 anime movies: "Koe no Katachi (A Silent Voice)," "Kimi no Na wa. (Your Name)," "Tenki no Ko (Weathering With You)," "Tonari no Totoro (My Neighbor Totoro)," "Sen to Chihiro no Kamikakushi (Spirited Away)," "Howl no Ugoku Shiro (Howl's Moving Castle)," and "Mononoke Hime (Princess Mononoke)," among others. These findings will guide the marketing team in selecting popular and critically acclaimed anime films for licensing, ensuring targeted and effective marketing efforts that attract a dedicated anime audience.


[ref]: #top
[Back to Table of Contents][ref]

<a name="prob_stat"></a>
## Problem Statement
This section discusses the description of the issue that needs to be addressed. 
***
A leading movie streaming service is looking to expand its content library by incorporating high-quality anime movies. To achieve this, the marketing team aims to identify the top-rated anime movies to prioritize for licensing agreements with the respective studios. This strategic move is expected to attract a significant anime fanbase, enhancing the platform's user engagement and subscription rates. By leveraging the Anime_rank dataset, we will perform an Exploratory Data Analysis (EDA) with PySpark SQL.

[ref]: #top
[Back to Table of Contents][ref]

<a name="obj"></a>
## Objectives
This section lists the intended outcomes of the study.
***
1. Data Collection
    - Gather a comprehensive dataset of anime movies, including information on user ratings, number of reviews, release dates, and genre.

2. Top 10 Identification
    - Particularly interested in determining which anime movies have one million or more members and a rating score of 7.5 or higher.

[ref]: #top
[Back to Table of Contents][ref]

<a name="dat_set"></a>
## Dataset
This section reveals the origin and a brief overview of the data set.
***
The dataset was collected from the popular anime forum site, MyAnimeList. The dataset used in this study is readily available on a data repository website
[Top Anime Ranked](https://www.kaggle.com/datasets/manishpingale/top-anime-ranked)


[ref]: #top
[Back to Table of Contents][ref]

<a name="dat_prep"></a>
## Data Preprocessing
This section shows the cleaning and displays the column preparation of the dataset before using for analysis
***

#### Spark Instantiation

In [9]:
spark = SparkSession.builder \
    .appName("EDA with PySpark SQL") \
    .getOrCreate()

In [11]:
anime_spark_df = spark.read.csv(
    'Anime_rank.csv',
    header=True, inferSchema=True
)

#### Data Viewing

In [12]:
anime_spark_df.printSchema()
anime_spark_df.show(15)

root
 |-- UID: integer (nullable = true)
 |-- Title: string (nullable = true)
 |-- Rank: string (nullable = true)
 |-- Stream type: string (nullable = true)
 |-- Episodes: string (nullable = true)
 |-- Start date: string (nullable = true)
 |-- End date: string (nullable = true)
 |-- Members: string (nullable = true)
 |-- Score: string (nullable = true)

+---+--------------------+----+-----------+--------+----------+---------+----------+-----+
|UID|               Title|Rank|Stream type|Episodes|Start date| End date|   Members|Score|
+---+--------------------+----+-----------+--------+----------+---------+----------+-----+
|  1|   Sousou no Frieren|   1|        TV |    28.0| Sep 2023 | Mar 2024|  800,615 | 9.35|
|  2|Fullmetal Alchemi...|   2|        TV |    64.0| Apr 2009 | Jul 2010|3,373,923 | 9.09|
|  3|         Steins;Gate|   3|        TV |    24.0| Apr 2011 | Sep 2011|2,584,616 | 9.07|
|  4|            Gintama°|   4|        TV |    51.0| Apr 2015 | Mar 2016|  636,631 | 9.06|
|  5|Sh

In [13]:
anime_spark_df = anime_spark_df.withColumn('Members', 
                                           regexp_replace('Members', ',', '').cast('int'))
anime_spark_df = anime_spark_df.withColumn('Score', 
                                           col('Score').cast('float'))

In [14]:
anime_spark_df.printSchema()

root
 |-- UID: integer (nullable = true)
 |-- Title: string (nullable = true)
 |-- Rank: string (nullable = true)
 |-- Stream type: string (nullable = true)
 |-- Episodes: string (nullable = true)
 |-- Start date: string (nullable = true)
 |-- End date: string (nullable = true)
 |-- Members: integer (nullable = true)
 |-- Score: float (nullable = true)



In [15]:
anime_spark_df = anime_spark_df.withColumn('Stream type', trim(col('Stream type')))

In [16]:
distinct_stream_types = anime_spark_df.select('Stream type').distinct()
distinct_stream_types.show()

+-----------+
|Stream type|
+-----------+
|         TV|
|    Special|
|       6683|
|        OVA|
|      Movie|
|        ONA|
|       3457|
| TV Special|
+-----------+



#### Filtering

In [17]:
anime_spark_df.filter(col('Stream type') == 'Movie').show(truncate=False)

+---+-------------------------------------------------------+----+-----------+--------+----------+---------+-------+-----+
|UID|Title                                                  |Rank|Stream type|Episodes|Start date|End date |Members|Score|
+---+-------------------------------------------------------+----+-----------+--------+----------+---------+-------+-----+
|6  |Gintama: The Final                                     |6   |Movie      |1.0     |Jan 2021  | Jan 2021|153317 |9.04 |
|16 |Koe no Katachi                                         |16  |Movie      |1.0     |Sep 2016  | Sep 2016|2355457|8.93 |
|21 |Gintama Movie 2: Kanketsu-hen - Yorozuya yo Eien Nare  |21  |Movie      |1.0     |Jul 2013  | Jul 2013|242729 |8.9  |
|27 |Violet Evergarden Movie                                |27  |Movie      |1.0     |Sep 2020  | Sep 2020|621318 |8.86 |
|28 |Kimi no Na wa.                                         |28  |Movie      |1.0     |Aug 2016  | Aug 2016|2768297|8.84 |
|35 |Kizumonogat

In [18]:
anime_spark_df_movies = anime_spark_df.filter(col('Stream type') == 'Movie')

In [19]:
anime_spark_df_movies.show(5)

+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
|UID|               Title|Rank|Stream type|Episodes|Start date| End date|Members|Score|
+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
|  6|  Gintama: The Final|   6|      Movie|     1.0| Jan 2021 | Jan 2021| 153317| 9.04|
| 16|      Koe no Katachi|  16|      Movie|     1.0| Sep 2016 | Sep 2016|2355457| 8.93|
| 21|Gintama Movie 2: ...|  21|      Movie|     1.0| Jul 2013 | Jul 2013| 242729|  8.9|
| 27|Violet Evergarden...|  27|      Movie|     1.0| Sep 2020 | Sep 2020| 621318| 8.86|
| 28|      Kimi no Na wa.|  28|      Movie|     1.0| Aug 2016 | Aug 2016|2768297| 8.84|
+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
only showing top 5 rows



#### Null Handling

In [20]:
anime_spark_df_movies = anime_spark_df_movies.dropna()

In [21]:
anime_spark_df_movies.show(5)

+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
|UID|               Title|Rank|Stream type|Episodes|Start date| End date|Members|Score|
+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
|  6|  Gintama: The Final|   6|      Movie|     1.0| Jan 2021 | Jan 2021| 153317| 9.04|
| 16|      Koe no Katachi|  16|      Movie|     1.0| Sep 2016 | Sep 2016|2355457| 8.93|
| 21|Gintama Movie 2: ...|  21|      Movie|     1.0| Jul 2013 | Jul 2013| 242729|  8.9|
| 27|Violet Evergarden...|  27|      Movie|     1.0| Sep 2020 | Sep 2020| 621318| 8.86|
| 28|      Kimi no Na wa.|  28|      Movie|     1.0| Aug 2016 | Aug 2016|2768297| 8.84|
+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
only showing top 5 rows



<a name="dat_expl"></a>
## Data Exploration
This section demonstrates that the dataframe is manipulated to extract the needed values.  By coding, a graphical representation of the data is possible.
***

In [22]:
anime_spark_df_movies.createOrReplaceTempView('anime_movies')

In [24]:
query = """
    SELECT *
    FROM anime_movies 
    WHERE Members >= 1000000 AND Score >= 7.5
    ORDER BY Rank ASC 
    LIMIT 10
"""
top_anime_movies = spark.sql(query)
top_anime_movies.show(10)

+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
|UID|               Title|Rank|Stream type|Episodes|Start date| End date|Members|Score|
+---+--------------------+----+-----------+--------+----------+---------+-------+-----+
| 16|      Koe no Katachi|  16|      Movie|     1.0| Sep 2016 | Sep 2016|2355457| 8.93|
| 28|      Kimi no Na wa.|  28|      Movie|     1.0| Aug 2016 | Aug 2016|2768297| 8.84|
|294|         Tenki no Ko| 294|      Movie|     1.0| Jul 2019 | Jul 2019|1012751| 8.27|
|322|    Tonari no Totoro| 322|      Movie|     1.0| Apr 1988 | Apr 1988|1075024| 8.25|
| 40|Sen to Chihiro no...|  40|      Movie|     1.0| Jul 2001 | Jul 2001|1871461| 8.77|
| 69| Howl no Ugoku Shiro|  69|      Movie|     1.0| Nov 2004 | Nov 2004|1353704| 8.66|
| 70|       Mononoke Hime|  70|      Movie|     1.0| Jul 1997 | Jul 1997|1268189| 8.66|
| 99|Kimetsu no Yaiba ...|  99|      Movie|     1.0| Oct 2020 | Oct 2020|1564434| 8.58|
+---+--------------------+----+-

<a name="conc"></a>
## Conclusion and Recommendations
***

**<u>Conclusion<u>** 
    
The filtered table reveals that anime movies with a high number of members(over 1 million) and high scores(7.5 or higher) are both popular and critically acclaimed. Iconic titles such as "Kimi no Nawa" (Your Name), "Sen to Chihiro no Kamikakushi" (Spirited Away), and "Mononoke Hime" (Princess Mononoke) feature prominently, highlighting their widespread appeal and enduring popularity. These movies span various release years, from "Tonari no totoro" (My Neighbor Totoro) in 1988 to "Kimetsu no Yaiba" (Demon Slayer) in 2020, indicating that excellent anime films are produced consistently over time.
    
**<u>Recommendations:<u>**

1. **Leverage Popular Titles**: Utilize the popularity of iconic titles to promote newer releases through collaborations, re-releases, or sequels
    
2. **Analyze Trends**: Further analysis of attributes such as genres, directors, and studios can provide deeper insights into the factors driving success, guiding future productions and marketing strategies.

3. **Uncover Hidden Gems**: Titles that have a decent following and a relatively high score might prove to be easily acquirable and add watch time

[ref]: #top
[Back to Table of Contents][ref]