In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_parquet("anime_data_feature_engineered_v2.parquet")

In [4]:
print(df.columns)

Index(['anime_id', 'anime_url', 'title', 'synopsis', 'main_pic', 'type',
       'source_type', 'num_episodes', 'status', 'start_date', 'end_date',
       'season', 'studios', 'genres', 'score', 'score_count', 'score_rank',
       'popularity_rank', 'members_count', 'favorites_count', 'watching_count',
       'completed_count', 'on_hold_count', 'dropped_count',
       'plan_to_watch_count', 'total_count', 'score_10_count',
       'score_09_count', 'score_08_count', 'score_07_count', 'score_06_count',
       'score_05_count', 'score_04_count', 'score_03_count', 'score_02_count',
       'score_01_count', 'clubs', 'pics', 'duration', 'release_year'],
      dtype='object')


In [4]:
import seaborn as sns
import matplotlib.pyplot as plt

The thing is, most of the values of features are not numerical values. And those features need to be included because logically they play a huge part in predicting the rating an anime will recieve. Example, Genres and studios, title, etc. And some of these features have multiple values. Example, an anime will not just be one specific genre, it can be a combination of many. 
1. So we first have to convert the non-numerical data into numbers by one hot encoding them (converts them into 0s and 1s).
2. then we also will have to separate the multivalued features into more features. If that particular value is present then 1 else 0. 


### üß† Feature Selection ‚Äì What to Keep & Why

One of the key steps in building a predictive model is carefully selecting relevant features. Below is a rationale for each column:

1. **anime_id** ‚Äì Not required. This is just a unique identifier used for referencing, not a meaningful feature. ‚ùå  
2. **anime_url** ‚Äì Also not useful for prediction. It‚Äôs metadata and holds no modeling value. ‚ùå  
3. **title** ‚Äì Potentially valuable, as certain keywords might influence user interest. However, converting titles to numerical features requires NLP techniques. Needs further exploration.  
4. **synopsis** ‚Äì Highly relevant. Synopses play a significant role in user decisions, and with proper text vectorization, this can be a powerful feature. ‚úÖ  
5. **main_pic** ‚Äì While visuals matter to users, image-based modeling is outside the scope of this project for now. ‚ùå  
6. **type** ‚Äì Very important. The format (TV, Movie, OVA, etc.) often affects user expectations and engagement. ‚úÖ  
7. **source_type** ‚Äì Likely not impactful for our prediction task. ‚ùå  
8. **num_episodes** ‚Äì Clearly relevant. Longer or shorter series may have different levels of appeal. ‚úÖ  
9. **status** ‚Äì Can be useful. Whether the anime is ongoing, completed, or yet to air might affect viewership and ratings. Worth keeping. ‚úÖ  
10. **start_date** ‚Äì May provide insight into seasonal trends or temporal patterns in anime popularity. ‚úÖ  
11. **end_date** ‚Äì Same rationale as `start_date`; could help derive duration or detect seasonal effects. ‚úÖ  
12. **season** ‚Äì Initially dismissed due to missing values, but could be engineered using start/end dates. Might capture seasonal viewing trends. ‚úÖ  
13. **studios** ‚Äì Important. Studios have varying reputations and fan bases which can impact popularity and perception. ‚úÖ  
14. **genres** ‚Äì Highly important. User preferences often align strongly with genre. ‚úÖ  
15. **score** ‚Äì Our **target** variable. üéØ

---

### ‚ú® Engineered Features

In addition to existing features, a few new ones can be derived:

- **duration** ‚Äì Number of days between start and end date; may correlate with format or production choices.  
- **release_year** ‚Äì Extracted from `start_date`, this can help analyze trends across different time periods.  


### üéØ Choosing the Right Target Variable

We need to clearly define what exactly we want to predict. Here are the main options:

---

#### üîπ **Option 1: `score`** ‚Äì *The user rating (e.g., 8.79)*  
**What it means**: The average rating given by MyAnimeList users.

**‚úÖ Pros:**
- Sounds natural ‚Äî ‚Äúpredict how good people think this anime is.‚Äù
- Continuous value ‚Üí great for regression models.

**‚ö†Ô∏è Cons:**
- Can be biased by niche anime with few votes.
- May not reflect popularity or mass appeal (some anime are highly rated but barely watched).

**üß† Best for**: Predicting perceived *quality*.

---

#### üîπ **Option 2: `popularity_rank` or `members_count`** ‚Äì *The mass appeal*  
**What it means**: How many users added it to their list (watched, plan to watch, etc.).

**‚úÖ Pros:**
- Reflects how ‚Äúmainstream‚Äù or widely known the anime is.
- May correlate well with features like genre, studio, or release season.

**‚ö†Ô∏è Cons:**
- Heavily affected by recency and hype (new shows don‚Äôt have time to build views).
- Highly skewed ‚Äî a few shows dominate, most remain niche.

**üß† Best for**: Understanding *reach* or popularity.

---

#### üîπ **Option 3: `score_count`** ‚Äì *The engagement metric*  
**What it means**: Number of users who actually rated the anime.

**‚úÖ Pros:**
- Good hybrid of quality and popularity ‚Äî reflects both viewership and user effort.
  
**‚ö†Ô∏è Cons:**
- Still affected by anime recency and whether rating was available to users.

**üß† Best for**: Capturing *viewer engagement*.

---

In [6]:
df['score'].head(20)

0     6.50
1     6.20
2     6.39
3     7.82
4     8.46
5     8.62
6     7.24
7     7.25
8     6.78
9     6.76
10    6.84
11    8.13
12    6.64
13    6.33
14    8.12
15    7.34
16    8.53
17    7.91
18    6.78
19    8.25
Name: score, dtype: float64

In [7]:
df['popularity_rank'].head(20)

0      1719
1      3349
2      1890
3       609
4       228
5        22
6     13494
7     11412
8     10840
9     11062
10    10740
11      672
12     2891
13     2634
14     1498
15     1331
16      469
17      306
18      295
19       58
Name: popularity_rank, dtype: int64

In [8]:
df['duration'].head(20)

0      77.0
1      84.0
2      77.0
3     175.0
4      77.0
5     161.0
6     127.0
7     175.0
8      63.0
9     189.0
10     70.0
11    161.0
12    315.0
13    357.0
14     70.0
15     70.0
16    168.0
17     77.0
18     77.0
19     84.0
Name: duration, dtype: float64

In [6]:
# Count how often each studio appears
studio_counts = df['studios'].value_counts()
print(studio_counts)

studios
Unknown                                         1493
Toei Animation                                   462
Sunrise                                          338
J.C.Staff                                        309
Madhouse                                         282
                                                ... 
Yamato Works                                       1
Digital Network Animation                          1
Production I.G|Signal.MD|Production GoodBook       1
Studio Rikka|Purple Cow Studio Japan               1
Signal.MD|Sublimation                              1
Name: count, Length: 1086, dtype: int64


In [7]:

# See top 10 studios
print(studio_counts.head(10))




studios
Unknown              1493
Toei Animation        462
Sunrise               338
J.C.Staff             309
Madhouse              282
Studio Deen           231
Studio Pierrot        217
Production I.G        205
A-1 Pictures          187
TMS Entertainment     186
Name: count, dtype: int64


In [8]:
# If you want all unique studio names
unique_studios = df['studios'].unique()
print(unique_studios)

['LIDENFILMS|Felix Film' 'Tomovies' 'AIC' ... 'Production I.G|Zexcs'
 'Sunrise|Bandai Visual' 'Fifth Avenue']


a big problem with studios, its separated by | 

In [3]:
df['studios'] = df['studios'].str.split('|').apply(lambda x: [s.strip() for s in x])

In [4]:
df.to_parquet("anime_data_feature_engineered_v2.parquet", index=False)


In [5]:
df2 = pd.read_parquet("anime_data_feature_engineered_v2.parquet")

In [7]:
studio_counts = df2['studios'].value_counts()
print(studio_counts)

studios
[Unknown]                         1493
[Toei Animation]                   462
[Sunrise]                          338
[J.C.Staff]                        309
[Madhouse]                         282
                                  ... 
[J.C.Staff, Egg Firm]                1
[Studio Pierrot, Pierrot Plus]       1
[Passione, Creators in Pack]         1
[Gallop, Studio Deen]                1
[Nippon Animation, Xebec]            1
Name: count, Length: 1359, dtype: int64


now 'studios' is a list just like genres. Lets engineer another feature called studio_rank_score based on how many times a certain studio name occurs.

In [8]:
df_exploded = df.explode('studios').copy()


In [9]:
df_exploded['studios'] = df_exploded['studios'].str.strip()


In [10]:
studio_counts = df_exploded['studios'].value_counts()
studio_freq_map = studio_counts.to_dict()


In [11]:
df['studio_rank_score'] = df['studios'].apply(
    lambda studio_list: sum(studio_freq_map.get(studio, 0) for studio in studio_list)
)


In [12]:
df.to_parquet("anime_data_feature_engineered_v2.parquet", index=False)

In [13]:
df3 = pd.read_parquet("anime_data_feature_engineered_v2.parquet")

In [14]:
df3['studio_rank_score'].head()

0     72
1      4
2    144
3     35
4     43
Name: studio_rank_score, dtype: int64

üîç What is studio_rank_score?
It is a numerical score assigned to each anime based on how popular its studios are, calculated like this:

‚úÖ For each row in your dataset (i.e., for each anime):
1. studios is a list of studios (e.g., ['Bones', 'Aniplex'])
2. You look up the frequency of each studio in the entire dataset (i.e., how many anime it's worked on)
3. You sum those frequencies

üß† Example:
Let‚Äôs say your studio_counts looks like this:
Bones: 100
Aniplex: 80
Studio DEEN: 25

Now imagine an anime has:
df.loc[42, 'studios'] = ['Bones', 'Aniplex']

Then its studio_rank_score will be:
100 (Bones) + 80 (Aniplex) = 180

If another anime has:
['Studio DEEN']
Then its studio_rank_score = 25

‚úÖ So to summarize:
- studio_rank_score is per anime
- it's based on how many times the studio(s) appeared in the entire dataset
- if there are multiple studios, it adds up all their frequencies



**Total features engineered: 3**

In [15]:
print(df.dtypes)

anime_id                        int64
anime_url                      object
title                          object
synopsis                       object
main_pic                       object
type                           object
source_type                    object
num_episodes                    int64
status                         object
start_date             datetime64[ns]
end_date               datetime64[ns]
season                         object
studios                        object
genres                         object
score                         float64
score_count                   float64
score_rank                    float64
popularity_rank                 int64
members_count                   int64
favorites_count                 int64
watching_count                  int64
completed_count                 int64
on_hold_count                   int64
dropped_count                   int64
plan_to_watch_count             int64
total_count                     int64
score_10_cou

In [16]:
# Your selected features
selected_features = [
    'type', 'num_episodes', 'status', 'start_date', 'end_date', 
    'season', 'studios', 'genres', 'score', 'release_year',
    'studio_rank_score', 'duration'
]

# Slice the DataFrame
df_selected = df[selected_features]

# Save to CSV
df_selected.to_csv('anime_data_selected_features.csv', index=False)

# Save to Parquet
df_selected.to_parquet('anime_data_selected_features.parquet', index=False)


In [23]:
df['season'].head(100)

0     Winter 2021
1       Fall 2015
2     Summer 2011
3       Fall 2004
4     Winter 2021
         ...     
95      Fall 1996
96    Winter 1977
97    Spring 1969
98      Fall 1980
99      Fall 2007
Name: season, Length: 100, dtype: object

In [18]:
df['status'].head()

0    Finished Airing
1    Finished Airing
2    Finished Airing
3    Finished Airing
4    Finished Airing
Name: status, dtype: object

In [19]:
print("Nulls in 'status':", df['status'].isnull().sum())
print("Nulls in 'season':", df['season'].isnull().sum())


Nulls in 'status': 0
Nulls in 'season': 6964


there are too many nulls in season. This feature basically is: Season the anime started airing on (example animes that started in Jan 2020 have season Winter 2020). So lets engineer the values using its start date 

In [24]:
df[df['season'].isnull()][['anime_id', 'start_date', 'season']].head(5)


Unnamed: 0,anime_id,start_date,season
3683,42603,2020-08-16,
3684,44752,2020-12-10,
3685,51013,2022-02-12,
3686,50462,2020-09-30,
3687,45643,2021-01-06,


In [25]:
# Step 1: Create a function to map month to season
def get_season_from_month(month):
    if month in [1, 2, 3]:
        return "Winter"
    elif month in [4, 5, 6]:
        return "Spring"
    elif month in [7, 8, 9]:
        return "Summer"
    elif month in [10, 11, 12]:
        return "Fall"
    else:
        return None

# Step 2: Fill null seasons based on start_date
def fill_season(row):
    if pd.isnull(row['season']) and pd.notnull(row['start_date']):
        month = row['start_date'].month
        year = row['start_date'].year
        season_name = get_season_from_month(month)
        if season_name:
            return f"{season_name} {year}"
    return row['season']  # keep original if not null

# Apply it
df['season'] = df.apply(fill_season, axis=1)


In [28]:
anime_ids_to_check = [42603, 44752, 51013, 50462, 45643]
df[df['anime_id'].isin(anime_ids_to_check)][['anime_id', 'start_date', 'season']]


Unnamed: 0,anime_id,start_date,season
3683,42603,2020-08-16,Summer 2020
3684,44752,2020-12-10,Fall 2020
3685,51013,2022-02-12,Winter 2022
3686,50462,2020-09-30,Summer 2020
3687,45643,2021-01-06,Winter 2021


In [27]:
print("Nulls in 'season':", df['season'].isnull().sum())

Nulls in 'season': 0


In [29]:
# Save to CSV
df.to_csv('anime_data_selected_features.csv', index=False)

# Save to Parquet
df.to_parquet('anime_data_selected_features.parquet', index=False)

In [30]:
df5 = pd.read_parquet('anime_data_selected_features.parquet')

In [31]:
anime_ids_to_check = [42603, 44752, 51013, 50462, 45643]
df5[df5['anime_id'].isin(anime_ids_to_check)][['anime_id', 'start_date', 'season']]

Unnamed: 0,anime_id,start_date,season
3683,42603,2020-08-16,Summer 2020
3684,44752,2020-12-10,Fall 2020
3685,51013,2022-02-12,Winter 2022
3686,50462,2020-09-30,Summer 2020
3687,45643,2021-01-06,Winter 2021


In [33]:
# Your selected features
selected_features = [
    'type', 'num_episodes', 'status', 'start_date', 'end_date', 
    'season', 'studios', 'genres', 'score', 'release_year',
    'studio_rank_score', 'duration'
]

# Slice the DataFrame
df_selected = df[selected_features]

# Save to CSV
df_selected.to_csv('anime_data_selected_featuresv2.csv', index=False)

# Save to Parquet
df_selected.to_parquet('anime_data_selected_featuresv2.parquet', index=False)

In [34]:
df6 = pd.read_parquet('anime_data_selected_featuresv2.parquet')

In [35]:
df6.head()

Unnamed: 0,type,num_episodes,status,start_date,end_date,season,studios,genres,score,release_year,studio_rank_score,duration
0,TV,12,Finished Airing,2021-01-04,2021-03-22,Winter 2021,"[LIDENFILMS, Felix Film]","[Adventure, Fantasy, Girls Love, Mystery, Sci-Fi]",6.5,2021,72,77.0
1,TV,13,Finished Airing,2015-10-02,2015-12-25,Fall 2015,[Tomovies],"[Horror, Mystery, Supernatural, Suspense]",6.2,2015,4,84.0
2,TV,12,Finished Airing,2011-07-10,2011-09-25,Summer 2011,[AIC],"[Comedy, Romance, Ecchi, Harem, School]",6.39,2011,144,77.0
3,TV,26,Finished Airing,2004-10-05,2005-03-29,Fall 2004,[Studio Comet],"[Comedy, Romance, School, Shounen]",7.82,2004,35,175.0
4,TV,12,Finished Airing,2021-01-06,2021-03-24,Winter 2021,[White Fox],"[Drama, Fantasy, Suspense, Psychological]",8.46,2021,43,77.0


In [37]:
# Check for nulls in each column
print("üîç Null Values in Each Column:\n")
print(df6.isnull().sum())
print("\n" + "-"*50 + "\n")

üîç Null Values in Each Column:

type                  0
num_episodes          0
status                0
start_date            0
end_date             28
season                0
studios               0
genres                0
score                 0
release_year          0
studio_rank_score     0
duration             28
dtype: int64

--------------------------------------------------



In [38]:
# Filter rows where end_date and duration are null
missing_end = df6[df6['end_date'].isnull() & df6['duration'].isnull()]

# If anime_id is available, include it too for clarity
columns_to_check = ['anime_id', 'status', 'start_date', 'end_date', 'duration'] if 'anime_id' in df6.columns else ['status', 'start_date', 'end_date', 'duration']

# Save to CSV
missing_end[columns_to_check].to_csv("missing_enddate_duration_check.csv", index=False)

print("‚úÖ CSV saved as 'missing_enddate_duration_check.csv' with the relevant columns.")


‚úÖ CSV saved as 'missing_enddate_duration_check.csv' with the relevant columns.


All are Currently Airing. Perfect.

In [39]:
print("üß™ Data Types of Each Column:\n")
print(df6.dtypes)

üß™ Data Types of Each Column:

type                         object
num_episodes                  int64
status                       object
start_date           datetime64[ns]
end_date             datetime64[ns]
season                       object
studios                      object
genres                       object
score                       float64
release_year                  int32
studio_rank_score             int64
duration                    float64
dtype: object
