## Load Dataset

In [1]:
import pandas as pd

anime_df = pd.read_csv('/content/anime.csv')
print("Dataset loaded successfully. Displaying the first 5 rows:")
print(anime_df.head())

Dataset loaded successfully. Displaying the first 5 rows:
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          GintamaÂ°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  


## Initial Data Inspection and Missing Value Handling

To understand the structure, data types, and non-null values of the `anime_df` DataFrame, I will use the `.info()` method. This will help identify columns that might require type conversion or have a significant number of missing values.



In [2]:
print("DataFrame Info:")
anime_df.info()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


To get a statistical summary of the numerical columns in `anime_df`, I will use the `.describe()` method. This will provide insights into the distribution and central tendency of these columns.



In [3]:
print("\nDescriptive Statistics:")
print(anime_df.describe())


Descriptive Statistics:
           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06


To identify which columns have missing values and the extent of those missing values, I will use the `.isnull().sum()` method on the `anime_df` DataFrame.



In [4]:
print("\nMissing values per column:")
print(anime_df.isnull().sum())


Missing values per column:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


Based on the missing value counts, I will fill missing 'genre' and 'type' values with 'Unknown' and missing 'rating' values with the mean of the 'rating' column to handle them appropriately.



In [5]:
anime_df['genre'].fillna('Unknown', inplace=True)
anime_df['type'].fillna('Unknown', inplace=True)
anime_df['rating'].fillna(anime_df['rating'].mean(), inplace=True)

print("Missing values after handling:")
print(anime_df.isnull().sum())

Missing values after handling:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  anime_df['genre'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  anime_df['type'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting v

The previous code block produced a `FutureWarning` due to the use of `inplace=True` with chained assignment. To address this, I will modify the `fillna` operations to directly assign the result back to the DataFrame columns, as suggested by the warning.



In [6]:
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['type'] = anime_df['type'].fillna('Unknown')
anime_df['rating'] = anime_df['rating'].fillna(anime_df['rating'].mean())

print("Missing values after handling:")
print(anime_df.isnull().sum())

Missing values after handling:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


## Feature Preprocessing

To understand the various formats and non-numeric entries present in the 'episodes' column, I need to inspect its unique values.



In [7]:
print("Unique values in 'episodes' column:")
print(anime_df['episodes'].unique())

Unique values in 'episodes' column:
['1' '64' '51' '24' '10' '148' '110' '13' '201' '25' '22' '75' '4' '26'
 '12' '27' '43' '74' '37' '2' '11' '99' 'Unknown' '39' '101' '47' '50'
 '62' '33' '112' '23' '3' '94' '6' '8' '14' '7' '40' '15' '203' '77' '291'
 '120' '102' '96' '38' '79' '175' '103' '70' '153' '45' '5' '21' '63' '52'
 '28' '145' '36' '69' '60' '178' '114' '35' '61' '34' '109' '20' '9' '49'
 '366' '97' '48' '78' '358' '155' '104' '113' '54' '167' '161' '42' '142'
 '31' '373' '220' '46' '195' '17' '1787' '73' '147' '127' '16' '19' '98'
 '150' '76' '53' '124' '29' '115' '224' '44' '58' '93' '154' '92' '67'
 '172' '86' '30' '276' '59' '72' '330' '41' '105' '128' '137' '56' '55'
 '65' '243' '193' '18' '191' '180' '91' '192' '66' '182' '32' '164' '100'
 '296' '694' '95' '68' '117' '151' '130' '87' '170' '119' '84' '108' '156'
 '140' '331' '305' '300' '510' '200' '88' '1471' '526' '143' '726' '136'
 '1818' '237' '1428' '365' '163' '283' '71' '260' '199' '225' '312' '240'
 '1306' '15

Based on the inspection of unique values, I will replace 'Unknown' in the 'episodes' column with 0 and then convert the column to an integer type to make it numerical for further processing.



In [8]:
anime_df['episodes'] = anime_df['episodes'].replace('Unknown', 0).astype(int)
print("'episodes' column after cleaning and type conversion:")
print(anime_df['episodes'].head())
print("Data type of 'episodes' column:", anime_df['episodes'].dtype)

'episodes' column after cleaning and type conversion:
0     1
1    64
2    51
3    24
4    51
Name: episodes, dtype: int64
Data type of 'episodes' column: int64


To prepare the 'genre' column for one-hot encoding, I will first split the comma-separated genre strings into individual genres and then collect all unique genres present in the dataset.



In [9]:
all_genres = set()
for genres_str in anime_df['genre']:
    for genre in genres_str.split(', '):
        all_genres.add(genre.strip())

print("Total unique genres found:", len(all_genres))
print("First 10 unique genres:", list(all_genres)[:10])

Total unique genres found: 44
First 10 unique genres: ['Slice of Life', 'Super Power', 'Mystery', 'Josei', 'Vampire', 'Unknown', 'Historical', 'Parody', 'School', 'Sports']


Now that all unique genres have been identified, I will apply one-hot encoding to the 'genre' column by creating new binary columns for each genre. Each new column will indicate the presence (1) or absence (0) of that genre for each anime.



In [10]:
for genre in all_genres:
    anime_df[genre] = anime_df['genre'].apply(lambda x: 1 if genre in x else 0)

print("DataFrame with one-hot encoded genres (first 5 rows and genre columns):")
print(anime_df[list(all_genres)[:5] + ['genre']].head())

DataFrame with one-hot encoded genres (first 5 rows and genre columns):
   Slice of Life  Super Power  Mystery  Josei  Vampire  \
0              0            0        0      0        0   
1              0            0        0      0        0   
2              0            0        0      0        0   
3              0            0        0      0        0   
4              0            0        0      0        0   

                                               genre  
0               Drama, Romance, School, Supernatural  
1  Action, Adventure, Drama, Fantasy, Magic, Mili...  
2  Action, Comedy, Historical, Parody, Samurai, S...  
3                                   Sci-Fi, Thriller  
4  Action, Comedy, Historical, Parody, Samurai, S...  


To one-hot encode the 'type' column, I will use `pd.get_dummies()` and then concatenate the resulting DataFrame with the original `anime_df`.



In [11]:
type_dummies = pd.get_dummies(anime_df['type'], prefix='type')
anime_df = pd.concat([anime_df, type_dummies], axis=1)

print("DataFrame with one-hot encoded 'type' column (first 5 rows and type columns):")
print(anime_df[[col for col in anime_df.columns if 'type_' in col or col == 'type']].head())

DataFrame with one-hot encoded 'type' column (first 5 rows and type columns):
    type  type_Movie  type_Music  type_ONA  type_OVA  type_Special  type_TV  \
0  Movie        True       False     False     False         False    False   
1     TV       False       False     False     False         False     True   
2     TV       False       False     False     False         False     True   
3     TV       False       False     False     False         False     True   
4     TV       False       False     False     False         False     True   

   type_Unknown  
0         False  
1         False  
2         False  
3         False  
4         False  


To normalize the numerical features 'rating' and 'members', I first need to import the `StandardScaler` from `sklearn.preprocessing`.



In [12]:
from sklearn.preprocessing import StandardScaler

print("StandardScaler imported successfully.")

StandardScaler imported successfully.


Now that `StandardScaler` is imported, I will initialize a `StandardScaler` object and then use it to fit and transform the 'rating' and 'members' columns, storing the results in new 'rating_scaled' and 'members_scaled' columns.



In [13]:
scaler = StandardScaler()
anime_df['rating_scaled'] = scaler.fit_transform(anime_df[['rating']])
anime_df['members_scaled'] = scaler.fit_transform(anime_df[['members']])

print("Numerical features 'rating' and 'members' scaled successfully.")
print(anime_df[['rating', 'rating_scaled', 'members', 'members_scaled']].head())

Numerical features 'rating' and 'members' scaled successfully.
   rating  rating_scaled  members  members_scaled
0    9.37       2.847535   200630        3.330241
1    9.26       2.739380   793665       14.148406
2    9.25       2.729547   114262        1.754713
3    9.17       2.650889   673572       11.957666
4    9.16       2.641057   151266        2.429742


## Combine Features and Compute Cosine Similarity

To combine all preprocessed features, I first need to identify all columns that represent genres (one-hot encoded), types (one-hot encoded), and the scaled numerical features ('rating_scaled' and 'members_scaled'). I will create a list of these column names.



In [14]:
genre_cols = [col for col in anime_df.columns if col in all_genres]
type_cols = [col for col in anime_df.columns if 'type_' in col]

feature_cols = genre_cols + type_cols + ['rating_scaled', 'members_scaled']

print("Identified feature columns:")
print(feature_cols[:10]) # Print first 10 for brevity
print(f"Total feature columns: {len(feature_cols)}")

Identified feature columns:
['Slice of Life', 'Super Power', 'Mystery', 'Josei', 'Vampire', 'Unknown', 'Historical', 'Parody', 'School', 'Sports']
Total feature columns: 53


Now that the feature columns are identified, I will create the `feature_matrix` DataFrame, import `cosine_similarity`, compute the cosine similarity matrix, and print its shape as per the instructions.



In [15]:
feature_matrix = anime_df[feature_cols]

from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(feature_matrix)

print("Shape of the cosine similarity matrix:", cosine_sim.shape)

Shape of the cosine similarity matrix: (12294, 12294)


## Implement Recommendation Function


I need to define the `get_recommendations` function, which will take an anime title, find its index in the DataFrame, retrieve its similarity scores from the precomputed `cosine_sim` matrix, sort these scores, and then return the names of the top N most similar anime, excluding the input anime itself.



In [16]:
def get_recommendations(anime_title, n_recommendations=10):
    # Find the index of the target anime
    if anime_title not in anime_df['name'].values:
        print(f"Anime '{anime_title}' not found in the dataset.")
        return []

    idx = anime_df[anime_df['name'] == anime_title].index[0]

    # Get the similarity scores for this anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top N most similar anime, excluding itself
    # The first element is always the anime itself with a similarity of 1, so skip it.
    sim_scores = sim_scores[1:n_recommendations+1]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top N recommended anime names
    return anime_df['name'].iloc[anime_indices].tolist()

print("Recommendation function 'get_recommendations' defined.")

Recommendation function 'get_recommendations' defined.



The subtask of defining the `get_recommendations` function has been completed in the previous turn. Now, I need to test the function to ensure it works as expected by calling it with an example anime title and displaying the recommendations.



In [17]:
print("Recommendations for 'Kimi no Na wa.':")
recommendations = get_recommendations('Kimi no Na wa.')
print(recommendations)

print("\nRecommendations for 'Fullmetal Alchemist: Brotherhood':")
recommendations = get_recommendations('Fullmetal Alchemist: Brotherhood')
print(recommendations)

print("\nRecommendations for a non-existent anime:")
recommendations = get_recommendations('NonExistentAnimeTitle')
print(recommendations)

Recommendations for 'Kimi no Na wa.':
['Hotarubi no Mori e', 'Suzumiya Haruhi no Shoushitsu', 'Hotaru no Haka', 'Clannad: After Story - Mou Hitotsu no Sekai, Kyou-hen', 'Kotonoha no Niwa', 'Yahari Ore no Seishun Love Comedy wa Machigatteiru. Zoku', 'Howl no Ugoku Shiro', 'Toki wo Kakeru Shoujo', 'Kanon (2006)', 'Clannad: Mou Hitotsu no Sekai, Tomoyo-hen']

Recommendations for 'Fullmetal Alchemist: Brotherhood':
['Fullmetal Alchemist', 'Shingeki no Kyojin', 'Fairy Tail', 'Akame ga Kill!', 'Noragami', 'Soul Eater', 'Sword Art Online', 'One Piece', 'Ao no Exorcist', 'Angel Beats!']

Recommendations for a non-existent anime:
Anime 'NonExistentAnimeTitle' not found in the dataset.
[]


## Analyze Recommendations and Thresholds

To allow filtering recommendations based on similarity, I need to modify the `get_recommendations` function to accept a `similarity_threshold` parameter and filter the results accordingly.



In [18]:
def get_recommendations(anime_title, n_recommendations=10, similarity_threshold=0):
    # Find the index of the target anime
    if anime_title not in anime_df['name'].values:
        print(f"Anime '{anime_title}' not found in the dataset.")
        return []

    idx = anime_df[anime_df['name'] == anime_title].index[0]

    # Get the similarity scores for this anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Filter out recommendations below the threshold and the anime itself
    filtered_scores = [s for s in sim_scores if s[1] >= similarity_threshold and s[0] != idx]

    # Get the scores of the top N most similar anime
    sim_scores = filtered_scores[:n_recommendations]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top N recommended anime names
    return anime_df['name'].iloc[anime_indices].tolist(), [i[1] for i in sim_scores]

print("Recommendation function 'get_recommendations' modified to include similarity_threshold.")

Recommendation function 'get_recommendations' modified to include similarity_threshold.


Now that the `get_recommendations` function is updated, I will test it with different `similarity_threshold` values as per the instructions, printing the threshold, recommendations, and their count for analysis.



In [19]:
anime_to_recommend = 'Kimi no Na wa.'
similarity_thresholds = [0.5, 0.7, 0.9]

for threshold in similarity_thresholds:
    print(f"\n--- Recommendations for '{anime_to_recommend}' with similarity_threshold = {threshold} ---")
    recommendations, scores = get_recommendations(anime_to_recommend, n_recommendations=10, similarity_threshold=threshold)
    if recommendations:
        for i, (rec, score) in enumerate(zip(recommendations, scores)):
            print(f"{i+1}. {rec} (Similarity: {score:.2f})")
    else:
        print("No recommendations found above this threshold.")
    print(f"Total recommendations: {len(recommendations)}")


--- Recommendations for 'Kimi no Na wa.' with similarity_threshold = 0.5 ---
1. Hotarubi no Mori e (Similarity: 0.95)
2. Suzumiya Haruhi no Shoushitsu (Similarity: 0.91)
3. Hotaru no Haka (Similarity: 0.90)
4. Clannad: After Story - Mou Hitotsu no Sekai, Kyou-hen (Similarity: 0.89)
5. Kotonoha no Niwa (Similarity: 0.89)
6. Yahari Ore no Seishun Love Comedy wa Machigatteiru. Zoku (Similarity: 0.89)
7. Howl no Ugoku Shiro (Similarity: 0.88)
8. Toki wo Kakeru Shoujo (Similarity: 0.88)
9. Kanon (2006) (Similarity: 0.88)
10. Clannad: Mou Hitotsu no Sekai, Tomoyo-hen (Similarity: 0.87)
Total recommendations: 10

--- Recommendations for 'Kimi no Na wa.' with similarity_threshold = 0.7 ---
1. Hotarubi no Mori e (Similarity: 0.95)
2. Suzumiya Haruhi no Shoushitsu (Similarity: 0.91)
3. Hotaru no Haka (Similarity: 0.90)
4. Clannad: After Story - Mou Hitotsu no Sekai, Kyou-hen (Similarity: 0.89)
5. Kotonoha no Niwa (Similarity: 0.89)
6. Yahari Ore no Seishun Love Comedy wa Machigatteiru. Zoku (Si

### Interpretation of Recommendations and Potential Improvements

**Interpretation of Recommendations:**

By experimenting with different `similarity_threshold` values, we can observe its direct impact on the number and perceived quality of recommendations:

*   **Lower Thresholds (e.g., 0.5):** A lower threshold like 0.5 provides a broader range of recommendations. As seen with 'Kimi no Na wa.', this might include anime that share general characteristics (e.g., genre, type, high rating) but might not be as directly similar in specific nuances. The recommendations are plentiful, but their relevance might vary.
*   **Higher Thresholds (e.g., 0.7, 0.9):** As the threshold increases, the number of recommendations decreases significantly. This indicates that the system is becoming more selective, only suggesting anime that are very closely aligned with the input anime's features. For 'Kimi no Na wa.' at a threshold of 0.9, only three highly similar anime were returned, implying a very strong match based on the defined features. While the quantity is lower, the quality (i.e., direct relevance) is likely higher.

The choice of threshold depends on the desired user experience. A lower threshold might be good for exploration, while a higher threshold is better for finding very specific, similar content.

**Potential Improvements:**

1.  **Enriching Features:**
    *   **Synopsis/Description Text Analysis:** Incorporate natural language processing (NLP) on anime synopses or descriptions to extract thematic similarities, keywords, or even sentiment. This could capture more nuanced content-based similarities that aren't apparent from genre or type.
    *   **Character/Staff Data:** Including data about voice actors, directors, studios, and main characters could add another layer of similarity, as users often follow works by specific creators or voice artists.
    *   **User Ratings/Reviews:** While the 'rating' column is used, incorporating average user ratings or even performing matrix factorization on a full user-item rating matrix (if available) would allow for true collaborative filtering elements, capturing user preferences beyond content features.

2.  **Hybrid Approaches:** Combine content-based filtering (what we've implemented) with collaborative filtering (if user-item interaction data were available). This can leverage the strengths of both, providing recommendations that are both similar in content and popular among users with similar tastes.

3.  **Advanced Feature Engineering:**
    *   **Genre Weighting:** Some genres might be more important to a user than others. Instead of simple one-hot encoding, a weighted approach could be considered.
    *   **Temporal Dynamics:** If release dates were considered, recommendations could prioritize newer or older anime based on user preferences.

4.  **Handling Cold Start:** For new anime or anime with limited data, the current system might struggle. Hybrid models or leveraging external metadata could help address this.

5.  **User Interface and Feedback:** Implement a way for users to provide feedback on recommendations (e.g.,

## Answer Interview Questions



### Q&A

**1. What is collaborative filtering, and how does it work?**
Collaborative filtering is a recommendation technique that predicts user interest in items by leveraging the preferences of many users. Its core principle is that if users share similar tastes for some items, they will likely have similar preferences for others. It works by:
1.  **Collecting User Behavior Data**: Gathering explicit (e.g., ratings) or implicit (e.g., viewing history) feedback on user-item interactions.
2.  **Finding Patterns**: Analyzing this data to identify similarities among users or items.
3.  **Making Recommendations**: Suggesting items to a target user based on these identified patterns, assuming that "people who agreed in the past tend to agree in the future" or "items liked by similar people are likely to be liked by you."

**2. Can you explain the difference between user-based and item-based collaborative filtering?**

| Feature | User-Based Collaborative Filtering | Item-Based Collaborative Filtering |
| :----------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Core Idea** | Recommends items to a user based on the preferences of similar users ("Show me what people like me liked."). | Recommends items to a user based on the similarity between items ("Since you liked this item, you might also like these other similar items."). |
| **Similarity Calculation** | Calculates similarity between **users** based on their shared item preferences. | Calculates similarity between **items** based on how users have interacted with them. |
| **Recommendation Process** | 1. Find users similar to the target user. <br> 2. Recommend items liked by those similar users but not yet seen by the target user. | 1. Identify items the target user has liked. <br> 2. Find items similar to those liked items. <br> 3. Recommend these similar items not yet seen by the target user. |
| **Computational Cost** | Can be computationally expensive for a large number of users, as user preferences are dynamic. | Often more scalable for large user bases, as item similarity is generally more stable and can be precomputed. |
| **Sparsity Issue** | More prone to data sparsity if users have rated very few items. | Generally less affected by user sparsity, as item-item relationships are derived from collective user behavior. |
| **Cold Start Problem** | **User Cold Start**: Difficult to recommend for new users with no interaction history. | **Item Cold Start**: Difficult to recommend new items with no interaction history. |
