# Anime Recommender System Project 2024

<div align="center" style="font-size: 40%; text-align: center; margin: 0 auto">
    <img src="https://mcdn.wallpapersafari.com/medium/67/98/JKSuGa.jpg" style="display: block; margin-left: auto; margin-right: auto; width: 800px; height: 200px;" />
</div>


### **Project Overview**

### **Loading Packages**

In [73]:
import numpy as np
import pandas as pd
import csv
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


In [74]:
# Download stopwords
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Thabo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Thabo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Loading Data 
>To begin our data analysis, we import and load our respective datasets into Pandas dataframes. These datasets include the 'anime_data', 'train_data', and 'test_data' CSV files.

In [75]:
# loading dataset
anime_df = pd.read_csv("anime-recommender-system-project-2024\\anime.csv")

test_df = pd.read_csv("anime-recommender-system-project-2024\\test.csv")

train_df = pd.read_csv("anime-recommender-system-project-2024\\train.csv")


>To maintain the integrity of the original data and prevent any unintended modifications, we made a copy of the dataframe using the df.copy() method. This copy was then referred to as df_copy for both dataframes.

In [76]:
# The copy of the dataframe
anime_df_copy = anime_df.copy()

test_df_copy = test_df.copy()

train_df_copy = train_df.copy()

### **Initial Data Inspection**

###**Data Cleaning**

>Clean data plays a critical role in  models. When the data is free from errors, inconsistencies, and inaccuracies, it ensures that the models are built on a foundation of accurate and relevant information. As a result, the predictive performance and reliability of the models are significantly improved. This allows the models to make more precise predictions and produce more dependable results, ultimately leading to better decision-making and outcomes.


In [77]:
anime_df_copy.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [78]:

anime_df_copy.shape

(12294, 7)

In [79]:
# Check the columns of the DataFrame
print(anime_df_copy.columns)

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')


>Before diving into any data-related tasks, it is essential to begin by examining the column names, checking the number of non-null values in each column, understanding the data types of the columns, and assessing the memory usage of the DataFrame. This preliminary information is crucial for conducting thorough data cleaning, preprocessing, and analysis tasks, thereby ensuring the accuracy and integrity of the subsequent data operations

In [80]:
anime_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


### Removing stop words
>Stop words are commonly occurring words that do not contribute to the meaning of a phrase or sentence, and are often filtered out from search queries to improve search results. These words include articles, prepositions, and other frequently used words that do not hold much significance when searching for specific information. Removing stop words helps to focus the search on the most important keywords, resulting in more relevant and accurate search results.

In [81]:
#We print out the stopwords for English
stopwords_list = stopwords.words('english')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [82]:
#We define a function to remove stopwords from a text
def remove_stopwords(text):
    if isinstance(text, str):
        words = text.split() 
        filtered_words = [word for word in words if word.lower() not in stopwords_list]
        return ' '.join(filtered_words)
    else:
        return text
    
#Apply function to the dataframe
anime_df_copy = anime_df_copy.applymap(remove_stopwords)

#Showing the dataframe without stopwords
print("\nData without stopwords:")
anime_df_copy.head()


Data without stopwords:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


### Converting-lowercase letters 

>We convert all the text into lowercase letters in order to eliminate any distractions caused by capitalization.

In [83]:
# Function to convert to lowercase
anime_df_copy = anime_df_copy.applymap(lambda x: x.lower() if isinstance(x, str) else x)
#We show the dataframe without punctuation
print("\nlowercase:")
anime_df_copy.head()


lowercase:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa.,"drama, romance, school, supernatural",movie,1,9.37,200630
1,5114,fullmetal alchemist: brotherhood,"action, adventure, drama, fantasy, magic, mili...",tv,64,9.26,793665
2,28977,gintama°,"action, comedy, historical, parody, samurai, s...",tv,51,9.25,114262
3,9253,steins;gate,"sci-fi, thriller",tv,24,9.17,673572
4,9969,gintama&#039;,"action, comedy, historical, parody, samurai, s...",tv,51,9.16,151266


###  Remove punctuation
>We eliminate punctuation marks from the dataset to reduce or eliminate potential interference and irrelevant information, aiming to produce a cleaner and more accurate dataset for analysis and processing.

In [84]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [85]:
##Defining a cleaning function
def remove_punctuation(text):
    if isinstance(text, str):  # Check if the cell contains a string
        return text.translate(str.maketrans('', '', string.punctuation))
    return text

#Apply the function to the dataframe
anime_df_copy = anime_df_copy.applymap(remove_punctuation)

In [86]:
print("\nCleaned Data without punctuation:")
anime_df_copy.head()


Cleaned Data without punctuation:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa,drama romance school supernatural,movie,1,9.37,200630
1,5114,fullmetal alchemist brotherhood,action adventure drama fantasy magic military ...,tv,64,9.26,793665
2,28977,gintama°,action comedy historical parody samurai scifi ...,tv,51,9.25,114262
3,9253,steinsgate,scifi thriller,tv,24,9.17,673572
4,9969,gintama039,action comedy historical parody samurai scifi ...,tv,51,9.16,151266


###  Tokenization
>A tokenizer breaks down text into smaller units called tokens, which are essentially "words". Tokenizers are used to prepare data for analysis by making it cleaner. This process simplifies the text, making it easier to manage and analyze. Tokenization standardizes text into a consistent format, making it easier to compare and perform operations across different texts.

In [87]:
def tokenize_text(text):
    if isinstance(text, str):
        return ' '.join(word_tokenize(text))
    return text

anime_df_copy = anime_df_copy.apply(tokenize_text)

In [88]:
print("\n Showing tokenized data:")
anime_df_copy.head()


 Showing tokenized data:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa,drama romance school supernatural,movie,1,9.37,200630
1,5114,fullmetal alchemist brotherhood,action adventure drama fantasy magic military ...,tv,64,9.26,793665
2,28977,gintama°,action comedy historical parody samurai scifi ...,tv,51,9.25,114262
3,9253,steinsgate,scifi thriller,tv,24,9.17,673572
4,9969,gintama039,action comedy historical parody samurai scifi ...,tv,51,9.16,151266


###  Lemmatization 
>Lemmatization reduces words to their base form, known as the lemma, considering context and part of speech to ensure validity. It maintains the meaning and grammatical integrity of words.

In [89]:

# Lemmatization function
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    if isinstance(text, str):
        return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

#Apply the function to the dataframe
anime_df_copy = anime_df_copy.applymap(lemmatize_text)

In [90]:

print("\nLemmatized Data:")
anime_df_copy.head()


Lemmatized Data:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa,drama romance school supernatural,movie,1,9.37,200630
1,5114,fullmetal alchemist brotherhood,action adventure drama fantasy magic military ...,tv,64,9.26,793665
2,28977,gintama°,action comedy historical parody samurai scifi ...,tv,51,9.25,114262
3,9253,steinsgate,scifi thriller,tv,24,9.17,673572
4,9969,gintama039,action comedy historical parody samurai scifi ...,tv,51,9.16,151266


###  Removing emojis
>Emojis are frequently used in text to express emotions or sentiments but can be irrelevant or distracting for certain NLP tasks. Removing emojis is a crucial step in data preprocessing to ensure that the text is clean and standardized.

In [91]:
#We define a function to remove emojis from the strings
def remove_emojis(text):
    if isinstance(text, str):  # Check if the cell contains a string
        # Define a regex pattern for emojis
        emoji_pattern = re.compile(
            "[" 
            "\U0001F600-\U0001F64F"  # emoticons
            "\U0001F300-\U0001F5FF"  # symbols & pictographs
            "\U0001F680-\U0001F6FF"  # transport & map symbols
            "\U0001F1E0-\U0001F1FF"  # flags (iOS)
            "\U00002702-\U000027B0"
            "\U000024C2-\U0001F251"
            "]+", flags=re.UNICODE)
        return emoji_pattern.sub(r'', text)
    return text


In [92]:

#We apply the function to the entire dataframe
anime_df_copy.applymap(remove_emojis)
print("\nCleaned Data without emojis:")
anime_df_copy.head()


Cleaned Data without emojis:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa,drama romance school supernatural,movie,1,9.37,200630
1,5114,fullmetal alchemist brotherhood,action adventure drama fantasy magic military ...,tv,64,9.26,793665
2,28977,gintama°,action comedy historical parody samurai scifi ...,tv,51,9.25,114262
3,9253,steinsgate,scifi thriller,tv,24,9.17,673572
4,9969,gintama039,action comedy historical parody samurai scifi ...,tv,51,9.16,151266



#### Stemming
> stemming is often used to normalize text, reducing the vocabulary size and helping algorithms focus on the underlying meaning of words rather than their specific forms.

In [93]:
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer

#Stemming function
def stem_text(text):
    stemmer = PorterStemmer()
    if isinstance(text, str):
        return ' '.join([stemmer.stem(word) for word in text.split()])
    return text

# Apply stemming to the entire DataFrame
anime_df_copy = anime_df_copy.apply(stem_text)


In [94]:
print("\nStemmed Data:")
anime_df_copy.head()


Stemmed Data:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,kimi na wa,drama romance school supernatural,movie,1,9.37,200630
1,5114,fullmetal alchemist brotherhood,action adventure drama fantasy magic military ...,tv,64,9.26,793665
2,28977,gintama°,action comedy historical parody samurai scifi ...,tv,51,9.25,114262
3,9253,steinsgate,scifi thriller,tv,24,9.17,673572
4,9969,gintama039,action comedy historical parody samurai scifi ...,tv,51,9.16,151266


### **Exploratory Data Analysis**

### **Unsupervised Learning Models**

### **Conclusions and Insights**