<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Code_challenge.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

### Unsupervised Learning Project: 2401FTDS Team JB1
© ExploreAI Academy

## 1. Project Overview
<a class="anchor" id="1-project-overview"></a>

## 2. Importing Packages
<a class = "anchor" id="2-importing-packages"></a>

In [155]:
# data processing
import numpy as np
import pandas as pd
import re
from scipy import stats
import logging
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

---
## 3. Loading Data
The data used for this project was located in the anime.csv, test.csv, and train.csv file. To better manipulate and analyse the anime.csv file, it was loaded into a Pandas Data Frame using the Pandas function, .read_csv() and referred to as titles_import. For demonstrating the column index in the dataframe , index_col=False was implemented.

In [135]:
# Load the datasets
train_df = pd.read_csv('train.csv', index_col=False)
test_df = pd.read_csv('test.csv', index_col=False)
anime_df = pd.read_csv('anime.csv', index_col=False)

# Display the first few rows of the training dataset
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


<div class="alert alert-block alert-danger">
<b>To prevent any major unnecessary changes occurring to the original data</b> , a copy of the dataframe was made using the anime_df.copy() method and referred to as `anime_df_copy`.
</div>

In [136]:
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()
anime_df_copy = anime_df.copy()

#### Inspect the Dataset

Examine the structure of the dataset to understand the types of data and identify any potential issues.

In [137]:
# Check the data types and missing values
print(train_df_copy.info())
print(test_df_copy.info())
print(anime_df_copy.info())

# Get a summary of the dataset
print(train_df_copy.describe(include='all'))
print(test_df_copy.describe(include='all'))
print(anime_df_copy.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5703554 entries, 0 to 5703553
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 130.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633686 entries, 0 to 633685
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   633686 non-null  int64
 1   anime_id  633686 non-null  int64
dtypes: int64(2)
memory usage: 9.7 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members 

In [138]:
anime_df_copy.shape

(12294, 7)

Results : The dataset consists of 12294 rows (observations) and 7 columns (features).

A look at the summary information of the dataframe by using `.info` and `.shape`.

In [139]:
anime_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [140]:
anime_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
anime_id,12294.0,14058.221653,11455.294701,1.0,3484.25,10260.5,24794.5,34527.0
rating,12064.0,6.473902,1.026746,1.67,5.88,6.57,7.18,10.0
members,12294.0,18071.338864,54820.676925,5.0,225.0,1550.0,9437.0,1013917.0


---
# 4. Data Cleaning

<div class="alert alert-block alert-info">
We cleaned the data through text preprocessing and handling missing values and checked for and found one duplicate row. Finally, we validated and saved the data to avoid re-cleaning, ensuring it is reliable and trustworthy.
</div>


In [141]:
def check_and_clean_data(anime_df_copy):
    """
    Check and clean the dataset by:
    1. Printing the count of null values for each column.
    2. Printing the missing values.
    3. Filling or dropping null values.
    4. Dropping duplicate rows.
    5. Performing basic preprocessing such as stripping whitespace and converting text to lowercase.

    Parameters:
    anime_df_copy (pandas.DataFrame): The DataFrame to check and clean.

    Returns:
    pandas.DataFrame: The cleaned DataFrame.
    """
    # Check for null values
    null_counts = anime_df_copy.isnull().sum()
    if null_counts.sum() > 0:
        print("Columns with missing values:")
        print(null_counts[null_counts > 0])
        print("\n")
        
        for column in anime_df_copy.columns:
            missing_rows = anime_df_copy[column].isnull()
            if missing_rows.any():
                print(f"Missing values in '{column}':")
                print(anime_df_copy[missing_rows][[column]])
                print("\n")
            
            # Fill or drop null values as needed
            if anime_df_copy[column].dtype == 'object':
                anime_df_copy[column] = anime_df_copy[column].fillna('unknown')  # Filling null object columns with 'unknown'
            else:
                anime_df_copy[column] = anime_df_copy[column].fillna(anime_df_copy[column].median())  # Filling null numeric columns with median

    # Count and drop duplicate rows
    duplicate_count = anime_df_copy.duplicated().sum()
    print(f'The dataset has {duplicate_count} duplicate rows.')
    anime_df_copy = anime_df_copy.drop_duplicates()

    # Basic preprocessing: strip whitespace and convert to lowercase
    for column in anime_df_copy.select_dtypes(include='object').columns:
        anime_df_copy[column] = anime_df_copy[column].str.strip().str.lower()

    return anime_df_copy

In [142]:
cleaned_anime_df = check_and_clean_data(anime_df_copy)

Columns with missing values:
genre      62
type       25
rating    230
dtype: int64


Missing values in 'genre':
      genre
2844    NaN
3541    NaN
6040    NaN
6646    NaN
7018    NaN
...     ...
11070   NaN
11086   NaN
11097   NaN
11112   NaN
11113   NaN

[62 rows x 1 columns]


Missing values in 'type':
      type
10898  NaN
10900  NaN
10906  NaN
10907  NaN
10918  NaN
10949  NaN
10963  NaN
10983  NaN
10988  NaN
10990  NaN
10991  NaN
10994  NaN
10995  NaN
10998  NaN
11010  NaN
11013  NaN
11041  NaN
11053  NaN
11055  NaN
11058  NaN
11062  NaN
11070  NaN
11101  NaN
12252  NaN
12259  NaN


Missing values in 'rating':
       rating
8968      NaN
9657      NaN
10896     NaN
10897     NaN
10898     NaN
...       ...
12274     NaN
12279     NaN
12280     NaN
12282     NaN
12285     NaN

[230 rows x 1 columns]




The dataset has 0 duplicate rows.


In [143]:
print(cleaned_anime_df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


In [147]:

def clean_anime_data(anime_df_copy):
    # Initialize logging
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    # Convert 'episodes' column to numeric and handle non-numeric values
    try:
        anime_df_copy['episodes'] = pd.to_numeric(anime_df_copy['episodes'], errors='coerce')
        anime_df_copy['episodes'].fillna(anime_df_copy['episodes'].median(), inplace=True)
        logger.info("'episodes' column converted to numeric and missing values handled.")
    except Exception as e:
        logger.error(f"Error processing 'episodes' column: {e}")

    # One-hot encode 'genre' column
    try:
        genres = anime_df_copy['genre'].str.get_dummies(sep=',')
        anime_df_copy = pd.concat([anime_df_copy, genres], axis=1)
        anime_df_copy.drop('genre', axis=1, inplace=True)
        logger.info("'genre' column one-hot encoded.")
    except Exception as e:
        logger.error(f"Error processing 'genre' column: {e}")

    # Normalize 'rating' column
    try:
        anime_df_copy['rating'] = (anime_df_copy['rating'] - anime_df_copy['rating'].min()) / (anime_df_copy['rating'].max() - anime_df_copy['rating'].min())
        logger.info("'rating' column normalized.")
    except Exception as e:
        logger.error(f"Error normalizing 'rating' column: {e}")

    # Handle outliers in 'members' and 'rating' columns
    try:
        z_scores = stats.zscore(anime_df_copy[['members', 'rating']])
        abs_z_scores = abs(z_scores)
        filtered_entries = (abs_z_scores < 3).all(axis=1)
        anime_df_copy = anime_df_copy[filtered_entries]
        logger.info("Outliers handled in 'members' and 'rating' columns.")
    except Exception as e:
        logger.error(f"Error handling outliers: {e}")

    # Enhance text preprocessing for 'name' column
    try:
        anime_df_copy['name'] = anime_df_copy['name'].apply(lambda x: re.sub(r'[^A-Za-z0-9 ]+', '', x))
        logger.info("Text preprocessing done for 'name' column.")
    except Exception as e:
        logger.error(f"Error preprocessing text in 'name' column: {e}")

    # Validate data types
    try:
        logger.info(f"Data types after cleaning:\n{anime_df_copy.dtypes}")
    except Exception as e:
        logger.error(f"Error validating data types: {e}")

    return anime_df_copy

In [148]:
cleaned_anime_df = clean_anime_data(anime_df_copy)
if cleaned_anime_df is not None:
    cleaned_anime_df.to_csv('cleaned_anime.csv', index=False)
    print("Cleaned dataset saved to 'cleaned_anime.csv'.")

INFO:__main__:'episodes' column converted to numeric and missing values handled.
ERROR:__main__:Error processing 'genre' column: 'genre'
INFO:__main__:'rating' column normalized.
INFO:__main__:Outliers handled in 'members' and 'rating' columns.
INFO:__main__:Text preprocessing done for 'name' column.
INFO:__main__:Data types after cleaning:
anime_id          int64
name             object
type             object
episodes        float64
rating          float64
                 ...   
Supernatural      int64
Thriller          int64
Vampire           int64
Yaoi              int64
unknown           int64
Length: 89, dtype: object


Cleaned dataset saved to 'cleaned_anime.csv'.


In [162]:
# Check the data types and missing values
print(cleaned_anime_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 11964 entries, 2 to 12293
Data columns (total 89 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   anime_id        11964 non-null  int64  
 1   name            11964 non-null  object 
 2   type            11964 non-null  object 
 3   episodes        11964 non-null  float64
 4   rating          11964 non-null  float64
 5   members         11964 non-null  int64  
 6    Adventure      11964 non-null  int64  
 7    Cars           11964 non-null  int64  
 8    Comedy         11964 non-null  int64  
 9    Dementia       11964 non-null  int64  
 10   Demons         11964 non-null  int64  
 11   Drama          11964 non-null  int64  
 12   Ecchi          11964 non-null  int64  
 13   Fantasy        11964 non-null  int64  
 14   Game           11964 non-null  int64  
 15   Harem          11964 non-null  int64  
 16   Hentai         11964 non-null  int64  
 17   Historical     11964 non-null  int6

In [160]:
cleaned_anime_df.head()

Unnamed: 0,anime_id,name,type,episodes,rating,members,Adventure,Cars,Comedy,Dementia,...,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,unknown
2,28977,Gintama,TV,51.0,0.909964,114262,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,9969,Gintama039,TV,51.0,0.89916,151266,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5,32935,Haikyuu Karasuno Koukou VS Shiratorizawa Gakue...,TV,10.0,0.897959,93351,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,820,Ginga Eiyuu Densetsu,OVA,110.0,0.893157,80679,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,15335,Gintama Movie Kanketsuhen Yorozuya yo Eien Nare,Movie,1.0,0.891957,72534,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
