<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Code_challenge.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

### Unsupervised Learning Project: 2401FTDS Team JB1
© ExploreAI Academy

## 1. Project Overview
<a class="anchor" id="1-project-overview"></a>

## 2. Importing Packages
<a class = "anchor" id="2-importing-packages"></a>

In [1]:
# data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

---
## 3. Loading Data
The data used for this project was located in the anime.csv, test.csv, and train.csv file. To better manipulate and analyse the anime.csv file, it was loaded into a Pandas Data Frame using the Pandas function, .read_csv() and referred to as titles_import. For demonstrating the column index in the dataframe , index_col=False was implemented.

In [2]:
# Load the datasets
train_df = pd.read_csv('train.csv', index_col=False)
test_df = pd.read_csv('test.csv', index_col=False)
anime_df = pd.read_csv('anime.csv', index_col=False)

# Display the first few rows of the training dataset
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


<div class="alert alert-block alert-danger">
<b>To prevent any major unnecessary changes occurring to the original data</b> , a copy of the dataframe was made using the anime_df.copy() method and referred to as `anime_df_copy`.
</div>

In [3]:
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()
anime_df_copy = anime_df.copy()

#### Inspect the Dataset

Examine the structure of the dataset to understand the types of data and identify any potential issues.

In [4]:
# Check the data types and missing values
print(train_df_copy.info())
print(test_df_copy.info())
print(anime_df_copy.info())

# Get a summary of the dataset
print(train_df_copy.describe(include='all'))
print(test_df_copy.describe(include='all'))
print(anime_df_copy.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5703554 entries, 0 to 5703553
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 130.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633686 entries, 0 to 633685
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   633686 non-null  int64
 1   anime_id  633686 non-null  int64
dtypes: int64(2)
memory usage: 9.7 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members 

In [5]:
anime_df_copy.shape

(12294, 7)

Results : The dataset consists of 12294 rows (observations) and 7 columns (features).

A look at the summary information of the dataframe by using `.info` and `.shape`.

In [6]:
anime_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [7]:
anime_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
anime_id,12294.0,14058.221653,11455.294701,1.0,3484.25,10260.5,24794.5,34527.0
rating,12064.0,6.473902,1.026746,1.67,5.88,6.57,7.18,10.0
members,12294.0,18071.338864,54820.676925,5.0,225.0,1550.0,9437.0,1013917.0


---
# 4. Data Cleaning

<div class="alert alert-block alert-info">
We cleaned the data through text preprocessing and handling missing values and checked for and found one duplicate row. Finally, we validated and saved the data to avoid re-cleaning, ensuring it is reliable and trustworthy.
</div>


In [10]:
import pandas as pd

def check_and_clean_data(anime_df_copy):
    """
    Check and clean the dataset by:
    1. Printing the count of null values for each column.
    2. Printing the number of duplicate rows.
    3. Filling or dropping null values.
    4. Dropping duplicate rows.
    5. Performing basic preprocessing such as stripping whitespace and converting text to lowercase.

    Parameters:
    anime_df_copy (pandas.DataFrame): The DataFrame to check and clean.

    Returns:
    pandas.DataFrame: The cleaned DataFrame.
    """
    # Check for null values
    for column in anime_df_copy:
        null_count = anime_df_copy[column].isnull().sum()
        if null_count > 0:
            print(f'{column} has {null_count} null values')
            # Fill or drop null values as needed
            if anime_df_copy[column].dtype == 'object':
                anime_df_copy[column] = anime_df_copy[column].fillna('unknown')  # Filling null object columns with 'unknown'
            else:
                anime_df_copy[column] = anime_df_copy[column].fillna(anime_df_copy[column].median())  # Filling null numeric columns with median

    # Count and drop duplicate rows
    duplicate_count = anime_df.duplicated().sum()
    print(f'The dataset has {duplicate_count} duplicate rows.')
    anime_df_copy = anime_df_copy.drop_duplicates()

    # Basic preprocessing: strip whitespace and convert to lowercase
    for column in anime_df.select_dtypes(include='object').columns:
        anime_df_copy[column] = anime_df_copy[column].str.strip().str.lower()

    return anime_df_copy

In [11]:
cleaned_anime_df = check_and_clean_data(anime_df_copy)

genre has 62 null values
type has 25 null values
rating has 230 null values
The dataset has 0 duplicate rows.


In [15]:
print(cleaned_anime_df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64
