### The fields we need in our dataset to train and test our model:
| **Category**                    | **Purpose**                                                         | **Example Value**                        |
| ------------------------------- | ------------------------------------------------------------------- | ---------------------------------------- |
| `title`                         | Used for both input and generation target                           | *"The Catcher in the Rye"*               |
| `book_summary` or `description` | Used as input context to help the model understand book content     | *"A story about teenage rebellion..."*   |
| `genre` (can be multi-label)    | Target for genre classification task                                | *\["Fiction", "Young Adult", "Classic"]* |
| `rating` (numerical score)      | Target for regression (rating prediction)                           | *4.2* (average rating)                   |
| `author`                        | metadata for deeper analysis or embeddings but needed for identification   | *"J.D. Salinger"*                        |


#  1.1 Here we test file integrity and basic data purity: 
    Data folders: 
    Raw data location : archive/raw/
    Cleaned data location : archive/Cleaned/

## Data sources collected so far, anything new comes, will be added here:
    1. 1_book_details.csv
more will be added later.


In [1]:
!python --version

Python 3.10.18


### <span style="color:blue"> 1. Now we work with 1_book_details.csv:</span>

since this dataset is the a bit small, now we will see the samples of first 20 data of how it looks.

In [2]:
import pandas as pd

# Read the CSV file
#df = pd.read_csv('archive/raw/1_book_details.csv')
#df = pd.read_csv('archive/raw/1.csv')
#df = pd.read_csv('archive/raw/2.csv')
#df = pd.read_csv('archive/raw/4.csv')
df = pd.read_csv('archive/cleaned/for_training.csv')
#df = pd.read_csv('archive/raw/6.csv')
#df = pd.read_csv('archive/raw/8.csv')
# df = pd.read_csv('archive/raw/9.csv')


# Display the first 5 items
print(df.head(5))

                                       title  \
0                           The Hunger Games   
1  Harry Potter and the Order of the Phoenix   
2                      To Kill a Mockingbird   
3                        Pride and Prejudice   
4                                   Twilight   

                                         description  \
0  WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...   
1  There is a door at the end of a silent corrido...   
2  The unforgettable novel of a childhood in a sl...   
3  Alternate cover edition of ISBN 9780679783268S...   
4  About three things I was absolutely positive.\...   

                                               genre  rating  \
0  ['Young Adult', 'Fiction', 'Dystopia', 'Fantas...    4.33   
1  ['Fantasy', 'Young Adult', 'Fiction', 'Magic',...    4.50   
2  ['Classics', 'Fiction', 'Historical Fiction', ...    4.28   
3  ['Classics', 'Fiction', 'Romance', 'Historical...    4.26   
4  ['Young Adult', 'Fantasy', 'Romance', 'Vampire...  

### Now we check if any of field has inconsistent or null data

### Now we are checking how many data fields are empty or contains null type data

In [8]:
import pandas as pd

# Read the CSV file
#df = pd.read_csv('archive/raw/1_book_details.csv')
#df = pd.read_csv('archive/raw/1.csv')
#df = pd.read_csv('archive/raw/2.csv')
#df = pd.read_csv('archive/raw/4.csv')
df = pd.read_csv('archive/cleaned/for_training.csv')
#df = pd.read_csv('archive/raw/6.csv')
# df = pd.read_csv('archive/raw/8.csv')

# Check for null data
null_counts = df.isnull().sum()

# Print the null counts
print(null_counts)

title             0
description    1338
genre             0
rating            0
author            0
dtype: int64


### Now we want to remove the other fields which aren't necessary for our data model training
    We just need title, description, genre, rating, and author. So, we can discard other columns from our training dataset. 

    

In [72]:
import pandas as pd

# Read the CSV file
#df = pd.read_csv('archive/cleaned/1_book_details.csv')
#df = pd.read_csv('archive/raw/1.csv')
#df = pd.read_csv('archive/raw/2.csv')
#df = pd.read_csv('archive/raw/4.csv')
#df = pd.read_csv('archive/raw/5.csv')
#df = pd.read_csv('archive/raw/6.csv')
df = pd.read_csv('archive/raw/8.csv')

# Select the desired columns
desired_cols = ['bookTitle', 'bookDesc', 'bookGenres', 'bookRating', 'bookAuthors']

# Create a new DataFrame that includes only the desired columns
new_df = df[desired_cols]

# Save the new DataFrame to a new CSV file
#new_df.to_csv('archive/cleaned/1_book_details.csv', index=False)
#new_df.to_csv('archive/cleaned/1.csv', index=False)
#new_df.to_csv('archive/cleaned/2.csv', index=False)
#new_df.to_csv('archive/cleaned/4.csv', index=False)
#new_df.to_csv('archive/cleaned/5.csv', index=False)
#new_df.to_csv('archive/cleaned/6.csv', index=False)
new_df.to_csv('archive/cleaned/8.csv', index=False)

### We need to make sure the column headings are same standard names in all files
    For example: some files might contain different heading names with same meaning. 

In [73]:
new_df = new_df.rename(columns={'bookTitle': 'title', 'bookDesc': 'description', 'bookGenres': 'genre', 'bookRating': 'rating', 'bookAuthors': 'author'})
new_df = new_df[['title', 'description', 'genre', 'rating', 'author']]

# Save the new DataFrame to a new CSV file
#new_df.to_csv('archive/cleaned/1_book_details.csv', index=False)
#new_df.to_csv('archive/cleaned/1.csv', index=False)
#new_df.to_csv('archive/cleaned/2.csv', index=False)
#new_df.to_csv('archive/cleaned/4.csv', index=False)
#new_df.to_csv('archive/cleaned/5.csv', index=False)
#new_df.to_csv('archive/cleaned/6.csv', index=False)
new_df.to_csv('archive/cleaned/8.csv', index=False)

print('Sorted according to your preference.')

Sorted according to your preference.


### <span style="color:Blue"> Now we eliminate rows which contains null values: </span>
we are using for_training.csv file.

In [6]:
# Load the dataset
df = pd.read_csv('archive/cleaned/for_training.csv')

# Drop rows where 'description' is null
df_cleaned = df.dropna(subset=['description'])

# Optional: Check if nulls are gone
print(df_cleaned.isnull().sum())

# Optional: Save cleaned dataset
df_cleaned.to_csv('archive/cleaned/for_training_no_nulls.csv', index=False)

title          0
description    0
genre          0
rating         0
author         0
dtype: int64
