### The fields we need in our dataset to train and test our model:
| **Category**                    | **Purpose**                                                         | **Example Value**                        |
| ------------------------------- | ------------------------------------------------------------------- | ---------------------------------------- |
| `title`                         | Used for both input and generation target                           | *"The Catcher in the Rye"*               |
| `book_summary` or `description` | Used as input context to help the model understand book content     | *"A story about teenage rebellion..."*   |
| `genre` (can be multi-label)    | Target for genre classification task                                | *\["Fiction", "Young Adult", "Classic"]* |
| `rating` (numerical score)      | Target for regression (rating prediction)                           | *4.2* (average rating)                   |
| `author`                        | metadata for deeper analysis or embeddings but needed for identification   | *"J.D. Salinger"*                        |


#  1.1 Here we test file integrity and basic data purity: 
    Data folders: 
    Raw data location : archive/raw/
    Cleaned data location : archive/Cleaned/

## Data sources collected so far, anything new comes, will be added here:
    1. 1_book_details.csv
more will be added later.


In [1]:
!python --version

Python 3.10.18


### <span style="color:blue"> 1. Now we work with 1_book_details.csv:</span>

since this dataset is the a bit small, now we will see the samples of first 20 data of how it looks.

In [1]:
import pandas as pd

# Read the CSV file
#df = pd.read_csv('archive/raw/1_book_details.csv')
df = pd.read_csv('archive/raw/1.csv')

# Display the first 5 items
print(df.head(5))

                    Title           Author  Rating  \
0               Divergent    Veronica Roth    4.15   
1           Catching Fire  Suzanne Collins    4.31   
2  The Fault in Our Stars       John Green    4.15   
3   To Kill a Mockingbird       Harper Lee    4.27   
4     The Lightning Thief     Rick Riordan    4.30   

                                         description  \
0  In Beatrice Prior's dystopian Chicago world, s...   
1  Sparks are igniting.Flames are spreading.And t...   
2  Despite the tumor-shrinking medical miracle th...   
3  The unforgettable novel of a childhood in a sl...   
4  Alternate cover for this ISBN can be found her...   

                                              genres  
0  Young Adult, Dystopia, Fantasy, Fiction, Scien...  
1  Young Adult, Dystopia, Fiction, Fantasy, Scien...  
2  Young Adult, Romance, Fiction, Contemporary, R...  
3  Classics, Fiction, Historical Fiction, School,...  
4  Fantasy, Young Adult, Mythology, Fiction, Midd...  


### Now we check if any of field has inconsistent or null data

In [23]:
# Read the CSV file
df = pd.read_csv('archive/raw/1_book_details.csv')

# Check the data types of each column
dtypes = df.dtypes

# Print the data types
print(dtypes, "\n")

# Count the number of cells with inconsistent data
inconsistent_counts = {}
for col, dtype in dtypes.items():
    if dtype == 'object':  # Check for string columns
        inconsistent_counts[col] = df[col].str.contains('[^a-zA-Z0-9]').sum()
    elif dtype == 'int64':  # Check for integer columns
        inconsistent_counts[col] = df[col].apply(lambda x: not pd.api.types.is_integer(x)).sum()
    elif dtype == 'float64':  # Check for float columns
        inconsistent_counts[col] = df[col].apply(lambda x: not pd.api.types.is_float(x)).sum()

# Print the inconsistent counts
print(inconsistent_counts)

title             object
author            object
rating           float64
no_of_ratings      int64
no_of_reviews     object
description       object
genres            object
dtype: object 

{'title': np.int64(11740), 'author': np.int64(13221), 'rating': np.int64(0), 'no_of_ratings': np.int64(0), 'no_of_reviews': np.int64(13324), 'description': 13270, 'genres': 12001}


### Now we are checking how many data fields are empty or contains null type data

In [24]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('archive/raw/1_book_details.csv')

# Check for null data
null_counts = df.isnull().sum()

# Print the null counts
print(null_counts)

title              0
author             0
rating             0
no_of_ratings      0
no_of_reviews      0
description       51
genres           997
dtype: int64


### Now we want to remove the other fields which aren't necessary for our data model training
    We just need title, description, genre, rating, and author. So, we can discard other columns from our training dataset. 

    

In [5]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('archive/cleaned/1_book_details.csv')

# Select the desired columns
desired_cols = ['title', 'description', 'genre', 'rating', 'author']

# Create a new DataFrame that includes only the desired columns
new_df = df[desired_cols]

# Save the new DataFrame to a new CSV file
new_df.to_csv('archive/cleaned/1_book_details.csv', index=False)

### We need to make sure the column headings are same standard names in all files
    For example: some files might contain different heading names with same meaning. 

In [6]:
new_df = new_df.rename(columns={'title': 'title', 'description': 'description', 'genres': 'genre', 'rating': 'rating', 'author': 'author'})
new_df = new_df[['title', 'description', 'genre', 'rating', 'author']]
# Save the new DataFrame to a new CSV file
new_df.to_csv('archive/cleaned/1_book_details.csv', index=False)
print('Sorted according to your preference.')

Sorted according to your preference.


### <span style="color:Blue"> Done cleaning the file 1_book_details.csv</span>