<a href="https://colab.research.google.com/github/MsAraa19/NLP/blob/main/NLP_Lab1_goodreaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Loading and Cleaning with Pandas

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Set display options for Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

# Define column names
column_names = ["rating", "review_count", "isbn", "booktype", "author_url",
                "year", "genre_urls", "dir", "rating_count", "name"]

# Load the dataset with proper column names
df = pd.read_csv("/content/goodreads.csv", names=column_names, header=None)

# Inspect the data
print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
   rating  review_count        isbn         booktype                                         author_url    year                                         genre_urls                                                dir  rating_count                                               name
0    4.40      136455.0  0439023483  good_reads:book  https://www.goodreads.com/author/show/153394.S...  2008.0  /genres/young-adult|/genres/science-fiction|/g...                dir01/2767052-the-hunger-games.html     2958974.0            The Hunger Games (The Hunger Games, #1)
1    4.41       16648.0  0439358078  good_reads:book  https://www.goodreads.com/author/show/1077326....  2003.0  /genres/fantasy|/genres/young-adult|/genres/fi...  dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...     1284478.0  Harry Potter and the Order of the Phoenix (Har...
2    3.56       85746.0  0316015849  good_reads:book  https://www.goodreads.com/author/show/941441.S...  2005.0  /genres/young-adult

In [3]:
print("\nDataset info:")
df.info()


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rating        5998 non-null   float64
 1   review_count  5998 non-null   float64
 2   isbn          5523 non-null   object 
 3   booktype      5998 non-null   object 
 4   author_url    5998 non-null   object 
 5   year          5993 non-null   float64
 6   genre_urls    5938 non-null   object 
 7   dir           6000 non-null   object 
 8   rating_count  5998 non-null   float64
 9   name          5998 non-null   object 
dtypes: float64(4), object(6)
memory usage: 468.9+ KB


In [4]:
# Check for missing values
print("\nMissing values by column:")
print(df.isnull().sum())


Missing values by column:
rating            2
review_count      2
isbn            477
booktype          2
author_url        2
year              7
genre_urls       62
dir               0
rating_count      2
name              2
dtype: int64


In [5]:
# Handle missing values
df = df.dropna(subset=["year"])  # Drop rows with missing 'year'

# Convert columns to numeric and handle invalid data
int_columns = ["year", "rating_count", "review_count"]
for col in int_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
    df = df[df[col].notnull()]  # Drop rows with NaN after conversion
    df[col] = df[col].astype(int)

# Fill missing strings with empty values
df["genre_urls"].fillna("", inplace=True)
df["isbn"].fillna("", inplace=True)

print("\nUpdated dataset info:")
df.info()


Updated dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 5993 entries, 0 to 5999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rating        5993 non-null   float64
 1   review_count  5993 non-null   int64  
 2   isbn          5993 non-null   object 
 3   booktype      5993 non-null   object 
 4   author_url    5993 non-null   object 
 5   year          5993 non-null   int64  
 6   genre_urls    5993 non-null   object 
 7   dir           5993 non-null   object 
 8   rating_count  5993 non-null   int64  
 9   name          5993 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 515.0+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col], errors='coerce')
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["genre_urls"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: val

# Part 2: Parsing and Completing the Data Frame

In [6]:
def get_author(url):
    """
    Extracts and returns the author's name from the given author URL.
    """
    try:
        parts = url.split('/')
        author_name = parts[-1].replace('-', ' ') if parts[-1] else parts[-2].replace('-', ' ')
        return author_name
    except Exception as e:
        print(f"Error processing URL {url}: {e}")
        return None

# Apply the function to extract author names
df['author'] = df['author_url'].map(get_author)

In [7]:
def split_and_join_genres(url):
    """
    Extracts and cleans genre names from a genre_url string.
    """
    try:
        genres = url.strip().split('|')
        genre_names = [e.split('/')[-1].replace('-', ' ') for e in genres]
        return "|".join(genre_names)
    except Exception as e:
        print(f"Error processing genre URL {url}: {e}")
        return None

# Apply the function to clean genre information
df['genres'] = df['genre_urls'].map(split_and_join_genres)

# Drop the 'genre_urls' column
df.drop(columns=['genre_urls'], inplace=True)

# Save the cleaned dataframe
import os
os.makedirs('data', exist_ok=True)
df.to_csv("data/cleaned-goodreads.csv", index=False)
print("\nCleaned dataframe has been saved to 'data/cleaned-goodreads.csv'.")


Cleaned dataframe has been saved to 'data/cleaned-goodreads.csv'.


# Part 3: Grouping and Analysis

In [8]:
# Filter observations with negative years
negative_years = df[df['year'] < 0]
print("\nBooks with negative years:")
print(negative_years)
print("\nNumber of books with negative years:", len(negative_years))


Books with negative years:
      rating  review_count        isbn         booktype                                         author_url  year                                                dir  rating_count                                  name             author                                             genres
47      3.68          5785  0143039954  good_reads:book    https://www.goodreads.com/author/show/903.Homer  -800                        dir01/1381.The_Odyssey.html        560248                           The Odyssey          903.Homer  classics|fiction|poetry|fantasy|mythology|acad...
246     4.01           365  0147712556  good_reads:book    https://www.goodreads.com/author/show/903.Homer  -800              dir03/1375.The_Iliad_The_Odyssey.html         35123                 The Iliad/The Odyssey          903.Homer  classics|fantasy|mythology|fantasy|academic|sc...
455     3.85          1499  0140449140  good_reads:book    https://www.goodreads.com/author/show/879.Plato  -380  

In [9]:
# Group by 'author' and analyze statistics
dfgb_author = df.groupby('author')
author_stats = dfgb_author[['rating', 'rating_count', 'review_count', 'year']].describe()
print("\nAuthor statistics summary:")
print(author_stats)


Author statistics summary:
                            rating                                                       rating_count                                                                               review_count                                                                  year                                                                  
                             count      mean       std   min    25%    50%     75%   max        count          mean            std      min       25%      50%        75%       max        count    mean          std     min      25%     50%      75%      max count         mean        std     min      25%     50%      75%     max
author                                                                                                                                                                                                                                                                                                                    

In [10]:
# Group by 'year' to find the best-rated book each year
best_books = df.loc[df.groupby('year')['rating'].idxmax()]
print("\nBest-rated books for each year:")
print(best_books)


Best-rated books for each year:
      rating  review_count        isbn         booktype                                         author_url  year                                                dir  rating_count                              name                              author                                             genres
1398    3.60          1644  0141026286  good_reads:book  https://www.goodreads.com/author/show/5158478.... -1500             dir14/19351.The_Epic_of_Gilgamesh.html         42026             The Epic of Gilgamesh                   5158478.Anonymous  religion|literature|ancient|academic|read for ...
246     4.01           365  0147712556  good_reads:book    https://www.goodreads.com/author/show/903.Homer  -800              dir03/1375.The_Iliad_The_Odyssey.html         35123             The Iliad/The Odyssey                           903.Homer  classics|fantasy|mythology|fantasy|academic|sc...
1397    4.03           890  0192840509  good_reads:book  https://www.g

In [11]:
# Save best-rated books by year to a CSV file
best_books.to_csv("data/best_books_by_year.csv", index=False)
print("\nBest-rated books by year have been saved to 'data/best_books_by_year.csv'.")


Best-rated books by year have been saved to 'data/best_books_by_year.csv'.
