# An Exploration of the Data gotten from Scraping novelupdates.com

## Columns needed to find which country has the most popular novels on average
- Title
- Rank
- StarRating
- NumberOfRatings
- NumberOfReaders
- CountryOfOrigin

## Columns needed to find out why
- NumberOfChapters
- ReleaseSchedule
- LastChapterDateOfPublication
- Genres
- CountryOfOrigin
- Genres in comparison to ratings
- Genres in comparison to Country of Origin

In [65]:
import duckdb

In [66]:
con = duckdb.connect()
con.execute("DROP TABLE IF EXISTS novel_table")
con.execute("DROP TABLE IF EXISTS country_novel_table")
table = f"""
    CREATE TABLE novel_table AS
    SELECT * FROM 'webnovel_2025_analysis.csv'
"""
con.execute(table)

<duckdb.duckdb.DuckDBPyConnection at 0x1182250f0>

Test by grabbing top ten

In [67]:
print(con.execute("SELECT * FROM novel_table LIMIT 10").fetchdf())

  Rank                                              Title CountryOfOrigin  \
0   #1                        T*ash of the Count’s Family              KR   
1   #2      The Death Mage Who Doesn’t Want a Fourth Time              JP   
2   #3                        Everyone Else is a Returnee              KR   
3   #4                  Tsuki ga Michibiku Isekai Douchuu              JP   
4   #5                             Kumo Desu ga, Nani ka?              JP   
5   #6                Tensei Shitara Slime Datta Ken (WN)              JP   
6   #7              My Death Flags Show No Sign of Ending              JP   
7   #8                           The Founder of Diabolism              CN   
8   #9              The Scum Villain’s Self-Saving System              CN   
9  #10  Death March kara Hajimaru Isekai Kyusoukyoku (WN)              JP   

   StarRating NumberOfChapters  ReleaseFrequency(Days)  NumberOfReaders  \
0         4.5    1101 Chapters                     3.7            40397   
1 

Check if any titles don't have a country of origin

In [68]:
print(con.execute("SELECT Rank, Title, CountryOfOrigin from novel_table WHERE CountryOfOrigin IS NULL").fetchdf())

Empty DataFrame
Columns: [Rank, Title, CountryOfOrigin]
Index: []


Check if any Country of Origin is outside of China, Korea, or Japan

In [69]:
print(con.execute("SELECT Rank, Title, CountryOfOrigin from novel_table WHERE CountryOfOrigin <> 'KR' AND CountryOfOrigin <> 'CN' AND CountryOfOrigin <> 'JP'").fetchdf())

       Rank                                              Title CountryOfOrigin
0     #1016                                     Fields of Gold              MY
1     #1091  Villain Heal: The Villainess’s Plan to Heal a ...              TH
2     #1770     My New Life, Won’t You Please Become Peaceful!              TH
3     #2200            Baby Princess Through the Status Window              ID
4     #2201            Baby Princess Through the Status Window              ID
..      ...                                                ...             ...
116  #26838                                    Umadevi Diamond              TH
117  #26934                                         Moonflower             FIL
118  #27267                        Tấm Cám – An Ill-fated Tale              VN
119  #27443             Vengeance is Mine, All Others Pay Cash              ID
120  #28692                              Dear Future Boyfriend             FIL

[121 rows x 3 columns]


This means that there are 120 novels outside of the JP, KR, CN dataset that we should account for.

In [70]:
country_table = con.execute("CREATE TABLE country_novel_table AS SELECT * FROM 'webnovel_2025_analysis.csv' WHERE CountryOfOrigin == 'KR' OR CountryOfOrigin == 'CN' OR CountryOfOrigin == 'JP'").fetchdf()

Next Check for data with Under 5 Reviews. These points are likely to have inaccurate ratings as there is too little data to protect from extremes

In [71]:
print(con.execute("SELECT Title, NumberOfReviews FROM country_novel_table WHERE NumberOfReviews < 5").fetchdf())

                                                   Title  NumberOfReviews
0           The Obsessive Second Male Lead Has Gone Wild                4
1              I’m Worried that My Brother is Too Gentle                3
2                                           Inma no Hado                4
3      Elf no Kuni no Kyuutei Madoushi ni Naretanode,...                3
4                                         Gang of Yuusha                2
...                                                  ...              ...
19970  Tracing the Origins of the System: From Gu Mas...                0
19971                            The Perfect Show Window                0
19972  Immortality: After Living for Ten Thousand Yea...                0
19973  Gundam: Changing the World Even with a Mass-Pr...                1
19974  Classmate wa Isekai de Yuusha ni Natta kedo, O...                0

[19975 rows x 2 columns]


Since almost 20000 of the 30000 novels have under 5, we cannot just remove the data from what we analyze. Instead, we should create a column for starts times review to find total amount of stars. By finding a summation of this column and dividing by a summation of the number of reviews, we will be able to find an overall star rating per review rather than base it on individual rates.

In [72]:
query = "ALTER TABLE country_novel_table ADD COLUMN TotalStars DECIMAL(10, 2)"
con.execute(query)

<duckdb.duckdb.DuckDBPyConnection at 0x1182250f0>

In [73]:
query = "UPDATE country_novel_table SET TotalStars = NumberOfReviews * StarRating"
con.execute(query)

<duckdb.duckdb.DuckDBPyConnection at 0x1182250f0>

Testing the new column

In [74]:
print(con.execute("SELECT Title, TotalStars FROM country_novel_table ORDER BY TotalStars DESC LIMIT 100").fetchdf())

                                                Title  TotalStars
0                  Quickly Wear the Face of the Devil      3343.5
1                         T*ash of the Count’s Family      2718.0
2                            The Founder of Diabolism      2613.2
3   The Rebirth of the Malicious Empress of Milita...      2603.6
4                          Heaven Official’s Blessing      2359.4
..                                                ...         ...
95                   The Legendary Moonlight Sculptor       834.2
96                               Please Confess to Me       832.6
97                                     Coiling Dragon       832.5
98  The Reborn Otaku’s Code of Practice for the Ap...       827.2
99                                         Ze Tian Ji       824.0

[100 rows x 2 columns]


Confirm Number of Readers is never 0

In [75]:
print(con.execute("SELECT Title, NumberOfReaders FROM country_novel_table WHERE NumberOfReaders == 0").fetchdf())

Empty DataFrame
Columns: [Title, NumberOfReaders]
Index: []


A known issue with NovelUpdates is being inaccurate with the number of chapters a book actually has. This should be tested to see how big of a problem this is.

In [76]:
print(con.execute("SELECT Title, NumberOfChapters FROM country_novel_table WHERE NumberOfChapters == 0").fetchdf())

ConversionException: Conversion Error: Could not convert string '1101 Chapters' to INT32

LINE 1: ... Title, NumberOfChapters FROM country_novel_table WHERE NumberOfChapters == 0
                                                                   ^

Here I found that the chapters data included the string "Chapters" after the number making the data harder to parse.

In [None]:
query = "UPDATE country_novel_table SET NumberOfChapters = REPLACE(NumberOfChapters, ' Chapters', '')"
con.execute(query)

<duckdb.duckdb.DuckDBPyConnection at 0x1045df0f0>

In [None]:
print(con.execute("SELECT Title, NumberOfChapters FROM country_novel_table WHERE NumberOfChapters == 0").fetchdf())

                                                  Title NumberOfChapters
0                           Everyone Else is a Returnee                0
1                         Omniscient Reader’s Viewpoint                0
2                              The Book Eating Magician                0
3                  Infinite Competitive Dungeon Society                0
4                            I Reincarnated For Nothing                0
...                                                 ...              ...
2257  Love Live! Nijigasaki Gakuen School Idol Club ...                0
2258                                One Hundred Stories                0
2259  Love Live! Nijigasaki Gakuen School Idol Club ...                0
2260  Tracing the Origins of the System: From Gu Mas...                0
2261                            The Perfect Show Window                0

[2262 rows x 2 columns]


Now that the Chapter Number is updated, we can guarantee that there are at least 2262 titles with inaccurate chapters recorded on this site. 

Next we will check the format of the dates tab

In [77]:
print(con.execute("SELECT Title, LastChapterDateOfPublication FROM country_novel_table").fetchdf())

                                                   Title  \
0                            T*ash of the Count’s Family   
1                            Everyone Else is a Returnee   
2                                        Dungeon Defense   
3                          The Second Coming of Gluttony   
4                          Omniscient Reader’s Viewpoint   
...                                                  ...   
28698  Tracing the Origins of the System: From Gu Mas...   
28699                            The Perfect Show Window   
28700  Immortality: After Living for Ten Thousand Yea...   
28701  Gundam: Changing the World Even with a Mass-Pr...   
28702  Classmate wa Isekai de Yuusha ni Natta kedo, O...   

      LastChapterDateOfPublication  
0                       02-25-2025  
1                       02-01-2021  
2                       11-26-2017  
3                       07-09-2019  
4                       07-03-2018  
...                            ...  
28698               

This showed that there are also dates stored as N/A. We should replace these with NULL.

In [79]:
query = "UPDATE country_novel_table SET LastChapterDateOfPublication = NULL WHERE LastChapterDateOfPublication == 'N/A'"
con.execute(query)

<duckdb.duckdb.DuckDBPyConnection at 0x1182250f0>

In [81]:
print(con.execute("SELECT Title, LastChapterDateOfPublication FROM country_novel_table WHERE LastChapterDateOfPublication IS NULL").fetchdf())

                                                  Title  \
0                  Infinite Competitive Dungeon Society   
1                                     Remarried Empress   
2                                              Breakers   
3                                         Taming Master   
4                                    FFF-Class Trashero   
...                                                 ...   
1882                                One Hundred Stories   
1883  Love Live! Nijigasaki Gakuen School Idol Club ...   
1884                                One Hundred Stories   
1885  Love Live! Nijigasaki Gakuen School Idol Club ...   
1886  Tracing the Origins of the System: From Gu Mas...   

     LastChapterDateOfPublication  
0                            None  
1                            None  
2                            None  
3                            None  
4                            None  
...                           ...  
1882                         None  
188

This Means there were 1886 novels where the last chapter of publication was not known. For future reference, this data was recollected from a webscraper on February 28th, 2025.

From here we should be able to investigate the initial question and all but the genre related questions. Since relating Genres to ratings is difficult due to the number of genres a book can have, I am leaving that for later work if it is deemed necessary for the exploration later on.