# Exploratory data analysis

## Importing the neccessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats

warnings.filterwarnings('ignore')
%matplotlib inline

## Loading the datasets

In [3]:
books_df = pd.read_csv('data/Books.csv', encoding='ISO-8859-1')
ratings_df = pd.read_csv('data/Ratings.csv', encoding='ISO-8859-1')
users_df = pd.read_csv('data/Users.csv', encoding='ISO-8859-1')

In [4]:
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Book Dataset Columns:

- **SBN**: A unique identifier for each book, typically used to track and reference books in databases and libraries.
- **Book-Title**: The title of the book.
- **Book-Author**: The author(s) of the book, indicating the writer(s) responsible for the content.
- **Year-Of-Publication**: The year when the book was officially published.
- **Publisher**: The publishing company responsible for the book's distribution.
- **Image-URL-S**: URL for the small-sized image of the book's cover.
- **Image-URL-M**: URL for the medium-sized image of the book's cover.
- **Image-URL-L**: URL for the large-sized image of the book's cover.

In [5]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings Dataset Columns:

- **User-ID**: A unique identifier for each user who has rated a book.
- **ISBN**: The unique identifier for a book, used to reference the specific book rated by the user.
- **Book-Rating**: The rating given by the user for a specific book, typically on a scale (e.g., 1-5 or 1-10).

In [6]:
users_df.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Users Dataset Columns:

- **User-ID**: A unique identifier for each user in the dataset.
- **Location**: The geographical location (e.g., city, state, or country) of the user.
- **Age**: The age of the user, representing the user's age at the time of data collection.

In [19]:
#Defining the datasets
datasets = {'Books': books_df, 'Ratings': ratings_df, 'Users': users_df}

## Shapes of the datasets

In [20]:
for name, df in datasets.items():
    print(f"{name} Dataset Shape: {df.shape}")

Books Dataset Shape: (271360, 8)
Ratings Dataset Shape: (1149780, 3)
Users Dataset Shape: (278858, 3)


### Findings:

- **Ratings Dataset**: Contains more records than the `Users` and `Books` datasets, indicating that users may have rated multiple books.
- **Books Dataset**: Has fewer records than the `Ratings` dataset, suggesting that many ratings correspond to the same book.
- **Users Dataset**: Has fewer records than the `Ratings` dataset, confirming that users have rated multiple books.

## Dataset Info

In [18]:
for name, df in datasets.items():
    print(f"\n===== {name} Dataset Info =====")
    df.info()
    print('-'*60)


===== Books Dataset Info =====
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB
------------------------------------------------------------

===== Ratings Dataset Info =====
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      114978

### Findings:

- **Books Dataset**: 271,360 entries, 8 columns. Some missing values in **Book-Author**, **Publisher**, and **Image-URL-L**.
- **Ratings Dataset**: 1,149,780 entries, 3 columns. No missing values.
- **Users Dataset**: 278,858 entries, 3 columns. **Age** has missing values for 110,762 users.

### Null values

In [24]:
for name, df in datasets.items():
    print(f"{name} Dataset's NUll values count:")
    print(df.isnull().sum())
    print('-'*60)

Books Dataset's NUll values count:
ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64
------------------------------------------------------------
Ratings Dataset's NUll values count:
User-ID        0
ISBN           0
Book-Rating    0
dtype: int64
------------------------------------------------------------
Users Dataset's NUll values count:
User-ID          0
Location         0
Age         110762
dtype: int64
------------------------------------------------------------


In [None]:
for name, df in datasets.items():
    print(f"\n{name} Dataset - Percentage of Null Values:")
    null_percentage = (df.isnull().mean() * 100).round(2)   # Calculates and rounds to 2 decimal places
    null_percentage = null_percentage.astype(str) + ' %'
    print(null_percentage)
    print('-' * 60)


Books Dataset - Percentage of Null Values:
ISBN                   0.0 %
Book-Title             0.0 %
Book-Author            0.0 %
Year-Of-Publication    0.0 %
Publisher              0.0 %
Image-URL-S            0.0 %
Image-URL-M            0.0 %
Image-URL-L            0.0 %
dtype: object
------------------------------------------------------------

Ratings Dataset - Percentage of Null Values:
User-ID        0.0 %
ISBN           0.0 %
Book-Rating    0.0 %
dtype: object
------------------------------------------------------------

Users Dataset - Percentage of Null Values:
User-ID       0.0 %
Location      0.0 %
Age         39.72 %
dtype: object
------------------------------------------------------------


- The **Age** feature in the Users dataset has **a large number of missing values (~40%)**, making direct imputation unreliable.  
- Instead of predicting exact ages, we will **categorize users into age groups (bins)** to reduce bias and improve consistency.  
- This approach ensures that the dataset remains usable while maintaining meaningful patterns in user demographics.  

In [29]:
users_df.tail()

Unnamed: 0,User-ID,Location,Age
278853,278854,"portland, oregon, usa",
278854,278855,"tacoma, washington, united kingdom",50.0
278855,278856,"brampton, ontario, canada",
278856,278857,"knoxville, tennessee, usa",
278857,278858,"dublin, n/a, ireland",


In [37]:
count,count1 = 0,0
for loc in users_df['Location'].str.split(','):
    if len(loc) == 3:
        count += 1
        
    else:
        count1 += 1
        
print((count,count1))

(277348, 1510)
