**CodeAlpha Data Analytics Internship TASK2: Exploratory Data Analysis (EDA)**

**Load Dataset**

In [3]:
import pandas as pd

# 1. Load the data you saved from Task 1
# IMPORTANT: Make sure 'books_data.csv' is uploaded to your Colab environment
# if you are starting a new session. (Use the folder icon on the left to upload it).
try:
    df = pd.read_csv('books_data.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Error: 'books_data.csv' not found. Please upload it to Colab.")
    # Exit or handle the error appropriately
    # For now, we'll continue assuming it's loaded.

# Display the first few rows to confirm loading
print("\n--- Initial Data Check (First 5 Rows) ---")
print(df.head())

Data loaded successfully!

--- Initial Data Check (First 5 Rows) ---
                                   Title  Price Availability  Rating_Stars
0                   A Light in the Attic  51.77     In stock             3
1                     Tipping the Velvet  53.74     In stock             1
2                             Soumission  50.10     In stock             1
3                          Sharp Objects  47.82     In stock             4
4  Sapiens: A Brief History of Humankind  54.23     In stock             5


**Data Inspection**

In [4]:
# Check for data structure and data types
print("\n--- Step 1: Data Structure (df.info()) ---")
df.info()

# Get a quick summary of numerical columns
print("\n--- Step 1: Descriptive Statistics (df.describe()) ---")
print(df.describe())


--- Step 1: Data Structure (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         1000 non-null   object 
 1   Price         1000 non-null   float64
 2   Availability  1000 non-null   object 
 3   Rating_Stars  1000 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 31.4+ KB

--- Step 1: Descriptive Statistics (df.describe()) ---
            Price  Rating_Stars
count  1000.00000   1000.000000
mean     35.07035      2.923000
std      14.44669      1.434967
min      10.00000      1.000000
25%      22.10750      2.000000
50%      35.98000      3.000000
75%      47.45750      4.000000
max      59.99000      5.000000


**Missing Data and Duplicates**

In [5]:
# Check for missing values (NaN)
print("\n--- Step 2: Missing Value Count ---")
print(df.isnull().sum())

# Check for duplicate rows
print("\n--- Step 2: Duplicate Row Count ---")
duplicate_count = df.duplicated().sum()
print(f"Total duplicate rows found: {duplicate_count}")

# If duplicates exist, remove them (keep the first occurrence)
if duplicate_count > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates removed. New row count:", len(df))


--- Step 2: Missing Value Count ---
Title           0
Price           0
Availability    0
Rating_Stars    0
dtype: int64

--- Step 2: Duplicate Row Count ---
Total duplicate rows found: 0


**Univariate Analysis (Analyzing Single Columns)**

**A. Analyze 'Rating_Stars' (Categorical/Ordinal)**

In [6]:
print("\n--- Step 3A: Analysis of 'Rating_Stars' ---")
rating_counts = df['Rating_Stars'].value_counts().sort_index(ascending=False)
print("Distribution of Book Ratings:")
print(rating_counts)

# Calculate the percentage of books with a 5-star rating
five_star_percentage = (rating_counts.get(5, 0) / len(df)) * 100
print(f"\nPercentage of 5-star rated books: {five_star_percentage:.2f}%")


--- Step 3A: Analysis of 'Rating_Stars' ---
Distribution of Book Ratings:
Rating_Stars
5    196
4    179
3    203
2    196
1    226
Name: count, dtype: int64

Percentage of 5-star rated books: 19.60%


**B. Analyze 'Price' (Numerical)**

In [7]:
print("\n--- Step 3B: Analysis of 'Price' ---")
print("Price Statistics:")
print(df['Price'].describe().apply(lambda x: f"{x:.2f}")) # Format to 2 decimal places

# Find the most expensive and cheapest books
most_expensive = df.loc[df['Price'].idxmax()]
cheapest = df.loc[df['Price'].idxmin()]

print(f"\nMost Expensive Book: '{most_expensive['Title']}' at £{most_expensive['Price']:.2f}")
print(f"Cheapest Book: '{cheapest['Title']}' at £{cheapest['Price']:.2f}")


--- Step 3B: Analysis of 'Price' ---
Price Statistics:
count    1000.00
mean       35.07
std        14.45
min        10.00
25%        22.11
50%        35.98
75%        47.46
max        59.99
Name: Price, dtype: object

Most Expensive Book: 'The Perfect Play (Play by Play #1)' at £59.99
Cheapest Book: 'An Abundance of Katherines' at £10.00


**C. Analyze 'Availability' (Categorical)**

In [8]:
print("\n--- Step 3C: Analysis of 'Availability' ---")
availability_counts = df['Availability'].value_counts()
print("Distribution of Book Availability:")
print(availability_counts)

# Check the ratio of in-stock items
in_stock_ratio = (availability_counts.get('In stock', 0) / availability_counts.sum()) * 100
print(f"\nPercentage of books currently 'In stock': {in_stock_ratio:.2f}%")


--- Step 3C: Analysis of 'Availability' ---
Distribution of Book Availability:
Availability
In stock    1000
Name: count, dtype: int64

Percentage of books currently 'In stock': 100.00%


**Bivariate Analysis (Analyzing Relationships)**

**Relationship: Price vs. Rating**

In [9]:
print("\n--- Step 4: Bivariate Analysis (Price vs. Rating) ---")

# Calculate the average price for each star rating
average_price_by_rating = df.groupby('Rating_Stars')['Price'].mean().sort_index(ascending=False)

print("Average Price by Star Rating:")
print(average_price_by_rating.apply(lambda x: f"£{x:.2f}"))


--- Step 4: Bivariate Analysis (Price vs. Rating) ---
Average Price by Star Rating:
Rating_Stars
5    £35.37
4    £36.09
3    £34.69
2    £34.81
1    £34.56
Name: Price, dtype: object
