# Product Dataset: Exploratory Data Analysis (EDA)

In this notebook, you will perform an exploratory data analysis on a new product dataset. The goal is to practice the pandas skills you learned in the previous workshop.

Follow the instructions in the markdown cells and write your code in the code cells provided.

### 1. Load the Dataset

First, we need to import pandas and load the `product_data_large.csv` file into a DataFrame called `df`.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("product_data_large.csv" , encoding="latin1")

## 2. Initial Data Inspection

Let's get a first look at our data to understand its structure.

### a) View the first few rows
Display the first few rows of the DataFrame to get an initial look at the data structure.

In [None]:
# Your code here


### b) Check the dimensions
Find out how many rows and columns are in the DataFrame.

In [None]:
# Your code here


### c) Get a summary
Get a concise summary of the DataFrame, including the data types and non-null counts for each column.

In [None]:
# Your code here


## 3. Summary Statistics

Now, let's generate some summary statistics for the numerical columns.

### a) Generate summary statistics
Generate summary statistics for all numerical columns in the dataset.

In [None]:
# Your code here


### b) Interpret the statistics

**Your task:** Look at the output above. What potential data quality issues can you spot? Write your observations in this markdown cell.


## 4. Data Cleaning and Detailed Exploration

Let's fix the issues we identified and explore our columns in more detail.

### a) Check data types
Use `.dtypes` to see the data type of each column. Notice anything unusual about the `price` column?

In [None]:
# Your code here


### b) Fix data types
The `price` column should be numeric, but some values are stored as strings. Use `pd.to_numeric()` with `errors='coerce'` to convert them safely. Does the column contain any null values after conversion?

In [None]:
# Your code here


### c) Explore the 'category' column
Examine the distribution of values in the 'category' column. Look for any data quality issues such as typos or inconsistent naming.

In [None]:
# Your code here


### d) Fix the typos in 'category'
Use the `.replace()` method to correct the typos you found.

In [None]:
# Your code here


### e) Find outliers in 'stock_quantity'
Filter the DataFrame to find rows with extremely high stock quantities (greater than 1000).

In [None]:
# Your code here


### f) Check for missing values
Count the number of missing values in each column of the DataFrame.

In [None]:
# Your code here


## 5. Data Manipulation & Subsetting

### a) Select specific data with `.loc`
Select the `product_name` and `price` for products at rows 20-25.

In [None]:
# Your code here


### b) Add a new column
Create a new column called `revenue` which is the product of `price` and `stock_quantity`, and then inspect the first few rows to check this has worked correctly

In [None]:
# Your code here


### c) Drop a column
Remove the `is_discontinued` column using the `.drop()` method, and then inspect the first few rows to check this has worked correctly

In [None]:
# Your code here


## 6. Filtering to Answer Questions

### a) Filter by category
Find all products that belong to the 'Office Furniture' category.

In [None]:
# Your code here


### b) Filter by rating
Find all products that have a rating greater than 4.5.

In [None]:
# Your code here


### c) Find all expensive electronics
Combine two conditions: find all products where the `category` is 'Electronics' AND the `price` is over $100.

Hint: Use `&` to combine multiple conditions. You will need brackets around each condition

In [None]:
# Your code here


## 7. Extension Activities

The following sections introduce some additional pandas functionality that builds on what you've learned. These activities include links to the official pandas documentation to help you learn how to use documentation effectively.

**Note:** These are extension activities - don't worry if you don't complete them all!

### Extension A: Working with Dates

The `release_date` column contains dates, but they're currently stored as strings. Let's convert them to proper datetime objects so we can work with them more effectively.

**Documentation:** [pandas.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

#### A1) Convert to datetime
Convert the `release_date` column from strings to datetime objects. This is similar to what you did when converting a column to numeric values. You will need to specify the argument `format = "%d/%m/%Y"`

Check the data types after conversion.

In [None]:
# Convert release_date to datetime


#### A2) Extract date components
Create new columns for `release_year` and `release_month` by extracting these components from the datetime column.

**Documentation:** [pandas datetime accessor](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html)

Hint: `df['release_date'].dt.year` will return the year from the release date.

In [None]:
# Your code here


#### A3) Filter by date
Find all products that were released in 2022.

In [None]:
# Your code here


### Extension B: Sorting and Ranking

Sorting helps us identify the best and worst performers in our dataset.

**Documentation:** [pandas.DataFrame.sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

#### B1) Find the cheapest products
Sort the DataFrame by price to find the 5 cheapest products. Display their name, price, and category.

In [None]:
# Your code here
# Sort by price (ascending - cheapest first)


#### B2) Find the most expensive products
Now find the 5 most expensive products.

In [None]:
# Your code here
# Sort by price (descending - most expensive first)


#### B3) Top rated products
Find the top 5 highest-rated products.

**Documentation:** [pandas.DataFrame.nlargest](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html)

In [None]:
# Your code here
# Top 5 highest rated products


#### B4) Products that might need restocking
Find the 10 products with the lowest stock quantities. These might need to be restocked soon.

**Documentation:** [pandas.DataFrame.nsmallest](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nsmallest.html)

In [None]:
# Your code here
# Products with lowest stock (might need restocking)


#### B5) Multi-column sorting challenge
**Challenge:** Sort the products first by category (alphabetically), then by price (highest to lowest within each category).

**Documentation:** [pandas.DataFrame.sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

Hint: Check the examples in the documentation. You will need to use lists for both the `by` argument and the `ascending` argument.

In [None]:
# Your code here - this requires sorting by multiple columns
# Sort by category first, then by price within each category


## Summary

Congratulations! You've completed a comprehensive exploratory data analysis. You've learned how to:

- Load and inspect datasets
- Identify and fix data quality issues
- Explore data using summary statistics and value counts
- Filter and subset data
- Create new columns and manipulate existing ones
- Work with dates and perform sorting operations

These skills form the foundation of data analysis with pandas. In future workshops, you'll learn about data visualisation and more advanced analysis techniques.