# **Online Retail Sales Analysis**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

- 3 csvs:
- 10% of the dataset focusing on UK Sales
- UK Customer Revenue
- Global Sales Revenue

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [42]:
import os
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/Hackathon2/Online_Retail_Sales_Analysis'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [43]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [44]:
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/Hackathon2'

# Section 1: ETL Process

### 1.1 Extract: Importing Libraries, Extracting & Describing the Dataset

- This section involves importing libraries necessary for subsequent data analysis and visualisation tasks.

These libraries are imported to handle data manipulation (pandas, numpy), create visualisations (seaborn, matplotlib, plotly), and perform statistical analysis (scipy.stats).

In [45]:
# Importing necessary libraries for data manipulation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as st

We opted to access the CSV file, Online Retail.csv, directly from a GitHub repository using the raw URL of the file. 

By using this method, the CSV file is fetched directly from the GitHub repository without needing to download it manually, making the process efficient.

In [46]:
# Storing the url of the dataset and storing it in a DataFrame
url = "https://raw.githubusercontent.com/bvhadra/Online_Retail_Sales_Analysis/refs/heads/main/Online%20Retail.csv"
df = pd.read_csv(url) 

In [47]:
# Display the first few rows of the DataFrame to confirm successful import
print(df.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55       17850  United Kingdom  
1  2010-12-01 08:26:00       3.39       17850  United Kingdom  
2  2010-12-01 08:26:00       2.75       17850  United Kingdom  
3  2010-12-01 08:26:00       3.39       17850  United Kingdom  
4  2010-12-01 08:26:00       3.39       17850  United Kingdom  


In [48]:
# Display the last few rows of the DataFrame to confirm successful import
print(df.tail())  

       InvoiceNo StockCode                      Description  Quantity  \
541904    581587     22613      PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899     CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254    CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255  CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138    BAKING SET 9 PIECE RETROSPOT          3   

                InvoiceDate  UnitPrice  CustomerID Country  
541904  2011-12-09 12:50:00       0.85       12680  France  
541905  2011-12-09 12:50:00       2.10       12680  France  
541906  2011-12-09 12:50:00       4.15       12680  France  
541907  2011-12-09 12:50:00       4.15       12680  France  
541908  2011-12-09 12:50:00       4.95       12680  France  


We then used the `df.describe()` function to generate a summary of key statistics for the numerical columns, helping us understand the data's distribution and identify potential outliers.

In [49]:
# Generates a summary of statistics (count, mean, std, min, max, etc.) for numerical columns in the DataFrame.
df[['Quantity', 'UnitPrice']].describe()

Unnamed: 0,Quantity,UnitPrice
count,541909.0,541909.0
mean,9.55225,4.611114
std,218.081158,96.759853
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0


#### Findings:

**Count**: Both **Quantity** and **UnitPrice** have 500k+ non-null entries, indicating a large dataset. Given our technical limitations, we will need to truncate this data.

1. **Mean**: The average **Quantity** is 9.55, and the average **UnitPrice** is 4.61, providing insight into typical values for these columns.

2. **Standard Deviation**: The standard deviations (218.08 for **Quantity** and 96.76 for **UnitPrice**) show significant variability in both columns, suggesting that the data includes a wide spread of values.

**Min/Max**: The extreme values are notable:
1. The **Quantity** column includes a minimum of **-80,995** and a maximum of **80,995**, with negative quantities being particularly concerning.

2. Similarly, the **UnitPrice** has a negative minimum of **-11,062.06** and a maximum of **38,970**, indicating the presence of abnormal negative values for prices.

We started by inspecting the counts of the null/not-null values in the columns using `df.count()` and `df.isna()` respectively.

In [50]:
# Display the count of non null values in each column
print("\nCount of non-NA values in each column:")
print(df.count())


Count of non-NA values in each column:
InvoiceNo      541909
StockCode      541909
Description    540455
Quantity       541909
InvoiceDate    541909
UnitPrice      541909
CustomerID     541909
Country        541909
dtype: int64


In [51]:
# Check for missing values in each column
print("\nMissing values in each column:")
print(df.isna().sum())


Missing values in each column:
InvoiceNo         0
StockCode         0
Description    1454
Quantity          0
InvoiceDate       0
UnitPrice         0
CustomerID        0
Country           0
dtype: int64


The dataset has no missing values in most columns, except for **Description**, which contains 1,454 missing entries. 


#### Next Steps:

Following the extraction stage, we will be addressing the following key issues:

1. **Negative Values**: We need to look into the negative values found in both **Quantity** and **UnitPrice**, especially the extreme ones, as they could be errors or invalid entries. These might be caused by mistakes in data entry, returns, or system glitches.

2. **Clean the Data**: We will remove or correct the negative values in both columns, making sure only valid positive numbers are used in the analysis.

3. **Outlier Detection**: We’ll also need to deal with the extreme outliers in **Quantity** and **UnitPrice**. 

4. **Feature Engineering**: We will also perform feature engineering to ensure that the name column is properly standardised, especially since some items have the same name but come in different colours (e.g., "WHITE HANGING HEART T-LIGHT HOLDER" in different colorus). This will help avoid confusion and ensure consistency.


### 1.2 Transform: Data pre-processing & Feature Engineering

- This section involves cleaning the data, removing missing and invalid values. It also includes feature engineering and truncation.

#### 1.2.1 Filtering the data for the UK

Following the request from our stakeholders, we will focus exclusively on the United Kingdom (UK) for the analysis.

First, we want to confirm that the United Kingdom does not have any alternate names in the data set.

In [52]:
# Check unique values in the Country column
unique_countries = df['Country'].unique()
print("Unique values in 'Country' column:", unique_countries)

Unique values in 'Country' column: ['United Kingdom' 'France' 'Australia' 'Netherlands' 'Germany' 'Norway'
 'EIRE' 'Switzerland' 'Spain' 'Poland' 'Portugal' 'Italy' 'Belgium'
 'Lithuania' 'Japan' 'Iceland' 'Channel Islands' 'Denmark' 'Cyprus'
 'Sweden' 'Austria' 'Israel' 'Finland' 'Bahrain' 'Greece' 'Hong Kong'
 'Singapore' 'Lebanon' 'United Arab Emirates' 'Saudi Arabia'
 'Czech Republic' 'Canada' 'Unspecified' 'Brazil' 'USA'
 'European Community' 'Malta' 'RSA']


Having confirmed this, we will now filter the data accordingly.

In [53]:

# Filter the dataset to include only the United Kingdom
df_uk = df[df['Country'] == 'United Kingdom']

# Verify the filter by checking the unique countries again
unique_countries_after_filter = df_uk['Country'].unique()
print("Unique values in 'Country' column after filtering:", unique_countries_after_filter)


Unique values in 'Country' column after filtering: ['United Kingdom']


We will now begin to inspect the now filtered data.

In [54]:
# Display the first few rows of the DataFrame to get an overview.
print("First few rows of the data:")
print(df_uk.head())

First few rows of the data:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55       17850  United Kingdom  
1  2010-12-01 08:26:00       3.39       17850  United Kingdom  
2  2010-12-01 08:26:00       2.75       17850  United Kingdom  
3  2010-12-01 08:26:00       3.39       17850  United Kingdom  
4  2010-12-01 08:26:00       3.39       17850  United Kingdom  


In [55]:
# Summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
df_uk[['Quantity', 'UnitPrice']].describe()


Summary statistics for numerical columns:


Unnamed: 0,Quantity,UnitPrice
count,495478.0,495478.0
mean,8.605486,4.532422
std,227.588756,99.315438
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.1
75%,10.0,4.13
max,80995.0,38970.0


#### 1.2.2 Removing Null Rows

In [56]:
# Display the count of non null values in each column
print("\nCount of non-NA values in each column:")
print(df_uk.count())


Count of non-NA values in each column:
InvoiceNo      495478
StockCode      495478
Description    494024
Quantity       495478
InvoiceDate    495478
UnitPrice      495478
CustomerID     495478
Country        495478
dtype: int64


In [57]:
print("\nMissing values in each column:")
print(df_uk.isna().sum())


Missing values in each column:
InvoiceNo         0
StockCode         0
Description    1454
Quantity          0
InvoiceDate       0
UnitPrice         0
CustomerID        0
Country           0
dtype: int64


#### 1.2.3 Truncating the dataset

In terms of technical feasibility, truncating the dataset was a necessary decision. The full filtered dataset for the UK subset contains upwards of 400,000 records.

This reduction in size will allow us to focus on a representative sample of the data and ensure that the analysis remains efficient, whilst aligning with the stakeholder requirements.

In [58]:
df_truncated = df_uk.head(50000)

# Verifies the truncation
print(df_truncated.shape)
print(df_truncated.head())

(50000, 8)
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55       17850  United Kingdom  
1  2010-12-01 08:26:00       3.39       17850  United Kingdom  
2  2010-12-01 08:26:00       2.75       17850  United Kingdom  
3  2010-12-01 08:26:00       3.39       17850  United Kingdom  
4  2010-12-01 08:26:00       3.39       17850  United Kingdom  


Having previously ascertained that the only missing values in the dataset were in the **Description** column, we can delete those particular rows. 

The rationale behind this decision mainly lies in the fact that our stakeholder wishes to gather insights on product trends, null values for product names will be unhelpful for this analysis.

In [59]:
missing_before = df_truncated['Description'].isna().sum()
print("Missing values in 'Description' column before:", missing_before)

# Delete rows with missing values in 'Description'
df_truncated_cleaned = df_truncated.dropna(subset=['Description'])
print(f"Shape before deletion: {df_truncated.shape}")

Missing values in 'Description' column before: 158
Shape before deletion: (50000, 8)


In [60]:
# Check the count of missing values in 'Description' after deletion
missing_after = df_truncated_cleaned['Description'].isna().sum()
print(f"Missing values in 'Description' after deletion: {missing_after}")

print(f"Shape after deletion: {df_truncated_cleaned.shape}")

Missing values in 'Description' after deletion: 0
Shape after deletion: (49842, 8)


In [61]:
# Check for any remaining NA values in the dataset
na_check = df_truncated_cleaned.isna().sum()
print("Remaining null values in the dataset:")
print(na_check)

Remaining null values in the dataset:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64


#### 1.2.4 Transforming invalid inputs

- remove negative + 0 values for price and qty? why? b/c can't buy 0 items.. + can't pay negative

- some rows contain neg, likely indicates data entry error or anomaly w/ transactiosns

for purpose of this analysis and tp prvent skewed results and to ensure data integrity - removing these rwos.

unknown if returns or refunds, 

In [62]:
# Remove rows with negative values in Quantity or UnitPrice
df_truncated_cleaned = df_truncated_cleaned[(df_truncated_cleaned['Quantity'] >= 0) & (df_truncated_cleaned['UnitPrice'] >= 0)]

# Verify the shape after removal
print(f"Shape after removing rows with negative values: {df_truncated_cleaned.shape}")


Shape after removing rows with negative values: (49019, 8)


#### 1.2.5 Feature engineering

- break down invoice date into year, month, day, hour
- create new col revenue total per item purchase, 
- create new col revenue per year, month, day
- break items into categories and colour? dep. on num of unique values....

Invoice date must be transformed into datetime format to easily break separate + analyse transaction patterns, good for later on peak times to entice customers with discount codes etc. to inc. revenue.

break time into hour bc 24 hours is more manageable than smaller components - can utilise in analysis of popular timings

analyse time based trends effectively with broken down dates

In [63]:
# Breaking Invoice Date into multiple columns

# We need to ensure the column is treated as a datetime format as opposed to a string
df_truncated_cleaned['InvoiceDate'] = pd.to_datetime(df_truncated_cleaned['InvoiceDate'])

# Extract the year, month, day, hour from the 'InvoiceDate' column
df_truncated_cleaned['Year'] = df_truncated_cleaned['InvoiceDate'].dt.year
df_truncated_cleaned['Month'] = df_truncated_cleaned['InvoiceDate'].dt.month
df_truncated_cleaned['Day'] = df_truncated_cleaned['InvoiceDate'].dt.day
df_truncated_cleaned['Hour'] = df_truncated_cleaned['InvoiceDate'].dt.hour

# Display the first few rows of the DataFrame to confirm the changes
print(df_truncated_cleaned[['InvoiceDate', 'Year', 'Month', 'Day', 'Hour']].head())

          InvoiceDate  Year  Month  Day  Hour
0 2010-12-01 08:26:00  2010     12    1     8
1 2010-12-01 08:26:00  2010     12    1     8
2 2010-12-01 08:26:00  2010     12    1     8
3 2010-12-01 08:26:00  2010     12    1     8
4 2010-12-01 08:26:00  2010     12    1     8


We utilise the `to_datetime()` function to ensure the type is set as a datetime variable, rather than a string variable.

#### 1.2.5 Calculating Aggregated Revenues

In [64]:
# Calculate the total amount spent per transaction and store in new col
df_truncated_cleaned['TotalAmount'] = df_truncated_cleaned['Quantity'] * df_truncated_cleaned['UnitPrice']

Now that we have successfully calculcated revenue per transaction, we can aggregate by year, month and day and store each in a csv.

In [65]:
unique_year = df_truncated_cleaned['Year'].nunique()
print(f"Number of unique years: {unique_year}")

unique_months = df_truncated_cleaned['Month'].nunique()
print(f"Number of unique months: {unique_months}")

unique_days = df_truncated_cleaned['Day'].nunique()
print(f"Number of unique days: {unique_days}")

unique_times = df_truncated_cleaned['Hour'].nunique()
print(f"Number of unique hours: {unique_times}")

Number of unique years: 2
Number of unique months: 2
Number of unique days: 22
Number of unique hours: 14


In [66]:
# Aggregate revenue by year, month, day, and hour
revenue_by_year = df_truncated_cleaned.groupby('Year')['TotalAmount'].sum().reset_index()
revenue_by_month = df_truncated_cleaned.groupby(['Year', 'Month'])['TotalAmount'].sum().reset_index()
revenue_by_day = df_truncated_cleaned.groupby(['Year', 'Month', 'Day'])['TotalAmount'].sum().reset_index()
revenue_by_hour = df_truncated_cleaned.groupby(['Year', 'Month', 'Day', 'Hour'])['TotalAmount'].sum().reset_index()

# Rename the 'TotalAmount' column to the desired naming convention
revenue_by_year = revenue_by_year.rename(columns={'TotalAmount': 'revenue_per_year'})
revenue_by_month = revenue_by_month.rename(columns={'TotalAmount': 'revenue_per_month'})
revenue_by_day = revenue_by_day.rename(columns={'TotalAmount': 'revenue_per_day'})
revenue_by_hour = revenue_by_hour.rename(columns={'TotalAmount': 'revenue_per_hour'})

# Round the revenue values to 2 decimal places
revenue_by_year['revenue_per_year'] = revenue_by_year['revenue_per_year'].round(2)
revenue_by_month['revenue_per_month'] = revenue_by_month['revenue_per_month'].round(2)
revenue_by_day['revenue_per_day'] = revenue_by_day['revenue_per_day'].round(2)
revenue_by_hour['revenue_per_hour'] = revenue_by_hour['revenue_per_hour'].round(2)

# Export each aggregation to a separate CSV
revenue_by_year.to_csv('revenue_by_year.csv', index=False)
revenue_by_month.to_csv('revenue_by_month.csv', index=False)
revenue_by_day.to_csv('revenue_by_day.csv', index=False)
revenue_by_hour.to_csv('revenue_by_hour.csv', index=False)

In [67]:
revenue_by_year.head()

Unnamed: 0,Year,revenue_per_year
0,2010,748268.98
1,2011,190982.63


In [68]:
revenue_by_month.head()

Unnamed: 0,Year,Month,revenue_per_month
0,2010,12,748268.98
1,2011,1,190982.63


In [69]:
revenue_by_day.head()

Unnamed: 0,Year,Month,Day,revenue_per_day
0,2010,12,1,54818.08
1,2010,12,2,47570.53
2,2010,12,3,41308.69
3,2010,12,5,25853.2
4,2010,12,6,53322.12


In [70]:
revenue_by_hour.head()

Unnamed: 0,Year,Month,Day,Hour,revenue_per_hour
0,2010,12,1,8,527.95
1,2010,12,1,9,7356.39
2,2010,12,1,10,4877.56
3,2010,12,1,11,4041.56
4,2010,12,1,12,7447.92


#### 1.2.6 Top 10 customers

We are now interested in calculating revenue per each individual customer so that we can assess the top 10 customers.

We will be saving this revenue per customer data in a new CSV.

In [71]:
unique_customers = df_truncated_cleaned['CustomerID'].nunique()
print(f"Number of unique customers: {unique_customers}")


Number of unique customers: 929


In [72]:
revenue_per_customer = df_truncated_cleaned.groupby('CustomerID')['TotalAmount'].sum().reset_index()
revenue_per_customer.rename(columns={'TotalAmount': 'Revenue'}, inplace=True)

# Display the first few rows of the DataFrame 
print("First few rows of revenue per customer:")
print(revenue_per_customer.head())
print("\n")

# top 10 customers by revenue
top_10_customers = revenue_per_customer.sort_values(by='Revenue', ascending=False).head(10)
print("Top 10 customers by revenue:")
print(top_10_customers)

First few rows of revenue per customer:
   CustomerID  Revenue
0       12747   706.27
1       12748  4479.43
2       12826   155.00
3       12829   293.00
4       12838   390.79


Top 10 customers by revenue:
     CustomerID    Revenue
384       15287  280583.57
904       18102   27834.61
469       15749   22998.40
350       15061   21324.65
766       17450   20649.04
521       16029   13202.52
774       17511   10573.22
46        13089    7738.67
550       16210    7000.64
149       13777    6961.78


In [73]:
revenue_per_customer.to_csv('revenue_per_customer.csv', index=False)

Having examined the CSV file we can see that the revenue values extend 2 decimal places, therefore we will need to truncate these down.

In [74]:
revenue_per_customer = pd.read_csv('revenue_per_customer.csv')
revenue_per_customer['Revenue']=revenue_per_customer['Revenue'].round(2)
revenue_per_customer.to_csv('revenue_per_customer.csv', index=False)


#### 1.2.7 Creating category columns

In [75]:
# Round off total amount values to 2 decimal places
df_truncated_cleaned['TotalAmount'] = df_truncated_cleaned['TotalAmount'].round(2)
#Save current DataFrame to a CSV file
df_truncated_cleaned.to_csv('online_retail_cleaned.csv', index=False)

We will utilise ChatGPT to apply a keyword-based categorisation upon the dataset we have. This will ensure more accurate categorisation and handle the large volume of records quickly and consistently. This will reduce human error in category selection and application, proving it to be more reliable compared to manual methods.

In [76]:
categorised_df = pd.read_csv('online_retail_categorized.csv')

# View generated category list
categorised_df["Category"].unique()


FileNotFoundError: [Errno 2] No such file or directory: 'online_retail_categorized.csv'

In [None]:
# Category Counts
# Calculate the count of each category in the 'Category' column of the 'categorised_df' DataFrame
category_counts = categorised_df['Category'].value_counts()
print("Category Counts:")
print(category_counts)

Category Counts:
Category
Other                     26581
Drinkware                  3461
Accessories                3375
Home Decor                 2860
Seasonal                   2634
Storage & Organization     2592
Heating & Warmth           1944
Stationery                 1418
Tableware                  1275
Gardening                  1160
Photo & Frames              725
Timepieces                  679
Toys & Games                315
Name: count, dtype: int64


#### 1.2.8 Categories & Revenue

In [None]:
# In terms of items purchased
print("\nTop 10 categories by items purchased:")
top_10_categories = category_counts.sort_values(ascending=False).head(10)
print(top_10_categories)

print("\n")

# In terms of revenue generated
print("Top 10 categories by revenue:")
revenue_by_category = categorised_df.groupby('Category')['TotalAmount'].sum().reset_index()
top_10_revenue_categories = revenue_by_category.sort_values(by='TotalAmount', ascending=False).head(10)
print(top_10_revenue_categories)



Top 10 categories by items purchased:
Category
Other                     26581
Drinkware                  3461
Accessories                3375
Home Decor                 2860
Seasonal                   2634
Storage & Organization     2592
Heating & Warmth           1944
Stationery                 1418
Tableware                  1275
Gardening                  1160
Name: count, dtype: int64


Top 10 categories by revenue:
                  Category  TotalAmount
5                    Other    507229.88
3         Heating & Warmth     74786.31
4               Home Decor     68422.84
0              Accessories     62248.68
1                Drinkware     52585.97
9   Storage & Organization     51482.13
7                 Seasonal     39110.05
6           Photo & Frames     22442.31
10               Tableware     20860.83
2                Gardening     15125.57


#### 1.2.9 Subcategorising 'Other' Category

To address the 26k+ records currently categorised as 'Other' we utilised generative AI once more to break down the Other category into subcategories to refine it.

Upon its analysis, it was found that no clear sub categories emerged. The records largely consisted of miscellaneous items, inlcuding products such as dust bins, garlands, cupcake cases, and other similar items. 

In future analysis, further refinement of product categories could be explored.

#### 1.3 Category Price Bins

To better understand the distribution of product prices in the dataset, we applied a price binning strategy to categorise the prodcuts into defined price ranges.

In [None]:
# Define bins and labels for unit price
bins = [0, 10, 50, 100, 500, 1000, float('inf')]  # 6 bins for price ranges
labels = ['0-10', '11-50', '51-100', '101-500', '501-1000', '1000+']  # 6 labels to match the bins



# Bin prices and store in a new column
categorised_df['PriceRange'] = pd.cut(categorised_df['UnitPrice'], bins=bins, labels=labels, right=False)

# Count of each price range
price_range_counts = categorised_df['PriceRange'].value_counts()

print("Price Range Counts:")
print(price_range_counts)

Price Range Counts:
PriceRange
0-10        46364
11-50        2559
101-500        51
501-1000       29
51-100         14
1000+           2
Name: count, dtype: int64


In [None]:
# Save the cleaned, categorised and price-binned DataFrame to a CSV file
categorised_df.to_csv('UK_online_retail_final.csv', index=False)

# Delete unncessary CSV files:
os.remove('online_retail_cleaned.csv')
os.remove('online_retail_categorized.csv')

# Section 2 : VISUALISATIONS

Section 2 content

In [None]:
print("test")

test


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)