# **Milestone** | Cleaning & Analyzing Revenue Data for ASOS

<div style="text-align: center;">
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a8/Asos.svg" alt="ASOS Logo" width="200"/>
</div>

## Introduction
In this Milesone, you'll take on the role of a Junior Data Analyst at ASOS, an online fast-fashion and cosmetic retailer. Your task is to help make sens of some weekly revenue data that ... isn't exactly clean.

In the list named `revenue_by_week`, you'll find a snippet of ASOS's estimated weekly revenue, captured in millons of pounds.

If you try to sum this list, Python throws a TypeError. Why?

In [4]:
revenue_by_week = [65, 77, '66', '74',
                   64, 82, '86', 72, '80',
                   96, 101, '35', '72', '68',
                  ]

sum(revenue_by_week)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I have the following list of revenue data:
revenue_by_week = [65, 77, '66', '74', 64, 82, '86', 72, '80', 96, 101, '35', '72', '68']
When I try to sum this list, I get a TypeError. Why is that happening, and how can I fix it?
  </span>
</div>


## Task 1: Cleaning The Data

A `TypeError` is basically saying that Python doesn't know how to add (`+`) a number (`int`) and string (`str`) together.

Take a closer look at the `revenue_by_week` list. You'll notice that some numbers are stored as strings (with quotes around them), while others are integers.

While you could go into the list and manually remove all of the `'` marks, that sounds like a pain, and imagine if you had a lot more numbers than the three months of ASOS data. You can use your programming skills so that Python does the manual work for you! There's a built-in function, `int()`, that changes the argument given to it into a number.

**Run the cell below** to see a demonstration on some various data types.

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>Note: </strong>The last line in the cell will give an error since a `list` is not a number (the individual elements can be, though!)
</span>
</div>

In [5]:
print(int(105))   # integer
print(int(52.80)) # decimal (float)
print(int('97'))  # integer string
print(int([105, '97'])) # list

105
52
97


TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

Notice that the `int()` function can accept values that are already numbers (chopping off any decimal part if present) or strings that depict integers (it will fail on decimal strings, however). But you should also notice that Python threw an error when we tried to give it a list.

In order to clean the data, we need to do that item by item.


To complete this task, create a new list `cleaned_revenue` that has all of the data values in a numeric data type:
- Set up `cleaned_revenue` as an empty list.
- Use a `for` loop to loop over the elements of `revenue_by_week`.
  - For each element, convert it to an integer data type with the `int()` function
  - Append the converted value to the `cleaned_revenue` list.

Outside of your loop, `print` the completed `cleaned_revenue` and `print` the total sum of values.

<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I’ve got a mix of strings and integers in my list, and I’m converting everything to int(). Are there any risks or edge cases where this approach could go wrong? When might it fail in a real dataset?
  </span>
</div>

In [12]:
# set up storage for cleaned data
cleaned_revenue = []

# loop through data and convert to integers
for revenue in revenue_by_week:
    # For each element, convert it to an integer data type with the int() function and append the converted value to the cleaned_revenue list
    cleaned_revenue.append(int(revenue))

# assess the cleaned data by printing it
print(f"The sum of the cleaned \"revenue_by_week\" list is {sum(cleaned_revenue)}.")

[65, 77, 66, 74, 64, 82, 86, 72, 80, 96, 101, 35, 72, 68]
The sum of the cleaned "revenue_by_week" list is 1038.


<div style="border: 3px solid #f8c43e; background-color: #fff3c1; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
      If done correctly, the value for the <span style="font-family: monospace; color: #222;">sum</span> of <span style="font-family: monospace; color: #222;">cleaned_revenue</span> should be <strong>1038</strong>.
  </span>
</div>

## Task 2: Monthly Analysis

Great. With a cleaned list `cleaned_revenue` you're ready to calculate:
- **The total amount made in each month.**
- **The highest average (weekly) revenue.**

You're told the months break down like this:
- June: first 4 weeks
- July: next 5 weeks
- August: final 5 weeks


Use slicing to get the relevant parts of the original revenue list, then use the `sum()` and `len()` functions to help you calculate the total and average for each month. Remember: the average will be the total revenue divided by the number of weeks.

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>Hint: </strong>Be careful about how indexing works in Python! You might want to try printing the slices you pull out first to check that they're capturing the correct values, before trying to summarize them.
</span>
</div>

In [27]:
# revenue by month
june_revenue = cleaned_revenue[0:4]
july_revenue = cleaned_revenue[-10:-5]
august_revenue = cleaned_revenue[-5:]

# calculate sum
june_total = sum(june_revenue)
july_total = sum(july_revenue)
august_total = sum(august_revenue)

# calculate avg
june_avg = june_total / len(june_revenue)
july_avg = july_total / len(july_revenue)
august_avg = august_total / len(august_revenue)

# print the total amount and average revenue for each month
print(f"At a revenue total of {june_total} for June, the average for the first 4 weeks is {june_avg}.")
print(f"At a revenue total of {july_total} for July, the average for the second 5 weeks is {july_avg}.")
print(f"At a revenue total of {august_total} for August, the average for the last 5 weeks is {august_avg}.")

At a revenue total of 282 for June, the average for the first 4 weeks is 70.5.
At a revenue total of 384 for July, the average for the second 5 weeks is 76.8.
At a revenue total of 372 for August, the average for the last 5 weeks is 74.4.


<div style="border: 3px solid #f8c43e; background-color: #fff3c1; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
      If done correctly, the average values you should get are:
  <ul>
    <li>June: 70.5</li>
    <li>July: 76.8</li>
    <li>August: 74.4</li>
  </ul>  </span>

</div>

Which month had the highest average revenue?

July had the highest average revenue.

What do you think could explain the difference in revenue between months? Are there any patterns or external factors that could be influencing these fluctuations?

<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> [MONTH] had the highest average revenue, but only by a small margin. What kinds of external factors might explain fluctuations in weekly sales for a fashion retailer like ASOS?
  </span>
</div>

Being that the data only represents three months out of the summer, a fairly common indicator of the higher month in July is the Holiday sale(s). While both June and August contain national holidays, July 4th is a big retail holiday sale that usually encompasses the weekend pre and post as well. This may be the largest indicator of the higher average revenue in July. Averages are more prone to outliers like large pieces than the mean would be. For example, with the given instruction, June has a smaller time period than the other two months- meaning the month of June has less of a chance for a large piece of data, i.e. a large bump of revenue.

## LevelUp

If you were preparing a short internal summary for ASOS leadership, what is one clear takeaway about monthly revenue performance and what would you recommend they investigate further?

I would recommend having more sales driven by time periods of the year. While many retailers across the nation all strive for the big Holiday deals (such as July 4th, Labor Day, Memorial Day, tax-free weekend, etc.) ASOS could have independent Holiday deals that are accessible both in-store and online. Also, given that US-based companies are offering more, or at least maintaining, the number of paid holidays - according to Management Research Association (MRA) 2023 Holiday Practices Survey-, in-store specific sales may be a big driving factor for even larger discounts than those seen online. Again, I would recommend increases the number of store Holiday deals, possibly even being the driving force behind more retailers following more Holidays for sales.