# Background Information:
- The company has recently launched two new products, BobaWonder and WaterCure, which were developed as substitute goods for different target audiences. BobaWonder is aimed at bubble tea customers, while WaterCure targets individuals who prefer high-quality water. Surprisingly, the sales data analysis showed a positive correlation between the sales of these two products, which puzzled the company.

- To further investigate this phenomenon, a deeper analysis was conducted, focusing on the relationship between sales and external factors like temperature. The hypothesis is that the observed correlation between the sales of WaterCure units and high temperatures could explain the positive correlation with BobaWonder, as higher temperatures might influence the demand for refreshing products like WaterCure.

- For this analysis, I was provided with a dataset containing the number of WaterCure units sold per day and the daily temperature from January to August 2024 in Australia. The goal of this analysis is to:

1. Summarize the daily sales and temperature data.
2. Examine the correlation between daily units sold and daily temperature.
3. Sort the data according to the number of units sold per day.
4. Save the sorted data to a new CSV file for further analysis.
- This analysis aims to understand the impact of climatic conditions on the sales of WaterCure and to identify whether external factors, such as temperature, might explain the unexpected correlation between the two products.

In [27]:
# Importing library
import pandas as pd

file_path = 'C:\\Users\\balbi\\OneDrive\\daily_sales.csv' # Loading the CSV file into a DataFrame
df = pd.read_csv(file_path)
df_cleaned = pd.read_csv(file_path, skiprows=2)  # Reading CSV file again, skipping unnecessary rows to get the correct data
df_cleaned.columns = ['Date', 'Daily Units Sold', 'Daily Unit Price', 'Daily Temperature (C)']  # Renaming columns for clarity
df_cleaned.head()  # Displaying cleaned DataFrame

Unnamed: 0,Date,Daily Units Sold,Daily Unit Price,Daily Temperature (C)
0,1/1/2024,91,24.0,25
1,1/2/2024,90,24.0,24
2,1/3/2024,70,24.0,19
3,1/4/2024,89,24.0,23
4,1/5/2024,100,24.0,36


In [29]:
# Reading the CSV file and skipping the first three rows to clean the data
df = pd.read_csv(file_path, skiprows=3, names=["Date", "Daily Units Sold", "Daily Unit Price", "Daily Temperature (C)"]) 

# Getting summary statistics for 'Daily Units Sold' and 'Daily Temperature (C)'
summary_stats = df[['Daily Units Sold', 'Daily Temperature (C)']].describe() 
# Displaying the summary statistics
print(summary_stats)


       Daily Units Sold  Daily Temperature (C)
count        244.000000             244.000000
mean          66.479508              19.512295
std           19.003021               5.374442
min           11.000000               7.000000
25%           57.000000              16.000000
50%           63.500000              19.000000
75%           81.250000              23.000000
max          100.000000              36.000000


# Summary of the Daily Sales and Daily Temperature
Daily Units Sold 
- Count: 244 days (total number of records for daily sales).<br>
- Mean: 66.48 units (average number of units sold per day).<br>
- Standard Deviation (std): 19.00 units (variability in daily sales; a higher value indicates more fluctuation).<br>
- Minimum (min): 11 units (the lowest number of units sold on a single day).<br>
- Median (50%): 63.5 units (the middle value; half of the days had sales less than this amount).<br>
- Maximum (max): 100 units (the highest number of units sold on any single day).<br>

# Daily Temperature (C)
- Count: 244 days (total number of records for temperature).<br>
- Mean: 19.51°C (average daily temperature).<br>
- Standard Deviation (std): 5.37°C (variability in daily temperature; a higher value indicates more fluctuation).<br>
- Minimum (min): 7°C (the coldest temperature recorded).<br>
- Median (50%): 19°C (the middle temperature value; half of the days had temperatures below this level).<br>
- Maximum (max): 36°C (the hottest temperature recorded).<br>

# Interpretation
# Daily Units Sold:

- The mean sales of approximately 66 units per day suggest moderate daily demand for the WaterCure units. The standard deviation of 19 units indicates a reasonable level of variability in daily sales.
- The maximum value of 100 units sold in a single day points to a peak in demand, while the minimum of 11 units sold indicates the lowest level of daily sales.

# Daily Temperature:

- The mean temperature of 19.51°C indicates a moderate climate overall during the recorded period.
- The maximum temperature of 36°C represents the hottest day, which might correspond to increased sales if warmer weather drives higher demand for WaterCure units. Similarly, the minimum temperature of 7°C marks the coldest day during this period.

This summary provides insights into the variability and trends in both sales and temperature over the specified period, which can help in understanding how external factors like weather might affect sales performance.

In [30]:
# Reading CSV file and skipping first three rows to clean the data
df = pd.read_csv(file_path, skiprows=3, names=["Date", "Daily Units Sold", "Daily Unit Price", "Daily Temperature (C)"])

# Calculating the Pearson correlation between 'Daily Units Sold' and 'Daily Temperature (C)'
pearson_correlation = df['Daily Units Sold'].corr(df['Daily Temperature (C)'])

# Rounding off the Pearson correlation to 3 decimal places
pearson_correlation_rounded = round(pearson_correlation, 3)

# Displaying the rounded Pearson correlation
print("Pearson Correlation between Daily Units Sold and Daily Temperature (C):", pearson_correlation_rounded)

Pearson Correlation between Daily Units Sold and Daily Temperature (C): 0.804


# Interpretation
- The Pearson correlation coefficient of 0.804 which is close to 1.It indicates a strong positive correlation between daily sold units and daily temperature.
- This suggests that as the temperature increases, the number of WaterCure units sold also tends to increase. This positive relationship might imply that warmer weather drives a higher demand for these units, possibly because they are more relevant or necessary in hotter conditions.
The strong positive correlation can be useful for forecasting sales and understanding how external factors like temperature impact the product's demand. 

In [35]:
# Sort the DataFrame by 'Daily Units Sold' in descending order
sorted_df = df.sort_values(by='Daily Units Sold', ascending=False)

# Display the sorted DataFrame
print(sorted_df.head())

         Date  Daily Units Sold  Daily Unit Price  Daily Temperature (C)
4    1/5/2024               100              24.0                     36
16  1/17/2024               100              24.0                     36
27  1/28/2024                97              24.0                     33
30  1/31/2024                96              24.0                     32
28  1/29/2024                95              24.0                     30


# Interpretation
- After sorting, the days with the highest sales will appear at the top of the DataFrame.
This will help the company identify patterns, such as how sales peak during particularly hot days, reinforcing the potential influence of climate on sales