# Kiosk Sales Report

## 1. Introduction

### Project Context

This project aims to analyze sales data from a kiosk that offers various products, such as candies, beverages, cigarettes, and other items. By applying the CRISP-DM methodology, we systematically approach the data analysis process to extract valuable insights that can help increase sales and improve the kiosk’s overall performance.

### Process Summary

The analysis followed the six phases of CRISP-DM: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Each phase was crucial to ensure a thorough analysis and results applicable to the business.

## 2. Key Results

The following visualizations summarize the most significant results of the analysis. These include the distribution of sales by category, sales trends over time, and the performance of predictive models.

### 2.1. Distribution of Total Sales

The analysis of the distribution of total sales reveals the following observations:

* Right-Skewed Distribution: The total sales distribution is asymmetric, with a clear rightward trend. Most total sales are concentrated in low values, indicating that most transactions involve smaller amounts.
* High Frequency in Low Sales: The highest volume of sales is found in the range of 0 to 25 units, with the highest frequency around 100 transactions in the lowest interval. This suggests that small sales are common and dominate the distribution.
* Gradual Decrease in Higher Sales: As total sales values increase, their frequency gradually decreases. Sales exceeding 100 units are much less frequent, with a marked decline as the sales value increases.
* Long Tail Towards Higher Sales: There are some higher sales transactions, though they are much less common. This long tail may be influenced by outliers or unique high-value sales.
 
In summary, the right-skewed distribution and high frequency of small sales suggest that the business primarily consists of lower-value transactions. Identifying strategies to increase the average sale amount could be an opportunity to boost total sales.

![Distribution of Total Sales](../graphs/distribution_of_total_sales.png)

### 2.2. Total Sales by Category

The analysis of total sales by category reveals the following observations:

- Dominant Category: The 'Cigarettes' category leads in sales with a total close to 17,500 units. This suggests that cigarettes account for the largest portion of sales volume compared to other categories.
- Significant Sales in 'Drinks' and 'Candies': The 'Drinks' and 'Candies' categories show similar performance, with total sales ranging between 15,000 and 16,000 units. This indicates that these categories are also significantly popular among customers.
- Lower Performance in 'Others': The 'Others' category has the lowest sales, with a total around 13,000 units. This category seems to have a lesser impact on overall sales compared to the other evaluated categories.

In summary, the 'Cigarettes', 'Drinks', and 'Candies' categories stand out as the top sellers, with 'Cigarettes' leading. These results suggest that sales and marketing strategies could be more focused on these categories to maximize revenue.

![Total Sales by Category](../graphs/total_sales_by_category.png)

### 2.3. Sales Trends Over Time

The analysis of total sales over time reveals the following key points:

- Variability in Daily Sales: Total sales show considerable variability across days, with frequent fluctuations between consecutive days. This indicates an inconsistent sales pattern, which could be influenced by external factors such as promotions, special events, or changes in market demand.
- Notable Sales Peaks: Significant sales peaks are observed on 06-28, 07-06, 07-15, and 07-18, with the highest reaching close to 3,000 units on 07-18. These peaks might correspond to periods of high demand or successful promotional activities.
- Days with Low Sales: In contrast, the lowest sales are recorded on 07-01, 07-10, and 07-21, with a notable drop on 07-21 where sales fall below 1,000 units. These low sales days could represent opportunities to investigate underlying causes and adjust strategies.
- Overall Trend: Despite high variability, there is a slight upward trend in sales from late June to mid-July, suggesting a gradual increase in demand during this period.

In summary, the data show high volatility in daily sales with several peaks and troughs. Identifying the factors driving high peaks and sharp declines could be key to optimizing sales strategy and improving performance consistency.

![Sales Trends Over Time](../graphs/sales_trends_over_time.png)

### 2.4. Correlation Analysis

A correlation analysis was performed to evaluate the relationship between different variables in the dataset. The following key results were observed in the correlation matrix:

#### 1. Relationship Between Quantity and Other Variables
    - Quantity shows a strong positive correlation with Total Sale (0.74) and Taxes (0.68). This indicates that as the quantity sold increases, both total sales and associated taxes tend to rise significantly.
    - A moderate positive correlation with Discounts (0.49) is also observed, suggesting that higher quantities sold are related to applied discounts, which could indicate volume discount strategies.

#### 2. Relationship Between Unit Price and Other Variables
    - Unit Price has a low correlation with Quantity (0.0056), indicating that the unit price does not directly influence the quantity sold.
    - However, there is a positive correlation with Total Sale and Taxes, which is expected, as higher unit prices should lead to higher total sales and consequently higher taxes.


#### 3. Negative Relationships
    - A negative relationship is observed between Remaining Stock and Day (-0.40), which could reflect a trend of decreasing remaining stock as the day progresses or over time.

These results provide an overview of how variables are interrelated and can guide decisions on pricing policies, inventory management, and sales strategies.

![Correlation Matrix](../graphs/correlation_matrix.png)

## 3. Model Performance

### 3.1. Model Evaluation
Several regression models were trained to predict future sales, including Linear Regression, Decision Tree, and Random Forest. Below are the evaluation metrics for each model:

![Model Evaluation Metrics Table](../graphs/model_evaluation_metrics_table.png)

### 3.2. Best Model Selection
Based on the metrics presented in the table above, the Random Forest model was selected as the best model for sales prediction. This model showed the lowest Mean Absolute Error (MAE) and Mean Squared Error (MSE), as well as the highest R-squared (R2), indicating that it best fits the data and makes the most accurate predictions.

Selected Model: Random Forest
- MAE: 0.98
- MSE: 3.59
- R2: 0.998

### 3.3. Feature Importance
The feature importance analysis reveals that the Quantity and Unit Price variables are the most influential in predicting sales, with significantly greater importance than the other categorical variables. This suggests that sales strategies should focus on optimizing these two factors to maximize predictions and, consequently, sales.

- Quantity: 0.56
- Unit Price: 0.44
- The other categories (Candies, Cigarettes, Others, Drinks) did not show significant influence on the model.

![Feature Importances](../graphs/feature_importances.png)

### 3.4. Model Comparison
Comparing the performance of the different models trained, it is confirmed that the Random Forest model not only outperforms Linear Regression and Decision Tree in terms of evaluation metrics but also proves to be the most robust, minimizing error and maximizing prediction accuracy.

- MSE - Linear Regression: 204.30
- MSE - Decision Tree: 5.08
- MSE - Random Forest: 3.59

### 3.5. Future Sales Prediction
With the Random Forest model selected, future sales predictions were made, yielding the following results:

- Future Sales Predictions: 16.05, 24.90, 35.79
 
### 3.6. Model Validation
To ensure the robustness of the model, additional evaluation was conducted using a test set and cross-validation:

#### Test Set Performance:

    - MAE: 0.98
    - MSE: 3.59
    - R2: 0.998

#### Cross-Validation:

    - Average MSE: 2.51
    
### 3.7. Residual Analysis
The residual analysis showed that the residuals are randomly distributed around zero, indicating that the model does not have systematic errors. However, some outliers were observed, which may require further review to improve the model's performance. This analysis supports the reliability of the Random Forest model for making accurate predictions.

![Residual Analysis](../graphs/residual_analysis.png)

## 4. Interpretation of Results and Conclusions
This section interprets the results obtained from previous analyses and predictive models, extracting key conclusions that will inform strategic decisions.

### 4.1. Total Sales and Category Distribution
- Interpretation: The analysis of total sales distribution shows a clear concentration of sales in low values, suggesting that most transactions are of lower value. The 'Cigarettes' category leads in sales, followed by 'Drinks' and 'Candies,' indicating that these products, though with small individual transactions, generate significant revenue.

- Conclusions:
    - Focus on Key Categories: The high concentration of sales in the 'Cigarettes,' 'Drinks,' and 'Candies' categories suggests that focusing marketing strategies on these products could maximize revenue.
    - Opportunity to Increase Average Ticket: Since most sales are of lower value, identifying strategies to increase the average ticket, such as volume promotions or bundles, could significantly boost total sales.

### 4.2. Sales Trends Over Time
- Interpretation: The temporal analysis of sales revealed considerable daily variability, with some significant peaks and drops, suggesting that external factors like promotions or special events could be strongly influencing sales.

- Conclusions:
    - Leveraging Demand Peaks: Future strategies should focus on capitalizing on high-demand days, possibly associated with special events or promotions, while investigating the reasons behind low sales days to optimize consistency in sales.
    - Inventory and Promotion Planning: Adjusting inventory levels and promotion planning based on observed sales patterns can help minimize the impact of daily variability and improve operational efficiency.
      
### 4.3. Predictive Model Performance
- Interpretation: The Random Forest model was selected as the best model for sales prediction, based on its superior performance across all evaluated metrics (MAE, MSE, and R2).

- Conclusions:
    - Confidence in Future Predictions: The Random Forest model is a reliable tool for predicting future sales, allowing the company to make more informed data-driven decisions.
    - Ongoing Monitoring and Validation: Although the model shows robust performance, it is crucial to continue monitoring its performance and making adjustments as new data is collected to maintain its accuracy.

### 4.4. Feature Importance
- Interpretation: The feature importance analysis highlights Quantity and Unit Price as the most influential factors in sales prediction.

- Conclusions:
    - Focus on Quantity and Price Optimization: Sales strategies should focus on optimizing the quantity sold and the unit price, as these variables have a significant impact on sales predictions.
    - Review of Secondary Strategies: The other categorical variables showed lesser influence, so strategies focusing on these should be reviewed to ensure they contribute value to sales.

### 4.5. Residual Analysis
- Interpretation: The residual analysis showed that the model does not have systematic errors, although some identified outliers could indicate special cases or data errors.

- Conclusions:

    - Outlier Investigation: A detailed review of detected outliers is recommended to determine if they represent opportunities for model improvement or if they are unique cases requiring a particular approach.
    - Continuous Model Refinement: Investigating these outliers will help refine the model, further improving its accuracy and applicability.


## 5. Final Conclusions
### 5.1. Business Implications
- Product Strategies: Since the 'Cigarettes,' 'Drinks,' and 'Candies' categories lead sales, focusing marketing strategies in these areas could maximize revenue.
- Price and Quantity Optimization: Improving price management and increasing the quantity sold per transaction are key to enhancing overall business performance.
- Demand Management: Identifying and understanding the factors influencing daily sales variations will optimize planning and promotions, reducing volatility and maximizing revenue opportunities.

### 5.2. Suggested Actions
- Marketing Focus: Develop campaigns specifically targeting the product categories that contribute most to sales, such as 'Cigarettes,' 'Drinks,' and 'Candies.'
- Price Review: Consider a review and possible price adjustment based on insights gained about the correlation between Unit Price and Quantity sold.
- Promotions and Offers: Implement strategic promotions during high-demand days identified to capitalize on sales peaks and mitigate low-sales days.
- Inventory Management: Adjust inventory to align with observed sales trends, ensuring the availability of the most demanded products.

### 5.3. Future Considerations
- Exploration of New Categories: Investigate the possibility of introducing or expanding product categories that could increase sales in less-exploited segments.
- Data Collection Improvements: Continue enhancing the quality and quantity of collected data for future analysis, allowing for further refinement of predictive models and business strategies.
- Ongoing Model Evaluation: Continuously monitor the performance of the Random Forest model and other alternative models to ensure predictions remain accurate over time.

### 5.4. Next Steps

#### Model Implementation

- Model Integration: Implement the Random Forest model in the production environment to generate real-time sales predictions and support strategic decision-making.
- Staff Training: Train the sales and marketing team on the use of predictive tools so they can interpret the results and adjust strategies accordingly.

#### Monitoring and Adjustments

- Performance Tracking: Establish a continuous monitoring process for the model to evaluate its accuracy and adaptability to possible changes in sales patterns.
- Periodic Adjustments: Make periodic adjustments to the model and business strategies based on the latest data, ensuring alignment with current market conditions and customer needs.

## 6. Closing
### Final Reflection
In summary, this comprehensive sales analysis has provided valuable insights that not only help understand the current business behavior but also offer a solid foundation for future planning. The distribution of total and category sales highlights the importance of key products such as 'Cigarettes,' 'Drinks,' and 'Candies,' and suggests that there is significant potential in developing strategies focused on increasing the average ticket. Additionally, sales trends over time reveal high variability, highlighting the need to adjust promotions and inventory management according to demand patterns. The superior performance of the Random Forest model reinforces confidence in the ability to predict future sales accurately, while the feature importance analysis underscores the importance of focusing on quantity and unit price for optimized sales performance.

### Acknowledgements
- Project Team: Thank you to everyone who contributed to this project for their dedication and professionalism.
- Stakeholders: Special thanks to the key stakeholders for their support and valuable feedback throughout the process.