# Requirement specification

- Author: Iina Pirinen
- Version: 0.9
- Date: 23.10.2023 

## Introduction

This project is part of the AI/DA (Artificial Intelligence/Data Analytics) course, TTC8070-3005, offered by JAMK University of Applied Sciences. The assignment involves analyzing a dataset related to used cars.

The project group consists of students from JAMK, collaborating to apply data mining and analytics techniques to gain insights into the used car market.

The project follows the CRISP-DM (Cross Industry Standard Process for Data Mining) model, which provides a structured framework for data analysis.

Although there is an "imaginary customer" defined for this project, it's important to note that the customer's presence is conceptual. The customer serves as the driving force behind the project's goals and objectives.

## Client

The customer is a car dealership chain that has commercial interests in utilizing data analytics for future trading. The chain's core business is currently the sale of brand-independent used vehicles. The customer especially wants to know what features affect the price and sale time of used vehicles. The customer is also interested in what kind of cars (make, type, year model, etc.) can be sold on the market more easily, so that they do not remain in the warehouse for too long.

### Business Area Questions

- Feature Impact on Price: Understand which features (e.g., make, model, mileage, year) significantly affect the pricing of used cars.
- Factors Influencing Sale Time: Determine the factors influencing the time a used car stays in inventory before being sold.
- Market Demand for Vehicle Types: Identify the types of vehicles and manufacturers that are in higher demand in the market.

## Business goals

1.  Optimize Pricing Strategy: Determine the features that significantly influence the pricing of used vehicles. The goal is to develop a pricing strategy that maximizes profit while remaining competitive in the market.
2. Reduce Inventory Holding Time: Identify factors that impact the sale time of used vehicles. Develop insights to minimize the time cars spend in the dealership's inventory, improving cash flow and operational efficiency.
3. Market Demand Understanding: Gain insights into the types of vehicles and manufacturers that have higher demand. This understanding helps in optimizing inventory to meet market preferences, reducing the risk of having unsold vehicles in stock.
4. Profitability of selling taxis: The customer was also particularly interested in whether the price of the car been affected if the car has been used by a taxi before. We are going to seek answer to this question also.

## Beneficiaries

- Primary Beneficiary: The car dealership chain will directly benefit from improved pricing strategies, reduced holding costs, and enhanced market responsiveness.
- Secondary Beneficiaries: Customers may benefit from more competitive pricing, and the dealership's partners may benefit from more streamlined inventory management.

## Project costs vs. project benefits

In this context, while the project may have minimal direct costs due to the voluntary nature of the team, the potential benefits in terms of enhanced profitability, risk reduction, and competitive advantage for the car dealership can be significant. These benefits, when realized, can provide long-term value to the business and outweigh the limited costs involved. Since there is no real customer, we don't get to know the final value of the end result of the project from the customer's point of view.

The benefits of the project are more directed to the project team itself, which gets good experience in teamwork and in handling and analyzing large amounts of data and building machine learning models in the team.


## Stakeholder

Stakeholder is a car dealership company that wants to get a lot of customers with the most relevant cars and get better returns (than the a competing company) from the cars it buys.

![](./stakeholder_map.png)

## Solution

1. Feature Importance Analysis: We will provide a list of features and their impact on pricing, helping the dealership prioritize influential factors. Ten (10) most dominant charcteristics of best sold cars are presented. The price of cars that used to be taxis is detailed, as well as other characteristics that affect the sale of cars that used to be taxis. Also cars that sell the worst are analyzed.

2. Prediction Model for Sale Time: We will develop a model that predicts how long a particular used vehicle is likely to remain in the inventory before being sold. This will help the client to decide if they are going to take the car or not.

3. Prediction Model for Car Price: We will develop a model that predicts the price for the car as accurately as possible based on the car's features.

4. Market Segment Insights: Offer insights into market preferences, helping the dealership make informed decisions about the types of vehicles to stock.

Our solution offers substantial benefits to the car dealership, empowering them with actionable insights to enhance their operations and profitability. By providing a clear list of the ten most influential characteristics of best-selling cars and detailing pricing trends for former taxi vehicles, our solution equips the dealership with the knowledge to set competitive and profitable prices. This optimization directly leads to increased profit margins and more successful sales. This data-driven approach minimizes guesswork and maximizes the chances of successful sales.

The prediction model for sale time allows the customer to make informed decisions about whether to acquire a specific used vehicle. This insight is invaluable in reducing the risk of cars sitting in inventory for extended periods. Shorter inventory holding times translate into improved cash flow and operational efficiency.

Our solution provides market segment insights, helping the dealership align their inventory with market preferences. By stocking the right types of vehicles preferred by customers, the dealership can boost sales and minimize the risk of having unsold cars in stock. This knowledge also aids in inventory turnover, reducing the financial burden of holding inventory.

As a result of improved inventory management and more competitive pricing, customers may benefit from a broader selection of vehicles, competitive prices, and a smoother buying experience. Satisfied customers are more likely to return and recommend the dealership to others, fostering long-term customer relationships.

Our solution provides the car dealership with a significant competitive advantage in the used car market. By staying ahead of market trends and customer preferences, they can outperform competitors and maintain a strong position in the industry.

## Resources

1. Personnel

- Data Scientists/Analysts: Team members with expertise in data analysis, machine learning, and statistical modeling.
- Domain Expert: Someone knowledgeable about the automotive industry, market trends, and dealership operations.
- Data Engineer: Team member for preprocessing and database management.

Our team has good skills in analyzing data, preporeccing the data and building machine learning models. Our team may lack a bit of knowledge about the automotive industry, market trends, and dealership operations, but hopefully we'll learn from them while we're doing the project. The team has some knowledge in business management and calculating budgets.

2. Data

- Dataset: The "US Used cars dataset" from Kaggle, which is the primary data source for our project.
- Additional Data Sources: Depending on the project's needs, we may require supplementary data sources for market trends, competitors' pricing, or regional factors.

3. Hardware and Software

- Computers: High-performance computers with the necessary hardware (e.g., CPU, RAM, and GPU for machine learning tasks).
- Software Tools:
    - Python for coding, data analysis, and model development.
    - scikit-learn (sklearn), pandas, NumPy, and other Python libraries for data manipulation and machine learning.
    - Jupyter Notebooks or integrated development environments (IDEs) for code development.
    - Database management system (e.g., MySQL, PostgreSQL) if we decide to use a database for data storage.
    - Data visualization tools (e.g., D3.js, Plotly) for creating interactive visualizations.
    - Project management and collaboration tools: Microsoft Teams for team coordination, GitLab for managing project and Git for version control.
-  Internet Connectivity: Reliable and fast internet connectivity for data download, research, and access to online resources.

## Requirements

### Data Access

- Access to the "US Used cars dataset" from Kaggle for data analysis.
- Access to supplementary data sources, if necessary for market insights.

### Hardware and Software

- High-performance computers for data analysis and modeling.
- Appropriate software tools, including Python, scikit-learn, data visualization libraries, and database management systems.

### Documentation

A requirement for thorough project documentation, including data cleaning, preprocessing, modeling details, and findings.

### Collaboration and Communication

Effective use of project management and collaboration tools to facilitate communication and team coordination.

### Data Quality Assurance:

Ensure data quality through data cleaning, handling of missing values, and addressing outliers.

### Model Evaluation Metrics

Define the specific metrics and criteria for evaluating machine learning models' performance, such as accuracy, precision, recall, etc.


## Assumptions

- Data Quality: We may assume that the "US Used cars dataset" is of reasonable quality, but we'll verify this during data quality assessment.

- Relevance of Features: Assumptions about the relevance of certain features to car pricing and sale time may guide our initial analysis, but these assumptions should be validated during feature importance analysis.

- Access to Domain Knowledge: We assume that we have access to a domain expert or resources to understand the automotive market and customer preferences.


## Restrictions

### Data Sensitivity

Ensure that data sensitivity and privacy regulations are adhered to when handling and storing data, especially if the dataset contains any personally identifiable information (PII).

### Information Security

Implement security measures to protect project-related data, especially if any data is stored on a database server or cloud services.

### Data Usage Compliance

Adhere to the terms and conditions of data usage outlined by Kaggle or any other data sources.

### Budget Constraints

Since the project belongs to the JAMK IT Institute course, only freely available sources and applications should be used in the project.

### Resource Limitations

This data is for academic, research and individual experimentation only and is not intended for commercial purposes.

### Model Complexity

There may be restrictions on the computational resources available for training complex machine learning models.

### Data Availability

Data files Â© Original Authors

Assumptions about the availability of data from supplementary sources may be restricted by licensing or data availability.

### Team Skills and Experience

Project outcomes may be influenced by the skills and experience of team members.

## Technologies Used

- Python: The primary programming language for implementing data analysis and machine learning solutions.
- scikit-learn (sklearn) Library: Used for machine learning model development, feature analysis, and data preprocessing.
- Machine Learning Models: The specific machine learning algorithms for the project have not been decided yet, but they will play a crucial role in the software architecture.

### Placement View

- Machine Learning Models: The models will be placed within the software architecture to process and analyze the data. The choice of machine learning models will be made in the modeling phase of the project based on the dataset and specific goals.

### Database Description

- Dataset from Kaggle: The dataset, named "US Used cars dataset," is the primary data source for the project.
    - Database Consideration: It's not yet decided whether a separate database is needed. The dataset may be directly loaded into memory for analysis. The dataset is really big, so it may be possible that we have to make a database for the dataset.

### Integrations

- HTML for Data Visualization: HTML pages may be produced for data visualization, likely using libraries like D3.js or Plotly for interactive data visualizations. These can be integrated into the project's reports or user interfaces.


##  Technical Goals and Success of the Project

The success of the project is measured by achieving these technical goals and fulfilling the associated success measurement criteria. Ultimately, the technical success contributes to the achievement of the business goals, such as increased profitability, optimized inventory management, and enhanced decision-making for the car dealership.

**1. Data Cleaning and Preprocessing:**

Goal: Ensure that the dataset is clean, complete, and ready for analysis by addressing missing values, outliers, and formatting issues.

Success Measurement: Data quality assessment and data completeness metrics. A clean dataset with minimal missing values and outliers is an indicator of success.

**2. Feature Selection and Engineering:**

Goal: Identify and select relevant features or variables that have the most significant impact on car pricing and sale time. Create new features if they enhance model performance.

Success Measurement: Feature importance analysis and selection metrics. Success is determined by identifying influential features and optimizing the feature set for modeling.

**3. Model Development:**

Goal: Develop machine learning models that accurately predict car pricing and sale time based on the selected features.

Success Measurement: Model evaluation metrics, such as mean absolute error (MAE), mean squared error (MSE), or R-squared (R2) for pricing prediction, and time-to-sale prediction accuracy for sale time.

**4. Model Evaluation and Selection:**

Goal: Evaluate the performance of multiple machine learning models and select the best-performing model.

Success Measurement: Comparative model performance metrics. The model with the lowest error rates or highest prediction accuracy is considered successful.

**5. Data Visualization:**

Goal: Create clear and informative data visualizations that convey insights about market trends and vehicle preferences.

Success Measurement: The effectiveness of data visualization in conveying meaningful patterns and trends to stakeholders.

**6. Business Goal Alignment:**

Goal: Ensure that the technical goals align with the broader business objectives, such as optimizing pricing, reducing inventory holding times, and enhancing customer satisfaction.

Success Measurement: The extent to which the project contributes to the achievement of business goals. For example, if the project leads to improved pricing strategies and shorter sale times, it aligns successfully with business objectives.

**7. Documentation and Reporting:**

Goal: Create comprehensive documentation and reports that communicate the findings, insights, and recommendations to stakeholders.

Success Measurement: The quality and comprehensiveness of project documentation, including clear and actionable recommendations for the car dealership.

**8. Deployment Plan:**

Goal: Develop a plan for deploying the model and making it accessible for use by the car dealership.

Success Measurement: A well-defined deployment plan, including the choice of a production server, user interface design, and data update strategies.

**9. Data Security and Compliance:**

Goal: Ensure that data security and compliance with privacy regulations are maintained throughout the project.

Success Measurement: Adherence to data security best practices and compliance with relevant data protection regulations.
