# US015 - Avarage park cost 

## Introduction

The user story 15 is to predict the average monthly cost that will be paid for water consumption in a new 55-hectare park. To achieve this, we employ a linear regression model where the park area is the independent variable and the average monthly cost of water consumption is the dependent variable. 

### Linear regression 

Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. 
In the context of park management, linear regression can be utilized to predict costs, such as water consumption, based on specific characteristics of the parks, such as their area, as it will be in this user story.

The regression model defines a linear function between the X and Y variables that best showcases the relationship between the two. It is represented by the slant line seen in the above figure, where the objective is to determine an optimal ‘regression line’ that best fits all the individual data points.

Mathematically these slant lines follow the following equation:

$y$ = m$x$ + b

Where:

* $x$ = dependent variable (target)

* $y$ = independent variable

* m = slope of the line (slope is defined as the ‘rise’ over the ‘run’)

## Analysis

By reading the provided data files, "water consumption updated.csv" and "Area.csv," which contain information about the daily water consumption and the area of the parks, we calculate the average monthly cost of water consumption for each park, following the rules defined in US09, that involves analyzing water consumption data on a daily and monthly basis, including associated costs and other relevant insights.

1. Data Distribution:
     * The data points (blue dots) are spread across different park areas, ranging from small to large.
     * There is a general upward trend, indicating that larger parks tend to have higher average monthly costs for water consumption.

2. Regression Line:
     * The red regression line shows the linear relationship between park area and average monthly cost. The line fits the data reasonably well, capturing the overall trend of increasing cost with increasing area.
     
3. Prediction for 55-Hectare Park:
     * The red line intersects the y-axis at approximately 1934.17€ when the x-axis value is 55 hectares, indicating the predicted cost for a 55-hectare park.

4. Outliers and Variability:
     * Some data points deviate significantly from the regression line, suggesting variability in water consumption costs that the model does not fully capture.
     * The point at around 70 hectares has a notably high cost, indicating it might be an outlier or influenced by other factors not accounted for in the model.

## Conclusions

1. Feasibility of Linear Adjustment:
   * The linear regression model appears to be a feasible approach for predicting the average monthly water consumption cost based on park area.
    * Despite some variability and potential outliers, the linear trend captures the general relationship between area and cost.

2. Predictive Capability:
    * The model predicts that the average monthly cost for a new 55-hectare park will be approximately 1934.17€. This prediction provides a useful estimate for budgeting purposes, allowing the park management company to anticipate future costs.

3. Model Limitations:
   * While the linear model is helpful, it does not account for all the variability in the data. There may be other factors influencing water consumption costs that are not included in the model.
   * The presence of outliers suggests that a more complex model or additional variables might improve prediction accuracy.

4. Practical Implications:
     * The park management company can use this linear model as a starting point for estimating water costs for parks of various sizes.
    *  For more accurate predictions, especially for larger parks or those with unique characteristics, additional data and possibly more sophisticated modeling techniques should be considered.
    
    
# US16 - Polynomial regression

## Introduction

The user story 16 involves using polynomial regression to find the best-fitting line for the given data, building upon the data and results from US14, that runs tests for inputs of variable size.
The goal is to use polynomial regression to achieve a more accurate model compared to linear regression.

### Polynomial regression

Polynomial regression is a type of regression analysis used to model the relationship between a dependent variable and one or more independent variables. 
Unlike linear regression, which models this relationship as a straight line, polynomial regression models the relationship as a polynomial equation, allowing for curves and more complex relationships.

This can be defined as the process of fitting a polynomial equation of degree $n$ to the data points. The general form of a polynomial regression equation in one variable $x$ is:

$y = b_0 + b_1 x + b_2 x^2 + b_3 x^3 + \ldots + b_n x^n + \epsilon$

where:

   \begin{align*}
y & : \text{ is the independent variable.} \\
x & : \text{ is the dependent variable.} \\
b_0, b_1, \ldots, b_n & : \text{ are the coefficients of the polynomial, which are determined during the regression process.} \\
\epsilon & : \text{ is the error term, representing the difference between the observed and predicted values.}
\end{align*} 

## Analysis

1. Polynomial Regression:

    * The red curve fits the blue data points very closely, capturing the underlying pattern and trend of the data, which indicates that the polynomial model is effective in describing the relationship between input size and execution time.
    * The close alignment of the curve with the data points suggests that the polynomial regression has successfully modeled the data, providing accurate predictions across the range of input sizes.

2. Model Performance:

    * The Mean Squared Error of 0.09% indicates that the average squared difference between the observed values and the predicted values is very small. 
    * The R-squared value of 99.15% means that the model accounts for nearly all the variability in the execution time data. 
    
The high R-squared and low Mean Squared Error values confirms the model's effectiveness and reliability.

3. Prediction and Analysis:

    * Given the high accuracy and excellent fit of the polynomial regression model, it can be confidently used to predict execution times for different input sizes.
    * The model's performance metrics and the visual fit suggest that polynomial regression is an appropriate method for this dataset, capturing the non-linear relationship between input size and execution time more effectively than a linear model would.
    
## Conclusions

   * The polynomial regression analysis demonstrates a highly accurate and reliable model for predicting execution time based on input size. 
   * The low Mean Squared Error and high R-squared values reflect the model's strong performance, while the fitted curve visually confirms its effectiveness. 
   * This model can thus be utilized for precise predictions, aiding in planning and decision-making processes related to execution time based on input size. 
   * The analysis shows that polynomial regression, with its capacity to fit complex patterns, is well-suited for this type of data.

# Self Assessment:
   - 1230481: 26%
   - 1221018: 17%
   - 1230929: 19%
   - 1231151: 19%
   - 1231170: 19%