# COMP2006: Group Project

## Requirements

To successfully complete this project, you need to collect and process data, then train and evaluate at least two machine learning models and lastly deploy them to a website. Please see below for further details:

> Python (or a python package) is to be used everywhere it possibly can!

**Data Collection**

Data can be collected (legally) from anywhere. You may use data that you already have; or from sites that allow you to download the data, for example, [UCI Machine Learning Repository](https://archive.ics.uci.edu/) or [Kaggle Datasets](https://www.kaggle.com/datasets); or via web scraping; or via an API. We can restrict ourselves to data that would fit nicely into a spreadsheet. The content and amount of data are not the main consideration, as long as the data has:
- at least 10 variables
- three or more data types
- two or more problems: missing data, inconsistencies, errors, categorical data that needs to be converted to numeric, entries like text that need to be converted into proper features, etc. 
- if the data does not have enough problems, you can substitute one problem for feature engineering (creating new features from the original features)

We are not concerned with acquiring *huge* datasets or creating super accurate models, but more with the process of creating a proper pipeline for machine learning and, for any model deployed, having a reliable estimate of its performance. Although, some effort should go into improving an initial model. 

How many datasets do you need?
 - Groups of 2 need **two** datasets
 - Group of 3 needs **three** datasets

**Database**

After the data has been collected and processed, it should be stored in a SQLite database. At a minimum, each dataset should have its own table. Database and table creation and data insertion can be done either with the `sqlite3` package or with `Pandas`. Both SQLite and `sqlite3` come with Python. 

**Machine Learning Models**

For each dataset you should train, evaluate, and save a machine learning model:
- one model should be for a *classification* problem
- the other model should be for a *regression* problem
- Group of 3: you should have 2 of one type

A *validation* dataset must be used to either select between models, or to choose between hyperparameter values of a single model. 

A *test* dataset must be used to evaluate the performance of the final chosen model. 

**Website**

The final models should be presented to an end-user through a website. (Deployment need only be to *localhost*). The website must be done using a Python "web framework", e.g., *flask*, *Django*, *streamlit*.  

The website should have:
 - a *Welcome* page that describes your project
 - an *About* page for each dataset that provides:
     - the source of the dataset
     - definition of each variable in the dataset
     - a view of a sample of the dataset used for training (pulled from the database)
 - a page for each machine learning model that:
    - identifies the model being used, with a brief description
    - allows the end-user to enter their own data to get a prediction

**Readme.md**

This file should present the reader with a basic description of your project and how they can use it. 

**Requirements.txt**

This file contains all packages necessary to run your code. This file should allow the user to install all necessary packages via the command: `pip install -r requirements.txt`


## Structure

- All project related code in a single Github repository
- All code in the repository is only FINAL code
- The repository structure is
    - main folder
        - data collection 
        - data processing 
        - database
        - models
            - model 1
            - model 2
            - model 3 (if required)
        - website
        - Readme.md
        - requirements.txt
- Each subfolder should be logically organized

## Submission

Submission consists of uploading a link to the Github repository containing all the code for your project. There should be one submission per group. 

Example: `https://github.com/markcassar/COMP2006_project_Group_8`



## Security
Using an API is still allowed, just not required. If you choose to use an API, then 

> please DO NOT include your API Key in your GitHub repository

You will be creating a public GitHub repository for your project, which means anyone can access and use your any code or data that is in it. Many API providers will require you to register and create an API Key. When accessing data through the API, you need to authenticate using your API Key before any data will be returned from an API call. 

To keep you API Key(s) safe, please do the following:
 - create a `credentials.py` file that stores the value of your key(s) in Python variables
 - add `credentials.py` to the `.gitignore` file of your repository so GitHub does not automatically track any changes to this file
 - in your code, you can access your keys via import:
 
 ```python
 import credentials
 
 weather_api_key = credentials.weather_api_key
 ```
In this way, anyone accessing your GitHub repository will not be able to access your personal API account.

In [9]:
# # imports what we want to run this code
# import pandas as pd
# from sklearn.neighbors import KNeighborsRegressor
# # reads the csv file
# fuel_read_file = pd.read_csv("./dataset/my2024-fuel-consumption-ratings.csv")
 
 
# X = fuel_read_file[['Engine size (L)', 'Cylinders', 'Highway (L/100 km)']]
# y = fuel_read_file['City (L/100 km)']
 
 
# regressor = KNeighborsRegressor(n_neighbors=5)
 
 
# regressor.fit(X, y)
 
 
# new_data = pd.DataFrame({
#     'Engine size (L)': [2.0], 'Cylinders': [4], 'Highway (L/100 km)': [6.5]
# })
 
# predicted_city_consumption = regressor.predict(new_data)
# print(f"Predicted City Fuel Consumption: {predicted_city_consumption[0]:.1f} L/100 km")

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from pandas.api.types import is_string_dtype, is_categorical_dtype
import numpy as np
 
 
 
 
 
 
# Read the dataset
 
read_file = pd.read_csv("../dataset/my2024-fuel-consumption-ratings.csv")
dataframe_numeric_features = read_file.select_dtypes(include=['int64', 'float64'])
dataframe_numeric_features.fillna(0, inplace=True)
# Createing missing values in all numeric columns
missing_percentage = 0.05  # 5% missing in all numeric features here
for column in read_file.select_dtypes(include=['int64', 'float64']).columns:
    num_missing = int(len(read_file) * missing_percentage)
    missing_indices = read_file[column].sample(n=num_missing).index
    read_file.loc[missing_indices, column] = np.nan
 
 
# Loop through each column and check if it's a string dtype
for col in read_file.columns:
    if is_string_dtype(read_file[col]):
        read_file[col] = read_file[col].astype('category').cat.as_ordered()
 
# Loop through each column and check if it's a categorical dtype
for col in read_file.columns:
    if is_categorical_dtype(read_file[col]):
        read_file[col] = read_file[col].cat.codes + 1
 
# Define target and features
target = 'City (L/100 km)'
X = dataframe_numeric_features.drop(columns=[target])
Y = dataframe_numeric_features[target]
 
# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
 
# Initialize lists to store evaluation metrics
mae_scores = []
mse_scores = []
rmse_scores = []
 
# Number of runs
num_runs = 10
 
for ii in range(num_runs):
    # Random Forest Regressor
    random_Forest = RandomForestRegressor(n_estimators=150)
    random_Forest.fit(X_train, Y_train)
   
    # Predictions
    Y_pred = random_Forest.predict(X_test)
   
    # MAE
    mae_scores.append(mean_absolute_error(Y_test, Y_pred))
# Calculate average scores
mean_mae = np.mean(mae_scores)
 
 
# Print evaluation metrics
print(f"Average MAE of {num_runs} runs:", mean_mae)
print(read_file.head(60))

  if is_categorical_dtype(read_file[col]):
  if is_categorical_dtype(read_file[col]):
  if is_categorical_dtype(read_file[col]):
  if is_categorical_dtype(read_file[col]):
  if is_categorical_dtype(read_file[col]):
  if is_categorical_dtype(read_file[col]):


Average MAE of 10 runs: 0.16649397254397671
    Model year  Make  Model  Vehicle class  Engine size (L)  Cylinders  \
0       2024.0     1    308              2              1.5        4.0   
1       2024.0     1    308              2              1.5        4.0   
2       2024.0     1    309              2              2.0        NaN   
3       2024.0     1    353              8              3.5        6.0   
4       2024.0     1    354              9              3.0        6.0   
5       2024.0     1    430              8              2.0        4.0   
6       2024.0     1    431              8              2.0        4.0   
7       2024.0     2    272              3              2.0        4.0   
8       2024.0     2    273              3              2.0        NaN   
9       2024.0     2    274              3              2.9        6.0   
10      2024.0     2    505              8              2.0        4.0   
11      2024.0     2    506              8              2.0        4