# COMP2006: Group Project

## Requirements

To successfully complete this project, you need to collect and process data, then train and evaluate at least two machine learning models and lastly deploy them to a website. Please see below for further details:

> Python (or a python package) is to be used everywhere it possibly can!

**Data Collection**

Data can be collected (legally) from anywhere. You may use data that you already have; or from sites that allow you to download the data, for example, [UCI Machine Learning Repository](https://archive.ics.uci.edu/) or [Kaggle Datasets](https://www.kaggle.com/datasets); or via web scraping; or via an API. We can restrict ourselves to data that would fit nicely into a spreadsheet. The content and amount of data are not the main consideration, as long as the data has:
- at least 10 variables
- three or more data types
- two or more problems: missing data, inconsistencies, errors, categorical data that needs to be converted to numeric, entries like text that need to be converted into proper features, etc. 
- if the data does not have enough problems, you can substitute one problem for feature engineering (creating new features from the original features)

We are not concerned with acquiring *huge* datasets or creating super accurate models, but more with the process of creating a proper pipeline for machine learning and, for any model deployed, having a reliable estimate of its performance. Although, some effort should go into improving an initial model. 

How many datasets do you need?
 - Groups of 2 need **two** datasets
 - Group of 3 needs **three** datasets

**Database**

After the data has been collected and processed, it should be stored in a SQLite database. At a minimum, each dataset should have its own table. Database and table creation and data insertion can be done either with the `sqlite3` package or with `Pandas`. Both SQLite and `sqlite3` come with Python. 

**Machine Learning Models**

For each dataset you should train, evaluate, and save a machine learning model:
- one model should be for a *classification* problem
- the other model should be for a *regression* problem
- Group of 3: you should have 2 of one type

A *validation* dataset must be used to either select between models, or to choose between hyperparameter values of a single model. 

A *test* dataset must be used to evaluate the performance of the final chosen model. 

**Website**

The final models should be presented to an end-user through a website. (Deployment need only be to *localhost*). The website must be done using a Python "web framework", e.g., *flask*, *Django*, *streamlit*.  

The website should have:
 - a *Welcome* page that describes your project
 - an *About* page for each dataset that provides:
     - the source of the dataset
     - definition of each variable in the dataset
     - a view of a sample of the dataset used for training (pulled from the database)
 - a page for each machine learning model that:
    - identifies the model being used, with a brief description
    - allows the end-user to enter their own data to get a prediction

**Readme.md**

This file should present the reader with a basic description of your project and how they can use it. 

**Requirements.txt**

This file contains all packages necessary to run your code. This file should allow the user to install all necessary packages via the command: `pip install -r requirements.txt`


## Structure

- All project related code in a single Github repository
- All code in the repository is only FINAL code
- The repository structure is
    - main folder
        - data collection 
        - data processing 
        - database
        - models
            - model 1
            - model 2
            - model 3 (if required)
        - website
        - Readme.md
        - requirements.txt
- Each subfolder should be logically organized

## Submission

Submission consists of uploading a link to the Github repository containing all the code for your project. There should be one submission per group. 

Example: `https://github.com/markcassar/COMP2006_project_Group_8`



## Security
Using an API is still allowed, just not required. If you choose to use an API, then 

> please DO NOT include your API Key in your GitHub repository

You will be creating a public GitHub repository for your project, which means anyone can access and use your any code or data that is in it. Many API providers will require you to register and create an API Key. When accessing data through the API, you need to authenticate using your API Key before any data will be returned from an API call. 

To keep you API Key(s) safe, please do the following:
 - create a `credentials.py` file that stores the value of your key(s) in Python variables
 - add `credentials.py` to the `.gitignore` file of your repository so GitHub does not automatically track any changes to this file
 - in your code, you can access your keys via import:
 
 ```python
 import credentials
 
 weather_api_key = credentials.weather_api_key
 ```
In this way, anyone accessing your GitHub repository will not be able to access your personal API account.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import sqlite3
# Load the dataset
euros_read_file = pd.read_csv("../dataset/euros.csv")
conn = sqlite3.connect('../database/pd_data.db')

# Define base features and target variable
target = 'Winning Team'
 
# Drop the 'Date' column
euros_read_file.drop(columns=['Date'], inplace=True)
 
#Createing missing values in all numeric columns
missing_percentage = 0.05  # 5% missing in all numeric features here
for column in euros_read_file.select_dtypes(include=['int64', 'float64']).columns:
    num_missing = int(len(euros_read_file) * missing_percentage)
    missing_indices = euros_read_file[column].sample(n=num_missing).index
    euros_read_file.loc[missing_indices, column] = np.nan
 
 
dataframe_numeric_features = euros_read_file.select_dtypes(include=['int64', 'float64'])
 
# Splitting data into features and target.
X = dataframe_numeric_features
y = euros_read_file[target]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# decided to use randomForestClassifier with 150 decision trees
model = RandomForestClassifier(n_estimators=150, random_state=42)
model.fit(X_train, y_train)
 
# Make predictions on the testing set
y_pred = model.predict(X_test)
 
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
 
# Display the updated DataFrame
print(dataframe_numeric_features.head(40))

euros_read_file.to_sql('pd_euros', conn, if_exists='replace', index=False)


cur = conn.cursor()
# conn.close()



Accuracy: 0.16176470588235295
      Year  Home Score  Away Score
0   1960.0         0.0         3.0
1   1960.0         4.0         5.0
2   1960.0         0.0         2.0
3   1960.0         2.0         1.0
4   1964.0         0.0         3.0
5   1964.0         2.0         NaN
6   1964.0         3.0         1.0
7   1964.0         2.0         1.0
8   1968.0         0.0         NaN
9   1968.0         0.0         0.0
10     NaN         NaN         0.0
11  1968.0         1.0         1.0
12  1968.0         2.0         0.0
13  1972.0         1.0         2.0
14  1972.0         NaN         1.0
15  1972.0         2.0         1.0
16  1972.0         3.0         0.0
17  1976.0         3.0         1.0
18  1976.0         2.0         4.0
19  1976.0         2.0         3.0
20  1976.0         2.0         2.0
21  1980.0         0.0         1.0
22  1980.0         1.0         NaN
23  1980.0         1.0         1.0
24  1980.0         0.0         0.0
25  1980.0         3.0         2.0
26  1980.0         1.0   

In [10]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from pandas.api.types import is_string_dtype, CategoricalDtype
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
 # is_categorical_dtype this is deprecated
# Load the dataset

conn = sqlite3.connect('../database/pd_data.db')
read_file1 = pd.read_csv("../dataset/euros.csv")
 
# Define base features and target variable
target = 'Winning Team'
# dataframe_numeric_features = read_file1.select_dtypes(include=['int64', 'float64'])
 
# Normalize missing values
read_file1.fillna(np.nan, inplace=True)

# print(read_file1)
"""
The category codes start with 0 and the code for “not a number” (nan) is -1. To bring everything into the range 0 or above, we add one to the category code. (Sklearn has equivalent functionality in its OrdinalEncoder transformer but it can't handle object columns with both integers and strings, plus it's get an error for missing values represented as np.nan.)
"""
print(len(read_file1['first_shooter'].unique()))    
first_shooter_counts = read_file1['first_shooter'].value_counts()

print(first_shooter_counts)
#To label encode any categorical variable, convert the column to an ordered categorical variable and then convert the strings to the associated categorical code (which is computed automatically by Pandas):

read_file1['first_shooter'] = read_file1['first_shooter'].astype('category').cat.as_ordered()
read_file1['first_shooter'] = read_file1['first_shooter'].cat.codes + 1
# count of which countries took first shoots and how many times, this seems to be a nominal variable
# so i am thinking how to encode them 12 countries with number of times they did first shoot

# Convert string columns to categorical and ordinal encode
# Loop through each column and check if it's a string dtype
for col in read_file1.columns:
    if is_string_dtype(read_file1[col]):
        read_file1[col] = read_file1[col].astype('category').cat.as_ordered()
        
for col in read_file1.columns:
    if isinstance(read_file1[col].dtype, pd.CategoricalDtype):
        read_file1[col] = read_file1[col].cat.codes + 1
allfeatures = [col for col in read_file1.columns.tolist()]
# checking all the features for the dataset, i am trying to encode first 
#shooter with other stuff


# print(allfeatures)
# Check value counts of each class
# print(read_file1.head(30))
 
# Separate features and target variable
X = read_file1.drop(target, axis=1)  # Features
y = read_file1[target]  # Target variable
 

# Checking data types of all columns
print(read_file1.dtypes)

read_file1 = read_file1.sample(frac=1) # shuffle data
read_file1_dev, read_file1_test = train_test_split(read_file1, test_size=0.15)
read_file1_train, read_file1_valid = train_test_split(read_file1_dev, test_size=0.15)
"""
A slight variation on this procedure is called k-fold cross validation and splits the dataset into k chunks of equal size. We train the model on k-1 chunks and test it on the other, repeating the procedure k times so that we every chunk gets used as a validation set, as shown in Figure
"""
#https://mlbook.explained.ai/bulldozer-testing.html
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier (n_estimators=100, max_depth=10, random_state=42)
scores = cross_val_score(rf, X, y, cv=5) 
print(scores.mean())



print(read_file1.head(30))

12
first_shooter
Italy             6
England           3
Spain             3
Netherlands       2
Switzerland       2
Czechoslovakia    1
Denmark           1
France            1
Sweden            1
Croatia           1
Portugal          1
Name: count, dtype: int64
Year               int64
Date               int16
Home Team           int8
Away Team           int8
Home Score       float64
Away Score       float64
Shootout            bool
Tournament          int8
City                int8
Country             int8
Neutral Venue       bool
Winning Team        int8
first_shooter       int8
Losing Team         int8
dtype: object




0.40079016681299384
     Year  Date  Home Team  Away Team  Home Score  Away Score  Shootout  \
51   1988    36          8         31         2.0         3.0     False   
10   1968     8          9         26         2.0         0.0     False   
281  2016   161         12         16         1.0         1.0      True   
29   1980    21         13         12         0.0         0.0     False   
268  2016   155         15          2         2.0         1.0     False   
92   1996    61          9         27         2.0         0.0     False   
303  2021   172         18          2         2.0         0.0     False   
127  2000    78          3         34         0.0         2.0     False   
112  2000    70         11          8         3.0         0.0     False   
87   1996    58         33          5         0.0         1.0     False   
306  2021   173         31         29         1.0         0.0     False   
197  2008   117         22         12         2.0         3.0     False   
242  

In [4]:
euros_read_file = pd.read_csv("../dataset/euros.csv")

In [5]:


dataframe_numeric_features.fillna(0, inplace=True)

print(dataframe_numeric_features)

       Year  Home Score  Away Score
0    1960.0         0.0         3.0
1    1960.0         4.0         5.0
2    1960.0         0.0         2.0
3    1960.0         2.0         1.0
4    1964.0         0.0         3.0
..      ...         ...         ...
332  2021.0         1.0         2.0
333  2021.0         0.0         4.0
334  2021.0         1.0         1.0
335  2021.0         2.0         0.0
336  2021.0         1.0         1.0

[337 rows x 3 columns]


In [6]:
import pandas as pd
from pandas.api.types import is_string_dtype, is_categorical_dtype
# for some reason is categorical dtype has this issues though they are talking on depricating
# this in next feature this should be working for now
# 

# Read the CSV file into a DataFrame
# Assuming euros_read_file is your DataFrame

# Loop through each column and check if it's a string dtype
for col in euros_read_file.columns:
    if is_string_dtype(euros_read_file[col]):
        euros_read_file[col] = euros_read_file[col].astype('category').cat.as_ordered()

# Loop through each column and check if it's a categorical dtype
for col in euros_read_file.columns:
    if is_categorical_dtype(euros_read_file[col]):
        euros_read_file[col] = euros_read_file[col].cat.codes + 1

# Check unique values before and after conversion


  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):
  if is_categorical_dtype(euros_read_file[col]):


In [7]:
euros_read_file.info

<bound method DataFrame.info of      Year  Date  Home Team  Away Team  Home Score  Away Score  Shootout  \
0    1960     1          7         26         0.0         3.0     False   
1    1960     1         11         37         4.0         5.0     False   
2    1960     2         11          7         0.0         2.0     False   
3    1960     3         25         37         2.0         1.0     False   
4    1964     4          8         26         0.0         3.0     False   
..    ...   ...        ...        ...         ...         ...       ...   
332  2021   184          6          8         1.0         2.0     False   
333  2021   184         34          9         0.0         4.0     False   
334  2021   185         16         31         1.0         1.0      True   
335  2021   186          9          8         2.0         1.0     False   
336  2021   187          9         16         1.0         1.0      True   

     Tournament  City  Country  Neutral Venue  Winning Team first_s

In [8]:
combined_features = pd.concat([euros_read_file, dataframe_numeric_features], axis=1)
target = 'Winning Team'
X = dataframe_numeric_features
y = euros_read_file[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# decided to use randomForestClassifier with 150 decision trees
model = RandomForestClassifier(n_estimators=150, random_state=42)
model.fit(X_train, y_train)
 
# Make predictions on the testing set
y_pred = model.predict(X_test)
# print(y_pred)
# Re evaluation Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



Accuracy: 0.14705882352941177


In [9]:
# import pandas as pd

# fuel_read_file = pd.read_csv("my2024-fuel-consumption-ratings 1.csv")

# print(fuel_read_file)