# COMP 352 - Final Project: Footballer Market Evaluation

**Author:** Zevin Attisha, Santi Guerrero, Santiago Pedetti, Christian Vaikona

#### Full dataset: https://www.kaggle.com/datasets/davidcariboo/player-scores/data?select=appearances.csv

## Table of Contents:
* [Environment Setup](#env-setup)
* [Data Importing and Pre-processing](#data-importing)
* [Data Analysis and Visualization](#data-vis)
* [Data Analytics](#data-analytics)

## Environment Setup <a class="anchor" id="env-setup"></a>

(COPIED AND PASTED - EDIT LATER)

First we must setup our environment to make sure we have all appropriate modules installed. To do this, I have provided 2 methods. The 1st, is to install all modules using a ```.yaml``` file via ```conda```. 

To do this, run:
```bash
conda env create -f env_setup/data_environment.yml
```
Then activate the environment by:
```bash
conda activate data_env
```

(COPIED AND PASTED - EDIT LATER)

You can also use the ```requirements.txt``` file to download the modules via ```pip```.

To do this, first make create and activate your environment:
```bash
conda create -n my_data_env
conda activate my_data_env
```

You may need to install setup tools. To do this run (Note you may need to change ```pip3``` to ```pip```):
```bash
pip3 install --upgrade pip setuptools wheel
```

and then run:
```bash
pip3 install -r env_setup/requirements.txt
```

## Data Importing and Pre-processing <a class="anchor" id="data-importing"></a>

Section Overview:

- Import dataset and describe characteristics such as dimensions, data types, file types, and import methods used
- Clean, wrangle, and handle missing data, duplicate data, etc.
- Encode any categorical variables
- Perform feature engineering on the dataset
- Transform data appropriately using techniques such as aggregation, normalization, and feature construction
- Reduce redundant data and perform need based discretization

In [1]:
# import libraries needed
import pandas as pd

pd.set_option("display.max_columns", None)
import warnings

import branca
import folium
import geopandas as gpd
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import xgboost as xgb
from branca.element import Figure
from folium import Marker
from folium.plugins import HeatMap
from scipy.special import boxcox1p
from scipy.stats import norm, probplot, skew
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet, LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor

from utils.model_scripts import (
    hello
)
from utils.metric_scripts import (
    hello
)
from utils.data_validation import (
    check_first_field
)

warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=FutureWarning, module="pandas.*")
%matplotlib inline

In [None]:
housing_df = pd.read_csv("raw/player_valuations.csv")

In [2]:
# Importing dataset via Kaggle API 

# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "game_lineups.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "davidcariboo/player-scores",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/davidcariboo/player-scores?dataset_version_number=602&file_name=game_lineups.csv...


100%|██████████| 233M/233M [00:05<00:00, 41.6MB/s] 


                    game_lineups_id        date  game_id  player_id  club_id  \
0  b2dbe01c3656b06c8e23e9de714e26bb  2013-07-27  2317258       1443      610   
1  b50a3ec6d52fd1490aab42042ac4f738  2013-07-27  2317258       5017      610   
2  7d890e6d0ff8af84b065839966a0ec81  2013-07-27  2317258       9602     1090   
3  8c355268678b9bbc7084221b1f0fde36  2013-07-27  2317258      12282      610   
4  76193074d549e5fdce4cdcbba0d66247  2013-07-27  2317258      25427     1090   

         player_name             type            position number  team_captain  
0  Christian Poulsen      substitutes  Defensive Midfield      5             0  
1   Niklas Moisander  starting_lineup         Centre-Back      4             0  
2    Maarten Martens      substitutes         Left Winger     11             0  
3        Daley Blind  starting_lineup           Left-Back     17             0  
4        Roy Beerens  starting_lineup        Right Winger     23             0  


## Data Analysis and Visualization <a class="anchor" id="data-vis"></a>

Section Overview:

- Identify categorical, ordinal, and numerical variables within data
- Provide measures of centrality and distribution with visualizations
- Diagnose for correlations between variables and determine independent and dependent variables
- Perform exploratory analysis in combination with visualization techniques to discover patterns and features of interest
- Create visualizations that allow for the discovery of insights in the data

## Data Analytics <a class="anchor" id="data-analytics"></a>