# COMP 352 - Final Project: Footballer Market Evaluation

**Author:** Zevin Attisha, Santi Guerrero, Santiago Pedetti, Christian Vaikona

#### Full dataset: https://www.kaggle.com/datasets/davidcariboo/player-scores/data?select=appearances.csv

## Table of Contents:
* [Environment Setup](#env-setup)
* [Data Importing and Pre-processing](#data-importing)
* [Data Analysis and Visualization](#data-vis)
* [Data Analytics](#data-analytics)

## Environment Setup <a class="anchor" id="env-setup"></a>

(COPIED AND PASTED - EDIT LATER)

First we must setup our environment to make sure we have all appropriate modules installed. To do this, I have provided 2 methods. The 1st, is to install all modules using a ```.yaml``` file via ```conda```. 

To do this, run:
```bash
conda env create -f env_setup/data_environment.yml
```
Then activate the environment by:
```bash
conda activate data_env
```

(COPIED AND PASTED - EDIT LATER)

You can also use the ```requirements.txt``` file to download the modules via ```pip```.

To do this, first make create and activate your environment:
```bash
conda create -n my_data_env
conda activate my_data_env
```

You may need to install setup tools. To do this run (Note you may need to change ```pip3``` to ```pip```):
```bash
pip3 install --upgrade pip setuptools wheel
```

and then run:
```bash
pip3 install -r env_setup/requirements.txt
```

## Data Importing and Pre-processing <a class="anchor" id="data-importing"></a>

Section Overview:

- Import dataset and describe characteristics such as dimensions, data types, file types, and import methods used
- Clean, wrangle, and handle missing data, duplicate data, etc.
- Encode any categorical variables
- Perform feature engineering on the dataset
- Transform data appropriately using techniques such as aggregation, normalization, and feature construction
- Reduce redundant data and perform need based discretization

In [None]:
# import libraries needed
import pandas as pd

pd.set_option("display.max_columns", None)
import warnings

import branca
import folium
import geopandas as gpd
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import xgboost as xgb
from branca.element import Figure
from folium import Marker
from folium.plugins import HeatMap
from scipy.special import boxcox1p
from scipy.stats import norm, probplot, skew
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet, LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor

from utils.model_scripts import (
    hello
)
from utils.metric_scripts import (
    hello
)

warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=FutureWarning, module="pandas.*")
%matplotlib inline

In [15]:
# read in files
player_valuations_df = pd.read_csv("data/raw/player_valuations.csv")
players_df = pd.read_csv("data/raw/players.csv")

In [18]:
# dimensions of each df
print("player_valuations.csv dimensions: " + str(player_valuations_df.shape))
print("players.csv dimensions: " + str(players_df.shape))

player_valuations.csv dimensions: (496606, 5)
players.csv dimensions: (32601, 23)


In [None]:
# count the number of categorical variables

pv_cat_count = 0
for dtype in player_valuations_df.dtypes:
    if dtype == "object":
        pv_cat_count +=1

pv_numeric_vars = player_valuations_df.shape[1] - pv_cat_count - 1 # subtract an extra column as 1 column is an ID column

print("For player_valuations.csv Data:")
print("# of categorical values: " + str(pv_cat_count))
print("# of continuous variables:", str(pv_numeric_vars)+"\n")

p_cat_count = 0
for dtype in players_df.dtypes:
    if dtype == "object":
        p_cat_count +=1

p_numeric_vars = players_df.shape[1] - p_cat_count - 1 # subtract an extra column as 1 column is an ID column

print("For players.csv Data:")
print("# of categorical values: " + str(p_cat_count))
print("# of continuous variables:", str(p_numeric_vars)) 

For player_valuations.csv Data:
# of categorical values: 2
# of continuous variables: 2
For players.csv Data:
# of categorical values: 17
# of continuous variables: 5


## Data Analysis and Visualization <a class="anchor" id="data-vis"></a>

Section Overview:

- Identify categorical, ordinal, and numerical variables within data
- Provide measures of centrality and distribution with visualizations
- Diagnose for correlations between variables and determine independent and dependent variables
- Perform exploratory analysis in combination with visualization techniques to discover patterns and features of interest
- Create visualizations that allow for the discovery of insights in the data

## Data Analytics <a class="anchor" id="data-analytics"></a>