# Process Data
### Description:

*Because I am prototyping my method, I am holding off on breaking the following logic into a python module.*

*UPDATE: I now have a module for these operations. Example usage in 02a_process_data.ipynb*

The goal of this notebook is to, ***for both home and away tables:***
- (1) remove the unneeded data from the raw csv files
- (2) calculate **goals per match scored** and save to a new column
- (3) calculate **goals per match conceded** and save to a new column

### Instructions:
'Run all' to clean and process updateded data to prepare for analysis

Visually confirm that all data is present in data frames in final cell

Confirm that data is saved in `../data/processed/home_table.csv` and `../data/processed/away_table.csv`



### (1) Remove Unneeded Data

For our analysis we are interested in the average goals scored and conceded. To keep our data frames focused, we will drop the following columns:
- Wins (W)
- Draws (D)
- Losses (L)
- Points (PTS)
- Expected Goals (xG)
- Expected Goals Against (xGA)
- Expected Points (xPTS)

In [None]:
import pandas as pd

# load data into dataframes
home_df = pd.read_csv('../data/raw/home_table_raw.csv')
away_df = pd.read_csv('../data/raw/away_table_raw.csv')

# drop unneeded columns
columns_to_drop = [
        'W',
        'D',
        'L',
        'PTS',
        'xG',
        'xGA',
        'xPTS',
    ]

home_df_clean = home_df.drop(columns=columns_to_drop)
away_df_clean = away_df.drop(columns=columns_to_drop)

"""
Clean data
We need to perform some arithmetic on columns that are storing numbers as strings. This next
step converts Matches (M), Goals Scored (G), and Goals Conceded (GA) to numbers.
"""
columns_to_clean = ['M', 'G', 'GA']

home_df_clean[columns_to_clean] = home_df_clean[columns_to_clean].apply(
    pd.to_numeric,
    errors='coerce'
)

away_df_clean[columns_to_clean] = away_df_clean[columns_to_clean].apply(
    pd.to_numeric,
    errors='coerce'
)


### (2) Calculate **goals per match scored** and save to a new column

Now we're ready to calculate the goals per match scored for each team when they are playing at home and away.


$gpm = G / M$

We save this as gpm_scored in each dataframe.



In [None]:
home_df_clean["gpm_scored"] = home_df_clean["G"] / home_df_clean["M"]
away_df_clean["gpm_scored"] = away_df_clean["G"] / away_df_clean["M"]

### (3) calculate **goals per match conceded** and save to a new column

We do the same opperation for goals per match conceded.

$gpm = GA / M$

At this point our data is ready for analysis. We print it to visualize and save it to csv files in `data/processed/` and procede to analysis.

In [None]:
home_df_clean["gpm_conceded"] = home_df_clean["GA"] / home_df_clean["M"]
away_df_clean["gpm_conceded"] = away_df_clean["GA"] / away_df_clean["M"]

home_df_clean.to_csv('../data/processed/home_table.csv', index=False)
print("\nhome_df_clean saved to '../data/processed/home_table.csv'")
print("\n======================   ~ home_df cleaned ~   =====================\n\n", home_df_clean)

away_df_clean.to_csv('../data/processed/away_table.csv', index=False)
print("\n\naway_df_clean saved to '../data/processed/away_table.csv'")
print("\n\n======================   ~ away_df cleaned ~   =====================\n\n", away_df_clean)
