## Coding Exercise

For the next upcoming weeks, our knowledge of pandas & basic Python syntax will be tested as we implement data-pipelines and various analyses.

To refresh our pandas skills, let's implement the following non-trivial lines of pseudocode to reveal information regarding our dataset describing top-soccer/football scorers internationally.

To find out more information regarding this dataset, check out: https://www.kaggle.com/datasets/mohamedhanyyy/top-football-leagues-scorers

Implement your code in `individual_coding_exercise.ipynb`.

##

## Missing Values

We cannot progress with exploration & analysis until our dataframe is correctly formatted.
“Garbage in, garbage out.”
“Bad data in, bad predictions out.”

Let’s consider where our dataframe fails:

How many values are we missing in the “Goals” column?
	57
Do you notice anything odd when grouping players by country?
	Two Netherlands?
What are the average goals per player for a soccer season?
	11.77
Which country and year had the highest count of soccer players who score above average goals?	
    England - 2018

In [1]:
import pandas as pd
# load in your dataset located in `data/scorers.csv`
df = pd.read_csv("../data/scorers.csv")

df.head()

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19.0,16.0,1849.0,11.0,48.0,20.0,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36.0,0.0,3129.0,16.0,88.0,41.0,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34.0,1.0,2940.0,28.0,120.0,57.0,3.88,1.84,2016
3,Spain,La Liga,(CAR),Ruben Castro,32.0,3.0,2842.0,,117.0,42.0,3.91,1.4,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21.0,10.0,1745.0,13.0,50.0,23.0,2.72,1.25,2016


In [41]:
missing = df.loc[3, "Goals"]
print(missing)

nan


In [43]:
missing_goals = df[df["Goals"].isna()]

len(missing_goals)

57

What do missing values look like?

Sometimes we will see some placeholder value that indicates a value is missing. 

If a dataset is properly documented, we should see some information such as “0 = missing data” or 
“N/A = missing data.”

Sometimes we can also intuitively understand that some data is missing, although we always want documentation.


In [2]:
mean  =  df["Goals"].mean()
mean

11.772802653399669

In [None]:
above_avg_goals = df[df["Goals"] > mean].groupby(["Country", "Year"])["Goals"].count().sort_values()
above_avg_goals

Why is missing data important?

Well, let’s consider the mean of our “Goal” column in our sample dataset.

    11.7728

Let us then view the mean of our “Goal” column in our population dataset “scorers_full.csv”

	11.8300


In [6]:
# load in your dataset located in `data/scorers_full.csv`
full = pd.read_csv("../data/scorers_full.csv")

full["Goals"].mean()

11.830015313935682

The difference is marginal (11.8300 - 11.7728), due to the central limit theorem (sample is good enough estimate of population).

However, we cannot leave this data as is.

This missing data will prevent us from calculating statistics on our dataset such as mean & max.

## Cleaning a DataFrame

The strategy we use to handle this data will lead us to different calculations. 
We can either:
Drop missing data.

Could potentially drop entire column. The cut off varies from data-scientist to data-scientist, and project to project. (10%, 20%, 50%, 70%)

Replace missing data.

Usually with some statistic of the dataset. (Mean, median, mode, or some simulated data).
Whatever we do, we must document. (Ex: Dropped columns “xG” and “avgxG” since our analysis is not dependent on it.)

## Dropping

This, of course, skews our data in some direction. 
If we have a good chunk (relative term) of data missing, it’ll be better to simply remove the column and document it as unusable.

Ex: 50% of rows missing goal data. Do I really want to drop 50% of my data?

Especially considering the central limit theorem, we should have at least 30 samples per data-type. 

In [4]:
df.dropna(subset="Goals", inplace=True)
df["Goals"].mean()

11.772802653399669

In [30]:
missing_goals = df[df["Goals"].isna()]

len(missing_goals)

0

## Dropping Columns

Like mentioned before, sometimes we would like to drop columns:
Either to simply remove irrelevant information.

Or to drop an usable column.

In [53]:
#df.drop(columns=["Substitution "], inplace=True)
df.head()

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Mins,Goals,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19.0,1849.0,11.0,48.0,20.0,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36.0,3129.0,16.0,88.0,41.0,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34.0,2940.0,28.0,120.0,57.0,3.88,1.84,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21.0,1745.0,13.0,50.0,23.0,2.72,1.25,2016
5,Spain,La Liga,(JUV),Cristiano Ronaldo,29.0,2634.0,25.0,162.0,60.0,5.84,2.16,2016


Notice how `Substitution` is not found in our axis. This is because the column

In [39]:
mean  = df["Goals"].mean()

In [5]:
df = pd.read_csv("../data/scorers.csv")
mean  =  df["Goals"].mean(skipna=False)
mean

nan

In [38]:
df = pd.read_csv("../data/scorers.csv")
zero_filled  =  df["Goals"].fillna(0)
zero_filled.mean()

10.756060606060606

In [40]:
mean_filled  =  df["Goals"].fillna(mean)
mean_filled.mean()

11.772802653399669