<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/sa-getting-started/SA_3_4_Data_Cleanup_Case_Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: How to Prepare a Dataframe for Analysis

<img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>


We have a data file (`csv` format) which we got from the internet. However, it is meant for humans to read and is not suitable for analysis just the way it is.

So, we perform a number of "clean-up" activities on the data. This document is based on an actual analysis done. Let's go through the tasks at hand, step by step.

## Data Clean up activities to be done.

1. There is a column called `Loaned From` which we want to drop, since it is not relevant to the analysis.
1. If a player has no Club affiliation, we want to drop that row
1. Remove (strip) the '€' symbol, and

1. There are 3 financial columns -- `Wages` `Value`
and `Release Clause`. These are in all sorts of formats, and we want to convert them to proper integers. Convert dollar values to Integer

1. If a Player's "value" is missing, we want to drop that player.

2. The Heights are given as a string `5'11"` to be converted to `71` inches.
1. Drop the rows if Player Height is missing
3. Each player's weight has `lbs` attached to it. We want to strip that out and make it into an integer.


```
Input file: `fifa_eda_stats.csv`

Output: A modified data frame
```

This example is based on the Kaggle notebook originally done by [Vick Prevert](https://www.kaggle.com/code/vprivert/fifa-statistics-eda)

For the sake of illustration, I have added many more explanatory steps, so that the reader can see the intermediate steps.

In [6]:
import pandas as pd
df = pd.read_csv('fifa_eda_stats.csv')

In [7]:
df.shape

(18207, 57)

In [8]:
df

Unnamed: 0,ID,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Preferred Foot,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,158023,L. Messi,31,Argentina,94,94,FC Barcelona,€110.5M,€565K,Left,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,20801,Cristiano Ronaldo,33,Portugal,94,94,Juventus,€77M,€405K,Right,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,190871,Neymar Jr,26,Brazil,92,93,Paris Saint-Germain,€118.5M,€290K,Right,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,193080,De Gea,27,Spain,91,93,Manchester United,€72M,€260K,Right,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,192985,K. De Bruyne,27,Belgium,91,92,Manchester City,€102M,€355K,Right,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18202,238813,J. Lundstram,19,England,47,65,Crewe Alexandra,€60K,€1K,Right,...,45.0,40.0,48.0,47.0,10.0,13.0,7.0,8.0,9.0,€143K
18203,243165,N. Christoffersson,19,Sweden,47,63,Trelleborgs FF,€60K,€1K,Right,...,42.0,22.0,15.0,19.0,10.0,9.0,9.0,5.0,12.0,€113K
18204,241638,B. Worman,16,England,47,67,Cambridge United,€60K,€1K,Right,...,41.0,32.0,13.0,11.0,6.0,5.0,10.0,6.0,13.0,€165K
18205,246268,D. Walker-Rice,17,England,47,66,Tranmere Rovers,€60K,€1K,Right,...,46.0,20.0,25.0,27.0,14.0,6.0,14.0,8.0,9.0,€143K


## Step 1: Remove `Loaned From` column since it only applies to a few records



In [9]:
# Remove Loaned From column since it only applies to a few records
df.drop('Loaned From', axis=1, inplace=True)

In Pandas, you should always think `[row, column]` -- thus axis=0 is by row and axis=1 is by column.

## Step 2: If the `Club` for a player is NA (null) then drop that player from the dataset.

In [None]:
# Remove players with no club
df.dropna(subset=['Club'], inplace=True)

## Step 3: Remove the '€' Symbol from money columns

In [10]:
# Clean € cells by removing special characters
df['Value'] = df['Value'].str.replace('€', '')
df['Wage'] = df['Wage'].str.replace('€', '')
df['Release Clause'] = df['Release Clause'].str.replace('€', '')



## Step 4: Make finance values into numbers (not strings)

Note that we are using `e3` and `e6` - which is really 10^3 (1000s) and 10^6 (millions).

Also, note the use of "chaining" where a number of operations are performed in a chain with the dot operator.

In [11]:
# Convert values to numeric
df['Value'] = df['Value'].str.replace('K', 'e3').str.replace('M', 'e6').astype(float)
df['Wage'] = df['Wage'].str.replace('K', 'e3').str.replace('M', 'e6').astype(float)
df['Release Clause'] = df['Release Clause'].str.replace('K', 'e3').str.replace('M', 'e6').astype(float)



In [13]:
df[['Value', 'Wage', 'Release Clause']].dtypes # Let us confirm that it worked

Value             float64
Wage              float64
Release Clause    float64
dtype: object

## Step 5: If a player's value is missing, then drop that row entirely.
Note that we are overwriting the df by assigning the new value to it.

In [14]:
# Drop players with no value
df = df[df['Value'] != 0]


In [16]:
df = df.copy() # doing this to avoid the SettingCopyWithWarning error

## Step 6: The `Height` column has feet and inches. Convert it to inches

In [None]:
df['Height'].sample(5)

7964      6'0
12188     6'1
12253     6'3
15830    5'10
10429     6'2
Name: Height, dtype: object

Note several things:
1. We are splitting the height at the feet symbol `'`
2. When we use the `expand = True` flag, we will get NEW columns for each split piece. (In this case the original Height column will get split in two.
3. We take each piece and cast it as float.


In [19]:
# Height to inches
height_split = df['Height'].str.split("'", expand=True)
feet = height_split[0].astype(float)
inches = height_split[1].astype(float)



In [None]:
# Trick: display the first 5 rows of height_split and the 'height' column, side by side
pd.concat([df['Height'], height_split ], axis=1).head()


Unnamed: 0,Height,0,1
0,5'7,5,7
1,6'2,6,2
2,5'9,5,9
3,6'4,6,4
4,5'11,5,11


In [None]:
[feet.tail(3), inches.tail(3)]

[18204    5.0
 18205    5.0
 18206    5.0
 Name: 0, dtype: float64,
 18204     8.0
 18205    10.0
 18206    10.0
 Name: 1, dtype: float64]

Now, we can convert from Feet to inches.

In [20]:
df['Height'] = (feet * 12) + inches


In [21]:
df['Height'].sample(5)

9839     69.0
17225    72.0
12997    70.0
12140    71.0
13217    76.0
Name: Height, dtype: float64

That looks good!



## Step 7: Next, let's drop the rows which have the players Height missing.

In [22]:
# List the total number of rows in the Height column that are missing or null
df['Height'].isnull().sum()

48

The command above tells us that there are 48 rows missing. We need to get rid of these rows, because Player height is an important data point for our current analysis.

In [None]:
df.dropna(subset=['Height'], inplace=True)  # Drop rows with missing height values


In [None]:
df['Height'].isnull().sum()

0

## Step 8: Clean up the Weight Column

We want to strip out the "lbs" in each row and make it into a number.

In [23]:
df['Weight']

0        159lbs
1        183lbs
2        150lbs
3        168lbs
4        154lbs
          ...  
18202    134lbs
18203    170lbs
18204    148lbs
18205    154lbs
18206    176lbs
Name: Weight, Length: 17955, dtype: object

In [26]:
# Fix weight to remove lbs
df['Weight'] = df['Weight'].str.replace('lbs', '').astype('Int64')

In [27]:
df['Weight']

0        159
1        183
2        150
3        168
4        154
        ... 
18202    134
18203    170
18204    148
18205    154
18206    176
Name: Weight, Length: 17955, dtype: Int64


The initial dataframe Preparation step is complete.

Now, the dataframe is ready for us to start our analysis.


