# Data analysis project :  Forbes Highest Paid Athletes 1990-2020

## Introduction

In this project, I apply my knowledge of data analysis to explore the Forbes highest-paid athletes between the years 1990 and 2020. The dataset is sourced from Kaggle, and the objective of this project is purely educational, aimed at enhancing my data analysis skills.

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt

In [2]:
# Read in the data
dataframe = pd.read_csv('Dataset/Forbes Richest Atheletes (Forbes Richest Athletes 1990-2020).csv')

In [3]:
#We display key information about the dataset 
dataframe.head()

Unnamed: 0,S.NO,Name,Nationality,Current Rank,Previous Year Rank,Sport,Year,earnings ($ million)
0,1,Mike Tyson,USA,1,,boxing,1990,28.6
1,2,Buster Douglas,USA,2,,boxing,1990,26.0
2,3,Sugar Ray Leonard,USA,3,,boxing,1990,13.0
3,4,Ayrton Senna,Brazil,4,,auto racing,1990,10.0
4,5,Alain Prost,France,5,,auto racing,1990,9.0


In [4]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   S.NO                  301 non-null    int64  
 1   Name                  301 non-null    object 
 2   Nationality           301 non-null    object 
 3   Current Rank          301 non-null    int64  
 4   Previous Year Rank    277 non-null    object 
 5   Sport                 301 non-null    object 
 6   Year                  301 non-null    int64  
 7   earnings ($ million)  301 non-null    float64
dtypes: float64(1), int64(3), object(4)
memory usage: 18.9+ KB


In [5]:
dataframe.shape

(301, 8)

In [6]:
dataframe.describe()

Unnamed: 0,S.NO,Current Rank,Year,earnings ($ million)
count,301.0,301.0,301.0,301.0
mean,151.0,5.448505,2005.122924,45.516279
std,87.035433,2.850995,9.063563,33.525337
min,1.0,1.0,1990.0,8.1
25%,76.0,3.0,1997.0,24.0
50%,151.0,5.0,2005.0,39.0
75%,226.0,8.0,2013.0,59.4
max,301.0,10.0,2020.0,300.0


## 1. Data cleaning 

### Handling Missing Values

In [7]:
#We display the sum of Missing Values
dataframe.isna().sum()

S.NO                     0
Name                     0
Nationality              0
Current Rank             0
Previous Year Rank      24
Sport                    0
Year                     0
earnings ($ million)     0
dtype: int64

In [8]:
#Display all rows with NaN values
dataframe[dataframe.isna().any(axis=1)]

Unnamed: 0,S.NO,Name,Nationality,Current Rank,Previous Year Rank,Sport,Year,earnings ($ million)
0,1,Mike Tyson,USA,1,,boxing,1990,28.6
1,2,Buster Douglas,USA,2,,boxing,1990,26.0
2,3,Sugar Ray Leonard,USA,3,,boxing,1990,13.0
3,4,Ayrton Senna,Brazil,4,,auto racing,1990,10.0
4,5,Alain Prost,France,5,,auto racing,1990,9.0
5,6,Jack Nicklaus,USA,6,,golf,1990,8.6
6,7,Greg Norman,Australia,7,,golf,1990,8.5
7,8,Michael Jordan,USA,8,,basketball,1990,8.1
8,9,Arnold Palmer,USA,8,,golf,1990,8.1
9,10,Evander Holyfield,USA,8,,boxing,1990,8.1


The 'Previous Year Rank' column is not pertinent to the analysis of the highest-paid athletes, so it can be safely removed to streamline the dataset.

In [9]:
#We delete the columns 
dataframe.drop("Previous Year Rank", axis=1, inplace=True)

In [15]:
dataframe.columns

Index(['S.NO', 'Name', 'Nationality', 'Current Rank', 'Sport', 'Year',
       'earnings ($ million)'],
      dtype='object')

### Data Type Correction

In [19]:
#Display all columns types
dataframe.dtypes

S.NO                      int64
Name                     object
Nationality              object
Current Rank              int64
Sport                    object
Year                      int64
earnings ($ million)    float64
dtype: object

In [24]:
#Display the memory usage
dataframe.memory_usage().sum()

16988

In [29]:
#Transform columns to date types
dataframe["Year"] = pd.to_datetime(dataframe['Year'], format='%Y')