<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Exploring Data</div>

<h2>Import</h2>

In [1]:
import numpy as np
import pandas as pd
import datetime

<h2>Exploring your data</h2>

<h3>Read raw data from file</h3>

In [2]:
raw_df = pd.DataFrame()
raw_df=pd.read_csv('data_footballer.csv')

In [3]:
# TEST
raw_df.head()

Unnamed: 0,Name,Height,Weight,Preferred Foot,Birth Date,Age,Nation,Preferred Positions,OVR,POT,...,Long Shots,Curve,FK Acc.,Penalties,Volleys,GK Positioning,GK Diving,GK Handling,GK Kicking,GK Reflexes
0,Erling Haaland,195 cm,94 kg,Left,"July 21, 2000",23,Norway,ST,91,94,...,86,77,62,84,90,11,7,14,13,7
1,Kylian Mbappé,182 cm,75 kg,Right,"Dec. 20, 1998",24,France,"ST, LW",91,94,...,83,80,69,84,85,11,13,5,7,6
2,Kevin De Bruyne,181 cm,75 kg,Right,"June 28, 1991",32,Belgium,"CM, CAM",91,91,...,92,92,83,83,83,10,15,13,5,13
3,Harry Kane,188 cm,85 kg,Right,"July 28, 1993",30,England,ST,90,90,...,87,82,65,92,89,14,8,10,11,11
4,Thibaut Courtois,199 cm,96 kg,Left,"May 11, 1992",31,Belgium,GK,90,90,...,17,19,20,27,12,90,85,89,76,93


<h3>How many rows and how many columns does the raw data have?</h3>

In [4]:
rows=len(raw_df.axes[0])
cols=len(raw_df.axes[1])
shape=(rows,cols)
print(f"Current shape: {shape}")

Current shape: (10020, 46)


<h3>What data type does each column currently have? Are there any columns whose data types are not suitable for further processing?</h3>

<h4>Convert all columns into their correct datatype before doing anything else</h4>

In [5]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10020 entries, 0 to 10019
Data columns (total 46 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Name                 10020 non-null  object
 1   Height               10020 non-null  object
 2   Weight               10020 non-null  object
 3   Preferred Foot       10020 non-null  object
 4   Birth Date           10020 non-null  object
 5   Age                  10020 non-null  int64 
 6   Nation               10020 non-null  object
 7   Preferred Positions  10020 non-null  object
 8   OVR                  10020 non-null  int64 
 9   POT                  10020 non-null  int64 
 10  Value                10020 non-null  object
 11  Wage                 10020 non-null  object
 12  Ball Control         10020 non-null  int64 
 13  Dribbling            10020 non-null  int64 
 14  Marking              10020 non-null  object
 15  Slide Tackle         10020 non-null  int64 
 16  Stan

Column `Birth Date` should be converted to datetime.

In [6]:
cols_change=['Value','Wage', 'Marking']
raw_df['Value'] = raw_df['Value'].str.replace('.','')
raw_df[cols_change] = raw_df[cols_change].apply(pd.to_numeric, errors='coerce')
raw_df['Birth Date'] = raw_df['Birth Date'].str.replace('.','').str.replace('Sept', 'Sep')
converted_dates = []
for i in range(len(raw_df['Birth Date'])):
    try:
        converted_dates.append(pd.to_datetime(raw_df['Birth Date'].iloc[i], format='%b %d, %Y'))
    except ValueError:
        converted_dates.append(pd.to_datetime(raw_df['Birth Date'].iloc[i], format='%B %d, %Y'))

raw_df['Birth Date'] = converted_dates
raw_df['Height'] = raw_df['Height'].str.rstrip(' cm').astype(int)
raw_df['Weight'] = raw_df['Weight'].str.rstrip(' kg').astype(int)

  raw_df['Value'] = raw_df['Value'].str.replace('.','')
  raw_df['Birth Date'] = raw_df['Birth Date'].str.replace('.','').str.replace('Sept', 'Sep')


In [7]:
# TEST
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10020 entries, 0 to 10019
Data columns (total 46 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Name                 10020 non-null  object        
 1   Height               10020 non-null  int64         
 2   Weight               10020 non-null  int64         
 3   Preferred Foot       10020 non-null  object        
 4   Birth Date           10020 non-null  datetime64[ns]
 5   Age                  10020 non-null  int64         
 6   Nation               10020 non-null  object        
 7   Preferred Positions  10020 non-null  object        
 8   OVR                  10020 non-null  int64         
 9   POT                  10020 non-null  int64         
 10  Value                9961 non-null   float64       
 11  Wage                 9961 non-null   float64       
 12  Ball Control         10020 non-null  int64         
 13  Dribbling            10020 non-

<h3>Check missing value and solve missing value in data</h3>

In [8]:
missing_ratio=(raw_df.isnull().sum()/len(raw_df)*100).round(2)
missing_ratio

Name                     0.00
Height                   0.00
Weight                   0.00
Preferred Foot           0.00
Birth Date               0.00
Age                      0.00
Nation                   0.00
Preferred Positions      0.00
OVR                      0.00
POT                      0.00
Value                    0.59
Wage                     0.59
Ball Control             0.00
Dribbling                0.00
Marking                100.00
Slide Tackle             0.00
Stand Tackle             0.00
Aggression               0.00
Reactions                0.00
Att. Position            0.00
Interceptions            0.00
Vision                   0.00
Composure                0.00
Crossing                 0.00
Short Pass               0.00
Long Pass                0.00
Acceleration             0.00
Stamina                  0.00
Strength                 0.00
Balance                  0.00
Sprint Speed             0.00
Agility                  0.00
Jumping                  0.00
Heading   

If the percentage of missing values is greater than 75%, the column is dropped from the dataframe.

In [9]:
for i in missing_ratio.keys():
    if missing_ratio[i]>75.0:
        raw_df=raw_df.drop(columns=[i])

In [10]:
# TEST
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10020 entries, 0 to 10019
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Name                 10020 non-null  object        
 1   Height               10020 non-null  int64         
 2   Weight               10020 non-null  int64         
 3   Preferred Foot       10020 non-null  object        
 4   Birth Date           10020 non-null  datetime64[ns]
 5   Age                  10020 non-null  int64         
 6   Nation               10020 non-null  object        
 7   Preferred Positions  10020 non-null  object        
 8   OVR                  10020 non-null  int64         
 9   POT                  10020 non-null  int64         
 10  Value                9961 non-null   float64       
 11  Wage                 9961 non-null   float64       
 12  Ball Control         10020 non-null  int64         
 13  Dribbling            10020 non-

After remove features which have large missing values, our dataframe still have missing values. So that, we need to fill these missing values so that they can be used in analysis.

In [11]:
# Với 2 cột Value và Wage điền giá trị NaN bằng trung bình của hai giá trị bên cạnh
raw_df['Value'] = raw_df['Value'].fillna((raw_df['Value'].shift() + raw_df['Value'].shift(-1)) / 2)
raw_df['Wage'] = raw_df['Wage'].fillna((raw_df['Wage'].shift() + raw_df['Wage'].shift(-1)) / 2)
# Hiển thị DataFrame sau khi điền giá trị NaN
display(raw_df)

Unnamed: 0,Name,Height,Weight,Preferred Foot,Birth Date,Age,Nation,Preferred Positions,OVR,POT,...,Long Shots,Curve,FK Acc.,Penalties,Volleys,GK Positioning,GK Diving,GK Handling,GK Kicking,GK Reflexes
0,Erling Haaland,195,94,Left,2000-07-21,23,Norway,ST,91,94,...,86,77,62,84,90,11,7,14,13,7
1,Kylian Mbappé,182,75,Right,1998-12-20,24,France,"ST, LW",91,94,...,83,80,69,84,85,11,13,5,7,6
2,Kevin De Bruyne,181,75,Right,1991-06-28,32,Belgium,"CM, CAM",91,91,...,92,92,83,83,83,10,15,13,5,13
3,Harry Kane,188,85,Right,1993-07-28,30,England,ST,90,90,...,87,82,65,92,89,14,8,10,11,11
4,Thibaut Courtois,199,96,Left,1992-05-11,31,Belgium,GK,90,90,...,17,19,20,27,12,90,85,89,76,93
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10015,Tim Schreiber,191,87,Right,2002-04-24,21,Germany,GK,65,75,...,6,11,11,18,9,59,69,61,63,70
10016,Calvin Ramsay,177,68,Right,2003-07-31,20,Scotland,"RWB, RB",65,81,...,43,61,54,42,27,9,13,8,9,8
10017,Tomás Salazar,170,60,Right,2000-06-06,23,Colombia,CM,65,74,...,64,49,60,55,38,5,6,11,8,10
10018,Arda Kızıldağ,187,77,Right,1998-10-15,25,Turkey,CB,65,72,...,31,28,33,53,29,13,10,13,14,9


In [12]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10020 entries, 0 to 10019
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Name                 10020 non-null  object        
 1   Height               10020 non-null  int64         
 2   Weight               10020 non-null  int64         
 3   Preferred Foot       10020 non-null  object        
 4   Birth Date           10020 non-null  datetime64[ns]
 5   Age                  10020 non-null  int64         
 6   Nation               10020 non-null  object        
 7   Preferred Positions  10020 non-null  object        
 8   OVR                  10020 non-null  int64         
 9   POT                  10020 non-null  int64         
 10  Value                10013 non-null  float64       
 11  Wage                 10013 non-null  float64       
 12  Ball Control         10020 non-null  int64         
 13  Dribbling            10020 non-

<h3>Group feature</h3>

`raw_df` have so many columnns, that make them distracted to analyze.  
To further streamline the dataset and improve analytical focus, consider merging related columns into cohesive groups, such as **'Ball Skills,' 'Defence,' 'Mental,' 'Passing,' 'Physical,' 'Shooting,' and 'Goalkeeper.'**
This grouping approach enhances the organization of the data, making it more accessible and facilitating a more insightful analysis of player performance.

In [13]:
raw_df['Ball Skills']=(raw_df['Ball Control']+raw_df['Dribbling'])/2
raw_df['Defence']=(raw_df['Slide Tackle']+raw_df['Stand Tackle'])/2
raw_df['Mental'] = raw_df.loc[:,'Aggression':'Composure'].mean(axis=1)
raw_df['Passing'] = raw_df.loc[:,'Crossing':'Long Pass'].mean(axis=1)
raw_df['Physical'] = raw_df.loc[:,'Acceleration':'Jumping'].mean(axis=1)
raw_df['Shooting'] = raw_df.loc[:,'Heading':'Volleys'].mean(axis=1)
raw_df['Goalkeeper'] = raw_df.loc[:,'GK Positioning':'GK Reflexes'].mean(axis=1)
raw_df.drop(columns=raw_df.columns[11:43], inplace=True)


In [14]:
raw_df = round(raw_df, 1)
raw_df

Unnamed: 0,Name,Height,Weight,Preferred Foot,Birth Date,Age,Nation,Preferred Positions,OVR,POT,Value,GK Kicking,GK Reflexes,Ball Skills,Defence,Mental,Passing,Physical,Shooting,Goalkeeper
0,Erling Haaland,195,94,Left,2000-07-21,23,Norway,ST,91,94,157000000.0,13,7,80.5,38.0,80.2,59.0,83.7,84.0,10.4
1,Kylian Mbappé,182,75,Right,1998-12-20,24,France,"ST, LW",91,94,153500000.0,7,6,92.5,33.0,76.7,78.3,89.0,82.2,8.4
2,Kevin De Bruyne,181,75,Right,1991-06-28,32,Belgium,"CM, CAM",91,91,103000000.0,5,13,89.0,61.5,84.0,94.3,75.7,83.1,11.2
3,Harry Kane,188,85,Right,1993-07-28,30,England,ST,90,90,119500000.0,11,11,84.5,42.0,81.3,85.0,75.9,86.5,10.8
4,Thibaut Courtois,199,96,Left,1992-05-11,31,Belgium,GK,90,90,63000000.0,76,93,18.0,17.0,41.5,27.3,54.0,22.4,86.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10015,Tim Schreiber,191,87,Right,2002-04-24,21,Germany,GK,65,75,625000.0,63,70,16.5,12.5,30.0,31.3,40.4,14.9,64.4
10016,Calvin Ramsay,177,68,Right,2003-07-31,20,Scotland,"RWB, RB",65,81,700000.0,9,8,66.0,58.0,58.7,60.7,70.7,47.0,9.4
10017,Tomás Salazar,170,60,Right,2000-06-06,23,Colombia,CM,65,74,825000.0,8,10,65.0,58.5,60.3,59.0,58.4,53.9,8.0
10018,Arda Kızıldağ,187,77,Right,1998-10-15,25,Turkey,CB,65,72,725000.0,14,9,48.0,67.5,55.2,52.0,63.1,40.8,11.8


<h3>What does each line mean? Does it matter if the lines have different meanings?</h3>

Each line in the provided data represents information about a specific player in a football (soccer) context.  

These lines provide a comprehensive overview of each player's characteristics and abilities. The differences in meanings among the lines are expected and reflect the diversity of player positions, attributes, and roles within a football team. The dataset captures a range of player types, from strikers (ST) and midfielders (CM, CAM) to goalkeepers (GK), each with their own set of skills and attributes. 

Analyzing this data allows for comparisons between players, understanding their strengths and weaknesses, and making informed decisions, such as scouting or selecting players for specific positions on a team.

<h3>Does the raw data have duplicate rows?</h3>

In [15]:
num_duplicated_rows = None
dupli=raw_df.duplicated()
count=0
for i in dupli:
    if i == True:
        count+=1
num_duplicated_rows=count

In [16]:
# TEST
if num_duplicated_rows == 0:
    print(f"Your raw data have no duplicated line.!")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Your raw data have {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

Your raw data have no duplicated line.!


In [17]:
# De-deduplicate raw data
raw_df=raw_df.drop_duplicates()

<h3>What does each column mean?</h3>

<div>
    
**Height**: The player's height.
    
**Weight**: The player's weight.

**Preferred Foot**: The foot that the player prefers to use (Left or Right).

**Birth Date**: The player's date of birth.Age: The player's current age.

**Preferred Positions**: The positions on the field that the player prefers to play.

**OVR (Overall)**: The player's overall skill level or rating.

**POT (Potential)**: The player's potential skill level or rating.

**Value**: The estimated market value of the player.

**Wage**: The player's weekly wage.

**Ball Skills**: Composite score representing ball control and dribbling skills.

**Defence**: Composite score representing defensive skills, combining slide tackle and stand tackle.

**Mental**: The average of attributes related to mental skills (Aggression to Composure).

**Passing**: The average of attributes related to passing skills (Crossing to Long Pass).

**Physical**: The average of attributes related to physical skills (Acceleration to Jumping).

**Shooting**: The average of attributes related to shooting skills (Heading to Volleys).

**Goalkeeper**: The average of attributes related to goalkeeping skills.
</div>

<h3>Save your processed data</h3>

In [18]:
raw_df.to_csv("data_footballer_processed.csv", index=False)