# **FIFA 23 Players Dataset: Data Cleansing and Transformation**

<p align="center">
  <img src="https://i.ytimg.com/vi/Sa0E9Ze2Q7c/maxresdefault.jpg" alt="FIFA 23 Players" width="650" height="400">
</p>

# FIFA 23 Players Dataset: Data Cleansing and Transformation

This notebook focuses on cleaning and transforming the FIFA 23 Players dataset obtained from Kaggle. It includes steps for data cleaning, handling missing values, and basic transformations.

---

## Setup

First, let's set up the environment by installing necessary libraries, uploading Kaggle API keys, and downloading the dataset.

```python
# Install necessary libraries
!pip install kaggle

# Import required modules
from google.colab import files
import os

# Upload Kaggle API JSON file
uploaded = files.upload()

# Move and configure Kaggle API JSON file
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download and extract dataset
!kaggle datasets download -d stefanoleone992/fifa-23-complete-player-dataset
!unzip -o fifa-23-complete-player-dataset.zip -d fifa_dataset


#**Data Loading and Exploration**

Now, let's load the dataset, explore the contents, and perform initial data exploration.

In [52]:
import pandas as pd

# Load a fraction of the players dataset
fraction = 100
skip = lambda x: x % fraction != 0
players = pd.read_csv('/content/fifa_dataset/male_players.csv', skiprows=skip)
teams = pd.read_csv('/content/fifa_dataset/male_teams.csv', skiprows=skip)

# Display basic information about the datasets
print('Players Shape:', players.shape)
print('Teams Shape:', teams.shape)


Players Shape: (200071, 110)
Teams Shape: (3850, 54)


#**Determining Sample Size and Attributes**
Let's examine the number of samples and attributes present in the datasets.

In [53]:
# Determining the number of samples and attributes in the dataset
print('Players:', players.shape)
print('Teams:', teams.shape)

players.shape[0]


Players: (200071, 110)
Teams: (3850, 54)


200071

#**Missing Values Analysis**
Now, let's analyze missing values in the datasets to determine their impact on the data quality.

In [54]:
missing_data1 = players.isnull().sum()
missing_data2 = teams.isnull().sum()

# Display the percentage of missing values in each dataset
print('Players Missing Data:', missing_data1)
print('Teams Missing Data:', missing_data2)


Players Missing Data: player_id           0
player_url          0
fifa_version        0
fifa_update         0
fifa_update_date    0
                   ..
cb                  0
rcb                 0
rb                  0
gk                  0
player_face_url     0
Length: 110, dtype: int64
Teams Missing Data: team_id                           0
team_url                          0
fifa_version                      0
fifa_update                       0
fifa_update_date                  0
team_name                         0
league_id                         0
league_name                       0
league_level                    244
nationality_id                    0
nationality_name                  0
overall                           0
attack                            0
midfield                          0
defence                           0
coach_id                         30
home_stadium                      2
rival_team                        0
international_prestige            0
domest

#**Handling Missing Values**
Based on the analysis, let's handle missing values by setting a threshold and dropping columns surpassing that threshold.

In [55]:
threshold = 1000  # Setting threshold for missing values
columns_to_drop = players.columns[players.isnull().sum() > threshold]

players_cleaned = players.drop(columns=columns_to_drop)


#**Data Type Inspection and Conversion**
Inspecting data types and converting 'fifa_update_date' to a datetime format for further analysis.

In [56]:
print(players_cleaned[['height_cm','weight_kg']].dtypes)

# Convert 'fifa_update_date' column to datetime format
players_cleaned['fifa_update_date'] = pd.to_datetime(players_cleaned['fifa_update_date'])
print(players_cleaned['fifa_update_date'].dtype)


height_cm    int64
weight_kg    int64
dtype: object
datetime64[ns]


#**Splitting Date Components**
Split 'fifa_update_date' into year and month columns for temporal analysis

In [57]:
# Split 'fifa_update_date' into year and month columns
players_cleaned['fifa_update_year'] = players_cleaned['fifa_update_date'].dt.year
players_cleaned['fifa_update_month'] = players_cleaned['fifa_update_date'].dt.month

# Display the updated DataFrame with new columns for year and month
print(players_cleaned[['fifa_update_year', 'fifa_update_month']])


        fifa_update_year  fifa_update_month
0                   2023                  1
1                   2023                  1
2                   2023                  1
3                   2023                  1
4                   2023                  1
...                  ...                ...
200066              2014                  8
200067              2014                  8
200068              2014                  8
200069              2014                  8
200070              2014                  8

[200071 rows x 2 columns]


#**Conclusion**
This notebook covered initial data exploration, handling missing values, data type conversions, and temporal analysis. Further analysis and modeling can be performed using the cleaned dataset.