# FIFA 23 Official Dataset 
## Step 1: Load and Explore the Dataset
In this step, we will:
- Download the dataset from Kaggle `https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database/data?select=FIFA23_official_data.csv`
- Set up our environment
- Load the dataset into a Pandas DataFrame
- Explore the dataset to understand its structure


### Step 1.1: Install Required Libraries
Run the following command to ensure you have the necessary Python libraries installed 

In [3]:
!pip install pandas numpy matplotlib seaborn



### Results:
The necessary libraries (`pandas`, `numpy`, `matplotlib`, and `seaborn`) are now installed.

### Step 1.2: Load the Dataset
Use Pandas to load the CSV file into a DataFrame.
If you want to test it locally make sure you update the location of your dataset

In [4]:
import pandas as pd

# Update this path with the location of your downloaded dataset
file_path = "FIFA23_official_data.csv"

# Load the dataset
fifa_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
fifa_data.head()


Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,Real Face,Position,Joined,Loaned From,Contract Valid Until,Height,Weight,Release Clause,Kit Number,Best Overall Rating
0,209658,L. Goretzka,27,https://cdn.sofifa.net/players/209/658/23_60.png,Germany,https://cdn.sofifa.net/flags/de.png,87,88,FC Bayern München,https://cdn.sofifa.net/teams/21/30.png,...,Yes,"<span class=""pos pos28"">SUB","Jul 1, 2018",,2026,189cm,82kg,€157M,8.0,
1,212198,Bruno Fernandes,27,https://cdn.sofifa.net/players/212/198/23_60.png,Portugal,https://cdn.sofifa.net/flags/pt.png,86,87,Manchester United,https://cdn.sofifa.net/teams/11/30.png,...,Yes,"<span class=""pos pos15"">LCM","Jan 30, 2020",,2026,179cm,69kg,€155M,8.0,
2,224334,M. Acuña,30,https://cdn.sofifa.net/players/224/334/23_60.png,Argentina,https://cdn.sofifa.net/flags/ar.png,85,85,Sevilla FC,https://cdn.sofifa.net/teams/481/30.png,...,No,"<span class=""pos pos7"">LB","Sep 14, 2020",,2024,172cm,69kg,€97.7M,19.0,
3,192985,K. De Bruyne,31,https://cdn.sofifa.net/players/192/985/23_60.png,Belgium,https://cdn.sofifa.net/flags/be.png,91,91,Manchester City,https://cdn.sofifa.net/teams/10/30.png,...,Yes,"<span class=""pos pos13"">RCM","Aug 30, 2015",,2025,181cm,70kg,€198.9M,17.0,
4,224232,N. Barella,25,https://cdn.sofifa.net/players/224/232/23_60.png,Italy,https://cdn.sofifa.net/flags/it.png,86,89,Inter,https://cdn.sofifa.net/teams/44/30.png,...,Yes,"<span class=""pos pos13"">RCM","Sep 1, 2020",,2026,172cm,68kg,€154.4M,23.0,


### Results:
The first five rows of the dataset are displayed, giving an initial view of the structure and the type of data available, including columns like player names, age, and more.


### Step 1.3: Explore the Dataset
We will print the shape of the dataset to understand the number of rows and columns. Also, view the column names and their data types to get a sense of the dataset's structure.


In [5]:
# Check the shape of the dataset
print(f"Dataset Shape: {fifa_data.shape}")

# View the columns in the dataset
print("Columns:")
print(fifa_data.columns)


Dataset Shape: (17660, 29)
Columns:
Index(['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall',
       'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Joined', 'Loaned From', 'Contract Valid Until', 'Height', 'Weight',
       'Release Clause', 'Kit Number', 'Best Overall Rating'],
      dtype='object')


### Results:
- `Dataset Shape`: Shows the total number of rows and columns, `(17660, 29)`.
- `Columns`: Displays the names of all columns, which represent player attributes and performance metrics.


### Step 1.4: Summary of the Dataset
Generate a quick summary of the dataset, including the count of non-null values, data types, and more.


In [6]:
# Get a summary of the dataset
fifa_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17660 entries, 0 to 17659
Data columns (total 29 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        17660 non-null  int64  
 1   Name                      17660 non-null  object 
 2   Age                       17660 non-null  int64  
 3   Photo                     17660 non-null  object 
 4   Nationality               17660 non-null  object 
 5   Flag                      17660 non-null  object 
 6   Overall                   17660 non-null  int64  
 7   Potential                 17660 non-null  int64  
 8   Club                      17449 non-null  object 
 9   Club Logo                 17660 non-null  object 
 10  Value                     17660 non-null  object 
 11  Wage                      17660 non-null  object 
 12  Special                   17660 non-null  int64  
 13  Preferred Foot            17660 non-null  object 
 14  Intern

### Results:
The `.info()` method provides:
- The count of non-null values for each column.
- The data type of each column (e.g., `int64`, `float64`, or `object` for categorical data).
- A quick overview of potential missing values in the dataset.


### Step 1.5: Preview Statistical Summary
Display basic statistical details like mean, median, and standard deviation for numerical columns.


In [8]:
# Display statistical summary for numerical columns
fifa_data.describe()


Unnamed: 0,ID,Age,Overall,Potential,Special,International Reputation,Weak Foot,Skill Moves,Kit Number
count,17660.0,17660.0,17660.0,17660.0,17660.0,17660.0,17660.0,17660.0,17625.0
mean,246319.424462,23.127746,63.369592,70.9812,1537.915855,1.106285,2.90034,2.297169,25.037957
std,31487.892861,4.639821,8.036268,6.529836,285.893809,0.407021,0.663523,0.754264,19.154116
min,16.0,15.0,43.0,42.0,749.0,1.0,1.0,1.0,1.0
25%,240732.5,20.0,58.0,67.0,1387.0,1.0,3.0,2.0,11.0
50%,257041.0,22.0,63.0,71.0,1548.0,1.0,3.0,2.0,22.0
75%,263027.5,26.0,69.0,75.0,1727.0,1.0,3.0,3.0,32.0
max,271340.0,54.0,91.0,95.0,2312.0,5.0,5.0,5.0,99.0


### Results:
The `.describe()` method outputs statistical details for numerical columns:
- Count: Total non-missing values.
- Mean: Average of the column.
- Std: Standard deviation (variation in data).
- Min, 25%, 50%, 75%, Max: Percentile values for data distribution.


## Step 2: Data Cleaning and Preparation

Now that we have loaded and explored the dataset, the next step is to clean and prepare it for analysis. Here's what we'll do:

- Handle Missing Values: Identify and address missing or null values.
- Check for Duplicates: Remove duplicate rows if any exist.
- Standardize Column Names: Ensure column names are clean, readable, and consistent.
- Inspect Data Types: Convert data types where necessary for accurate analysis.


### Step 2.1: Identify Missing Values
Check for missing values in the dataset to identify columns that need cleaning or imputation.


In [19]:
# Check for missing values
missing_values = fifa_data.isnull().sum()

# Display columns with missing values
print("Number of Columns with missing values:",len(missing_values[missing_values > 0]))
print("Columns with missing values:")
print(missing_values[missing_values > 0])

Number of Columns with missing values: 10
Columns with missing values:
Club                      211
Body Type                  38
Real Face                  38
Position                   35
Joined                   1098
Loaned From             16966
Contract Valid Until      361
Release Clause           1151
Kit Number                 35
Best Overall Rating     17639
dtype: int64


### Results:
As we see there is 10 columns with missing values like `Club` , `Body Type` ..., but `Best Overall Rating` and `Loaned From` are two columns with big number of missing values


### Step 2.2: Handle Missing Values
Based on the results from Step 2.1 we see that we need different ways to handle missing values for each column because we have different types of columns.


### Handle missing values for each column
#### `Club (211)`
Missing for free agents or players without a club, replace with "Free Agent"

In [23]:
fifa_data["Club"] = fifa_data["Club"].fillna("Free Agent")

#### `Body Type (38)`
This Column Represents the physical build of a player, replace with the most frequent value (mode)

In [25]:
fifa_data["Body Type"] = fifa_data["Body Type"].fillna(fifa_data["Body Type"].mode()[0])

#### `Real Face (38)`
Indicates if the player's face is scanned into the game, replace with "No" for missing

In [26]:
fifa_data["Real Face"] = fifa_data["Real Face"].fillna("No")

#### `Position (35)`
Represents a player's field position, replace missing values with "Unknown" because we can't use the mode for example because position is very changinf from player to an other also it's important factor

In [27]:
fifa_data["Position"] = fifa_data["Position"].fillna("Position")

#### `Joined (1098)`
Date when the player joined the current club, replace missing values with "Not Available"

In [28]:
fifa_data["Joined"] = fifa_data["Joined"].fillna("Not Available")

#### `Loaned From (16966 missing)`
Parent club for loaned players, replace missing values with "Not Loaned"

In [29]:
fifa_data["Loaned From"] = fifa_data["Loaned From"].fillna("Not Loaned")

#### `Contract Valid Until (361 missing)`
Indicates the contract's end date. Missing values might mean players without a club or undefined contracts.
Replace with a default value such as "Not Available."

In [30]:
fifa_data["Contract Valid Until"] = fifa_data["Contract Valid Until"].fillna("Not Available")

#### `Release Clause (1151 missing)`
Indicates the buyout clause value. Missing for free agents or players without a release clause.
Replace with 0 (indicating no release clause).

In [32]:
fifa_data["Release Clause"] = fifa_data["Release Clause"].fillna(0)

#### `Kit Number (35 missing)`
Missing values for players without a club or undefined positions.
As the kit number is not really important We can replace it with a placeholder value (e.g., 0 or -1).

In [34]:
fifa_data["Kit Number"] = fifa_data["Kit Number"].fillna(0)

#### `Best Overall Rating (17639 missing)`
This could be a critical column for analysis. Missing values suggest players without a defined rating.
As the rating is essential for our  analysis drop rows with missing values in this column.

In [35]:
fifa_data = fifa_data.dropna(subset=["Best Overall Rating"])

### Step 2.3: Confirm Null Values are Handled
Verify that all null values have been addressed and ensure the dataset is ready for further analysis.


In [37]:
# Check for any remaining missing values
remaining_nulls = fifa_data.isnull().sum().sum()

if remaining_nulls == 0:
    print("All missing values have been successfully handled. No null values remain in the dataset!")
else:
    print(f"There are still {remaining_nulls} missing values remaining. Please review the data cleaning process.")


All missing values have been successfully handled. No null values remain in the dataset!
