## FIFA '21 Player Ratings

Here, you have a very messy and raw dataset of EA Sports' installment of their hit FIFA series - FIFA21, which was scraped from sofifa.com.

Your task in this project is to clean up this dataset using the `Hints` provided in the ReadMe file

In [12]:
import pandas as pd
import numpy as np
import os
%matplotlib inline
import re


**Question 1:** Load the fifa dataset into a variable called `fifa_dataset`. Next, write a function called `check_data` to check if the `fifa_dataset` is empty or not by returning True or False.

In [13]:
fifa_dataset = pd.read_csv('fifa_21_raw_data.csv', engine='python')

def check_data(fifa_dataset):
    return fifa_dataset.empty



**Question 2:** Remove all duplicate player `name` from the dataset. Remember, you need to leave the first occurence and remove other ones.

In [14]:
def remove_duplicate(dataset):
    return dataset.drop_duplicates(subset="Name", keep="first", inplace=True)

# Call the function to remove duplicates
remove_duplicate(fifa_dataset)



**Question 3:** Remove all leading white spaces or newline "\n" in the `club` column. For example, `"\n\n\n\nFC Barcelona"` should be transformed into `"FC Barcelona"`



In [15]:
def remove_newline(dataset):
    dataset['Club'] = dataset['Club'].str.strip('\n')

# Call the function to remove newline characters from the "Club" column
remove_newline(fifa_dataset)


**Question 4:** List the TOP 5 countries with the highest number of representation of players in the data. Create a dictionary called `top_5_country` using the country names as the key and the number of players as the value. For example,

`top_5_country` = {'Kenya': 15,
                  'Nigeria': 45,
                  'Ghana': 14,
                  'Canada': 20,
                  'USA': 88
                 }


In [16]:
# Count the number of players from each country
country_counts = fifa_dataset['Nationality'].value_counts()

# Get the top 5 countries and convert to dict
top_5_country = country_counts.head(5).to_dict()

# Print the dictionary
top_5_country


{'England': 1574,
 'Germany': 1178,
 'Spain': 1019,
 'France': 971,
 'Argentina': 861}

<!-- BEGIN QUESTION -->

**Question 5:** What insight can you derive from this data?

Some of the insights that I can draw from this data is that; 
1. Majority of the players come from the big 5 leagues that is - England, Germany, Spain, France and Argentina.
2. We have a majority of players missing their hit data. 


The Below attempts are from the hints given in the README

Hints Additionals 
Converting the height to numerical values<!-- END QUESTION -->



In [17]:

# Function to convert height to inches
def height_to_inches(height):
    # Extract the numeric part using a regular expression
    match = re.search(r'(\d+)\'(\d+)', height)
    if match:
        feet, inches = map(int, match.groups())
        return feet * 12 + inches
    else:
        # Handle the case where the height is in centimeters (e.g., "170cm")
        match_cm = re.search(r'(\d+)cm', height)
        if match_cm:
            height_in_cm = int(match_cm.group(1))
            # Convert height from centimeters to inches
            return height_in_cm * 0.393701  # 1 centimeter is approximately 0.393701 inches
        else:
            return None

# Apply the height_to_inches function to the "Height" column
fifa_dataset['Height'] = fifa_dataset['Height'].apply(height_to_inches)

# Check the first few rows to verify the conversion
fifa_dataset[['Name', 'Height', 'Weight']].head()


Unnamed: 0,Name,Height,Weight
0,L. Messi,66.92917,72kg
1,Cristiano Ronaldo,73.622087,83kg
2,J. Oblak,74.015788,87kg
3,K. De Bruyne,71.259881,70kg
4,Neymar Jr,68.897675,68kg


Additional Hint - Split the Long Name to First Name and Last Name

In [18]:
# Split the "LongName" into two new columns - "First Name" and "Last Name"
fifa_dataset[['First Name', 'Last Name']] = fifa_dataset['LongName'].str.split(' ', n=1, expand=True)
fifa_dataset.head()

Unnamed: 0,ID,Name,LongName,photoUrl,playerUrl,Nationality,Age,↓OVA,POT,Club,...,IR,PAC,SHO,PAS,DRI,DEF,PHY,Hits,First Name,Last Name
0,158023,L. Messi,Lionel Messi,https://cdn.sofifa.com/players/158/023/21_60.png,http://sofifa.com/player/158023/lionel-messi/2...,Argentina,33,93,93,FC Barcelona,...,5 ★,85,92,91,95,38,65,771,Lionel,Messi
1,20801,Cristiano Ronaldo,C. Ronaldo dos Santos Aveiro,https://cdn.sofifa.com/players/020/801/21_60.png,http://sofifa.com/player/20801/c-ronaldo-dos-s...,Portugal,35,92,92,Juventus,...,5 ★,89,93,81,89,35,77,562,C.,Ronaldo dos Santos Aveiro
2,200389,J. Oblak,Jan Oblak,https://cdn.sofifa.com/players/200/389/21_60.png,http://sofifa.com/player/200389/jan-oblak/210006/,Slovenia,27,91,93,Atlético Madrid,...,3 ★,87,92,78,90,52,90,150,Jan,Oblak
3,192985,K. De Bruyne,Kevin De Bruyne,https://cdn.sofifa.com/players/192/985/21_60.png,http://sofifa.com/player/192985/kevin-de-bruyn...,Belgium,29,91,91,Manchester City,...,4 ★,76,86,93,88,64,78,207,Kevin,De Bruyne
4,190871,Neymar Jr,Neymar da Silva Santos Jr.,https://cdn.sofifa.com/players/190/871/21_60.png,http://sofifa.com/player/190871/neymar-da-silv...,Brazil,28,91,91,Paris Saint-Germain,...,5 ★,91,85,86,94,36,59,595,Neymar,da Silva Santos Jr.


Handle Missing Values with statistics

In [19]:
# Check for missing values and count them for each column
missing_values = fifa_dataset.isna().sum()

# Filter columns with missing values
columns_with_missing_values = missing_values[missing_values > 0]

# Display the columns with missing values and the count of missing values to understand columns to fill in
print(columns_with_missing_values)

# Fill missing values in "Loan Date End" with the mode (most frequent value)
mode_loan_date_end = fifa_dataset['Loan Date End'].mode().iloc[0]
fifa_dataset['Loan Date End'] = fifa_dataset['Loan Date End'].fillna(mode_loan_date_end)

# Convert the "Hits" column to numeric (integer)
fifa_dataset['Hits'] = pd.to_numeric(fifa_dataset['Hits'], errors='coerce')

# Calculate the mean for the "Hits" column
mean_hits = fifa_dataset['Hits'].mean()

# Fill missing values with the calculated mean
fifa_dataset['Hits'] = fifa_dataset['Hits'].fillna(mean_hits)


Loan Date End    16961
Hits              2361
dtype: int64


Hint  - COnvert Value, Wage and Release Clause to numbers

In [20]:
# Convert string columns to numerical values
def convert_to_numeric(value_str):
    if 'M' in value_str:
        # Convert million to numeric
        value_str = value_str.replace('M', '').replace('€', '').replace(',', '')
        return float(value_str) * 1000000
    elif 'K' in value_str:
        # Convert thousand to numeric
        value_str = value_str.replace('K', '').replace('€', '').replace(',', '')
        return float(value_str) * 1000
    else:
        # Handle cases with no suffix (e.g., direct numeric values)
        return float(value_str.replace('€', '').replace(',', ''))

# Apply the conversion function to "Value," "Wage," and "Release Clause" columns
fifa_dataset['Value'] = fifa_dataset['Value'].apply(convert_to_numeric)
fifa_dataset['Wage'] = fifa_dataset['Wage'].apply(convert_to_numeric)
fifa_dataset['Release Clause'] = fifa_dataset['Release Clause'].apply(convert_to_numeric)
fifa_dataset.head()



Unnamed: 0,ID,Name,LongName,photoUrl,playerUrl,Nationality,Age,↓OVA,POT,Club,...,IR,PAC,SHO,PAS,DRI,DEF,PHY,Hits,First Name,Last Name
0,158023,L. Messi,Lionel Messi,https://cdn.sofifa.com/players/158/023/21_60.png,http://sofifa.com/player/158023/lionel-messi/2...,Argentina,33,93,93,FC Barcelona,...,5 ★,85,92,91,95,38,65,771.0,Lionel,Messi
1,20801,Cristiano Ronaldo,C. Ronaldo dos Santos Aveiro,https://cdn.sofifa.com/players/020/801/21_60.png,http://sofifa.com/player/20801/c-ronaldo-dos-s...,Portugal,35,92,92,Juventus,...,5 ★,89,93,81,89,35,77,562.0,C.,Ronaldo dos Santos Aveiro
2,200389,J. Oblak,Jan Oblak,https://cdn.sofifa.com/players/200/389/21_60.png,http://sofifa.com/player/200389/jan-oblak/210006/,Slovenia,27,91,93,Atlético Madrid,...,3 ★,87,92,78,90,52,90,150.0,Jan,Oblak
3,192985,K. De Bruyne,Kevin De Bruyne,https://cdn.sofifa.com/players/192/985/21_60.png,http://sofifa.com/player/192985/kevin-de-bruyn...,Belgium,29,91,91,Manchester City,...,4 ★,76,86,93,88,64,78,207.0,Kevin,De Bruyne
4,190871,Neymar Jr,Neymar da Silva Santos Jr.,https://cdn.sofifa.com/players/190/871/21_60.png,http://sofifa.com/player/190871/neymar-da-silv...,Brazil,28,91,91,Paris Saint-Germain,...,5 ★,91,85,86,94,36,59,595.0,Neymar,da Silva Santos Jr.


Hints  - Convert currency characters to '$' in the specified columns ['Value', 'Wage', 'Release Clause']

In [21]:
# Convert currency characters to '$' in the specified columns
columns_to_convert = ['Value', 'Wage', 'Release Clause']

for column in columns_to_convert:
    fifa_dataset[column] = '$' + fifa_dataset[column].astype(str)



columns_to_print = ['Value', 'Wage', 'Release Clause']

print(fifa_dataset[columns_to_print])


              Value       Wage Release Clause
0      $103500000.0  $560000.0   $138400000.0
1       $63000000.0  $220000.0    $75900000.0
2      $120000000.0  $125000.0   $159400000.0
3      $129000000.0  $370000.0   $161000000.0
4      $132000000.0  $270000.0   $166500000.0
...             ...        ...            ...
18974     $100000.0    $1000.0       $70000.0
18975     $130000.0     $500.0      $165000.0
18976     $120000.0     $500.0      $131000.0
18977     $100000.0    $2000.0       $88000.0
18978     $100000.0    $1000.0       $79000.0

[17920 rows x 3 columns]
