# Exploring and Cleaning Player Team Data

In this notebook, we will be exploring and preparing a dataset containing player statistics from various teams. The overall goal is to clean and organize this data in a way that allows us to efficiently analyze and work with team-specific information.

### Step 1: Checking for Missing Values in Team Names

Our first step is to verify the integrity of the dataset by checking for any missing or undefined values in the `Basic Info_Team Name` column. Missing team names could indicate incomplete or faulty data, which could affect subsequent analysis. If any NaN values are found, we will either correct or remove them to ensure that the data is clean before proceeding with further analysis.


In [2]:
import pandas as pd

# Read the data from the CSV file
df = pd.read_csv('../data/deep_player_data.csv')

In [3]:
# Filter the DataFrame (teams with missing or undefined team names)
filtered_df = df[df['Basic Info_Team Name'].isna() | (df['Basic Info_Team Name'] == 'undefined')]

# Select only the player name and team name columns
filtered_df_selected = filtered_df[['Basic Info_Player Name', 'Basic Info_Team Name']]

# Display the selected data
display(filtered_df_selected.head())
display(filtered_df_selected.tail())

Unnamed: 0,Basic Info_Player Name,Basic Info_Team Name
139,CLASIA,undefined
324,motm,undefined
562,chop,undefined
580,stamina,undefined


Unnamed: 0,Basic Info_Player Name,Basic Info_Team Name
139,CLASIA,undefined
324,motm,undefined
562,chop,undefined
580,stamina,undefined


In [4]:
# Get the count of rows in the filtered DataFrame
row_count = filtered_df_selected.shape[0]

# Display the row count
print(f'The number of players/teams in the filtered data: {row_count}')

The number of players/teams in the filtered data: 4


In [5]:
#Checking the original data frame
df_selected = df[['Basic Info_Player Name', 'Basic Info_Team Name']]
display(df_selected.head())

Unnamed: 0,Basic Info_Player Name,Basic Info_Team Name
0,ZywOo,Vitality
1,s1mple,Natus Vincere
2,sh1ro,Spirit
3,donk,Spirit
4,deko,Aurora


In [6]:
# Check the data types of the selected columns
team_name_dtype = df['Basic Info_Team Name'].dtype

# Check for NaN values in the 'Basic Info_Team Name' column and count occurrences of each type (NaN or not NaN)
team_name_nan_count = df['Basic Info_Team Name'].isna().sum()
team_name_non_nan_count = df['Basic Info_Team Name'].notna().sum()

# Display the results
print(f"Data type of 'Basic Info_Team Name': {team_name_dtype}")
print(f"Number of NaN values: {team_name_nan_count}")
print(f"Number of non-NaN values: {team_name_non_nan_count}")


Data type of 'Basic Info_Team Name': object
Number of NaN values: 0
Number of non-NaN values: 968


In [7]:
# Get unique values in the 'Basic Info_Team Name' column
unique_teams = df['Basic Info_Team Name'].unique()

# Count occurrences of each unique value
team_counts = df['Basic Info_Team Name'].value_counts(dropna=False)

# Display unique values and their counts
print("Unique team names and their counts:")
print(team_counts)

# Check for empty strings or whitespace-only entries
empty_entries = df[df['Basic Info_Team Name'].str.strip() == '']
print(f"\nNumber of empty or whitespace-only entries: {len(empty_entries)}")

# Check for entries that might indicate no team
no_team_entries = df[df['Basic Info_Team Name'].str.lower().isin(['no team', 'none', 'n/a', 'unknown'])]
print(f"\nNumber of entries potentially indicating no team: {len(no_team_entries)}")

# Check for unusually short team names (less than 2 characters)
short_names = df[df['Basic Info_Team Name'].str.len() < 2]
print(f"\nNumber of unusually short team names: {len(short_names)}")

# Display any unusual entries for manual inspection
print("\nUnusual entries (if any):")
unusual_entries = team_counts[team_counts < 5]
print(unusual_entries)

Unique team names and their counts:
Basic Info_Team Name
No team              383
Ninjas in Pyjamas      8
OG                     7
G2                     7
Into the Breach        7
                    ... 
Arcade                 1
MASONIC                1
ENCE Academy           1
Punishers              1
MIBR Academy           1
Name: count, Length: 185, dtype: int64

Number of empty or whitespace-only entries: 0

Number of entries potentially indicating no team: 383

Number of unusually short team names: 0

Unusual entries (if any):
Basic Info_Team Name
ESC             4
TSM             4
KRÜ             4
undefined       4
Liquid          4
               ..
Arcade          1
MASONIC         1
ENCE Academy    1
Punishers       1
MIBR Academy    1
Name: count, Length: 133, dtype: int64


### Step 2: Cleaning the Data

After confirming that no missing values (NaN) are present in the `Basic Info_Team Name` column only solo players listed under no team, we can move forward with cleaning and preparing the dataset. This involves handling outliers, normalizing numerical values, and ensuring that all columns are formatted consistently. 

We aim to normalize the player statistics across all teams to ensure consistent comparisons and analysis. Once the data is cleaned, we will split it into separate dataframes for each team, allowing us to analyze team-specific information more efficiently.


In [16]:
#Checking the original data frame data types
display(df.dtypes)

# Count the number of columns with each data type
type_counts = df.dtypes.value_counts()
print(type_counts)

Summary Stats_DPR                               float64
Summary Stats_KPR                               float64
Detailed Stats_Deaths / round                   float64
Role Stats_Firepower_Score                       object
Role Stats_Opening_Win% after opening kill       object
                                                 ...   
Detailed Stats_Maps played                        int64
Role Stats_Trading_Saved teammate per round     float64
Summary Stats_Rating 2.0                        float64
Role Stats_Opening_Score                         object
Role Stats_Entrying_Traded deaths percentage     object
Length: 74, dtype: object

float64    37
object     32
int64       5
Name: count, dtype: int64


In [17]:
type_dict = {str(t): df.select_dtypes(include=[t]).columns.tolist() for t in df.dtypes.unique()}
for t, cols in type_dict.items():
    print(f"\n{t}:")
    print(", ".join(cols))


float64:
Summary Stats_DPR, Summary Stats_KPR, Detailed Stats_Deaths / round, Detailed Stats_Damage / Round, Role Stats_Opening_Opening deaths per round, Detailed Stats_Assists / round, Role Stats_Sniping_Sniper kills per round, Role Stats_Opening_Opening kills per round, Role Stats_Sniping_Sniper multi-kill rounds, Detailed Stats_Kills / round, Detailed Stats_Rating 1.0, Detailed Stats_Rating 2.0, Role Stats_Clutching_Clutch points per round, Role Stats_Entrying_Saved by teammate per round, Role Stats_Utility_Utility kills per 100 rounds, Summary Stats_Impact, Role Stats_Utility_Flashes thrown per round, Role Stats_Firepower_Kills per round, Detailed Stats_Saved by teammate / round, Role Stats_Entrying_Traded deaths per round, Role Stats_Entrying_Assists per round, Role Stats_Trading_Trade kills per round, Role Stats_Utility_Flash assists per round, Role Stats_Utility_Time opponent flashed per round, Summary Stats_Rating 1.0, Summary Stats_ADR, Role Stats_Utility_Utility damage per r

In [18]:
# Check the number of unique values in each object column
for col in df.select_dtypes(include=['object']):
    print(f"\nColumn: {col}")
    print(df[col].head()) 
    print(df[col].nunique(), "unique values")


Column: Role Stats_Firepower_Score
0    98/100
1    97/100
2    85/100
3    98/100
4    89/100
Name: Role Stats_Firepower_Score, dtype: object
87 unique values

Column: Role Stats_Opening_Win% after opening kill
0    74.5%
1    75.6%
2    77.3%
3    73.8%
4    73.2%
Name: Role Stats_Opening_Win% after opening kill, dtype: object
111 unique values

Column: Role Stats_Clutching_Score
0    75/100
1    52/100
2    83/100
3    37/100
4    73/100
Name: Role Stats_Clutching_Score, dtype: object
59 unique values

Column: Detailed Stats_Headshot %
0    41.3%
1    41.1%
2    28.9%
3    60.6%
4    38.3%
Name: Detailed Stats_Headshot %, dtype: object
327 unique values

Column: Role Stats_Entrying_Support rounds
0    15.3%
1    10.5%
2    19.4%
3    12.4%
4    16.3%
Name: Role Stats_Entrying_Support rounds, dtype: object
134 unique values

Column: Basic Info_Team Name
0         Vitality
1    Natus Vincere
2           Spirit
3           Spirit
4           Aurora
Name: Basic Info_Team Name, dtype: o