# Data Cleaning

Data cleaning is the process of fixing problems in a data set. We are going to start with data from the 2024-2025 Indiana Pacers NBA season.

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Data-Dunkers/data/refs/heads/main/NBA/team/2024-2025/IND_2024-2025_players_original.csv")
df

## 1. Flagging Traded Players

You'll notice that the `Name` column contains the player name and their position. In some cases there is also a `*`, indicating that the player was traded during the season.

Our first step in cleaning up that column will be to make a new `Traded` column and set it to True for any players with a `*`.

In [None]:
df['Traded'] = df['Name'].str.contains('*', regex=False)
df

## 2. Cleaning Names

Now we can remove the `*` characters from the `Name` column, and strip out any extra spaces.

In [None]:
df['Name'] = df['Name'].str.replace('*', '', regex=False).str.strip()
df

## 3. Extracting Position

Now to extract the `Position` from the `Name` column by splitting on the ` ` character and extracting the part after the last space.

In [None]:
df['Position'] = df['Name'].str.split(' ').str[-1]
df

## 4. Finalizing Names

To remove the position information from the name column is similar, we want everything up to the last space.

In [None]:
df['Name'] = df['Name'].str.split(' ').str[:-1]
df['Name'] = df['Name'].str.join(' ')
df

## 5. Handling Totals and Missing Values

The total row (row 22) might need a manual name fix, and we should also replace any `NaN` (not number) values with the value `0`.

In [None]:
df.loc[22, 'Name'] = 'Total'
df = df.fillna(0)
df

## 6. Reordering Columns

As a final step, we can reorder the columns to make the dataset easier to read.

In [None]:
cols = ['Name', 'Position', 'Traded', 'GP', 'GS', 'MIN', 'PTS', 'OR', 'DR', 'REB', 
        'AST', 'STL', 'BLK', 'TO', 'PF', 'AST/TO', 'FGM', 'FGA', 'FG%', '3PM', 
        '3PA', '3P%', 'FTM', 'FTA', 'FT%', '2PM', '2PA', '2P%', 'SC-EFF', 'SH-EFF']
df = df[cols]
df.head()

## Reflection Questions

1. Why do you think we set `regex=False` when checking for the `*` character?
2. Explain the difference between `.str.split(' ').str[-1]` and `.str.split(' ').str[:-1]`.
3. How would you modify the code to create a `FirstName` and `LastName` column instead of keeping the full name together?

---

### Online Access
You can run this notebook online using the following links:

*   [**Google Colab**](https://colab.research.google.com/github/Data-Dunkers/student/blob/main/activities/data-cleaning.ipynb)
*   [**Callysto Hub**](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FData-Dunkers%2Fstudent&branch=main&subPath=activities/data-cleaning.ipynb&depth=1)