# Data Cleaning

Data cleaning is the process of fixing problems in a data set. We are going to start with data from the 2024-2025 Indiana Pacers NBA season.

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Data-Dunkers/data/refs/heads/main/NBA/team/2024-2025/IND_2024-2025_players_original.csv")
df

You'll notice that the `Name` column contains the player name and their position. In some cases there is also a `*`, indicating that the player was traded during the season.

Our first step in cleaning up that column will be to make a new `Traded` column and set it to True for any players with a `*`.

In [None]:
df['Traded'] = df['Name'].str.contains('*', regex=False)
df

Now we can remove the `*` characters from the `Name` column, and strip out any extra spaces.

In [None]:
df['Name'] = df['Name'].str.replace('*', '', regex=False).str.strip()
df

Now to extract the `Position` from the `Name` column by splitting on the ` ` character and extracting the part after the last space.

In [None]:
df['Position'] = df['Name'].str.split(' ').str[-1]
df

To remove the position information from the name column is similar, we want everything up to the last space.

In [None]:

df['Name'] = df['Name'].str.split(' ').str[:-1]
df

But there are a couple of issues now with the name column. The first issue is that each value is a dictionary, so we'll want to join them together with a space.

In [None]:
df['Name'] = df['Name'].str.join(' ')
df

The other issue is that in the total row (row 22) the `Name` column value has been eliminated. So let's manually enter it back in.

In [None]:
df.loc[22, 'Name'] = 'Total'
df

We may also want to replace any `NaN` (not number) values with the value `0`

In [None]:
df = df.fillna(0)
df

As a final step, we can reorder the columns if we'd like. First we will list all of the columns.

In [None]:
df.columns

Then we can copy that list, edit it, and use it to reorder the columns.

In [None]:
df = df[['Name', 'Position', 'Traded', 'GP', 'GS', 'MIN', 'PTS', 'OR', 'DR', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', 'AST/TO', 'FGM', 'FGA', 'FG%', '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', '2PM', '2PA', '2P%', 'SC-EFF', 'SH-EFF']]
df

Your challenge is to re-import the data and clean it up in a single cell, using the fewest lines of code possible.

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/Data-Dunkers/data/refs/heads/main/NBA/team/2024-2025/IND_2024-2025_players_original.csv")




## Questions

1. Why do you think we set `regex=False` when checking for the `*` character? You may want to look at [metacharacters in regular expressions](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended).
2. Explain the difference between `.str.split(' ').str[-1]` and `.str.split(' ').str[:-1]`. Why did we use each one?
3. How would you modify the code to create a `FirstName` and `LastName` column instead of keeping the full name together?