# Import Data

In [1]:
import pandas as pd

In [2]:
positions = ["G", "G-F", "F-G", "F", "F-C", "C-F", "C"]

To bring data from a file as a DataFrame, we can use the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function.

In [3]:
player_info = pd.read_csv("./data/bballPlayers.csv")
player_info.head()

Unnamed: 0,playerID,name,pos,height,weight
0,abdulka01,Kareem Abdul-Jabbar,C,85.0,225.0
1,abdulma02,Mahmoud Abdul-Rauf,G,73.0,162.0
2,abdulza01,Zaid Abdul-Aziz,C-F,81.0,230.0
3,adamsal01,Alvan Adams,C-F,81.0,210.0
4,adamsmi01,Michael Adams,G,70.0,162.0


Each column of the DataFrame is referred to as a variable in the DataFrame, and can be accessed using dot notation. When text labels are intended to represent a finite set of possibilities, such as a player's position, it's more suitable to store the data as a categorical array.
We can use the [Categorical()](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html#pandas-categorical) function to convert an array of text labels to a categorical array.

In [4]:
player_info.pos = pd.Categorical(player_info.pos)

The function Categorical() returns a list of all possible categories in a categorical array. Sometimes we may want to define the categories for our categorical data. For example, if we are pulling data from multiple sources and not all the categories are represented in that particular set.
We can use a string array of category names as a second input to the Categorical() function to specify the categories.

In [5]:
player_info.pos = pd.Categorical(player_info.pos, positions)

Any data not specified by a category in positions becomes NaN. To remove rows from a DataFrame which contain an undefined or missing value, we can use the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas-dataframe-dropna) function.

In [6]:
player_info.dropna(inplace=True)
player_info.head()

Unnamed: 0,playerID,name,pos,height,weight
0,abdulka01,Kareem Abdul-Jabbar,C,85.0,225.0
1,abdulma02,Mahmoud Abdul-Rauf,G,73.0,162.0
2,abdulza01,Zaid Abdul-Aziz,C-F,81.0,230.0
3,adamsal01,Alvan Adams,C-F,81.0,210.0
4,adamsmi01,Michael Adams,G,70.0,162.0


In [7]:
all_stats = pd.read_csv("./data/bballStats.csv")
all_stats.head()

Unnamed: 0,playerID,GP,minutes,points,oRebounds,dRebounds,rebounds,assists,steals,blocks,...,PostTurnovers,PostPF,PostfgAttempted,PostfgMade,PostftAttempted,PostftMade,PostthreeAttempted,PostthreeMade,note,year
0,abdulma02,67,1505,942,34,87,121,206,55,4,...,0,0,0,0,0,0,0,0,,1990
1,adamsmi01,66,2346,1752,58,198,256,693,147,6,...,0,0,0,0,0,0,0,0,,1990
2,aguirma01,78,2006,1104,134,240,374,139,47,20,...,20,41,178,90,51,42,33,12,,1990
3,aingeda01,80,1710,890,45,160,205,285,63,13,...,24,33,105,47,28,23,36,11,,1990
4,andergr01,26,247,70,26,49,75,3,8,9,...,0,0,0,0,0,0,0,0,,1990


Separate statistics are stored in allStats for the regular season and for the post-season tournament. Since not all players participate in the post-season, we want to remove the post-season statistics from the DataFrame.
We can remove rows or variables from a DataFrame using the [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#pandas-dataframe-drop) function.

In [8]:
all_stats.drop(all_stats.columns[18:], axis=1, inplace=True)
all_stats.head()

Unnamed: 0,playerID,GP,minutes,points,oRebounds,dRebounds,rebounds,assists,steals,blocks,turnovers,PF,fgAttempted,fgMade,ftAttempted,ftMade,threeAttempted,threeMade
0,abdulma02,67,1505,942,34,87,121,206,55,4,110,149,1009,417,98,84,100,24
1,adamsmi01,66,2346,1752,58,198,256,693,147,6,240,162,1421,560,529,465,564,167
2,aguirma01,78,2006,1104,134,240,374,139,47,20,128,209,909,420,317,240,78,24
3,aingeda01,80,1710,890,45,160,205,285,63,13,100,195,714,337,138,114,251,102
4,andergr01,26,247,70,26,49,75,3,8,9,22,29,73,27,28,16,1,0
