<a href="https://colab.research.google.com/github/MonkeyWrenchGang/MGTPython/blob/main/module_5/5_1a_wrangling_pt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Manipulating Data Frames

---

Manipulating panda's data frames is a crucial part of our analysis process. It involves transforming and cleaning the data so that it can be effectively analyzed and visualized. Pandas provides a variety of functions and methods that enable the manipulation of data frames in a number of ways, including dealing with missing or null values, updating variables, deleting columns and rows, grouping and aggregating data, and more.

In this notebook we'll dive into some basics including:

1. dealing with nulls
  - replacing nulls with a constant 
  - dropping rows containing nulls
2. creating new columns 
5. dropping columns 
6. dropping rows 
6. creating a "ranking"


In [41]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
import warnings
warnings.filterwarnings('ignore')
# ------------------------------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# ------------------------------------------------------------------
pd.set_option('display.float_format', lambda x: '%.2f' % x)

%matplotlib inline

## Import NBA data

This dataset is used in many panda's tutorials, I think it is useful for us to take a look at how we can clean up things. 

In [43]:
nba = pd.read_csv("/content/drive/MyDrive/2022-MGT/nba.csv")
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


# 1. Deal with nulls

---
first check which columns contain nulls!





In [44]:
nba.isna().sum(axis=0)

Name         1
Team         1
Number       1
Position     1
Age          1
Height       1
Weight       1
College     85
Salary      12
dtype: int64

## Fillna()


---

The `fillna()` method replaces NULL values with a specified value.The fillna() method returns a new DataFrame object unless the inplace parameter is set to True, in that case the fillna() method does the replacing in the original DataFrame instead.

```python
# whole dataframe
df.fillna(
  value, 
  inplace)
  
# specific column
df["column"].fillna(
  value, 
  inplace)
```

where: 

- value: Specifies the value to replace the NULL values with. This can also be values for the entire row or column.
- inplace: If True: the replacing is done on the current DataFrame. If False: returns a copy where the replacing is done.


In [45]:
# replace null colleges with "unknown"
nba['College'] = nba['College'].fillna('Unknown', inplace=False)
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0


In [46]:
# replace null Salary with median

nba['Salary'] = nba['Salary'].fillna(
    nba['Salary'].median().round(2),
    inplace=False)
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,2839073.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0


In [47]:
# -- check -- 
nba.isna().sum(axis=0)

Name        1
Team        1
Number      1
Position    1
Age         1
Height      1
Weight      1
College     0
Salary      0
dtype: int64

## Dropna()


---


The dropna method in Pandas is used to remove missing or null values from a data frame. By default, it removes any row that contains at least one missing value, but it can also be configured to remove columns with missing values or to remove rows only if all the values in the row are missing. The basic syntax for using the dropna method is as follows:

```python
df.dropna(
  axis=0, 
  how='any', 
  subset=None, 
  inplace=False
  )
```

where:

- axis: The axis along which the missing values are to be removed. 0 refers to rows, and 1 refers to columns.
- how: Specifies when to remove missing values. The default is 'any', which removes any row that contains at least one missing value. The other possible value is 'all', which removes only rows where all the values are missing.
- subset: Specifies a subset of columns to consider when removing missing values.
- inplace: If True, the data frame is modified in place, and nothing is returned. If False (default), a new data frame with the missing values removed is returned.

In [48]:
# how = any, any record with a null 
# how = all, only drop if all fields in record are null 
nba_filtered = nba.dropna(axis=0, how='any')

In [49]:
# -- check -- 
nba_filtered.isna().sum(axis=0)

Name        0
Team        0
Number      0
Position    0
Age         0
Height      0
Weight      0
College     0
Salary      0
dtype: int64

# 2. Adding new columns


---

We've already seen how to create new columns a few times. Let's formalize this now. 

1. create a new column that is a constant 
2. create a new column positionally 
3. create a new column using a formula 
4. create a new column with conditional logic using np.where()


By default, new columns are added to end of data frames.  

### create a new column with a constant


---



In [50]:
nba['sport'] = "Basketball"
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,sport
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,2839073.0,Basketball
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,Basketball
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0,Basketball


### create a new column positionally


---


We can insert a new column anywhere in the data frame using insert and passing the positional location (loc) index. here we are going to add a new column to the first position of the data frame. 

In [51]:
# -- insert a new column at the beggining 
nba.insert(loc=0, 
           column='sport_name',
           value='Basketball')
nba.head()


Unnamed: 0,sport_name,Name,Team,Number,Position,Age,Height,Weight,College,Salary,sport
0,Basketball,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Basketball,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,Basketball,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,2839073.0,Basketball
3,Basketball,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,Basketball
4,Basketball,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0,Basketball


### create a new column using a formula


---


creating new columns with simple formulas is easy enough. 

In [52]:
# -- create new column weight_kg by dividing weight by  2.205
nba['weight_kg'] = nba['Weight'] / 2.205
nba['salary_by_age'] = nba['Salary']/nba['Age']
nba.head()

Unnamed: 0,sport_name,Name,Team,Number,Position,Age,Height,Weight,College,Salary,sport,weight_kg,salary_by_age
0,Basketball,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball,81.63,309213.48
1,Basketball,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball,106.58,271844.68
2,Basketball,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,2839073.0,Basketball,92.97,105150.85
3,Basketball,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,Basketball,83.9,52210.91
4,Basketball,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0,Basketball,104.76,172413.79


### Conditionally Create a Column 


---


We can also apply conditional logic using `np.where(condition,true,false)`

like this:


In [53]:
nba["over_5M_salary"] = np.where(nba["Salary"] > 5000000,"Over","Under")
nba["over_5M_salary"].value_counts()

Under    314
Over     144
Name: over_5M_salary, dtype: int64

# 3. Drop a column 


---

`.drop()` Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. 

```python
df.drop(labels,
  axis=,
  inplace=True
)

```
- labels: Index or column labels to drop. 
- axis: 0 for rows, 1 for columns 
- inplace: update existing dataframe(True) or return a copy (False)


In [54]:
nba.drop(['sport','sport_name','weight_kg'],  # list of columns to drop
         axis=1,       # dealing with columns not rows 
         inplace=True) # do it inplace 

nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,salary_by_age,over_5M_salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,309213.48,Over
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,271844.68,Over
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,2839073.0,105150.85,Under
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,52210.91,Under
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unknown,5000000.0,172413.79,Under


# 4. Drop Rows

---

Here we have two methods. First the easiest method is to simply filter the rows you want removed from the data frame using query or other conditional filtering. The second method is to use the `.drop()` method passing in the row index(s) and the axis = 0. 



### Method 1 simple filter


---
let's remove anyone with their player number == 0. quite easy simply return a dataset where number != 0 or number > 0. As you'll see there is a difference 


In [55]:
print("shape returns rows and column count \nthe NBA dataset contains the following:")
print(nba.shape)

# create a filtered data frame 
print("Number > 0")
nba_filtered = nba[nba['Number'] > 0]
print(nba_filtered.shape)

# create a filtered data frame 
print("Number != 0")
nba_filtered = nba.query('Number != 0')
print(nba_filtered.shape)



shape returns rows and column count 
the NBA dataset contains the following:
(458, 11)
Number > 0
(437, 11)
Number != 0
(438, 11)


### Method 2 drop()


---



In [56]:
# drop row in place 
nba.drop(nba[nba['Number'] < 0].index, inplace = True)
nba.shape

(458, 11)

In [57]:
# -- this identifies the index positions
nba[nba['Number'] < 2].index

Int64Index([  0,  22,  24,  41,  47,  57,  73, 116, 120, 122, 141, 152, 164,
            172, 174, 188, 191, 205, 210, 227, 242, 248, 266, 272, 285, 291,
            295, 323, 329, 334, 339, 347, 356, 370, 383, 393, 394, 401, 426,
            436],
           dtype='int64')

# 5. Ranking


---

The rank() function is used to compute numerical data ranks (1 through n) along axis. By default, equal values are assigned a rank that is the average of the ranks of those values.

```python

df["column_to_rank"].rank(ascending=False)
```


### Smallest to Largest 

---



In [58]:
# rank defaults to smallest to largest 
nba["salary_rank_smallest_to_largest"] = nba['Salary'].rank()
nba[["Name","Team","Age","Salary","salary_rank_smallest_to_largest"]].sort_values("salary_rank_smallest_to_largest").head(5)

Unnamed: 0,Name,Team,Age,Salary,salary_rank_smallest_to_largest
32,Thanasis Antetokounmpo,New York Knicks,23.0,30888.0,1.0
291,Orlando Johnson,New Orleans Pelicans,27.0,55722.0,2.5
130,Phil Pressey,Phoenix Suns,25.0,55722.0,2.5
135,Alan Williams,Phoenix Suns,23.0,83397.0,4.0
175,Jordan McRae,Cleveland Cavaliers,25.0,111196.0,5.0


###  Largest to Smallest


---



In [59]:
# rank defaults to largest to smallest 
nba["salary_rank_largest_to_smallest"] = nba['Salary'].rank(ascending=False)
nba[["Name","Team","Age","Salary","salary_rank_largest_to_smallest","salary_rank_smallest_to_largest"]].sort_values("salary_rank_largest_to_smallest").head(5)


Unnamed: 0,Name,Team,Age,Salary,salary_rank_largest_to_smallest,salary_rank_smallest_to_largest
109,Kobe Bryant,Los Angeles Lakers,37.0,25000000.0,1.0,458.0
169,LeBron James,Cleveland Cavaliers,31.0,22970500.0,2.0,457.0
33,Carmelo Anthony,New York Knicks,32.0,22875000.0,3.0,456.0
251,Dwight Howard,Houston Rockets,30.0,22359364.0,4.0,455.0
339,Chris Bosh,Miami Heat,32.0,22192730.0,5.0,454.0
