# DataFrames I

In [1]:
import pandas as pd

---

## Methods and Attributes shared between Series and DataFrames
- A **DataFrame** is a 2-dimensional table consisting of rows and columns.
- Pandas uses a `NaN` designation for cells that have a missing value. It is short for "not a number". Most operations on `NaN` values will produce `NaN` values.
- Like with a **Series**, Pandas assigns an index position/label to each **DataFrame** row.
- The **DataFrame** and **Series** have common and exclusive methods/attributes.
- The `hasnans` attribute exists only a **Series**. The `columns` attribute exists only on a **DataFrame**.
- Some methods/attributes will return different types of data.
- The `info` method returns a summary of the pandas object.

In [2]:
nba = pd.read_csv("data_files/nba.csv")

"""
NaN is used when a value isnt available or existent

The very last column where it's all NaN is common issue, mostly when you
are given old data or imperfect.
"""

nba.tail()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0
591,,,,,,,


In [3]:
nba.values  # 2D arrays

nba.index

RangeIndex(start=0, stop=592, step=1)

In [4]:
nba.shape
# (rows, columns)

(592, 7)

In [5]:
# NOTE: Even if ONE item in a column is NaN, the entire column will
#       be converted to a float.
#       To change this, you must clean the data. For example NaN -> -9999

nba.dtypes

Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [6]:
nba.columns

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [7]:
# This is basically like using .index and .columns
# and having those outputs in a 2D list

nba.axes

[RangeIndex(start=0, stop=592, step=1),
 Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')]

In [8]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 32.5+ KB


---

## Differences between Shared Methods
- The `sum` method adds a **Series's** values.
- On a **DataFrame**, the `sum` method defaults to adding the values by traversing the index (row values).
- The `axis` parameter customizes the direction that we add across. Pass `"columns"` or `1` to add "across" the columns.

In [9]:
rev = pd.read_csv("data_files/revenue.csv", index_col="Date")

rev.head()

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/26,985,122,499
1/2/26,738,788,534
1/3/26,14,20,933
1/4/26,730,904,885
1/5/26,114,71,253


In [10]:
# By default you will get back a Series with the total number for the individual column(s)

rev.sum()["New York"]
rev.sum()

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [11]:
rev.sum(axis="index")  # Defualt

rev.sum(axis="columns")

Date
1/1/26     1606
1/2/26     2060
1/3/26      967
1/4/26     2519
1/5/26      438
1/6/26     1935
1/7/26     1234
1/8/26     2313
1/9/26     2623
1/10/26     555
dtype: int64

---

## Select One Column from a DataFrame
- We can use attribute syntax (`df.column_name`) to select a column from a DataFrame. **The syntax will not work if the column name has spaces.**
- We can also use square bracket syntax (`df["column name"]`) which will work for any column name.
- Pandas extracts a column from a **DataFrame** as a **Series**.
- The **Series** is a view, so changes to the **Series** *will* affect the **DataFrame**.
- Pandas will display a warning if you mutate the **Series**. Use the `copy` method to create a duplicate.

In [12]:
nba.columns

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [13]:
nba.Team
nba.Height
nba.Salary
nba.Name

0             Saddiq Bey
1      Bogdan Bogdanovic
2            Kobe Bufkin
3           Clint Capela
4         Bruno Fernando
             ...        
587         Ryan Rollins
588        Landry Shamet
589     Tristan Vukcevic
590         Delon Wright
591                  NaN
Name: Name, Length: 592, dtype: object

In [14]:
nba["Team"]
nba["Salary"]
nba["Height"]
nba["Name"]

0             Saddiq Bey
1      Bogdan Bogdanovic
2            Kobe Bufkin
3           Clint Capela
4         Bruno Fernando
             ...        
587         Ryan Rollins
588        Landry Shamet
589     Tristan Vukcevic
590         Delon Wright
591                  NaN
Name: Name, Length: 592, dtype: object

### Editing a view

In [15]:
# NOTE: This syntax is a view, meaning any changes made will be made to the df.
# You will get a warning. This is normal because you are editing a view.
# To remove this you will need to create a copy.

nba["Name"].iloc[0] = "New Name"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nba["Name"].iloc[0] = "New Name"


In [16]:
nba["Name"].head(3)

0             New Name
1    Bogdan Bogdanovic
2          Kobe Bufkin
Name: Name, dtype: object

### Creating a copy

In [17]:
names = nba["Name"].copy()

In [18]:
names.iloc[0] = "We have changed the copy, not the view"

In [19]:
names.head(3)

0    We have changed the copy, not the view
1                         Bogdan Bogdanovic
2                               Kobe Bufkin
Name: Name, dtype: object

In [20]:
nba["Name"].head(3)

0             New Name
1    Bogdan Bogdanovic
2          Kobe Bufkin
Name: Name, dtype: object

---

## Select Multiple Columns from a DataFrame
- Use square brackets with a list of names to extract multiple **DataFrame** columns.
- Pandas stores the result in a new **DataFrame** (a copy).

In [21]:
nba[ ["Name", "Team", "College"] ].head()

Unnamed: 0,Name,Team,College
0,New Name,Atlanta Hawks,Villanova
1,Bogdan Bogdanovic,Atlanta Hawks,Fenerbahce
2,Kobe Bufkin,Atlanta Hawks,Michigan
3,Clint Capela,Atlanta Hawks,Elan Chalon
4,Bruno Fernando,Atlanta Hawks,Maryland


---

## Add New Column to DataFrame
- Use square bracket extraction syntax with an equal sign to add a new **Series** to a **DataFrame**.
- The `insert` method allows us to insert an element at a specific column index.
- On the right-hand side, we can reference an existing **DataFrame** column and perform a broadcasting operation on it to create the new **Series**.

In [22]:
nba = pd.read_csv("data_files/nba.csv")

# nba["Sport"] = "Basketball"

# loc (location) is the index of the column of where we will insert the data.
# In this case "Name" would be index 0, "Team" index 1, etc.
# This new column will NOT replace the existing column, it will just
# push all the other columns to the right.

# column is the name of the new column.

# value is the data

nba.insert(loc=3, column="Sport", value="Basketball")

In [23]:
nba.tail()

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary
587,Ryan Rollins,Washington Wizards,G,Basketball,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,Basketball,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,Basketball,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,Basketball,6-5,185.0,Utah,8195122.0
591,,,,Basketball,,,,


In [24]:
def pay_per_pound(salary, weight):
    """ Calc how much money they make per pound of body weight """
    return round(salary / weight)

In [25]:
nba["Pay per pound"] = pay_per_pound(nba["Salary"], nba["Weight"])

In [26]:
nba.head()

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary,Pay per pound
0,Saddiq Bey,Atlanta Hawks,F,Basketball,6-7,215.0,Villanova,4556983.0,21195.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,Basketball,6-5,225.0,Fenerbahce,18700000.0,83111.0
2,Kobe Bufkin,Atlanta Hawks,G,Basketball,6-5,195.0,Michigan,4094244.0,20996.0
3,Clint Capela,Atlanta Hawks,C,Basketball,6-10,256.0,Elan Chalon,20616000.0,80531.0
4,Bruno Fernando,Atlanta Hawks,F-C,Basketball,6-10,240.0,Maryland,2581522.0,10756.0


---

## A Review of the value_counts Method
- The `value_counts` method counts the number of times that each unique value occurs in a **Series**.

In [27]:
nba["Team"].value_counts().head()

Team
Dallas Mavericks     23
Denver Nuggets       22
Miami Heat           22
Milwaukee Bucks      22
Memphis Grizzlies    22
Name: count, dtype: int64

In [28]:
# Setting normalize to true gives you a precentage value,
# then multiply by 100 to get the the real percent
nba_position = nba["Position"].value_counts(normalize=True).mul(100)

round(nba_position, 2)

Position
G      39.21
F      32.02
C       8.05
G-F     7.88
F-C     6.34
C-F     3.94
F-G     2.57
Name: proportion, dtype: float64

In [29]:
nba["Salary"].value_counts()

Salary
559782.0      59
1119563.0     27
3196448.0     13
2019706.0      9
1719864.0      8
              ..
12119440.0     1
7977420.0      1
5266713.0      1
2624028.0      1
3918480.0      1
Name: count, Length: 298, dtype: int64

---

## Drop Rows with Missing Values
- Pandas uses a `NaN` designation for cells that have a missing value.
- The `dropna` method deletes rows with missing values. Its default behavior is to remove a row if it has *any* missing values.
- Pass the `how` parameter an argument of "all" to delete rows where all the values are `NaN`.
- The `subset` parameters customizes/limits the columns that pandas will use to drop rows with missing values.

In [30]:
nba.tail(2)

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary,Pay per pound
590,Delon Wright,Washington Wizards,G,Basketball,6-5,185.0,Utah,8195122.0,44298.0
591,,,,Basketball,,,,,


In [31]:
# This will drop the whole row if ANY data on any column is mising

nba.dropna().tail(2)

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary,Pay per pound
588,Landry Shamet,Washington Wizards,G,Basketball,6-4,190.0,Wichita State,10250000.0,53947.0
590,Delon Wright,Washington Wizards,G,Basketball,6-5,185.0,Utah,8195122.0,44298.0


In [32]:
# Ignore this code, this is to fix the last row's values
import numpy
nba.loc[591, "Sport"] = numpy.nan



# This will only drop the row if ALL the data is missing
nba.dropna(how="all").tail(3)

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary,Pay per pound
588,Landry Shamet,Washington Wizards,G,Basketball,6-4,190.0,Wichita State,10250000.0,53947.0
589,Tristan Vukcevic,Washington Wizards,F,Basketball,6-10,220.0,Real Madrid,,
590,Delon Wright,Washington Wizards,G,Basketball,6-5,185.0,Utah,8195122.0,44298.0


In [33]:
# Subset is a list of column names that pandas will search within.

# In other words, the code below will drop any rows that have a NaN value
# inside the College OR Salary column.
# NOTE: You MUST assign to a new variable OR use inplace=True to change the original df
nba_copy = nba.dropna(subset=["College", "Salary"])

len(nba)  # 592
len(nba_copy)  # 476

476

---

## Fill in Missing Values with the fillna Method
- The `fillna` method replaces missing `NaN` values with its argument.
- The `fillna` method is available on both **DataFrames** and **Series**.
- An extracted **Series** is a view on the original **DataFrame**, but the `fillna` method returns a copy.

In [34]:
nba.dropna(how="all", inplace=True)

In [37]:
# Every sinlge NaN will be filled with 0
nba.fillna(0).tail(3)

# To target a specific column, use the simple code below.
# NOTE: This use of fillna will return a COPY
# So you will need to overwrite the data with the new Series
no_na_salary = nba["Salary"].fillna(0)
nba["Salary"] = no_na_salary

no_college = nba["College"].fillna("unknown")
nba["College"] = no_college

---

## The astype Method I
- The `astype` method converts a **Series's** values to a specified type.
- Pass in the specified type as either a string or the core Python data type.
- Pandas cannot convert `NaN` values to numeric types, so we need to eliminate/replace them before we perform the conversion.
- The `dtypes` attribute returns a **Series** with the **DataFrame's** columns and their types.

In [56]:
"""
You'll notice that in the salary column, they are flouts even
though in the csv file they are int. This is becuase when they
where imported a few of them ad NaN, so pandas converted the
whole column into a float. Now we must change it back to int
"""

nba = pd.read_csv("data_files/nba.csv").dropna(how="all")
nba["Salary"] = nba["Salary"].fillna(0)

#nba.dtypes["Salary"]  # float64

# NOTE: This returns a copy so you'll need to set it manually
nba["Salary"] = nba["Salary"].astype(int)

#nba.dtypes["Salary"]  # int64

In [69]:
nba["Weight"] = nba["Weight"].fillna(0)
nba["Weight"] = nba["Weight"].astype(int)

nba.dtypes["Weight"]

dtype('int64')

---

## The astype Method II
- The `category` type is ideal for columns with a limited number of unique values.
- The `nunique` method will return a **Series** with the number of unique values in each column.
- With categories, pandas does not create a separate value in memory for each "cell". Rather, the cells point to a single copy for each unique value.

In [86]:
"""
There are lots of dup values like Team, College, Gender, or blood type.
Converting columns to be category can save lots of resorces and speed up render
times by not making pandas go over dup data over and over.

The way it works is, it'll take a value Michigan college, create a single pointer
for the name, then any time that college name comes up again, pandas will just point
to that single location in memory, instead of having a new value for each intance.
i.e. It'll find a spot for it once, then referce that pointer for every instance.
"""

#nba.info()  # memory usage: 36.9+ KB

nba.nunique()

Name        591
Team         30
Position      7
Height       20
Weight       94
College     182
Salary      299
dtype: int64

In [94]:
# To initiate a category, simple input "category" into the astype method
nba["Team"] = nba["Team"].astype("category")
nba["Position"] = nba["Position"].astype("category")
nba["College"] = nba["College"].astype("category")


nba.info()  # memory usage: 32.5+ KB (12% more efficient)

<class 'pandas.core.frame.DataFrame'>
Index: 591 entries, 0 to 590
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      591 non-null    object  
 1   Team      591 non-null    category
 2   Position  584 non-null    category
 3   Height    585 non-null    object  
 4   Weight    591 non-null    int64   
 5   College   578 non-null    category
 6   Salary    591 non-null    int64   
dtypes: category(3), int64(2), object(2)
memory usage: 32.5+ KB


---

## Sort a DataFrame with the sort_values Method I
- The `sort_values` method sorts a **DataFrame** by the values in one or more columns. The default sort is an ascending one (alphabetical for strings).
- The first parameter (`by`) expects the column(s) to sort by.
- If sorting by a single column, pass a string with its name.
- The `ascending` parameter customizes the sort order.
- The `na_position` parameter customizes where pandas places `NaN` values.

---

## Sort a DataFrame with the sort_values Method II
- To sort by multiple columns, pass the `by` parameter a list of column names. Pandas will sort in the specified column order (first to last).
- Pass the `ascending` parameter a Boolean to sort all columns in a consistent order (all ascending or all descending).
- Pass `ascending` a list to customize the sort order *per* column. The `ascending` list length must match the `by` list.

---

## Sort a DataFrame by its Index
- The `sort_index` method sorts the **DataFrame** by its index positions/labels.

---

## Rank Values with the rank Method
- The `rank` method assigns a numeric ranking to each **Series** value.
- Pandas will assign the same rank to equal values and create a "gap" in the dataset for the ranks.