# DataFrames I

In [1]:
import pandas as pd

## Methods and Attributes between Series and DataFrames
____
- A **DataFrame** is a 2-dimensional table consisting of rows and columns.
- Pandas uses a `NaN` designation for cells that have a missing value. It is short for "not a number". Most operations on `NaN` values will produce `NaN` values.
- Like with a **Series**, Pandas assigns an index position/label to each **DataFrame** row.
- The **DataFrame** and **Series** have common and exclusive methods/attributes.
- The `hasnans` attribute exists only a **Series**. The `columns` attribute exists only on a **DataFrame**.
- Some methods/attributes will return different types of data.
- The `info` method returns a summary of the pandas object.

In [None]:
nba = pd.read_csv("nba.csv")
nba

In [None]:
s = pd.Series([1, 2, 3, 4, 5])
s

In [None]:
nba.head()
nba.head(n=5)
nba.head(8)

nba.tail()
nba.tail(n=7)
nba.tail(1)

In [None]:
s.index
nba.index

In [None]:
s.values
nba.values

In [None]:
s.shape
nba.shape

In [None]:
s.dtypes
nba.dtypes

In [None]:
s.hasnans
# nba.hasnans

In [None]:
nba.columns
# s.columns

In [None]:
s.axes
nba.axes

In [None]:
s.info()

In [None]:
nba.info()

In [None]:
nba.tail()

## Differences between Shared Methods
___
- The `sum` method adds a **Series's** values.
- On a **DataFrame**, the `sum` method defaults to adding the values by traversing the index (row values).
- The `axis` parameter customizes the direction that we add across. Pass `"columns"` or `1` to add "across" the columns.

In [None]:
revenue = pd.read_csv("revenue.csv", index_col="Date")
revenue

In [None]:
s = pd.Series([1, 2, 3])
s.sum(axis="index")

The sum method takes a parameter called `axis` which defaults to `0`. This parameter is used to add the values in a specific direction. If you pass `axis=1`, the method will add the values across the columns. The syntax is `df.sum(axis=1)`.

In [None]:
# The revenue DataFrame has only one axis, so the axis parameter is not needed
revenue.sum()
revenue.sum(axis="index")
revenue.sum(axis="columns")

In [None]:
# Setting the sum to be calculated on the rows. This is the same as axis=0
revenue.sum(axis="index")

In [None]:
revenue.sum(axis=0)

In [None]:
# Setting the sum to be calculated on the columns. This is the same as axis=1
revenue.sum(axis="columns")

In [None]:
revenue.sum(axis=1)

In [None]:
# if we want to sum all the values in the columns we can use the sum method
revenue.sum().sum()

## Select One Column from a DataFrame
____
- We can use attribute syntax (`df.column_name`) to select a column from a **DataFrame**. The syntax will not work if the column name has spaces.
- We can also use square bracket syntax (`df["column name"]`) which will work for any column name.
- Pandas extracts a column from a **DataFrame** as a **Series**.
- The **Series** is a view, so changes to the **Series** *will* affect the **DataFrame**.
- Pandas will display a warning if you mutate the **Series**. Use the `copy` method to create a duplicate.

In [None]:
nba = pd.read_csv("nba.csv")
nba.head()

In [None]:
nba.Team
nba.Salary
nba.Name
# nba.name
type(nba.Name)

In [None]:
nba["Team"]
nba["Salary"]

In [None]:
names = nba["Name"].copy()
names

In [33]:
names.iloc[0] = "Whatever"

In [None]:
names.head()

In [None]:
nba.head()

## Select Multiple Columns from a DataFrame
____
- Use square brackets with a list of names to extract multiple **DataFrame** columns.
- Pandas stores the result in a new **DataFrame** (a copy).

In [None]:
nba = pd.read_csv("nba.csv")
nba.head()

In [None]:
nba[["Name", "Team"]]
nba[["Team", "Name"]]

nba[["Salary", "Team", "Name"]]

columns_to_select = ["Salary", "Team", "Name"]
nba[columns_to_select]

In [None]:
nba[columns_to_select].info()

## Add New Column to DataFrame
____
- Use square bracket extraction syntax with an equal sign to add a new **Series** to a **DataFrame**.
- The `insert` method allows us to insert an element at a specific column index.
- On the right-hand side, we can reference an existing **DataFrame** column and perform a broadcasting operation on it to create the new **Series**.

In [None]:
nba = pd.read_csv("nba.csv")
nba.head()

In [52]:
nba["Sport"] = "Basketball"

The insert method takes three parameters: the index of the column, the name of the column, and the values to insert. The syntax is:

`df.insert(loc=column_index, column='column_name', value=values, allow_duplicates=False)`

In [None]:
# nba.insert(loc=3, column="Sport", value="Basketball")
nba

In [53]:
nba["Salary"] * 2
nba["Salary"].mul(2)

nba["Salary Doubled"] = nba["Salary"].mul(2)

In [None]:
nba

In [55]:
nba["Salary"] - 5000000
nba["Salary"].sub(5000000)

nba["New Salary"] = nba["Salary"].sub(5000000)

In [None]:
nba

## A Review of the value_counts Method
- The `value_counts` method counts the number of times that each unique value occurs in a **Series**.

- The syntax is `s.value_counts()`. 
- Note that it can take a `normalize` parameter that will return the relative frequencies of the unique values. The `dropna` parameter will exclude missing values from the count.

- The method returns a **Series** with the unique values as the index and the counts as the values.

In [None]:
nba = pd.read_csv("nba.csv")
nba.head()

In [None]:
nba["Team"].value_counts()

nba["Position"].value_counts()
nba["Position"].value_counts(normalize=True)
nba["Position"].value_counts(normalize=True) * 100

nba["Salary"].value_counts()

## Drop Rows with Missing Values
- Pandas uses a `NaN` designation for cells that have a missing value.
- The `dropna` method deletes rows with missing values. Its default behavior is to remove a row if it has *any* missing values.
- Pass the `how` parameter an argument of "all" to delete rows where all the values are `NaN`.
- The `subset` parameters customizes/limits the columns that pandas will use to drop rows with missing values.

The `dropna` method takes two parameters: `how` and `subset`. The `how` parameter defaults to `any` and can take the values `any` or `all`. The `subset` parameter takes a list of column names to consider when dropping rows. The syntax is:

`df.dropna(how='any', subset=['column_name'])`



In [None]:
nba = pd.read_csv("nba.csv")
nba

In [None]:
nba.dropna()
nba.dropna(how="any")

nba.dropna(how="all")

nba.dropna(subset=["College"])
nba.dropna(subset=["College", "Salary"])

## Fill in Missing Values with the fillna Method
- The `fillna` method replaces missing `NaN` values with its argument.
- The `fillna` method is available on both **DataFrames** and **Series**.
- An extracted **Series** is a view on the original **DataFrame**, but the `fillna` method returns a copy.


The syntax is `df.fillna(value=fill_value)`. The `value` parameter is the value that will replace the missing values. The method returns a copy of the **DataFrame** with the missing values replaced.

In [None]:
nba = pd.read_csv("nba.csv").dropna(how="all")
nba

In [None]:
nba.fillna(0)

nba["Salary"] = nba["Salary"].fillna(0)

In [None]:
nba

In [None]:
nba["College"] = nba["College"].fillna(value="Unknown")

In [None]:
nba

## The astype Method I
- The `astype` method converts a **Series's** values to a specified type.
- Pass in the specified type as either a string or the core Python data type.
- Pandas cannot convert `NaN` values to numeric types, so we need to eliminate/replace them before we perform the conversion.
- The `dtypes` attribute returns a **Series** with the **DataFrame's** columns and their types.

In [None]:
nba = pd.read_csv("nba.csv").dropna(how="all")
nba["Salary"] = nba["Salary"].fillna(0)
nba["Weight"] = nba["Weight"].fillna(0)
nba

In [None]:
nba.dtypes

In [None]:
nba["Salary"].astype("int")
nba["Salary"].astype(int)

nba["Salary"] = nba["Salary"].astype(int)

In [None]:
nba["Weight"] = nba["Weight"].astype(int)

In [None]:
nba

## The astype Method II
- The `category` type is ideal for columns with a limited number of unique values.
- The `nunique` method will return a **Series** with the number of unique values in each column.
- With categories, pandas does not create a separate value in memory for each "cell". Rather, the cells point to a single copy for each unique value.

In [None]:
nba = pd.read_csv("nba.csv").dropna(how ="all")
nba.tail()

In [None]:
nba["Team"].nunique()
nba.nunique()

In [None]:
nba.info()

In [None]:
nba["Position"] = nba["Position"].astype("category")

In [None]:
nba["Team"] = nba["Team"].astype("category")

In [None]:
30/36

## Sort a DataFrame with the sort_values Method I
- The `sort_values` method sorts a **DataFrame** by the values in one or more columns. The default sort is an ascending one (alphabetical for strings).
- The first parameter (`by`) expects the column(s) to sort by.
- If sorting by a single column, pass a string with its name.
- The `ascending` parameter customizes the sort order.
- The `na_position` parameter customizes where pandas places `NaN` values.

In [None]:
nba = pd.read_csv("nba.csv")
nba.tail()

In [None]:
nba.sort_values("Name")
nba.sort_values(by="Name")
nba.sort_values(by="Name", ascending=True)
nba.sort_values(by="Name", ascending=False)

nba.sort_values("Salary")
nba.sort_values("Salary", ascending=False)
nba.sort_values("Salary", na_position="last")
nba.sort_values("Salary", na_position="first")
nba.sort_values("Salary", na_position="first", ascending=False)

## Sort a DataFrame with the sort_values Method II
- To sort by multiple columns, pass the `by` parameter a list of column names. Pandas will sort in the specified column order (first to last).
- Pass the `ascending` parameter a Boolean to sort all columns in a consistent order (all ascending or all descending).
- Pass `ascending` a list to customize the sort order *per* column. The `ascending` list length must match the `by` list.

In [None]:
nba = pd.read_csv("nba.csv")
nba.tail()

In [None]:
nba.sort_values(["Team", "Name"])
nba.sort_values(by=["Team", "Name"])
nba.sort_values(by=["Team", "Name"], ascending=True)
nba.sort_values(by=["Team", "Name"], ascending=False)

nba.sort_values(by=["Team", "Name"], ascending=[True, False])

nba.sort_values(["Position", "Salary"])
nba.sort_values(["Position", "Salary"], ascending=True)
nba.sort_values(["Position", "Salary"], ascending=False)
nba.sort_values(["Position", "Salary"], ascending=[True, False])
nba.sort_values(["Position", "Salary"], ascending=[False, True])

nba = nba.sort_values(["Position", "Salary"], ascending=[False, True])
nba

## Sort a DataFrame by its Index
- The `sort_index` method sorts the **DataFrame** by its index positions/labels.

In [None]:
nba = pd.read_csv("nba.csv")
nba = nba.sort_values(["Team", "Name"])
nba

In [None]:
nba.sort_index()
nba.sort_index(ascending=True)
nba.sort_index(ascending=False)

nba = nba.sort_index(ascending=False)

In [None]:
nba

## Rank Values with the rank Method
- The `rank` method assigns a numeric ranking to each **Series** value.
- Pandas will assign the same rank to equal values and create a "gap" in the dataset for the ranks.

In [5]:
nba = pd.read_csv("nba.csv").dropna(how="all")
nba["Salary"] = nba["Salary"].fillna(0).astype(int)
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0


In [6]:
nba["Salary"].rank()
nba["Salary"].rank(ascending=True)
nba["Salary"].rank(ascending=False).astype(int)

nba["Salary Rank"] = nba["Salary"].rank(ascending=False).astype(int)

In [7]:
nba.sort_values("Salary", ascending=False).head(10)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,Salary Rank
175,Stephen Curry,Golden State Warriors,G,6-2,185.0,Davidson,51915615,1
461,Kevin Durant,Phoenix Suns,F,6-10,240.0,Texas,47649433,2
261,LeBron James,Los Angeles Lakers,F,6-9,250.0,St. Vincent-St. Mary HS (OH),47607350,4
145,Nikola Jokic,Denver Nuggets,C,6-11,284.0,Mega Basket,47607350,4
436,Joel Embiid,Philadelphia 76ers,C-F,7-0,280.0,Kansas,47607350,4
456,Bradley Beal,Phoenix Suns,G,6-4,207.0,Florida,46741590,6
480,Damian Lillard,Portland Trail Blazers,G,6-2,195.0,Weber State,45640084,8
316,Giannis Antetokounmpo,Milwaukee Bucks,F,7-0,243.0,Filathlitikos,45640084,8
241,Kawhi Leonard,Los Angeles Clippers,F,6-7,225.0,San Diego State,45640084,8
239,Paul George,Los Angeles Clippers,F,6-8,220.0,Fresno State,45640084,8


In [None]:
nba.fillna()