# Pandas DataFrames

## Series and DataFrames
Series are 1D arrays while a single column DataFrame is a 2D table. Series are the columns and the DataFrame is the whole table. Remember, the Series look like arrays and the DataFrames look like spreadsheets.

In [4]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame({
    "Position": [
        "SG",
        "SF",
        "PG",
        "PF",
        "C"
    ],
    "Jersey No.": [
        17,
        15,
        1,
        23,
        3
    ],
    "Rating": [
        74,
        75,
        75,
        97,
        89
    ]
    }, 
    columns = ["Position", "Jersey No.", "Rating"]
)

In [7]:
names = [
        "Schroder",
        "Reaves",
        "Russels",
        "James",
        "Davis"
]
df.index = names

In [8]:
df

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Reaves,SF,15,75
Russels,PG,1,75
James,PF,23,97
Davis,C,3,89


In [9]:
df.columns

Index(['Position', 'Jersey No.', 'Rating'], dtype='object')

In [10]:
df.index

Index(['Schroder', 'Reaves', 'Russels', 'James', 'Davis'], dtype='object')

In [11]:
# returns a quick dataframe structre
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Schroder to Davis
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Position    5 non-null      object
 1   Jersey No.  5 non-null      int64 
 2   Rating      5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 120.0+ bytes


In [12]:
df.size

15

In [13]:
df.shape

(5, 3)

In [14]:
# returns statistics summary
df.describe()

Unnamed: 0,Jersey No.,Rating
count,5.0,5.0
mean,11.8,82.0
std,9.444575,10.440307
min,1.0,74.0
25%,3.0,75.0
50%,15.0,75.0
75%,17.0,89.0
max,23.0,97.0


In [15]:
df.dtypes

Position      object
Jersey No.     int64
Rating         int64
dtype: object

In [16]:
df.dtypes.value_counts()

int64     2
object    1
dtype: int64

<br>

## Indexing, selection and Slicing

A column can be easliy access using the column names. To access a row, you can use `loc` and the given idices, and you can use `iloc` to select rows with number indices. Whether you select horizontally (a column) or vertically (a row), it will always return a _Series_.

In [17]:
df.loc["James"]

Position      PF
Jersey No.    23
Rating        97
Name: James, dtype: object

In [18]:
df.iloc[-1]

Position       C
Jersey No.     3
Rating        89
Name: Davis, dtype: object

You can also use the DataFrames like NumPy where you can access different dimensions, from d1 to d2, d3, and so on.

In [19]:
df.loc['Schroder':'Russels']

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Reaves,SF,15,75
Russels,PG,1,75


In [20]:
df.loc["Reaves":"James", ["Position", "Rating"]]

Unnamed: 0,Position,Rating
Reaves,SF,75
Russels,PG,75
James,PF,97


In [21]:
df.iloc[0:, [0, -1]]

Unnamed: 0,Position,Rating
Schroder,SG,74
Reaves,SF,75
Russels,PG,75
James,PF,97
Davis,C,89


<br>

## Conditional Selection

In [22]:
df.loc[df["Rating"] > 75, ["Jersey No.", "Rating"]]

Unnamed: 0,Jersey No.,Rating
James,23,97
Davis,3,89


**Dropping** is like unselecting or avoiding to access.

In [23]:
df.drop("Reaves")

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Russels,PG,1,75
James,PF,23,97
Davis,C,3,89


In [24]:
df.drop(["Russels", "Davis"])

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Reaves,SF,15,75
James,PF,23,97


In [25]:
# drop columns
df.drop(columns=["Rating"])

Unnamed: 0,Position,Jersey No.
Schroder,SG,17
Reaves,SF,15
Russels,PG,1
James,PF,23
Davis,C,3


In [26]:
# axis can be "columns" or 1 for vertical
df.drop("Rating", axis="columns")

Unnamed: 0,Position,Jersey No.
Schroder,SG,17
Reaves,SF,15
Russels,PG,1
James,PF,23
Davis,C,3


In [27]:
# axis can be "rows" or 0 for horizontal
df.drop("Davis", axis=0)

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Reaves,SF,15,75
Russels,PG,1,75
James,PF,23,97


In [28]:
df

Unnamed: 0,Position,Jersey No.,Rating
Schroder,SG,17,74
Reaves,SF,15,75
Russels,PG,1,75
James,PF,23,97
Davis,C,3,89


> Note: Using axis method is a bit tricky.
> - `axis=0` or `axis="rows"` - moves vertically (row-by-row), return series horizontally (a row).
> - `axis=1` or `axis="columns"` - moves horizontally (column-by-column),  return series value vertically (a column).

## Modifying and Using DataFrame and Series together

In [29]:
# adding another column using series
earning_per_month = pd.Series([76000,77000,77000,99000,89000], index=names)

In [30]:
df["Earnings per Month"] = earning_per_month

In [31]:
df

Unnamed: 0,Position,Jersey No.,Rating,Earnings per Month
Schroder,SG,17,74,76000
Reaves,SF,15,75,77000
Russels,PG,1,75,77000
James,PF,23,97,99000
Davis,C,3,89,89000


You can do operations in DataFrame through Series.

In [34]:
decrease_rate = pd.Series([-2, -2000],index=["Rating", "Earnings per Month"])

In [35]:
df[["Rating", "Earnings per Month"]]

Unnamed: 0,Rating,Earnings per Month
Schroder,74,76000
Reaves,75,77000
Russels,75,77000
James,97,99000
Davis,89,89000


In [36]:
df[["Rating", "Earnings per Month"]] + decrease_rate

Unnamed: 0,Rating,Earnings per Month
Schroder,72,74000
Reaves,73,75000
Russels,73,75000
James,95,97000
Davis,87,87000


In [37]:
# renaming columns and rows
df.rename(
    columns = {
        "Rating": "Ave. Ratings"
    },
    index = {
        "Schroder": "D. Schroder",
        "Reaves": "M. Reaves",
        "Russels": "D. Russels",
        "James": "L. James",
        "Davis": "A. Davis"
    }
)

Unnamed: 0,Position,Jersey No.,Ave. Ratings,Earnings per Month
D. Schroder,SG,17,74,76000
M. Reaves,SF,15,75,77000
D. Russels,PG,1,75,77000
L. James,PF,23,97,99000
A. Davis,C,3,89,89000


The modifications we did from decreasing ratings and earning to changing rows and column names wont changed if we check it by calling, unless if we override and/or save it.

In [38]:
df

Unnamed: 0,Position,Jersey No.,Rating,Earnings per Month
Schroder,SG,17,74,76000
Reaves,SF,15,75,77000
Russels,PG,1,75,77000
James,PF,23,97,99000
Davis,C,3,89,89000


<br>

## Creating Columns

Above, we create a new column by creating a series and adding it to our dataframe. Here's another column added using series.

In [39]:
ave_points = pd.Series([12.6, 11.2, 13, 28.9, 25.9], index=names)

In [40]:
df["PTS"] = ave_points

In [41]:
df

Unnamed: 0,Position,Jersey No.,Rating,Earnings per Month,PTS
Schroder,SG,17,74,76000,12.6
Reaves,SF,15,75,77000,11.2
Russels,PG,1,75,77000,13.0
James,PF,23,97,99000,28.9
Davis,C,3,89,89000,25.9


We can also create another column using other column's info.

In [46]:
df["Ave. PTS% per Rating"] = df["Rating"] / df["PTS"]

In [47]:
df

Unnamed: 0,Position,Jersey No.,Rating,Earnings per Month,PTS,Ave. PTS% per Rating
Schroder,SG,17,74,76000,12.6,5.873016
Reaves,SF,15,75,77000,11.2,6.696429
Russels,PG,1,75,77000,13.0,5.769231
James,PF,23,97,99000,28.9,3.356401
Davis,C,3,89,89000,25.9,3.436293


### Statistical Info

Some other commands that might help you understand/process the data.
![other dataframe commands](../img/statistical-info.jpg)