# DataFrame object

* Is a two-dimensional table of data with rows and columns
* Also, we have an index position and index label to each dataframe row and each column
* The dataframe is two-dimensional because requires two points of reference to isolate a value from the data set

In [28]:
import pandas as pd
import numpy as np

Creating DataFrames with the constructor

When using dictionary to feed the data, the keys will be the column name and the values the column values

In [3]:
city_data = {
    'City': ["New York City", "Paris", "Barcelona", "Rome"],
    'Country': ["United States", "France", "Spain", "Italy"],
    'Population': [8600000, 2141000, 5515000, 2873000],
}

In [5]:
cities = pd.DataFrame(data=city_data)
cities

Unnamed: 0,City,Country,Population
0,New York City,United States,8600000
1,Paris,France,2141000
2,Barcelona,Spain,5515000
3,Rome,Italy,2873000


What if we want to swap the column headers with the index label? this is the **transpose**

In [6]:
cities.T

Unnamed: 0,0,1,2,3
City,New York City,Paris,Barcelona,Rome
Country,United States,France,Spain,Italy
Population,8600000,2141000,5515000,2873000


Using a ndarray to create a DataFrame

In [10]:
# randint function returns a ndarray, it takes as argument a low and high limit, and a size that can be a int or a list with the shape we want to return, in this case we want a shape of 3 rows and 5 columns, 15 random number in that format will be returned
random_data = np.random.randint(1, 101, [3, 5])
random_data

array([[ 3, 36, 40, 44, 58],
       [68, 30, 57, 36, 93],
       [65,  9, 96, 61, 91]])

In [13]:
# Now we have a DataFrame Object
pd.DataFrame(data=random_data)

Unnamed: 0,0,1,2,3,4
0,3,36,40,44,58
1,68,30,57,36,93
2,65,9,96,61,91


### We can set a customize index label and names of columns
With the `index` parameter of `DataFrame` constructor we pass an iterable to change the index labels, they have to match the same number of rows

We can modify the column names with the `columns` parameter of `DataFrame` constructor, we pass an iterable to change the names, they have to match the same number of columns

In [17]:
row_labels = ['Morning', 'Afternoon', 'Evening']
column_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

temperatures = pd.DataFrame(
    data=random_data,
    index=row_labels,
    columns=column_names
)
temperatures

Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday
Morning,3,36,40,44,58
Afternoon,68,30,57,36,93
Evening,65,9,96,61,91


# Using the `read_csv()` function

In [33]:
nba = pd.read_csv("/home/diego/Documents/Data/nba.csv", 
            parse_dates=['Birthday'])
nba

  nba = pd.read_csv("/home/diego/Documents/Data/nba.csv",


Unnamed: 0,Name,Team,Position,Birthday,Salary
0,Shake Milton,Philadelphia 76ers,SG,1996-09-26,1445697
1,Christian Wood,Detroit Pistons,PF,1995-09-27,1645357
2,PJ Washington,Charlotte Hornets,PF,1998-08-23,3831840
3,Derrick Rose,Detroit Pistons,PG,1988-10-04,7317074
4,Marial Shayok,Philadelphia 76ers,G,1995-07-26,79568
...,...,...,...,...,...
445,Austin Rivers,Houston Rockets,PG,1992-08-01,2174310
446,Harry Giles,Sacramento Kings,PF,1998-04-22,2578800
447,Robin Lopez,Milwaukee Bucks,C,1988-04-01,4767000
448,Collin Sexton,Cleveland Cavaliers,PG,1999-01-04,4764960


We need to use `dtypes` in plural with DataFrames

In [34]:
nba.dtypes

Name                object
Team                object
Position            object
Birthday    datetime64[ns]
Salary               int64
dtype: object

If we use `value_counts()` we can count the number of columns that store a value

In [35]:
nba.dtypes.value_counts()
# In this case we have three columns with object datatype, 1 with datetime and 1 with int64 datatype

object            3
datetime64[ns]    1
int64             1
Name: count, dtype: int64

In [37]:
nba.index

RangeIndex(start=0, stop=450, step=1)

In [38]:
nba.columns

Index(['Name', 'Team', 'Position', 'Birthday', 'Salary'], dtype='object')

In [39]:
nba.ndim

2

In [40]:
nba.shape

(450, 5)

`size` attribute return the total values, **including** the missing values

In [41]:
nba.size

2250

`count()` method does not consider the missing values

In [47]:
nba.count()

Name        450
Team        450
Position    450
Birthday    450
Salary      450
dtype: int64

Working with a dataframe with not valid values

In [49]:
data = {
    "A": [1, np.nan],
    "B": [2,3],
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,A,B
0,1.0,2
1,,3


In [50]:
df.size

4

In [53]:
df.count()

A    1
B    2
dtype: int64

In [70]:
nba.sample(4)

Unnamed: 0,Name,Team,Position,Birthday,Salary
132,Terence Davis,Toronto Raptors,SG,1997-05-16,898310
191,Cheick Diallo,Phoenix Suns,C,1996-09-13,1678854
307,Kawhi Leonard,Los Angeles Clippers,SF,1991-06-29,32742000
338,Nerlens Noel,Oklahoma City Thunder,C,1994-04-10,1882867


In [71]:
nba.nunique()

Name        450
Team         30
Position      9
Birthday    430
Salary      269
dtype: int64

In [72]:
nba.max()

Name             Zylan Cheatham
Team         Washington Wizards
Position                     SG
Birthday    2000-12-23 00:00:00
Salary                 40231758
dtype: object

In [73]:
nba.min()

Name               Aaron Gordon
Team              Atlanta Hawks
Position                      C
Birthday    1977-01-26 00:00:00
Salary                    79568
dtype: object

In [83]:
# nlargest works with numbers, dates
nba.nlargest(n=3, columns='Salary')

Unnamed: 0,Name,Team,Position,Birthday,Salary
205,Stephen Curry,Golden State Warriors,PG,1988-03-14,40231758
38,Chris Paul,Oklahoma City Thunder,PG,1985-05-06,38506482
219,Russell Westbrook,Houston Rockets,PG,1988-11-12,38506482


In [84]:
# the five oldest players
nba.nsmallest(n=5, columns='Birthday')

Unnamed: 0,Name,Team,Position,Birthday,Salary
98,Vince Carter,Atlanta Hawks,PF,1977-01-26,2564753
196,Udonis Haslem,Miami Heat,C,1980-06-09,2564753
262,Kyle Korver,Milwaukee Bucks,PF,1981-03-17,6004753
149,Tyson Chandler,Houston Rockets,C,1982-10-02,2564753
415,Andre Iguodala,Memphis Grizzlies,SF,1984-01-28,17185185


In [86]:
nba.sum(numeric_only=True)

Salary    3444112694
dtype: int64

In [88]:
nba.mean(numeric_only=True)

Salary    7.653584e+06
dtype: float64

In [90]:
nba.median(numeric_only=True)

Salary    3303074.5
dtype: float64

In [92]:
nba.mode(numeric_only=True)

Unnamed: 0,Salary
0,79568


In [94]:
nba.std(numeric_only=True)

Salary    9.288810e+06
dtype: float64

In [96]:
nba.var(numeric_only=True)

Salary    8.628200e+13
dtype: float64

# Sorting dataframe

Using the `sort_values` method, and passing the column name in the `by` parameter

In [100]:
nba.sort_values(by=['Team', 'Name'])

Unnamed: 0,Name,Team,Position,Birthday,Salary
359,Alex Len,Atlanta Hawks,C,1993-06-16,4160000
167,Allen Crabbe,Atlanta Hawks,SG,1992-04-09,18500000
276,Brandon Goodwin,Atlanta Hawks,PG,1995-10-02,79568
438,Bruno Fernando,Atlanta Hawks,C,1998-08-15,1400000
194,Cam Reddish,Atlanta Hawks,SF,1999-09-01,4245720
...,...,...,...,...,...
418,Jordan McRae,Washington Wizards,PG,1991-03-28,1645357
273,Justin Robinson,Washington Wizards,PG,1997-10-12,898310
428,Moritz Wagner,Washington Wizards,C,1997-04-26,2063520
21,Rui Hachimura,Washington Wizards,PF,1998-02-08,4469160


In [101]:
nba.sort_values(by=['Name', 'Team'])

Unnamed: 0,Name,Team,Position,Birthday,Salary
52,Aaron Gordon,Orlando Magic,PF,1995-09-16,19863636
101,Aaron Holiday,Indiana Pacers,PG,1996-09-30,2239200
437,Abdel Nader,Oklahoma City Thunder,SF,1993-09-25,1618520
81,Adam Mokoka,Chicago Bulls,G,1998-07-18,79568
399,Admiral Schofield,Washington Wizards,SF,1997-03-30,1000000
...,...,...,...,...,...
159,Zach LaVine,Chicago Bulls,PG,1995-03-10,19500000
302,Zach Norvell,Los Angeles Lakers,SG,1997-12-09,79568
312,Zhaire Smith,Philadelphia 76ers,SG,1999-06-04,3058800
137,Zion Williamson,New Orleans Pelicans,F,2000-07-06,9757440


What if the want different order in each column

In [104]:
nba.sort_values(by=['Team', 'Salary'], ascending=[True, False])

Unnamed: 0,Name,Team,Position,Birthday,Salary
111,Chandler Parsons,Atlanta Hawks,SF,1988-10-25,25102512
28,Evan Turner,Atlanta Hawks,PG,1988-10-27,18606556
167,Allen Crabbe,Atlanta Hawks,SG,1992-04-09,18500000
213,De'Andre Hunter,Atlanta Hawks,SF,1997-12-02,7068360
339,Jabari Parker,Atlanta Hawks,PF,1995-03-15,6500000
...,...,...,...,...,...
80,Isaac Bonga,Washington Wizards,PG,1999-11-08,1416852
399,Admiral Schofield,Washington Wizards,SF,1997-03-30,1000000
273,Justin Robinson,Washington Wizards,PG,1997-10-12,898310
283,Garrison Mathews,Washington Wizards,SG,1996-10-24,79568


# Sorting in column axis

In [107]:
nba.sort_index()

Unnamed: 0,Name,Team,Position,Birthday,Salary
0,Shake Milton,Philadelphia 76ers,SG,1996-09-26,1445697
1,Christian Wood,Detroit Pistons,PF,1995-09-27,1645357
2,PJ Washington,Charlotte Hornets,PF,1998-08-23,3831840
3,Derrick Rose,Detroit Pistons,PG,1988-10-04,7317074
4,Marial Shayok,Philadelphia 76ers,G,1995-07-26,79568
...,...,...,...,...,...
445,Austin Rivers,Houston Rockets,PG,1992-08-01,2174310
446,Harry Giles,Sacramento Kings,PF,1998-04-22,2578800
447,Robin Lopez,Milwaukee Bucks,C,1988-04-01,4767000
448,Collin Sexton,Cleveland Cavaliers,PG,1999-01-04,4764960


In [106]:
nba.sort_index(axis='columns')

Unnamed: 0,Birthday,Name,Position,Salary,Team
0,1996-09-26,Shake Milton,SG,1445697,Philadelphia 76ers
1,1995-09-27,Christian Wood,PF,1645357,Detroit Pistons
2,1998-08-23,PJ Washington,PF,3831840,Charlotte Hornets
3,1988-10-04,Derrick Rose,PG,7317074,Detroit Pistons
4,1995-07-26,Marial Shayok,G,79568,Philadelphia 76ers
...,...,...,...,...,...
445,1992-08-01,Austin Rivers,PG,2174310,Houston Rockets
446,1998-04-22,Harry Giles,PF,2578800,Sacramento Kings
447,1988-04-01,Robin Lopez,C,4767000,Milwaukee Bucks
448,1999-01-04,Collin Sexton,PG,4764960,Cleveland Cavaliers


Setting a customize index

In [115]:
nba = nba.set_index('Name')

KeyError: "None of ['Name'] are in the columns"

In [117]:
nba

Unnamed: 0_level_0,Team,Position,Birthday,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Shake Milton,Philadelphia 76ers,SG,1996-09-26,1445697
Christian Wood,Detroit Pistons,PF,1995-09-27,1645357
PJ Washington,Charlotte Hornets,PF,1998-08-23,3831840
Derrick Rose,Detroit Pistons,PG,1988-10-04,7317074
Marial Shayok,Philadelphia 76ers,G,1995-07-26,79568
...,...,...,...,...
Austin Rivers,Houston Rockets,PG,1992-08-01,2174310
Harry Giles,Sacramento Kings,PF,1998-04-22,2578800
Robin Lopez,Milwaukee Bucks,C,1988-04-01,4767000
Collin Sexton,Cleveland Cavaliers,PG,1999-01-04,4764960


## Extracting columns

In [120]:
nba['Salary']

Name
Shake Milton       1445697
Christian Wood     1645357
PJ Washington      3831840
Derrick Rose       7317074
Marial Shayok        79568
                    ...   
Austin Rivers      2174310
Harry Giles        2578800
Robin Lopez        4767000
Collin Sexton      4764960
Ricky Rubio       16200000
Name: Salary, Length: 450, dtype: int64

In [118]:
nba.Salary

Name
Shake Milton       1445697
Christian Wood     1645357
PJ Washington      3831840
Derrick Rose       7317074
Marial Shayok        79568
                    ...   
Austin Rivers      2174310
Harry Giles        2578800
Robin Lopez        4767000
Collin Sexton      4764960
Ricky Rubio       16200000
Name: Salary, Length: 450, dtype: int64