# DataFrames

In [22]:
import pandas as pd;

# Methods and Attributes between Series and DataFrames
* A **DataFrame** is a 2-dimensional table consisting of rows and columns.
* Pandas uses a **NaN** designation for cells that have a missing value. It is short for "not a number". Most operations on **NaN** values will produce **NaN** values.
* Like with a Series, Pandas assigns an index position/label to each DataFrame row.
* The **DataFrame** and **Series** have common and exclusive methods/attributes.
* The *hasnans* attribute exists only a **Series**. The *Columns* attribute exists only on a **DataFrame**.
* Some methods/attributes will return different types of data.
* The *info* method returns a summary of the pandas object.

In [23]:
nba=pd.read_csv('nba.csv')
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [24]:
s=pd.Series([1,2,3,4,5])

In [25]:
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [26]:
s.index
nba.index

RangeIndex(start=0, stop=592, step=1)

In [27]:
nba.values

array([['Saddiq Bey', 'Atlanta Hawks', 'F', ..., 215.0, 'Villanova',
        4556983.0],
       ['Bogdan Bogdanovic', 'Atlanta Hawks', 'G', ..., 225.0,
        'Fenerbahce', 18700000.0],
       ['Kobe Bufkin', 'Atlanta Hawks', 'G', ..., 195.0, 'Michigan',
        4094244.0],
       ...,
       ['Tristan Vukcevic', 'Washington Wizards', 'F', ..., 220.0,
        'Real Madrid', nan],
       ['Delon Wright', 'Washington Wizards', 'G', ..., 185.0, 'Utah',
        8195122.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [28]:
nba.columns

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [29]:
nba.dtypes
s.dtypes

dtype('int64')

In [30]:
nba.shape
s.shape

(5,)

In [31]:
s.hasnans

False

In [32]:
nba.tail()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0
591,,,,,,,


In [33]:
nba.columns

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [34]:
s.axes

[RangeIndex(start=0, stop=5, step=1)]

In [35]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [36]:
nba.axes

[RangeIndex(start=0, stop=592, step=1),
 Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')]

In [37]:
nba.index

RangeIndex(start=0, stop=592, step=1)

In [38]:
s.info()

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      int64
dtypes: int64(1)
memory usage: 172.0 bytes


In [39]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 32.5+ KB


# Difference b/w Shared Methods
* The **sum** method adds a Series's values.
* On a **DataFrame**, the **sum** method defaults to adding the values by traversing the index  (row values).
* The **axis** parameter customizes the direction that we add across. Pass "columns" or 1 to add "across" the columns.

In [40]:
revenue=pd.read_csv("revenue.csv", index_col="Date")
revenue

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/26,985,122,499
1/2/26,738,788,534
1/3/26,14,20,933
1/4/26,730,904,885
1/5/26,114,71,253
1/6/26,936,502,497
1/7/26,123,996,115
1/8/26,935,492,886
1/9/26,846,954,823
1/10/26,54,285,216


In [41]:
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [42]:
s.sum()

15

In [43]:
revenue.sum(axis="index")
revenue.sum(axis="columns")

Date
1/1/26     1606
1/2/26     2060
1/3/26      967
1/4/26     2519
1/5/26      438
1/6/26     1935
1/7/26     1234
1/8/26     2313
1/9/26     2623
1/10/26     555
dtype: int64

# Select One Column from a DataFrame
* The **Series** is a view, so changes to the **Series** will affect the **DataFrame**.

In [44]:
s=nba.Name
s

0             Saddiq Bey
1      Bogdan Bogdanovic
2            Kobe Bufkin
3           Clint Capela
4         Bruno Fernando
             ...        
587         Ryan Rollins
588        Landry Shamet
589     Tristan Vukcevic
590         Delon Wright
591                  NaN
Name: Name, Length: 592, dtype: object

In [45]:
nba["Name"]

0             Saddiq Bey
1      Bogdan Bogdanovic
2            Kobe Bufkin
3           Clint Capela
4         Bruno Fernando
             ...        
587         Ryan Rollins
588        Landry Shamet
589     Tristan Vukcevic
590         Delon Wright
591                  NaN
Name: Name, Length: 592, dtype: object

# Select Multiple Columns from a DataFrame
* Use square brackets with a list of names to extract multiple **DataFrame** columns.
* Pandas. stores the result in a new **DataFrame** columns.

In [46]:
nba.columns

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [47]:
df=nba[["Name", "Position"]]

In [48]:
df

Unnamed: 0,Name,Position
0,Saddiq Bey,F
1,Bogdan Bogdanovic,G
2,Kobe Bufkin,G
3,Clint Capela,C
4,Bruno Fernando,F-C
...,...,...
587,Ryan Rollins,G
588,Landry Shamet,G
589,Tristan Vukcevic,F
590,Delon Wright,G


In [49]:
df.iloc[0]["Position"]="Qw"
df

Unnamed: 0,Name,Position
0,Saddiq Bey,Qw
1,Bogdan Bogdanovic,G
2,Kobe Bufkin,G
3,Clint Capela,C
4,Bruno Fernando,F-C
...,...,...
587,Ryan Rollins,G
588,Landry Shamet,G
589,Tristan Vukcevic,F
590,Delon Wright,G


In [50]:
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [82]:
df.iloc[0]["Name"]="qwerty"
df

Unnamed: 0,Name,Position
0,qwerty,Qw
1,Bogdan Bogdanovic,G
2,Kobe Bufkin,G
3,Clint Capela,C
4,Bruno Fernando,F-C
...,...,...
587,Ryan Rollins,G
588,Landry Shamet,G
589,Tristan Vukcevic,F
590,Delon Wright,G


In [80]:
nba

Unnamed: 0,Name,Sport,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,basket ball,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,basket ball,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,basket ball,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,basket ball,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,basket ball,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...,...
587,Ryan Rollins,basket ball,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,basket ball,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,basket ball,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,basket ball,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [79]:
nba.insert(loc=1, column="Sport", value="basket ball")

In [54]:
nba["after increment"]=nba["Salary"]*2

In [55]:
type(nba["Salary"])

pandas.core.series.Series

In [56]:
nba

Unnamed: 0,Name,Sport,Team,Position,Height,Weight,College,Salary,after increment
0,Saddiq Bey,basket ball,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0,9113966.0
1,Bogdan Bogdanovic,basket ball,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0,37400000.0
2,Kobe Bufkin,basket ball,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0,8188488.0
3,Clint Capela,basket ball,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0,41232000.0
4,Bruno Fernando,basket ball,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0,5163044.0
...,...,...,...,...,...,...,...,...,...
587,Ryan Rollins,basket ball,Washington Wizards,G,6-3,180.0,Toledo,1719864.0,3439728.0
588,Landry Shamet,basket ball,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0,20500000.0
589,Tristan Vukcevic,basket ball,Washington Wizards,F,6-10,220.0,Real Madrid,,
590,Delon Wright,basket ball,Washington Wizards,G,6-5,185.0,Utah,8195122.0,16390244.0


## A Review of the value_counts method
* The value_counts method counts the number of times that each unique value occurs in **Series**.

In [57]:
nba["Team"].value_counts()

Team
Dallas Mavericks          23
Miami Heat                22
Denver Nuggets            22
Milwaukee Bucks           22
Memphis Grizzlies         22
Indiana Pacers            21
Utah Jazz                 21
Toronto Raptors           21
Philadelphia 76ers        21
Oklahoma City Thunder     21
New York Knicks           21
Washington Wizards        21
Phoenix Suns              20
Houston Rockets           20
Charlotte Hornets         20
San Antonio Spurs         20
Los Angeles Clippers      19
Minnesota Timberwolves    19
Detroit Pistons           19
Cleveland Cavaliers       19
Los Angeles Lakers        19
Chicago Bulls             19
Sacramento Kings          18
Orlando Magic             18
Boston Celtics            18
Atlanta Hawks             18
Portland Trail Blazers    17
Golden State Warriors     17
Brooklyn Nets             17
New Orleans Pelicans      16
Name: count, dtype: int64

# Drop Row with Missing Values
* Pandas uses a **NaN** designation for cells that have a missing value.
* The **dropna** method deletes row s with missing values. It's default behavior is to remove a row if it has a missing values.
* Pass the **how** parameter an argument of "all" to delete rows where all the values are **NaN**.
* The **subset** parameters customizes/limits the columns that pandas will use to drop rows with missing values.

In [58]:
nba=pd.read_csv("nba.csv")

In [59]:
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [60]:
nba.dropna()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
585,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


In [61]:
nba.dropna(how="any")

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
585,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


In [62]:
nba.dropna(how="all")

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [63]:
nba.dropna(subset=["College"])
nba.dropna(subset=["College", "Salary"])

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
585,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


# Fill in Missing Values with filna Method.
* The **fillna** method replaces missing **NaN** values with its argument.
* The **fillna** method is available on both **DataFrame** and **Series**.
* An extracted Series is a view on the original **DataFrame**, but the **fillna** method returns copy.

In [64]:
nba=pd.read_csv('nba.csv').dropna(how="all")
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [65]:
nba.fillna(0)
nba['Salary']=nba["Salary"].fillna(0)

In [66]:
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


In [67]:
nba["College"]=nba["College"].fillna("Unknown")  #use value attribute inside fillna
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


# The astype Method 1
* The **astype** method converts a **Series's** values to a specified value.
* Pass in the specified type as either a string or the core Python data type.
* Pandas can't convert **NaN** values to numeric types, so we need to eliminate/ replace them before we perform the conversion.
* The **dtypes** attribute returns a Series with DataFrame's columns and their types.

In [83]:
nba=pd.read_csv('nba.csv').dropna(how="all")
nba["Salary"]=nba['Salary'].fillna(0)
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


In [69]:
nba.dtypes

Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [84]:
nba['Salary'].astype('int')
nba['Salary'].astype(int)
nba['Salary']=nba['Salary'].astype('int')
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0


In [71]:
nba['Weight']=nba['Weight'].fillna(0)
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0


In [72]:
nba['Weight']=nba['Weight'].astype(int)
nba.dtypes

Name        object
Team        object
Position    object
Height      object
Weight       int32
College     object
Salary       int32
dtype: object

# The astype Method II
* The **category** type is ideal for columns with limite number of unique values.
* The **nunique** method will return a **Series** with the number of unique values in each column.
* With categories, pandas does not create a separate value in memory for each 'cell'. Rather, the cells point to a single copy for each unique value.

In [73]:
nba=pd.read_csv('nba.csv')
nba["Team"]=nba["Team"].astype("category")

In [85]:
nba.info()
nba.nunique()

<class 'pandas.core.frame.DataFrame'>
Index: 591 entries, 0 to 590
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    591 non-null    int32  
dtypes: float64(1), int32(1), object(5)
memory usage: 34.6+ KB


Name        591
Team         30
Position      7
Height       20
Weight       93
College     182
Salary      299
dtype: int64

In [75]:
nba["Position"]=nba['Position'].astype('category')

In [76]:
nba['Position']
nba['Team']

0           Atlanta Hawks
1           Atlanta Hawks
2           Atlanta Hawks
3           Atlanta Hawks
4           Atlanta Hawks
              ...        
587    Washington Wizards
588    Washington Wizards
589    Washington Wizards
590    Washington Wizards
591                   NaN
Name: Team, Length: 592, dtype: category
Categories (30, object): ['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets', 'Charlotte Hornets', ..., 'San Antonio Spurs', 'Toronto Raptors', 'Utah Jazz', 'Washington Wizards']

# Sort a Df with the sort_values Method I
* The *sort_values* method sorts a DF by the values in one or more columns. The default sort is an ascending one.
* The first parameter(by) expects the columns to sort by.
* If sorting by a single column, pass a string with its name.
* The ascending_parameter customizes the sort order.
* The na_position parameter customizes where pandas places **NaN** values.('first', 'last')

In [77]:
nba.sort_values(by="Name", ascending=False)
nba.sort_values('Salary', na_position='first')

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
23,Blake Griffin,Boston Celtics,F,6-9,250.0,Oklahoma,
26,Mfiondu Kabengele,Boston Celtics,C,6-10,250.0,Florida State,
28,Svi Mykhailiuk,Boston Celtics,G-F,6-7,205.0,Kansas,
35,Robert Williams III,Boston Celtics,C-F,6-9,237.0,Texas A&M,
39,Nic Claxton,Brooklyn Nets,C,6-11,215.0,Georgia,
...,...,...,...,...,...,...,...
261,LeBron James,Los Angeles Lakers,F,6-9,250.0,St. Vincent-St. Mary HS (OH),47607350.0
145,Nikola Jokic,Denver Nuggets,C,6-11,284.0,Mega Basket,47607350.0
436,Joel Embiid,Philadelphia 76ers,C-F,7-0,280.0,Kansas,47607350.0
461,Kevin Durant,Phoenix Suns,F,6-10,240.0,Texas,47649433.0


In [86]:
nba['Salary'].astype('int')

0       4556983
1      18700000
2       4094244
3      20616000
4       2581522
         ...   
586    27955357
587     1719864
588    10250000
589           0
590     8195122
Name: Salary, Length: 591, dtype: int32

In [None]:
nba.columns

In [None]:
nba.rename(columns={"Name":'name', "Team":'team'})
nba.drop(nba.index[1:5])

# Sort a DF with sort_values Method II
* The sort by multiple columns pass the *by* parameter a list of column names. 
* Pandas will sort in the specifies column order.
* Pass the *ascending* order a Boolean to sort all columns in a consistent order.
* Pass *ascending* a list to customize the sort order per column.
* The *ascending* list length must watch the *by* list.


In [None]:
nba=pd.read_csv('nba.csv')
nba

In [None]:
nba.sort_values(["Team", "Name"])

In [None]:
nba[nba["Team"]=="Atlanta Hawks"].sort_values(["Team","Name"], ascending=[True, False])

# Sort a DF by its Index
* The *sort_index* method sorts the **DataFrame** by its index position/labels.

In [None]:
nba

In [None]:
nba=nba.set_index("Name")

In [None]:
nba.sort_index()

# Rank Values with the rank method
* The *rank* method assigns a numeric ranking to each **Series** value.
* Pandas will assign the same rank to equal values and create a 'gap' in the dataset to the rank.

In [None]:
nba=pd.read_csv('nba.csv').dropna(how="all")
nba

In [None]:
nba['Salary']=nba["Salary"].fillna(0).astype(int)
nba

In [None]:
nba.Salary

In [89]:
nba["Salary Rank"]=nba.Salary.rank(ascending=False).astype(int)
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,Salary Rank
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983,231
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000,80
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244,243
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000,69
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522,308
...,...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357,48
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864,394
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000,140
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0,540


In [88]:
nba.Salary.max()

51915615

In [None]:
nba[nba["Salary"]==51915615]

In [None]:
nba.sort_values("Salary", ascending=False).head(10)

# DataFrames II:Filtering Data

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [None]:
employees=pd.read_csv('employees.csv')

In [None]:
employees.info()

In [None]:
employees["Start Date"]=pd.to_datetime(employees["Start Date"],  format="%m/%d/%Y")

In [None]:
employees['Last Login Time']=pd.to_datetime(employees['Last Login Time'],  format="%H:%M %p").dt.time

In [None]:
employees['Senior Management']=employees['Senior Management'].astype(bool)

In [None]:
employees['Gender']=employees['Gender'].astype("category")

In [None]:
employees.info()

In [None]:
employees

In [None]:
employees=pd.read_csv("employees.csv", parse_dates=["Start Date"], date_format="%m/%d/%Y")

In [None]:
employees["Last Login Time"]=pd.to_datetime(employees["Last Login Time"], format="%H:%M %p").dt.time

## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [None]:
employees.head()

In [None]:
employees[employees["Gender"]=="Male"]

In [None]:
employees["Senior Management"]=employees["Senior Management"].astype(bool)

In [None]:
employees[employees["Senior Management"]]
employees[employees["Salary"]>100000]
employees["Start Date"]<"1985-01-01"

In [None]:
import datetime as dt;

In [None]:
# dt.time(1,0,0)
employees[employees["Last Login Time"]<dt.time(12, 0,0)]

### Using datetime module

In [None]:
DT=dt.datetime(2003,12,12,10,30,0)

In [None]:
print(DT)

### Using pandas to_datetime function

In [None]:
dt_pandas=pd.to_datetime("2023-12-12 10:30:00")
print(dt_pandas)

In [None]:
dt_pandas.year
dt_pandas.month
dt_pandas.day

## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [None]:
employees.columns

In [None]:
employees[(employees["Gender"]=="Female") & (employees["Team"]=="Marketing")]

## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

In [None]:
employees["Team"].notnull()
employees["Team"].isnull()

In [None]:
employees.info()

## The between Method
- The `between` method returns True if a **Series** value is found within its range.

In [None]:
employees["Salary"].between(30000,60000)

In [None]:
dt

In [None]:
employees["Last Login Time"].between(dt.time(8,30), dt.time(12,0))

## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

In [None]:
employees["First Name"].duplicated()

In [None]:
employees[employees["First Name"]=="Henry"]
top=employees.head(150)
top[top["First Name"].duplicated(keep="last")]
top[top["First Name"].duplicated(keep="first")]


In [None]:
top[top["First Name"].duplicated(keep="last")]


In [None]:
employees

In [None]:
top[top["First Name"].duplicated(keep=False)]
employees[~employees["First Name"].duplicated(keep=False)]


## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

In [None]:
employees.drop_duplicates()
employees.drop_duplicates("Team", keep="last")
employees.drop_duplicates("Team", keep="first")
employees.drop_duplicates("First Name", keep=False)


In [None]:
employees.drop_duplicates(["Senior Management", "Team"], keep="last").sort_values("Team")

## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.

In [None]:
employees["Gender"].unique()
employees["Team"].unique()
employees["Team"].nunique(dropna=False)

In [None]:
employees["Team"].unique()
employees.nunique()

In [None]:
employees["Salary"].nunique()

# DataFrames III: Data Extraction

## This Module's Dataset
- This module's dataset is a collection of all James Bond movies.

In [None]:
james=pd.read_csv("jamesbond.csv")
james

## The set_index and reset_index Methods
- The index serves as the collection of primary identifiers/labels/entrypoints for the rows.
- The fastest way to extract a row is from a sorted index by position/label.
- Pandas uses index labels/values when merging different objects together.
- The `set_index` method sets an existing column as the index of the **DataFrame**.
- The `reset_index` method sets the standard ascending numeric index as the index of the **DataFrame**.

In [None]:
james=james.set_index("Film");

In [None]:
james.reset_index().set_index("Year")
james=james.reset_index()

## Retrieve Rows by Index Position with iloc Accessor
- The `iloc` accessor retrieves one or more rows by index position.
- Provide a pair of square brackets after the accessor.
- `iloc` accepts single values, lists, and slices.

In [None]:
james.iloc[[2,5]]
james.iloc[1:8]


## Retrieve Rows by Index Label with loc Accessor
- The `loc` accessor retrieves one or more rows by index label.
- Provide a pair of square brackets after the accessor.

In [None]:
james=james.set_index("Film")


In [None]:
james
james.loc["Dr. No"]
james.loc["Casino Royale"]
james.loc["Dr. No":"Thunderball"]

## Second Arguments to loc and iloc Accessors
- The second value inside the square brackets targets the columns.
- The `iloc` requires numeric positions for rows and columns.
- The `loc` requires labels for rows and columns.

In [None]:
james.iloc[0:5,0:4]
james.iloc[3,1]
james.loc["Dr. No":"Goldfinger","Year":"Director"]

## Overwrite Value in a DataFrame
- Use the `iloc` or `loc` accessor on the **DataFrame** to target a value, then provide the equal sign and a new value.

In [None]:
james.loc["Diamonds Are Forever","Actor"]="Maki Reddy"
james

##  Overwrite Multiple Values in a DataFrame
- The `replace` method replaces all occurrences of a **Series** value with another value (think of it like "Find and Replace").
- To overwrite multiple values in a **DataFrame**, remember to use an accessor on the **DataFrame** itself.
- Accessors like `loc` and `iloc` can accept Boolean Series. Use them to target the values to overwrite.

In [None]:
james["Actor"]=james["Actor"].replace("Sean Connery","Sir Sean Connery")
james.loc[james["Actor"]=="Sean Connery", "Actor"]="Sir Sean Connery"

In [None]:
james

## Rename Index Labels or Columns in a DataFrame
- The `rename` method accepts a dictionary for either its `columns` or `index` parameters.
- The dictionary keys represent the existing names and the values represent the new names.
- We can replace all columns by overwriting the **DataFrame's** `columns` attribute.

In [None]:
james.rename(columns={"Year":"year", "Box Office":"revenue"}, inplace=True)

In [None]:
swaps={
    "Dr. No":"Dr No",
    "GoldenEye":"Golden Eye"
}
james.set_index("Film" , inplace=True)
james

In [None]:
james.rename(index=swaps)


In [None]:
james.columns

## Delete Rows or Columns from a DataFrame
- The `drop` method deletes one or more rows/columns from a **DataFrame**.
- Pass the `index` or `columns` parameters a list of the column names to remove.
- The `pop` method removes and returns a single **Series** (it mutates the **DataFrame** in the process).
- Python's `del` keyword also removes a single **Series**.

In [None]:
james.drop(columns="Box Office", inplace=True)

In [None]:
james.drop(index=["Casino Royale"])
james.drop(index=["Casino Royale"], columns=["Budget"])

In [None]:
james.pop("Actor")


In [None]:
james


In [None]:
del james["Year"]

In [None]:
james

## Create Random Sample with the sample Method
- The `sample` method returns a specified one or more random rows from the **DataFrame**.
- Customize the `axis` parameter to extract random columns.

In [None]:
james.sample()
james.sample(axis=1)
james.sample(n=10)
james.sample(n=2, axis=0)

## The nsmallest and nlargest Methods
- The `nlargest` method returns a specified number of rows with the largest values from a given column.
- The `nsmallest` method returns rows with the smallest values from a given column.
- The `nlargest` and `nsmallest` methods are more efficient than sorting the entire **DataFrame**.
- only for numeric values.

In [91]:
james=pd.read_csv("jamesbond.csv")

In [94]:
james.sort_values("Box Office", ascending=False).head(4)
james.nlargest(n=4, columns="Box Office")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
26,No Time to Die,2021,Daniel Craig,Cary Joji Fukunaga,774.2,301.0,25.0


In [95]:
james["Box Office"].nlargest(4)

24    943.5
3     848.1
2     820.4
26    774.2
Name: Box Office, dtype: float64

In [96]:
james.nsmallest(4, columns="Bond Actor Salary")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


## Filtering with the where Method
- Similar to square brackets or `loc`, the `where` method filters the original `DataFrame` with a Boolean Series.
- Pandas will populate rows that do **not** match the criteria with `NaN` values.
- Leaving in the `NaN` values can be advantageous for certain merge and visualization operations.

In [None]:
actor_is_sean_connery=james["Actor"]=="Sean Connery"
james[actor_is_sean_connery]
james.loc[actor_is_sean_connery]
james.where(actor_is_sean_connery)

## The apply Method with DataFrames
- The `apply` method invokes a function on every column or every row in the **DataFrame**.
- Pass the uninvoked function as the first argument to the `apply` method.
- Pass the `axis` parameter an argument of `"columns"` to invoke the function on every row.
- Pandas will pass in the row's values as a **Series** object. We can use accessors like `loc` and `iloc` to extract the column's values for that row.

In [None]:
james["Actor"].apply(len)
james.set_index("Film", inplace=True)
james

In [None]:
def rank_movie(row):
    year=row.loc["Year"];
    actor=row.loc["Actor"];
    budget=row.loc["Budget"];
    if year>=1980 and year<1990 :
        return "Great 80's flick";
    if actor=="Pierce Brosnan":
        return "The best bond ever"
    if budget>100:
        return "Expensive movie, fun";
    return "No comment";
    
    
print(james.apply(rank_movie, axis=1).sort_index())