# Pandas

Panadas is a Python Module for numerical analysis and Time series analaysis.

It makes it easy to load, clean, analyze, and visualize datasets.

### The main objects are:

Series → one column of data.

DataFrame → a table with rows and columns (like Excel).

### Common tasks with pandas:

Read data (pd.read_csv, pd.read_excel)

Select/filter data (df['col'], df[df['col'] > 10])

Summarize (df.describe(), df.mean())

Clean data (drop missing values, rename columns)

Group and aggregate (df.groupby('col').mean())

### In short: pandas = Excel + SQL + Python, all in one tool.


The code you provided is for mounting Google Drive in a Google Colab environment. This allows you to access files stored in your Google Drive directly from your Colab notebook.

In [1]:
# Mount Google Drive (if not already mounted)
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !pwd

/content


In [2]:
# cd /content/drive/MyDrive/Python_Code/

/content/drive/MyDrive/Python_Code


In [7]:
# !pip install bokeh_sampledata

Collecting bokeh_sampledata
  Downloading bokeh_sampledata-2025.0-py3-none-any.whl.metadata (2.6 kB)
Collecting icalendar (from bokeh_sampledata)
  Downloading icalendar-6.3.1-py3-none-any.whl.metadata (9.0 kB)
Downloading bokeh_sampledata-2025.0-py3-none-any.whl (17.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading icalendar-6.3.1-py3-none-any.whl (242 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.3/242.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: icalendar, bokeh_sampledata
Successfully installed bokeh_sampledata-2025.0 icalendar-6.3.1


In [3]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import linregress

#example dataframe from the bokeh module
from bokeh.sampledata.autompg import autompg as example_df


In [4]:
# `autompg` is already a pandas DataFrame
print(example_df.head())


    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino


mpg → fuel efficiency (miles per gallon)

cyl → number of cylinders in the engine

displ → engine size (displacement)

hp → horsepower (engine power)

weight → car weight (in pounds)

accel → acceleration (seconds to go 0 → 60 mph)

yr → model year (70 = 1970, etc.)

origin → where the car was made (1 = USA, 2 = Europe, 3 = Japan)

name → car name (brand and model)

In [5]:
# show all data
example_df

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


## Basic Data Structures
### Series

A Series is a one-dimensional labeled array that holds data of any type (integers, strings, floats, etc.). You can think of it as a single column of data with an index (row labels).

It is similar to a NumPy array, but with labels (the index) that make it easier to reference and manipulate data.
### DataFrame
A DataFrame is essentially a collection of Series objects combined together, where each Series forms a column.

### In short:

Series = one column of labeled data.

DataFrame = multiple Series put together in a table.

In [6]:
series_1 = pd.Series([10,20,30,40,50])
series_1

0    10
1    20
2    30
3    40
4    50
dtype: int64

Can also just print the values

In [7]:
print(series_1.values)

print(series_1)

[10 20 30 40 50]
0    10
1    20
2    30
3    40
4    50
dtype: int64


### DataFrame

Creating an empty dataframe

In [8]:
df_1 = pd.DataFrame()

You can also create a empty DataFrame with define columns

In [9]:
df_1 = pd.DataFrame(columns=('Col 1','Col 2','Col 3'))

In [10]:
df_1

Unnamed: 0,Col 1,Col 2,Col 3


In [24]:
table_1 = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]],columns=list('ABCDE'))

In [12]:
table_1

Unnamed: 0,A,B,C,D,E
0,1,2,3,4,5
1,6,7,8,9,10


## Writing To and Reading From Files

CSV

Let's write of auto mpg data frame to csv.  Note: If you leave the index parameter set to True, you'll get an extra column called 'Unnamed:0' in your CSV

In [13]:
example_df.to_csv('./autompg.csv',index=False)

In [14]:
example_df

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


Here is how we read a CSV

In [15]:
example_df_2 = pd.read_csv('./autompg.csv')
example_df_2

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


   You can also specify which columns you'd like to read in...if you want a subset of the set

In [16]:
example_df_noname = pd.read_csv('./autompg.csv',usecols=['mpg','cyl','displ','hp','weight','accel','yr','origin'])
example_df_noname.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin
0,18.0,8,307.0,130,3504,12.0,70,1
1,15.0,8,350.0,165,3693,11.5,70,1
2,18.0,8,318.0,150,3436,11.0,70,1
3,16.0,8,304.0,150,3433,12.0,70,1
4,17.0,8,302.0,140,3449,10.5,70,1


In [18]:
example_df_2.to_csv('./autompg.tsv',index=False,sep='\t')
example_df_3 = pd.read_csv('./autompg.tsv',sep='\t')
print(example_df_3.head())

    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino


In [20]:
example_df_3.shape

(392, 9)






1.   The code you provided reads a large CSV file in chunks of 1000 rows at a time and prints the first row (the head) of each chunk.
2.   Instead of loading the entire file at once (which can be inefficient for very large files), Pandas loads smaller chunks, making the process more memory-efficient.
3.   For each chunk, this prints only the first row (head(1)) of the chunk, providing a quick preview of each chunk.




In [30]:
# # March 4th, 2024
# import pandas as pd
#
# # Create an empty DataFrame
# large_df = pd.DataFrame()
#
# # Concatenate df1 to large_df 100 times
# large_df = pd.concat([large_df] + [df1]*100, ignore_index=True)
#
# # Save the concatenated DataFrame to a CSV file
# large_df.to_csv("large.csv", index=False)
#
# # Read the large CSV file in chunks and print the head of each chunk
# for chunk in pd.read_csv('large.csv', chunksize=1000):
#     print(chunk.head(1))

# import pandas
import pandas as pd

# create a enpty dataframe
large_df = pd.DataFrame()

# concatenate example_df_3 to large_df 100 times
large_df = pd.concat([large_df] + [example_df_3] * 100, ignore_index=True)

# save large_df to a csv file
large_df.to_csv('./large_autompg.csv',index=False)

# read the large csv file in chunks and print the head of each chunk
for chunk in pd.read_csv('./large_autompg.csv',chunksize = 1000):
    print(chunk.head(1))



    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
       mpg  cyl  displ  hp  weight  accel  yr  origin           name
1000  36.0    4   79.0  58    1825   18.6  77       2  renault 5 gtl
       mpg  cyl  displ   hp  weight  accel  yr  origin               name
2000  14.0    8  318.0  150    4096   13.0  71       1  plymouth fury iii
       mpg  cyl  displ   hp  weight  accel  yr  origin                   name
3000  20.6    6  231.0  105    3380   15.8  78       1  buick century special
       mpg  cyl  displ  hp  weight  accel  yr  origin             name
4000  28.0    4   97.0  92    2288   17.0  72       3  datsun 510 (sw)
       mpg  cyl  displ   hp  weight  accel  yr  origin               name
5000  23.0    8  350.0  125    3900   17.4  79       1  cadillac eldorado
       mpg  cyl  displ   hp  weight  accel  yr  origin               name
6000  15.0    8  318.0  150    3399 

## Working with Excel

Use ExcelWriter to write a DataFrame or multiple DataFrames to an Excel Workbook

In [31]:
import pandas as pd

# create a sample DataFrame
example_df_4 = pd.DataFrame([{'Name':'Steve Jobs','Company':'Apple'},{'Name':'Bill Gates','Company':'Microsoft'}])

# create a Pandas Excel writer using XlsxWriter as the engine
with pd.ExcelWriter('./test_workbook.xlsx') as writer:
    # write DataFrame to multiple sheets
    example_df_3.to_excel(writer,sheet_name='autompg',index=False)
    example_df_4.to_excel(writer,sheet_name='name_company',index=False)


When reading from an Excel workbook, Pandas assumes you want just the first sheet of the workbook by default


In [36]:
# df1 = pd.read_excel('test_workbook.xlsx')
# df1.head()

df1 = pd.read_excel('./test_workbook.xlsx')
print(df1.head(10))

    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino
5  15.0    8  429.0  198    4341   10.0  70       1           ford galaxie 500
6  14.0    8  454.0  220    4354    9.0  70       1           chevrolet impala
7  14.0    8  440.0  215    4312    8.5  70       1          plymouth fury iii
8  14.0    8  455.0  225    4425   10.0  70       1           pontiac catalina
9  15.0    8  390.0  190    3850    8.5  70       1         amc ambassador dpl


To read a specific sheet you simply use the input variable sheet_name

In [37]:
df2 = pd.read_excel('./test_workbook.xlsx',sheet_name='name_company')
df2

Unnamed: 0,Name,Company
0,Steve Jobs,Apple
1,Bill Gates,Microsoft


## Working with JSON/APIs

This is a very simple example to illustrate that its easy to work with JSON payloads as long as the payload has a structure that can be interpreted.

Pandas can write a DataFrame to a JSON file, and also read in from a JSON file...

In [38]:
example_df.to_json('./autompg.json')

df_from_json = pd.read_json('./autompg.json')

print(df_from_json.head(10))


    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino
5  15.0    8  429.0  198    4341   10.0  70       1           ford galaxie 500
6  14.0    8  454.0  220    4354    9.0  70       1           chevrolet impala
7  14.0    8  440.0  215    4312    8.5  70       1          plymouth fury iii
8  14.0    8  455.0  225    4425   10.0  70       1           pontiac catalina
9  15.0    8  390.0  190    3850    8.5  70       1         amc ambassador dpl


This can also be done for JSON objects

In [40]:
json_object = example_df.to_json()

df_from_json = pd.read_json(json_object)

print(df_from_json.head(10))


    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino
5  15.0    8  429.0  198    4341   10.0  70       1           ford galaxie 500
6  14.0    8  454.0  220    4354    9.0  70       1           chevrolet impala
7  14.0    8  440.0  215    4312    8.5  70       1          plymouth fury iii
8  14.0    8  455.0  225    4425   10.0  70       1           pontiac catalina
9  15.0    8  390.0  190    3850    8.5  70       1         amc ambassador dpl


  df_from_json = pd.read_json(json_object)


## Summarizing and Inspecting a  DataFrame

Below are a sample of the most popular and useful methods built into pandas in order to explore your data at a cursory level

In [44]:
example_df.shape

(392, 9)

In [45]:
example_df.index

RangeIndex(start=0, stop=392, step=1)

In [46]:
example_df.info

<bound method DataFrame.info of       mpg  cyl  displ   hp  weight  accel  yr  origin  \
0    18.0    8  307.0  130    3504   12.0  70       1   
1    15.0    8  350.0  165    3693   11.5  70       1   
2    18.0    8  318.0  150    3436   11.0  70       1   
3    16.0    8  304.0  150    3433   12.0  70       1   
4    17.0    8  302.0  140    3449   10.5  70       1   
..    ...  ...    ...  ...     ...    ...  ..     ...   
387  27.0    4  140.0   86    2790   15.6  82       1   
388  44.0    4   97.0   52    2130   24.6  82       2   
389  32.0    4  135.0   84    2295   11.6  82       1   
390  28.0    4  120.0   79    2625   18.6  82       1   
391  31.0    4  119.0   82    2720   19.4  82       1   

                          name  
0    chevrolet chevelle malibu  
1            buick skylark 320  
2           plymouth satellite  
3                amc rebel sst  
4                  ford torino  
..                         ...  
387            ford mustang gl  
388                

In [47]:
example_df.count

<bound method DataFrame.count of       mpg  cyl  displ   hp  weight  accel  yr  origin  \
0    18.0    8  307.0  130    3504   12.0  70       1   
1    15.0    8  350.0  165    3693   11.5  70       1   
2    18.0    8  318.0  150    3436   11.0  70       1   
3    16.0    8  304.0  150    3433   12.0  70       1   
4    17.0    8  302.0  140    3449   10.5  70       1   
..    ...  ...    ...  ...     ...    ...  ..     ...   
387  27.0    4  140.0   86    2790   15.6  82       1   
388  44.0    4   97.0   52    2130   24.6  82       2   
389  32.0    4  135.0   84    2295   11.6  82       1   
390  28.0    4  120.0   79    2625   18.6  82       1   
391  31.0    4  119.0   82    2720   19.4  82       1   

                          name  
0    chevrolet chevelle malibu  
1            buick skylark 320  
2           plymouth satellite  
3                amc rebel sst  
4                  ford torino  
..                         ...  
387            ford mustang gl  
388               

In [48]:
# Will process numeric columns for count, mean, standard deviation (std), min, 25%, 50%, 75%, max
example_df.describe()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [49]:
example_df.head() # first 5 rows of the DataFrame

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [50]:
example_df.head(10) # first 10 rows

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190,3850,8.5,70,1,amc ambassador dpl


In [51]:
example_df.tail() # last 5 rows

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger
391,31.0,4,119.0,82,2720,19.4,82,1,chevy s-10


In [52]:
example_df.tail(10) # last 10 rows

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
382,26.0,4,156.0,92,2585,14.5,82,1,chrysler lebaron medallion
383,22.0,6,232.0,112,2835,14.7,82,1,ford granada l
384,32.0,4,144.0,96,2665,13.9,82,3,toyota celica gt
385,36.0,4,135.0,84,2370,13.0,82,1,dodge charger 2.2
386,27.0,4,151.0,90,2950,17.3,82,1,chevrolet camaro
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger
391,31.0,4,119.0,82,2720,19.4,82,1,chevy s-10


### Verifying Data Types

Its important to know how pandas will treat the data stored within a DataFrame and how it will read specific columns.  For example, pandas will try to automatically parse numbers as int or float and can parse dates as datetime objects.

Note: pandas will automatically use numpy objects

In [54]:
# View column names and their associated data types
example_df.dtypes

mpg       float64
cyl         int64
displ     float64
hp          int64
weight      int64
accel     float64
yr          int64
origin      int64
name       object
dtype: object

In [57]:
# Select columns where the data type is int64 using numpy
example_df.select_dtypes([np.int64])

Unnamed: 0,cyl,hp,weight,yr,origin
0,8,130,3504,70,1
1,8,165,3693,70,1
2,8,150,3436,70,1
3,8,150,3433,70,1
4,8,140,3449,70,1
...,...,...,...,...,...
387,4,86,2790,82,1
388,4,52,2130,82,2
389,4,84,2295,82,1
390,4,79,2625,82,1


In [58]:
# Select columns where the data type is a numpy object (like a string)
example_df.select_dtypes(include=[object,np.int64])

Unnamed: 0,cyl,hp,weight,yr,origin,name
0,8,130,3504,70,1,chevrolet chevelle malibu
1,8,165,3693,70,1,buick skylark 320
2,8,150,3436,70,1,plymouth satellite
3,8,150,3433,70,1,amc rebel sst
4,8,140,3449,70,1,ford torino
...,...,...,...,...,...,...
387,4,86,2790,82,1,ford mustang gl
388,4,52,2130,82,2,vw pickup
389,4,84,2295,82,1,dodge rampage
390,4,79,2625,82,1,ford ranger


In [62]:
# You can change the data type of a column
# df2 = df.copy()
# df2['mpg'] = df2['mpg'].astype(str)
# df2['mpg'].unique()

example_df_copy = example_df.copy()

example_df_copy['mpg'] = example_df_copy['mpg'].astype(str)

example_df_copy['mpg'].unique()


array(['18.0', '15.0', '16.0', '17.0', '14.0', '24.0', '22.0', '21.0',
       '27.0', '26.0', '25.0', '10.0', '11.0', '9.0', '28.0', '19.0',
       '12.0', '13.0', '23.0', '30.0', '31.0', '35.0', '20.0', '29.0',
       '32.0', '33.0', '17.5', '15.5', '14.5', '22.5', '24.5', '18.5',
       '29.5', '26.5', '16.5', '31.5', '36.0', '25.5', '33.5', '20.5',
       '30.5', '21.5', '43.1', '36.1', '32.8', '39.4', '19.9', '19.4',
       '20.2', '19.2', '25.1', '20.6', '20.8', '18.6', '18.1', '17.7',
       '27.5', '27.2', '30.9', '21.1', '23.2', '23.8', '23.9', '20.3',
       '21.6', '16.2', '19.8', '22.3', '17.6', '18.2', '16.9', '31.9',
       '34.1', '35.7', '27.4', '25.4', '34.2', '34.5', '31.8', '37.3',
       '28.4', '28.8', '26.8', '41.5', '38.1', '32.1', '37.2', '26.4',
       '24.3', '19.1', '34.3', '29.8', '31.3', '37.0', '32.2', '46.6',
       '27.9', '40.8', '44.3', '43.4', '36.4', '44.6', '33.8', '32.7',
       '23.7', '32.4', '26.6', '25.8', '23.5', '39.1', '39.0', '35.1',
       

### Modifying DataFrames

Modifications only work on assignment or when using inplace=True, this instructs the DataFrame to make the change without reassignment

In [63]:
# Change by assignment
print(example_df_copy.head(10))
example_df_copy = example_df_copy.drop('cyl', axis=1)
print(example_df_copy.head(10))

    mpg  cyl  displ   hp  weight  accel  yr  origin                       name
0  18.0    8  307.0  130    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5  70       1          buick skylark 320
2  18.0    8  318.0  150    3436   11.0  70       1         plymouth satellite
3  16.0    8  304.0  150    3433   12.0  70       1              amc rebel sst
4  17.0    8  302.0  140    3449   10.5  70       1                ford torino
5  15.0    8  429.0  198    4341   10.0  70       1           ford galaxie 500
6  14.0    8  454.0  220    4354    9.0  70       1           chevrolet impala
7  14.0    8  440.0  215    4312    8.5  70       1          plymouth fury iii
8  14.0    8  455.0  225    4425   10.0  70       1           pontiac catalina
9  15.0    8  390.0  190    3850    8.5  70       1         amc ambassador dpl
    mpg  displ   hp  weight  accel  yr  origin                       name
0  18.0  307.0  130    3504   12.0  70       1  chevrolet

In [64]:
# Change in place
example_df_copy.drop('hp',axis=1,inplace=True)
print(example_df_copy.head(10))

    mpg  displ  weight  accel  yr  origin                       name
0  18.0  307.0    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0  350.0    3693   11.5  70       1          buick skylark 320
2  18.0  318.0    3436   11.0  70       1         plymouth satellite
3  16.0  304.0    3433   12.0  70       1              amc rebel sst
4  17.0  302.0    3449   10.5  70       1                ford torino
5  15.0  429.0    4341   10.0  70       1           ford galaxie 500
6  14.0  454.0    4354    9.0  70       1           chevrolet impala
7  14.0  440.0    4312    8.5  70       1          plymouth fury iii
8  14.0  455.0    4425   10.0  70       1           pontiac catalina
9  15.0  390.0    3850    8.5  70       1         amc ambassador dpl


What happens if we remove inplace=True from drop()?

In [66]:
print('Before: \n')
print(example_df_copy.head(10))

example_df_copy.drop('yr',axis=1)

print('After: \n')
print(example_df_copy.head(10))

Before: 

    mpg  displ  weight  accel  yr  origin                       name
0  18.0  307.0    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0  350.0    3693   11.5  70       1          buick skylark 320
2  18.0  318.0    3436   11.0  70       1         plymouth satellite
3  16.0  304.0    3433   12.0  70       1              amc rebel sst
4  17.0  302.0    3449   10.5  70       1                ford torino
5  15.0  429.0    4341   10.0  70       1           ford galaxie 500
6  14.0  454.0    4354    9.0  70       1           chevrolet impala
7  14.0  440.0    4312    8.5  70       1          plymouth fury iii
8  14.0  455.0    4425   10.0  70       1           pontiac catalina
9  15.0  390.0    3850    8.5  70       1         amc ambassador dpl
After: 

    mpg  displ  weight  accel  yr  origin                       name
0  18.0  307.0    3504   12.0  70       1  chevrolet chevelle malibu
1  15.0  350.0    3693   11.5  70       1          buick skylark 320
2  18.0  318.0 

In [67]:
example_df_copy.drop('yr',axis=1,inplace=True)
print(example_df_copy.head(10))

    mpg  displ  weight  accel  origin                       name
0  18.0  307.0    3504   12.0       1  chevrolet chevelle malibu
1  15.0  350.0    3693   11.5       1          buick skylark 320
2  18.0  318.0    3436   11.0       1         plymouth satellite
3  16.0  304.0    3433   12.0       1              amc rebel sst
4  17.0  302.0    3449   10.5       1                ford torino
5  15.0  429.0    4341   10.0       1           ford galaxie 500
6  14.0  454.0    4354    9.0       1           chevrolet impala
7  14.0  440.0    4312    8.5       1          plymouth fury iii
8  14.0  455.0    4425   10.0       1           pontiac catalina
9  15.0  390.0    3850    8.5       1         amc ambassador dpl


## Working with Columns

In [68]:
# List Column names
example_df.columns

Index(['mpg', 'cyl', 'displ', 'hp', 'weight', 'accel', 'yr', 'origin', 'name'], dtype='object')

In [69]:
# Store Column names as a list - generally easier to work with
columns_list = list(example_df.columns)
print(columns_list)

['mpg', 'cyl', 'displ', 'hp', 'weight', 'accel', 'yr', 'origin', 'name']


You can batch rename columns but it requires a dictionary of the old values mapped to the new ones

In [70]:
# df2 = df.rename(columns={'mpg':'miles_per_gallon', 'cyl':'cylinders'})
# df2.head()
example_df_change_columns = example_df.rename(columns={'mpg':'miles/gallons','cyl':'cylinders'})
print(example_df_change_columns.head(10))

   miles/gallons  cylinders  displ   hp  weight  accel  yr  origin  \
0           18.0          8  307.0  130    3504   12.0  70       1   
1           15.0          8  350.0  165    3693   11.5  70       1   
2           18.0          8  318.0  150    3436   11.0  70       1   
3           16.0          8  304.0  150    3433   12.0  70       1   
4           17.0          8  302.0  140    3449   10.5  70       1   
5           15.0          8  429.0  198    4341   10.0  70       1   
6           14.0          8  454.0  220    4354    9.0  70       1   
7           14.0          8  440.0  215    4312    8.5  70       1   
8           14.0          8  455.0  225    4425   10.0  70       1   
9           15.0          8  390.0  190    3850    8.5  70       1   

                        name  
0  chevrolet chevelle malibu  
1          buick skylark 320  
2         plymouth satellite  
3              amc rebel sst  
4                ford torino  
5           ford galaxie 500  
6           

You are able to create new columns quite easily.  Similar to a dicitonary, if a columnn doesn't exist, pandas will automatically create one

In [71]:
# df2 = df.copy()
# df2['year'] = '2020' #This will set every rows year value to 2020
# df2.head()

example_df_change_allrows = example_df.copy()
example_df_change_allrows['yr'] = 2020
print(example_df_change_allrows.head(10))

    mpg  cyl  displ   hp  weight  accel    yr  origin  \
0  18.0    8  307.0  130    3504   12.0  2020       1   
1  15.0    8  350.0  165    3693   11.5  2020       1   
2  18.0    8  318.0  150    3436   11.0  2020       1   
3  16.0    8  304.0  150    3433   12.0  2020       1   
4  17.0    8  302.0  140    3449   10.5  2020       1   
5  15.0    8  429.0  198    4341   10.0  2020       1   
6  14.0    8  454.0  220    4354    9.0  2020       1   
7  14.0    8  440.0  215    4312    8.5  2020       1   
8  14.0    8  455.0  225    4425   10.0  2020       1   
9  15.0    8  390.0  190    3850    8.5  2020       1   

                        name  
0  chevrolet chevelle malibu  
1          buick skylark 320  
2         plymouth satellite  
3              amc rebel sst  
4                ford torino  
5           ford galaxie 500  
6           chevrolet impala  
7          plymouth fury iii  
8           pontiac catalina  
9         amc ambassador dpl  


### Accessing Index and Columns

In [72]:
# By Column
example_df['name'].head()

0    chevrolet chevelle malibu
1            buick skylark 320
2           plymouth satellite
3                amc rebel sst
4                  ford torino
Name: name, dtype: object

In [73]:
# Alternatively - Note: This doesn't work if there are spaces in the column name!
example_df.name.head()

0    chevrolet chevelle malibu
1            buick skylark 320
2           plymouth satellite
3                amc rebel sst
4                  ford torino
Name: name, dtype: object

In [74]:
# By index - this returns rows 2-4 non inclusive
example_df.iloc[2:4]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst


In [76]:
# By index and column
example_df.loc[[2],['name','mpg']]

Unnamed: 0,name,mpg
2,plymouth satellite,18.0


In [78]:
# By Column and Row - HP of a Plymouth Satellite
example_df.loc[2,'name']

'plymouth satellite'

### Removing Duplicates

This is used frequently to reduce the size of a DataFrame.  Let's use the existing df and add some duplicates to it...

In [79]:
len(example_df)

392

In [82]:
example_df_duplicated = pd.DataFrame();
example_df_duplicated = pd.concat([example_df_duplicated] + [example_df] * 2,ignore_index=True)
print(f"There are {len(example_df_duplicated)} rows in the DataFrame")

There are 784 rows in the DataFrame


In [84]:
#remove any duplicate rows
example_df_duplicated.drop_duplicates(inplace=True)
print(f"After removing duplicates, there are {len(example_df_duplicated)} rows in the DataFrame")

After removing duplicates, there are 392 rows in the DataFrame


In [86]:
# specify columns to reduce the number of cells in a row that must match to be dropped
example_df_duplicated = example_df_duplicated.drop_duplicates(subset=['mpg'])
print(f"After removing duplicates based on 'mpg', there are {len(example_df_duplicated)} rows in the DataFrame")

After removing duplicates based on 'mpg', there are 127 rows in the DataFrame


### Filtering on Column Data

Pandas allows you to filter on specfic columnar values

In [92]:
# Create a new DataFrame where 'cyl' value == 8
example_df_8cyl = example_df.loc[example_df['cyl'] == 6]
print(example_df_8cyl.head(10))

     mpg  cyl  displ   hp  weight  accel  yr  origin  \
15  22.0    6  198.0   95    2833   15.5  70       1   
16  18.0    6  199.0   97    2774   15.5  70       1   
17  21.0    6  200.0   85    2587   16.0  70       1   
24  21.0    6  199.0   90    2648   15.0  70       1   
32  19.0    6  232.0  100    2634   13.0  71       1   
33  16.0    6  225.0  105    3439   15.5  71       1   
34  17.0    6  250.0  100    3329   15.5  71       1   
35  19.0    6  250.0   88    3302   15.5  71       1   
36  18.0    6  232.0  100    3288   15.5  71       1   
44  18.0    6  258.0  110    2962   13.5  71       1   

                          name  
15             plymouth duster  
16                  amc hornet  
17               ford maverick  
24                 amc gremlin  
32                 amc gremlin  
33   plymouth satellite custom  
34   chevrolet chevelle malibu  
35             ford torino 500  
36                 amc matador  
44  amc hornet sportabout (sw)  


In [94]:
# Reset the index so that the index is ordered again
# df2 = df.loc[df['cyl'] == 8].reset_index(drop=True)
# df2.head()
example_df_8cyl = example_df.loc[example_df['cyl'] == 6].reset_index(drop=True)
example_df_8cyl.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,22.0,6,198.0,95,2833,15.5,70,1,plymouth duster
1,18.0,6,199.0,97,2774,15.5,70,1,amc hornet
2,21.0,6,200.0,85,2587,16.0,70,1,ford maverick
3,21.0,6,199.0,90,2648,15.0,70,1,amc gremlin
4,19.0,6,232.0,100,2634,13.0,71,1,amc gremlin


In [96]:
# This can also be done without .loc
example_df_mpggt20 = example_df.loc[example_df['mpg'] > 20].reset_index(drop=True)
example_df_mpggt20.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,24.0,4,113.0,95,2372,15.0,70,3,toyota corona mark ii
1,22.0,6,198.0,95,2833,15.5,70,1,plymouth duster
2,21.0,6,200.0,85,2587,16.0,70,1,ford maverick
3,27.0,4,97.0,88,2130,14.5,70,3,datsun pl510
4,26.0,4,97.0,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan


### Fill or Drop NaN or null values

Often you will get poor data/missing data from real life data sets - pandas has some built in functionality to handle this common scenario

In [98]:
example_df.reindex()
example_df.shape

(392, 9)

In [99]:
# First we'll add some empty values to the data frame
df_append = pd.DataFrame([{'name':'Ford Taurus'},{'mpg':18.0}])
# example_df_append = example_df.append(df_append, ignore_index=True)


AttributeError: 'DataFrame' object has no attribute 'append'

DataFrame.append() is deprecated in recent versions of pandas.

The recommended way is to use pd.concat:

In [100]:
# df2 = pd.concat([df, df3], ignore_index=True)
example_df_append = pd.concat([example_df] + [df_append],ignore_index=True)
example_df_append.tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger
391,31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10
392,,,,,,,,,Ford Taurus
393,18.0,,,,,,,,


In [101]:
#Check for NaN values
example_df_append.loc[example_df_append['name'].isnull()]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
393,18.0,,,,,,,,


In [104]:
# True or False on columns contain null values
example_df.isnull().any()
example_df_append.isnull().any()
example_df_append.isnull().all()

mpg       False
cyl       False
displ     False
hp        False
weight    False
accel     False
yr        False
origin    False
name      False
dtype: bool

In [107]:
# Sum of all missing values by column
example_df_append.isnull().sum()

mpg       1
cyl       2
displ     2
hp        2
weight    2
accel     2
yr        2
origin    2
name      1
dtype: int64

In [108]:
# Sum of all missing values across all columns
example_df_append.isnull().sum().sum()

np.int64(16)

In [109]:
# Locate all missing values
example_df_append.loc[example_df_append.isnull().T.any()]

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
392,,,,,,,,,Ford Taurus
393,18.0,,,,,,,,


In [112]:
# Fill NaN values
example_df_append_filled = example_df_append.fillna(0)
example_df_append_filled.tail()
# or inplace
example_df_append.fillna(0).tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger
391,31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10
392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Ford Taurus
393,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [113]:
# Drop NaN values
example_df_append.dropna().tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
387,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl
388,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger
391,31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10


In [114]:
# You can also target a column
example_df_append['mpg'].fillna(0).tail()

389    32.0
390    28.0
391    31.0
392     0.0
393    18.0
Name: mpg, dtype: float64

In [117]:
# Drop a row only if all/any columns are NaN
example_df_append.dropna(how='all').tail()
example_df_append.dropna(how='any').tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
387,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl
388,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger
391,31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10


In [120]:
# Drop if specific columns are NaN
example_df_append.dropna(subset=['mpg','cyl']).tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
387,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl
388,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger
391,31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10


### Simple Operations on DataFrames

In [121]:
# All Unique values in column

example_df['mpg'].unique()

array([18. , 15. , 16. , 17. , 14. , 24. , 22. , 21. , 27. , 26. , 25. ,
       10. , 11. ,  9. , 28. , 19. , 12. , 13. , 23. , 30. , 31. , 35. ,
       20. , 29. , 32. , 33. , 17.5, 15.5, 14.5, 22.5, 24.5, 18.5, 29.5,
       26.5, 16.5, 31.5, 36. , 25.5, 33.5, 20.5, 30.5, 21.5, 43.1, 36.1,
       32.8, 39.4, 19.9, 19.4, 20.2, 19.2, 25.1, 20.6, 20.8, 18.6, 18.1,
       17.7, 27.5, 27.2, 30.9, 21.1, 23.2, 23.8, 23.9, 20.3, 21.6, 16.2,
       19.8, 22.3, 17.6, 18.2, 16.9, 31.9, 34.1, 35.7, 27.4, 25.4, 34.2,
       34.5, 31.8, 37.3, 28.4, 28.8, 26.8, 41.5, 38.1, 32.1, 37.2, 26.4,
       24.3, 19.1, 34.3, 29.8, 31.3, 37. , 32.2, 46.6, 27.9, 40.8, 44.3,
       43.4, 36.4, 44.6, 33.8, 32.7, 23.7, 32.4, 26.6, 25.8, 23.5, 39.1,
       39. , 35.1, 32.3, 37.7, 34.7, 34.4, 29.9, 33.7, 32.9, 31.6, 28.1,
       30.7, 24.2, 22.4, 34. , 38. , 44. ])

In [122]:
# Count of unique values in column
example_df['cyl'].value_counts()

cyl
4    199
8    103
6     83
3      4
5      3
Name: count, dtype: int64

In [123]:
# Count all the entries in a column
example_df['cyl'].count()

np.int64(392)

In [124]:
# Sum all the entries in a column
example_df['hp'].sum()

np.int64(40952)

In [125]:
# Mean of all the values in a column
example_df['mpg'].mean()

np.float64(23.445918367346938)

In [126]:
# Median of all the values in a column
example_df['mpg'].median()

22.75

In [127]:
# Minimum of all column values
example_df['mpg'].min()

9.0

In [128]:
# Maximum of all column
example_df['mpg'].max()

46.6

In [130]:
# Standard Deviation
example_df['mpg'].std()

7.805007486571799

# Pandas (Advanced)

This is a contuniation of the LO3 - Panda's material.  It is likley out of the scope of the Programming for Data Science course and can be seen as optional.  It contains advanced usage of pandas.

## Sorting Columns

In [131]:
example_df.sort_values('mpg', ascending=False).head(10)

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
320,46.6,4,86.0,65,2110,17.9,80,3,mazda glc
327,44.6,4,91.0,67,1850,13.8,80,3,honda civic 1500 gl
323,44.3,4,90.0,48,2085,21.7,80,2,vw rabbit c (diesel)
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
324,43.4,4,90.0,48,2335,23.7,80,2,vw dasher (diesel)
242,43.1,4,90.0,48,1985,21.5,78,2,volkswagen rabbit custom diesel
307,41.5,4,98.0,76,2144,14.7,80,2,vw rabbit
322,40.8,4,85.0,65,2110,19.2,80,3,datsun 210
245,39.4,4,85.0,70,2070,18.6,78,3,datsun b210 gx
339,39.1,4,79.0,58,1755,16.9,81,3,toyota starlet


In [132]:
# Multi-column sort
example_df.sort_values(['mpg','cyl'],ascending=False).head(10)

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
320,46.6,4,86.0,65,2110,17.9,80,3,mazda glc
327,44.6,4,91.0,67,1850,13.8,80,3,honda civic 1500 gl
323,44.3,4,90.0,48,2085,21.7,80,2,vw rabbit c (diesel)
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
324,43.4,4,90.0,48,2335,23.7,80,2,vw dasher (diesel)
242,43.1,4,90.0,48,1985,21.5,78,2,volkswagen rabbit custom diesel
307,41.5,4,98.0,76,2144,14.7,80,2,vw rabbit
322,40.8,4,85.0,65,2110,19.2,80,3,datsun 210
245,39.4,4,85.0,70,2070,18.6,78,3,datsun b210 gx
339,39.1,4,79.0,58,1755,16.9,81,3,toyota starlet


## Merging DataFrames

While many of htese are similar, there is numerous arguments that can be used in conjunction to truly customize the type of DataFrame merging/joining/appending/concatinating you are trying to achieve.

We will define some sample data frames to use as examples of the various operations...

In [134]:
# Sample DataFrames
df_a = pd.DataFrame([[1,2,3],[4,5,6]],columns=list('ABC'))
df_b = pd.DataFrame([[7,8,9],[10,11,12]],columns=list('DEF'))
df_c = pd.DataFrame([[13,14,15],[16,17,18]],columns=list('GHI'))

In [136]:
df_a

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


In [137]:
df_b

Unnamed: 0,D,E,F
0,7,8,9
1,10,11,12


In [138]:
df_c

Unnamed: 0,G,H,I
0,13,14,15
1,16,17,18


### Concatenating DataFrames

In [141]:
import pandas as pd

df_ab = pd.concat([df_a] + [df_b],ignore_index=True)
# or df_ab = pd.concat([a,b], ignore_index=True)
df_ab

Unnamed: 0,A,B,C,D,E,F
0,1.0,2.0,3.0,,,
1,4.0,5.0,6.0,,,
2,,,,7.0,8.0,9.0
3,,,,10.0,11.0,12.0


In [143]:
df_abc = pd.concat([df_a,df_b,df_c], sort=False)
df_abc

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,1.0,2.0,3.0,,,,,,
1,4.0,5.0,6.0,,,,,,
0,,,,7.0,8.0,9.0,,,
1,,,,10.0,11.0,12.0,,,
0,,,,,,,13.0,14.0,15.0
1,,,,,,,16.0,17.0,18.0


### Joining DataFrames

Joins are similar to the concept in SQL, can specify join on index

In [147]:
df_a = pd.DataFrame([[1,2,3],[4,5,6]],columns=list('ABC'))
df_a

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


In [148]:
df_b = pd.DataFrame([[7,8,9],[10,11,12]],columns=list('BCD'))
df_b

Unnamed: 0,B,C,D
0,7,8,9
1,10,11,12


In [149]:
# how: 'left', 'right', 'outer', 'inner' (default is 'left')
# lsuffix and rsuffix are used to append to overlapping column names
df_joined = df_a.join(df_b,how='left',lsuffix='_a',rsuffix='_b')
df_joined


Unnamed: 0,A,B_a,C_a,B_b,C_b,D
0,1,2,3,7,8,9
1,4,5,6,10,11,12


In [151]:
import pandas as pd

# Example DataFrames
a = pd.DataFrame({'Name': ['Alice', 'Bob'],
                  'Age': [25, 30]})

b = pd.DataFrame({'Name': ['Bob', 'Charlie'],
                  'Salary': [50000, 60000]})

# Join DataFrames a and b based on the 'Name' column with a left join
joined_df = a.join(b.set_index('Name'), on='Name', how='left', lsuffix='_a', rsuffix='_b')

# Print the result
print(joined_df)


    Name  Age   Salary
0  Alice   25      NaN
1    Bob   30  50000.0


Merge, join, concatenate and compare
pandas provides various methods for combining and comparing Series or DataFrame.

https://pandas.pydata.org/docs/user_guide/merging.html

### Merging DataFrames

This allows you to merge two or more DataFrames with overlapping columns - similar to join

1. This merges DataFrame a with DataFrame b.
2. The parameter left_on='B' means that column 'B' from DataFrame a is used as the key for the merge.
3. The parameter right_on='D' means that column 'D' from DataFrame b is used as the key for the merge.
4. Rows from a and b are matched when the values in column 'B' from a match the values in column 'D' from b.
5. The result is a new DataFrame (merged_df) containing the matched rows, and columns from both DataFrames are included.

In [108]:
a = pd.DataFrame([[1,2,3],[3,4,5]], columns=list('ABC'))
b = pd.DataFrame([[5,2,3],[7,4,5]], columns=list('BDE'))
c = pd.DataFrame([[11,12,13], [17,14,15]], columns=list('XYZ'))

merged_df = a.merge(b, left_on='B', right_on='D')
print(merged_df)

   A  B_x  C  B_y  D  E
0  1    2  3    5  2  3
1  3    4  5    7  4  5


### Iterating DataFrames

1. Iterating is typically only variable on a small DataFrame.  For larger DataFrames you will generally need to use apply/map and functions for efficiency

2. Accessing values is done by index

    row[0] = Index

    row[1] = Values as pandas series
    
    row[1][0] = First column value of row, can specify column rows[1]['Column']

In [110]:
import pandas as pd

# Sample DataFrame
df5 = pd.DataFrame({
    "name": ["Toyota", "Honda", "Ford"],
    "mpg": [30, 28, 22]
})

# Iterate over rows with iterrows()
for index, row in df5.iterrows():
    print(f"Row {index}: {row['name']} has {row['mpg']} mpg")


Row 0: Toyota has 30 mpg
Row 1: Honda has 28 mpg
Row 2: Ford has 22 mpg


### IterTuples
A faster and more efficent way to iterate a DataFrame


In [111]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "name": ["Toyota", "Honda", "Ford"],
    "mpg": [30, 28, 22]
})

# Iterate with itertuples()
for row in df.itertuples(index=True, name="Car"):
    print(f"Row {row.Index}: {row.name} has {row.mpg} mpg")



Row 0: Toyota has 30 mpg
Row 1: Honda has 28 mpg
Row 2: Ford has 22 mpg


### Why itertuples() is better

Much faster than iterrows() (uses Python tuples instead of pandas Series).

More memory-efficient.

Keeps column access by attribute (row.mpg, row.name).

### Pivoting on a DataFrame

You can create Excel style pivot tables based on specified criteria

  pivot_table = df.pivot_table(values='column_to_aggregate',

                              index='column_to_group_by',

                              columns='column_to_use_as_columns',

                              aggfunc='function_to_apply')

In [112]:
from bokeh.sampledata.autompg import autompg as df
# Basic Pivot
#print(df)
df.pivot_table(index=['mpg', 'name']).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,accel,cyl,displ,hp,origin,weight,yr
mpg,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9.0,hi 1200d,18.5,8.0,304.0,193.0,1.0,4732.0,70.0
10.0,chevy c20,15.0,8.0,307.0,200.0,1.0,4376.0,70.0
10.0,ford f250,14.0,8.0,360.0,215.0,1.0,4615.0,70.0
11.0,chevrolet impala,14.0,8.0,400.0,150.0,1.0,4997.0,73.0
11.0,dodge d200,13.5,8.0,318.0,210.0,1.0,4382.0,70.0
11.0,mercury marquis,11.0,8.0,429.0,208.0,1.0,4633.0,72.0
11.0,oldsmobile omega,11.0,8.0,350.0,180.0,1.0,3664.0,73.0
12.0,buick electra 225 custom,11.0,8.0,455.0,225.0,1.0,4951.0,73.0
12.0,dodge monaco (sw),11.5,8.0,383.0,180.0,1.0,4955.0,71.0
12.0,ford country,12.5,8.0,400.0,167.0,1.0,4906.0,73.0


In [113]:
# Create a pivot table to calculate average weight by cylinders and car names
pivot_table_result = df.pivot_table(values=['weight'], index=['cyl', 'name'], aggfunc='mean').head(20)

# Display the resulting pivot table
print(pivot_table_result)


                               weight
cyl name                             
3   maxda rx3                  2124.0
    mazda rx-4                 2720.0
    mazda rx-7 gs              2420.0
    mazda rx2 coupe            2330.0
4   amc concord                3003.0
    amc spirit dl              2670.0
    audi 100 ls                2430.0
    audi 100ls                 2638.0
    audi 4000                  2188.0
    audi fox                   2219.0
    bmw 2002                   2234.0
    bmw 320i                   2600.0
    buick opel isuzu deluxe    2155.0
    buick skylark              2635.0
    buick skylark limited      2670.0
    capri ii                   2572.0
    chevrolet camaro           2950.0
    chevrolet cavalier         2605.0
    chevrolet cavalier 2-door  2395.0
    chevrolet cavalier wagon   2640.0


### Boolean Indexing

Filter DataFrame on Multiple Columns and Values using Boolean index

In [114]:
df.loc[(df['cyl'] < 6) &
       (df['mpg'] > 35)].head(20)

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
216,36.0,4,79.0,58,1825,18.6,77,2,renault 5 gtl
242,43.1,4,90.0,48,1985,21.5,78,2,volkswagen rabbit custom diesel
243,36.1,4,98.0,66,1800,14.4,78,1,ford fiesta
245,39.4,4,85.0,70,2070,18.6,78,3,datsun b210 gx
246,36.1,4,91.0,60,1800,16.4,78,3,honda civic cvcc
293,35.7,4,98.0,80,1915,14.4,79,1,dodge colt hatchback custom
302,37.3,4,91.0,69,2130,14.7,79,2,fiat strada custom
307,41.5,4,98.0,76,2144,14.7,80,2,vw rabbit
308,38.1,4,89.0,60,1968,18.8,80,3,toyota corolla tercel
310,37.2,4,86.0,65,2019,16.4,80,3,datsun 310


### Crosstab Viewing

Contingency table (also know as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) fequency distrobution of the variables

In [115]:
pd.crosstab(df['cyl'], df['yr'], margins=True)

yr,70,71,72,73,74,75,76,77,78,79,80,81,82,All
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,0,0,1,1,0,0,0,1,0,0,1,0,0,4
4,7,12,14,11,15,12,15,14,17,12,23,20,27,199
5,0,0,0,0,0,0,0,0,1,1,1,0,0,3
6,4,8,0,8,6,12,10,5,12,6,2,7,3,83
8,18,7,13,20,5,6,9,8,6,10,0,1,0,103
All,29,27,28,40,26,30,34,28,36,29,27,28,30,392


### An example of how complex things can get...

In [116]:
#Top number of Column1 unique values based on the mean of NumColumn Unique Values using .nlargest

df.cyl.value_counts().nlargest(math.ceil(df.mpg.value_counts().mean())).head()

Unnamed: 0_level_0,count
cyl,Unnamed: 1_level_1
4,199
8,103
6,83
3,4


### Creating a new column using logic

In [118]:
df2 = df.copy()
df2['mpg_str'] = df2['name'] + ' has MPG ' + df2['mpg'].astype(str)
df2.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,mpg_str
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,chevrolet chevelle malibu has MPG 18.0
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,buick skylark 320 has MPG 15.0
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,plymouth satellite has MPG 18.0
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,amc rebel sst has MPG 16.0
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,ford torino has MPG 17.0


### Functions on DataFrames

The fastest and most effecient method of running calculations against an entire DataFrame - should be used most of the time when 'iterating' or doing analytics

axis = 0 means function will be applited to each column
axis = 1 means funciton will be applied to each row

#### Map

Map applys a function to each element in a series - sounds like iterating, yes?

In [119]:
def concon(x):
    return 'Adding this string to all values: ' +  str(x)

df['name'].map(concon).head()

Unnamed: 0,name
0,Adding this string to all values: chevrolet ch...
1,Adding this string to all values: buick skylar...
2,Adding this string to all values: plymouth sat...
3,Adding this string to all values: amc rebel sst
4,Adding this string to all values: ford torino


### Apply

Apply runs a function against the axis specified

In [120]:
df2['hp_and_mpg'] = df2[['hp', 'mpg']].apply(sum, axis=1)
df2.loc[:, ['hp', 'mpg', 'hp_and_mpg', 'name']].head()

Unnamed: 0,hp,mpg,hp_and_mpg,name
0,130,18.0,148.0,chevrolet chevelle malibu
1,165,15.0,180.0,buick skylark 320
2,150,18.0,168.0,plymouth satellite
3,150,16.0,166.0,amc rebel sst
4,140,17.0,157.0,ford torino


In [None]:
df

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


### In Class Exercise

Use apply to find the number of missing values in each column and each row. Save screenshot of code and output as pandasapply.jpeg

In [121]:
# Remove this before distrubuting to students...

def missing(x):
    return sum(x.isnull())

#columns
df.apply(missing, axis=0)


Unnamed: 0,0
mpg,0
cyl,0
displ,0
hp,0
weight,0
accel,0
yr,0
origin,0
name,0


In [122]:
#rows
df.apply(missing, axis=1)

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
387,0
388,0
389,0
390,0


### Conditionally Updating Values

You can use .loc to update values where a condition has been met.  Think "Set X WHERE" in SQL

In [123]:
df2 = df.copy()
df2['efficiency'] = ""

df2.loc[(df2.mpg < 10), 'efficiency'] = 'poor'
df2.loc[(df2.mpg >= 10) & (df2.mpg < 30), 'efficiency'] = 'intermediate'
df2.loc[(df2.mpg >= 30), 'efficiency'] = 'high'

df2.tail()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,efficiency
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl,intermediate
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup,high
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage,high
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger,intermediate
391,31.0,4,119.0,82,2720,19.4,82,1,chevy s-10,high


### GroupBy and Aggregate

Pandas makes it pretty simple to group values and aggregate other results

In [124]:
# Group by 'cyl' and set as_index to False to keep grouped values as columns
grouped_df = df.groupby(by=['cyl'], as_index=False)

# Use .agg to aggregate the values with specified functions as strings
aggregated = grouped_df.agg({
    'mpg': 'mean',
    'displ': 'mean',
    'hp': 'mean',
    'yr': 'max',
    'accel': 'mean'
})

# Display the aggregated DataFrame
print(aggregated.head())

   cyl        mpg       displ          hp  yr      accel
0    3  20.550000   72.500000   99.250000  80  13.250000
1    4  29.283920  109.670854   78.281407  82  16.581910
2    5  27.366667  145.000000   82.333333  80  18.633333
3    6  19.973494  218.361446  101.506024  82  16.254217
4    8  14.963107  345.009709  158.300971  81  12.955340


In Class Exercise: Combine ser1 and ser2 to form a dataframe.  Save your code and output as a screenshot named pandasseries1.jpeg

In [125]:
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

# Solution 1
df = pd.concat([ser1, ser2], axis=1)

# Solution 2
df = pd.DataFrame({'col1': ser1, 'col2': ser2})
print(df.head())

  col1  col2
0    a     0
1    b     1
2    c     2
3    e     3
4    d     4


In Class Exercise: Create a pandas series from each of the items below: a list, numpy and a dictionary.  Save your code and output as a screenshot called pandasseries2.jpeg

In [126]:
# Inputs
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# Solution
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

a    0
b    1
c    2
e    3
d    4
dtype: int64
