# Basic data tasks using Python and pandas
This tutorial will show you how to do the following

* [Loading libraries](#Loading-libraries)
* [Loading data](#Loading-data)
* [Preview data](#Preview-data)
* [Clean up column names](#Clean-up-column-names)
* [Dropping columns](#Dropping-columns)
* [Selecting columns](#Selecting-columns)
* [Limit the data by one factor](#Limit-the-data-by-one-factor)
* [Limit the data by two factors](#Limit-the-data-by-two-factors)
* [Sorting](#Sorting)
* [Find unique values](#Find-unique-values)
* [Summarize the data](#Summarize-the-data)
* [Summarize groups of the data](#Summarize-groups-of-the-data)
* [Transforming and adding columns](#Transforming-and-adding-columns)
* [Putting it all together](#Putting-it-all-together)
* [Pivoting the data (and more)](#Pivoting-the-data-(and-more))
* [Saving your data](#Saving-your-data)
* [Joining data](#Joining-data)

Run this code in the console to generate the index if it changes...

```copy([...document.querySelectorAll("h2")].map(d => `* [${d.textContent.replace("¶","")}](#${d.querySelector("a").href.split("#")[1]})`).join("\n"))```

## Loading libraries
These are software packages that are specifically designed to do data analysis in python. They make working with data much easier.

In [1]:
import pandas as pd
import numpy as np

## Loading data

We're using data from the USDA's Agricultural Marketing Service on [advertised retail food prices](https://marketnews.usda.gov/mnp/fv-report-retail?category=retail&portal=fv&startIndex=1&class=ALL&region=NATIONAL&organic=ALL&commodity=ALL&reportConfig=true&dr=1&repType=wiz&step2=true&run=Run&type=retail&commodityClass=allcommodity)

In [2]:
# load a csv and store it as the variable df
df = pd.read_csv("usda-ams-retail-fruit-and-vegetables_201809-201909.csv")

## Preview data
This is how you can look at couple rows of data 

In [3]:
# show the first 5 rows
df.head()

Unnamed: 0,Date,Region,Class,Commodity,Variety,Organic,Environment,Unit,Number of Stores,Weighted Avg Price,Low Price,High Price,% Marked Local
0,09/07/2018,NATIONAL,FRUITS,APPLE PEARS,,,,each,264,1.49,,,
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,
3,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,3 lb bag,593,3.49,,,
4,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,per pound,1349,1.11,,,


In [4]:
# Show the first 3 rows
df.head(3)

Unnamed: 0,Date,Region,Class,Commodity,Variety,Organic,Environment,Unit,Number of Stores,Weighted Avg Price,Low Price,High Price,% Marked Local
0,09/07/2018,NATIONAL,FRUITS,APPLE PEARS,,,,each,264,1.49,,,
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,


In [5]:
# show rows 10, 11, and 12
df[9:12]

Unnamed: 0,Date,Region,Class,Commodity,Variety,Organic,Environment,Unit,Number of Stores,Weighted Avg Price,Low Price,High Price,% Marked Local
9,09/07/2018,NATIONAL,FRUITS,APPLES,GALA,,,per pound,7343,1.3,,,
10,09/07/2018,NATIONAL,FRUITS,APPLES,GALA,Y,,2 lb bag,1330,3.9,,,
11,09/07/2018,NATIONAL,FRUITS,APPLES,GALA,Y,,3 lb bag,311,4.35,,,


## Clean up column names

Notice the outermost parens; Those let you format your piped/chained methods on multiple lines in python

In [6]:
df_clean_headers = (df
     .rename(columns=lambda x: x.replace("% ", ""))
     .rename(columns=str.lower)
     .rename(columns=lambda x: x.replace(" ", "_"))
)

df_clean_headers.head()

Unnamed: 0,date,region,class,commodity,variety,organic,environment,unit,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
0,09/07/2018,NATIONAL,FRUITS,APPLE PEARS,,,,each,264,1.49,,,
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,
3,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,3 lb bag,593,3.49,,,
4,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,per pound,1349,1.11,,,


## Dropping columns

In [7]:
cols_to_drop = [
    "environment",
    "low_price", 
    "high_price", 
    "marked_local"
]

df_clean_headers.drop(cols_to_drop, axis=1).head()

Unnamed: 0,date,region,class,commodity,variety,organic,unit,number_of_stores,weighted_avg_price
0,09/07/2018,NATIONAL,FRUITS,APPLE PEARS,,,each,264,1.49
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,per pound,53,1.11
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,2 lb bag,55,1.99
3,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,3 lb bag,593,3.49
4,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,per pound,1349,1.11


## Selecting columns

In [8]:
cols_to_select = [
    "date",
    "variety",
    "organic",
    "unit",
    "number_of_stores",
    "weighted_avg_price"
]
df_clean_headers[cols_to_select].head()

Unnamed: 0,date,variety,organic,unit,number_of_stores,weighted_avg_price
0,09/07/2018,,,each,264,1.49
1,09/07/2018,BRAEBURN,,per pound,53,1.11
2,09/07/2018,FUJI,,2 lb bag,55,1.99
3,09/07/2018,FUJI,,3 lb bag,593,3.49
4,09/07/2018,FUJI,,per pound,1349,1.11


## Limit the data by one factor

In [9]:
# filter the data to only be apples
(df_clean_headers[
    df_clean_headers.commodity == "APPLES"
    ]
).tail()

Unnamed: 0,date,region,class,commodity,variety,organic,environment,unit,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
86195,08/30/2019,HAWAII,FRUITS,APPLES,FUJI,,,per pound,13,1.63,1.48,1.79,
86196,08/30/2019,ALASKA,FRUITS,APPLES,FUJI,Y,,per pound,1,2.59,2.59,2.59,
86197,08/30/2019,ALASKA,FRUITS,APPLES,GALA,,,5 lb bag,7,6.99,6.99,6.99,
86198,08/30/2019,ALASKA,FRUITS,APPLES,GALA,,,per pound,10,1.78,1.78,1.78,
86199,08/30/2019,HAWAII,FRUITS,APPLES,GALA,,,per pound,10,1.78,1.78,1.78,


**A quick note about selecting columns:** these two methods are equivalent. You can use dot notation or bracket notation to select a single column

In [10]:
(
    df_clean_headers.commodity == df_clean_headers["commodity"]
).all() # this evaluates as true if all element pairs evaluate as equal 

True

## Limit the data by two factors
There are lots of way to do this!

**Here's a straight-forward, but longwinded and persnickity way** (look at those parens)

In [11]:
# filter the data to only be apples at the National aggregation
(df_clean_headers[
    (df_clean_headers.commodity == "APPLES") &
    (df_clean_headers.region == "NATIONAL")
    ]
).head(2)

Unnamed: 0,date,region,class,commodity,variety,organic,environment,unit,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,


**This is a way using built in pandas methods** (notice how it doesn't need parens)

In [12]:
(df_clean_headers[
    df_clean_headers.commodity.str.match("APPLES") &
    df_clean_headers.region.str.match("NATIONAL")
    ]
).head(2)

Unnamed: 0,date,region,class,commodity,variety,organic,environment,unit,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,


**This is a more SQL-like way to do the same filter** It is also the easiest to read (especially when chaining/piping) and the fastest.

In [13]:
nat_apples = (df_clean_headers
    .query("region == 'NATIONAL' & commodity == 'APPLES'")
)

nat_apples.head()

Unnamed: 0,date,region,class,commodity,variety,organic,environment,unit,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
1,09/07/2018,NATIONAL,FRUITS,APPLES,BRAEBURN,,,per pound,53,1.11,,,
2,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,2 lb bag,55,1.99,,,
3,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,3 lb bag,593,3.49,,,
4,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,,,per pound,1349,1.11,,,
5,09/07/2018,NATIONAL,FRUITS,APPLES,FUJI,Y,,per pound,59,2.74,,,


## Sorting

In [14]:
(
    # sort by wap from large to small
    nat_apples.sort_values("weighted_avg_price", ascending=False)
    
    # filter to just the relevant columns
    [["date", "variety", "unit", "weighted_avg_price"]]
).head(10)

Unnamed: 0,date,variety,unit,weighted_avg_price
16105,11/09/2018,FUJI,3 lb bag,8.13
7154,10/05/2018,HONEYCRISP,3 lb bag,6.99
51534,04/05/2019,GRANNY SMITH,3 lb bag,6.99
33143,01/18/2019,PINK LADY/CRIPPS PINK,3 lb bag,6.99
34855,01/25/2019,GRANNY SMITH,3 lb bag,6.99
12557,10/26/2018,HONEYCRISP,3 lb bag,6.99
84726,08/30/2019,GALA,5 lb bag,6.99
33127,01/18/2019,GRANNY SMITH,3 lb bag,6.99
10749,10/19/2018,HONEYCRISP,3 lb bag,6.99
8954,10/12/2018,HONEYCRISP,3 lb bag,6.99


## Find unique values

In [15]:
print("Unique units:")
print(nat_apples.unit.unique())

print("\nUnique varieties:")
print(nat_apples.variety.unique())

Unique units:
['per pound' '2 lb bag' '3 lb bag' '5 lb bag']

Unique varieties:
['BRAEBURN' 'FUJI' 'GALA' 'GINGER GOLD' 'GOLDEN DELICIOUS' 'GRANNY SMITH'
 'HONEYCRISP' 'JONAGOLD' 'JONATHAN' 'MCINTOSH' 'PAULA RED'
 'PINK LADY/CRIPPS PINK' 'RED DELICIOUS' 'ROME']


## Summarize the data

In [16]:
nat_apples.describe()

Unnamed: 0,number_of_stores,weighted_avg_price,low_price,high_price,marked_local
count,2062.0,2062.0,0.0,0.0,0.0
mean,692.643065,2.713598,,,
std,1273.126278,1.247909,,,
min,1.0,0.25,,,
25%,46.0,1.7,,,
50%,219.0,2.5,,,
75%,789.0,3.5,,,
max,12056.0,8.13,,,


## Summarize groups of the data
Let's rank the variety of apples by the lowest national advertised price average

In [56]:
(nat_apples
 
 # group the data by apple type
 .groupby("variety")
 
 # summarize the data
 .describe()
 
 # summarizing a group creates a multi-index frame
 # so we have to specify a tuple to sort on
 # a specific column
 .sort_values([("weighted_avg_price", "min")])
 
 # limit the frame to only the WAP summary
 [["weighted_avg_price"]]
)

Unnamed: 0_level_0,weighted_avg_price,weighted_avg_price,weighted_avg_price,weighted_avg_price,weighted_avg_price,weighted_avg_price,weighted_avg_price,weighted_avg_price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
variety,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ROME,62.0,1.983226,0.909991,0.25,1.1625,2.0,2.7075,3.98
GINGER GOLD,29.0,2.077931,1.016167,0.58,1.14,2.5,2.7,4.99
JONATHAN,52.0,2.312885,0.858447,0.74,1.51,2.55,2.99,3.99
PINK LADY/CRIPPS PINK,169.0,2.69284,1.231585,0.74,1.72,2.11,3.84,6.99
MCINTOSH,119.0,2.27395,0.993446,0.78,1.415,2.27,2.8,6.99
BRAEBURN,120.0,2.15375,1.148314,0.79,1.32,1.645,2.99,5.0
GRANNY SMITH,233.0,3.05412,1.343118,0.79,1.89,2.99,3.99,6.99
JONAGOLD,88.0,2.030227,1.018624,0.79,1.2,1.745,2.5,5.99
GOLDEN DELICIOUS,149.0,2.317987,0.947722,0.88,1.52,2.17,2.99,4.98
RED DELICIOUS,271.0,2.784723,1.22626,0.88,1.85,2.86,3.5,6.99


## Transforming and adding columns
The date column in this data is a US formatted string, there are two ways to deal with that

In [17]:
(nat_apples
 
 # create two new columns datetime and date_str 
 # store dates in those formats there
 .assign(
     datetime=lambda x: pd.to_datetime(x.date,format="%m/%d/%Y"),
     date_str=lambda x: pd.to_datetime(x.date,format="%m/%d/%Y").dt.strftime("%Y-%m-%d")
 )
 
 # new columns are put at the end of a dataframe.
 # I'm limiting the columns so that you can see them
 [["date", "datetime", "date_str", "number_of_stores"]]
 
).head()


Unnamed: 0,date,datetime,date_str,number_of_stores
1,09/07/2018,2018-09-07,2018-09-07,53
2,09/07/2018,2018-09-07,2018-09-07,55
3,09/07/2018,2018-09-07,2018-09-07,593
4,09/07/2018,2018-09-07,2018-09-07,1349
5,09/07/2018,2018-09-07,2018-09-07,59


**It's also helpful to extract data or reformat it.**

let's add a column to indicate what the volume of the unit is

In [18]:
# define a function to do the conversion
def unit_to_vol(n):
    """convert a unit of 'per pound' or 'N pound bag' to an int of lbs"""
    if 'bag' not in n:
        return 1
    else:
        # parse 'N pound bag' by string splitting
        return int(float(n.split(" ")[0]))
                                                             
(nat_apples
 
 # add a column that is the unit volume
 .assign(
     unit_vol=lambda x: x.unit.map(unit_to_vol)
 )
 
 # new columns are put at the end of a dataframe.
 # I'm limiting the columns so that you can see them
 [["date", "variety", "unit", "unit_vol"]]
 
).head()

Unnamed: 0,date,variety,unit,unit_vol
1,09/07/2018,BRAEBURN,per pound,1
2,09/07/2018,FUJI,2 lb bag,2
3,09/07/2018,FUJI,3 lb bag,3
4,09/07/2018,FUJI,per pound,1
5,09/07/2018,FUJI,per pound,1


## Putting it all together
Let's put this all together to start from the raw data, rename the columns, and select fuji apples

In [19]:
fuji_df = (df
     # clean up the column names
    .rename(columns=lambda x: x.replace("% ", ""))
    .rename(columns=str.lower)
    .rename(columns=lambda x: x.replace(" ", "_"))
    
    # convert the date column into datetimes 
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%m/%d/%Y"),
        unit_vol=lambda x: x.unit.map(unit_to_vol)
    )
    
    # add a per_pound price, we do this in a separate assign
    # because it relies on something calculated in the first
    .assign(
        weighted_avg_price_pp=lambda x: x.weighted_avg_price / x.unit_vol,
        low_price_pp=lambda x: x.low_price / x.unit_vol,
        high_price_pp=lambda x: x.high_price / x.unit_vol
    )
 
     # limit to Fuji Apples
    .query("commodity == 'APPLES' & variety == 'FUJI'")
           
     # limit to only non-organic
    .query("organic != 'Y'")
           
     # make sure the dates are in chronological order
    .sort_values("date", ascending=False)
     
     # only keep the columns we want
    [["date", "region", "number_of_stores", "weighted_avg_price_pp", "low_price_pp", "high_price_pp", "unit"]]
)

fuji_df.head()

Unnamed: 0,date,region,number_of_stores,weighted_avg_price_pp,low_price_pp,high_price_pp,unit
86195,2019-08-30,HAWAII,13,1.63,1.48,1.79,per pound
86045,2019-08-30,NORTHWEST U.S.,5,1.29,1.29,1.29,per pound
85844,2019-08-30,SOUTHWEST U.S.,88,0.68,0.49,0.77,per pound
85674,2019-08-30,SOUTH CENTRAL U.S.,25,0.82,0.59,1.29,per pound
85460,2019-08-30,MIDWEST U.S.,146,1.39,0.99,1.49,per pound


### Where and when were the most expensive Fuji apples marketed? Cheapest? How were they sold?

In [20]:
fuji_expen = (fuji_df
    # sort the data high to low
    .sort_values("high_price_pp", ascending=False)
    
    # limit to the top 10
    .head(5)
    
    # limit the columns
    [["date", "region", "high_price_pp", "unit"]]
)

fuji_cheap = (fuji_df
    # sort the data low to high
    .sort_values("low_price_pp")
    
    # limit to the top 10
    .head(5)
    
    # limit the columns
    [["date", "region", "low_price_pp", "unit"]]
)

display(fuji_expen)
display(fuji_cheap)

Unnamed: 0,date,region,high_price_pp,unit
43125,2019-02-22,HAWAII,3.99,per pound
17714,2018-11-09,HAWAII,3.49,per pound
15970,2018-11-02,HAWAII,3.49,per pound
46480,2019-03-08,HAWAII,2.99,per pound
41065,2019-02-15,SOUTHWEST U.S.,2.99,per pound


Unnamed: 0,date,region,low_price_pp,unit
24104,2018-12-07,SOUTHWEST U.S.,0.33,per pound
15162,2018-11-02,MIDWEST U.S.,0.33,3 lb bag
44452,2019-03-01,SOUTHWEST U.S.,0.33,3 lb bag
54385,2019-04-12,SOUTHWEST U.S.,0.33,per pound
13828,2018-10-26,SOUTHWEST U.S.,0.33,per pound


## Pivoting the data (and more)

#### How did the price Fuji apples sold per pound change in average price over the course of the year? How did the national average compare to the Northwest US?

In [21]:
us_vs_nw_fuji = (fuji_df
 
 # limit to per pound sales
 .query("unit == 'per pound'")
 
 # limit to National and Northwest
 .query("region == 'NATIONAL' | region == 'NORTHWEST U.S.'")
 
 # pivot the data on date and region, use avg_price
 .pivot(
        index="date", 
        columns="region", 
        values="weighted_avg_price_pp"
  )
 
 # create a new column that is the difference of the National and NW price
 .assign(
     nat_nw_diff=lambda x: x["NORTHWEST U.S."] - x["NATIONAL"]
 )
 
 
)

(us_vs_nw_fuji
 
     # Do something fancy!
     # color the cells when Fuji's were more expensive than the national average
     .style.applymap(
         lambda x: "background-color: lightyellow; font-weight: bold;" if x > 0 else "", 
         subset=["nat_nw_diff"]
     )
)

region,NATIONAL,NORTHWEST U.S.,nat_nw_diff
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-09-07 00:00:00,1.11,0.99,-0.12
2018-09-14 00:00:00,1.17,1.4,0.23
2018-09-21 00:00:00,1.15,1.18,0.03
2018-09-28 00:00:00,1.32,1.5,0.18
2018-10-05 00:00:00,1.2,1.27,0.07
2018-10-12 00:00:00,1.14,1.38,0.24
2018-10-19 00:00:00,1.37,0.78,-0.59
2018-10-26 00:00:00,1.35,1.07,-0.28
2018-11-02 00:00:00,1.39,1.37,-0.02
2018-11-09 00:00:00,1.26,0.9,-0.36


## Saving your data

In [22]:
us_vs_nw_fuji.to_csv("us-vs-nw-fuji-apples-201809-201909.csv")

## Joining data
Let's load up some new data from the USDA and US Census about avocados to join import figures with retail prices

In [96]:
# load the retail data
avo_retail = (df
     # clean up the column names
    .rename(columns=lambda x: x.replace("% ", ""))
    .rename(columns=str.lower)
    .rename(columns=lambda x: x.replace(" ", "_"))
    
    # convert the date column into datetimes 
    .assign(
        date=lambda x: pd.to_datetime(x.date, format="%m/%d/%Y"),
        unit_vol=lambda x: x.unit.map(unit_to_vol)
    )
              
    .assign(
       # create a column that's a year and month string
       year_month=lambda x: x.date.dt.strftime("%Y-%m")
   )
              
    # limit to Hass Avocados
   .query("commodity == 'AVOCADOS' & variety == 'HASS'")
             
    # limit to the National aggregation
   .query("region == 'NATIONAL'")
             
   # limit to item sales
   .query("unit == 'each'")
          
    # limit to only non-organic
   .query("organic != 'Y'")
          
    # make sure the dates are in chronological order
   .sort_values("date", ascending=False)
    
    # only keep the columns we want
   [["date", "year_month", "number_of_stores", "weighted_avg_price"]]
)

# load the imports data
avo_imports = (pd.read_csv("census-us-imports-avocados-201805-201907.csv")
                     
   .assign(
       # convert the time column to a parsed date  
       date=lambda x: pd.to_datetime(x.time),
   )
               
   .assign(
       # create a column that's a year and month string
       year_month=lambda x: x.date.dt.strftime("%Y-%m")
   )
               
   # drop the blank column and the string date column
   .drop(["Unnamed: 4", "time"], axis=1)

  )

display(avo_retail.head())
display(avo_imports.head())

Unnamed: 0,date,year_month,number_of_stores,weighted_avg_price
84756,2019-08-30,2019-08,6163,1.28
83164,2019-08-23,2019-08,8155,1.11
81647,2019-08-16,2019-08,4688,1.18
80125,2019-08-09,2019-08,4657,1.36
78539,2019-08-02,2019-08,7790,1.06


Unnamed: 0,gen_val,gen_qty,gen_unit_val,date,year_month
0,277140118,83002869,3.34,2019-07-01,2019-07
1,212474553,56398177,3.77,2019-06-01,2019-06
2,253722506,75525694,3.36,2019-05-01,2019-05
3,303105601,87379294,3.47,2019-04-01,2019-04
4,199917909,99982135,2.0,2019-03-01,2019-03


Now let's join them on the `year_month` column.

Generally, you'll use `merge` to join data frames on arbitrary columns

In [100]:
(avo_retail
 .merge(avo_imports,
    # join using the year_month column
    on="year_month",
    
    # do a left outer join
    how="left",
    
    # where the columns overlap, add a suffix to 
    # make clear which frame it came from
    suffixes=["_retail", "_imports"]
 )
).tail()

Unnamed: 0,date_retail,year_month,number_of_stores,weighted_avg_price,gen_val,gen_qty,gen_unit_val,date_imports
47,2018-10-05,2018-10,6780,1.09,168443666.0,83302581.0,2.02,2018-10-01
48,2018-09-28,2018-09,8280,1.18,177933537.0,75976150.0,2.34,2018-09-01
49,2018-09-21,2018-09,6640,1.15,177933537.0,75976150.0,2.34,2018-09-01
50,2018-09-14,2018-09,6523,1.16,177933537.0,75976150.0,2.34,2018-09-01
51,2018-09-07,2018-09,2978,1.25,177933537.0,75976150.0,2.34,2018-09-01


However your data might share an index, rather than a column

In [107]:
# set `year_month` as an index for the example
avo_retail_ind = avo_retail.set_index("year_month")
avo_imports_ind = avo_imports.set_index("year_month")

display(avo_retail_ind.head(2))
display(avo_imports_ind.head(2))

Unnamed: 0_level_0,date,number_of_stores,weighted_avg_price
year_month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-08,2019-08-30,6163,1.28
2019-08,2019-08-23,8155,1.11


Unnamed: 0_level_0,gen_val,gen_qty,gen_unit_val,date
year_month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-07,277140118,83002869,3.34,2019-07-01
2019-06,212474553,56398177,3.77,2019-06-01


Now joining these frames is even simpler

In [113]:
(avo_retail_ind
   .join(avo_imports_ind, 
         lsuffix="_retail", 
         rsuffix="_imports"
    )
).head()

Unnamed: 0_level_0,date_retail,number_of_stores,weighted_avg_price,gen_val,gen_qty,gen_unit_val,date_imports
year_month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-09,2018-09-28,8280,1.18,177933537.0,75976150.0,2.34,2018-09-01
2018-09,2018-09-21,6640,1.15,177933537.0,75976150.0,2.34,2018-09-01
2018-09,2018-09-14,6523,1.16,177933537.0,75976150.0,2.34,2018-09-01
2018-09,2018-09-07,2978,1.25,177933537.0,75976150.0,2.34,2018-09-01
2018-10,2018-10-26,13631,0.97,168443666.0,83302581.0,2.02,2018-10-01
