# Lecture 07 – Data Science


A demo of data cleaning and exploratory data analysis using the **CDC** **Tuberculosis** data and the **Mauna Loa CO2** data.

In [30]:
#from google.colab import drive
#drive.mount('/content/drive')

In [31]:
import numpy as np
import pandas as pd

In [32]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Structure: Multiple Files
**CSV File**

Let's continue from where we left off last time. We loaded in the CDC Tuberculosis dataset, did some wrangling by inspecting it, and ended up with something like below.

In [33]:
# Perform same data wrangling as did in previous labs
tb = pd.read_csv(r'C:\Users\dell\Downloads\cdc_tuberculosis.csv', header = 1, thousands = ',')
tb = tb.rename(columns = {'2019':'Cases in 2019','2020':'Cases in 2020','2021':'Cases in 2021', '2019.1':'Incidences in 2019','2020.1':'Incidences in 2020','2021.1':'Incidences in 2021'})
tb.drop(0)
tb.head(10)

Unnamed: 0,U.S. jurisdiction,Cases in 2019,Cases in 2020,Cases in 2021,Incidences in 2019,Incidences in 2020,Incidences in 2021
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
5,California,2111,1706,1750,5.35,4.32,4.46
6,Colorado,66,52,58,1.15,0.9,1.0
7,Connecticut,67,54,54,1.88,1.5,1.5
8,Delaware,18,17,43,1.84,1.71,4.29
9,District of Columbia,24,19,19,3.39,2.75,2.84


In [34]:
tb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   U.S. jurisdiction   52 non-null     object 
 1   Cases in 2019       52 non-null     int64  
 2   Cases in 2020       52 non-null     int64  
 3   Cases in 2021       52 non-null     int64  
 4   Incidences in 2019  52 non-null     float64
 5   Incidences in 2020  52 non-null     float64
 6   Incidences in 2021  52 non-null     float64
dtypes: float64(3), int64(3), object(1)
memory usage: 3.0+ KB


In [35]:
print(tb.head())
print(tb.tail())

  U.S. jurisdiction  Cases in 2019  Cases in 2020  Cases in 2021  \
0             Total           8900           7173           7860   
1           Alabama             87             72             92   
2            Alaska             58             58             58   
3           Arizona            183            136            129   
4          Arkansas             64             59             69   

   Incidences in 2019  Incidences in 2020  Incidences in 2021  
0                2.71                2.16                2.37  
1                1.77                1.43                1.83  
2                7.91                7.92                7.92  
3                2.51                1.89                1.77  
4                2.12                1.96                2.28  
   U.S. jurisdiction  Cases in 2019  Cases in 2020  Cases in 2021  \
47          Virginia            191            169            161   
48        Washington            221            163            199   


### Gather Census Data

U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).

Running the below cells cleans the data. We encourage you to closely explore the CSV and study these lines after lecture...

There are a few new methods here:
* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)).
* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be used to drop null values.

In [36]:
# 2010s census data
census_2010s_df = pd.read_csv(r"C:\Users\dell\Downloads/nst-est2019-01.csv", header=3, thousands=",")

# Perform data wrangling on census_2010_df
census_2010s_df= census_2010s_df.rename(columns = {'Unnamed: 0': 'Geographic Area'})
census_2010s_df= census_2010s_df.dropna()
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
census_2010s_df.head(10)

Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538.0,308758105.0,309321666.0,311556874.0,313830990.0,315993715.0,318301008.0,320635163.0,322941311.0,324985539.0,326687501.0,328239523.0
1,Northeast,55317240.0,55318443.0,55380134.0,55604223.0,55775216.0,55901806.0,56006011.0,56034684.0,56042330.0,56059240.0,56046620.0,55982803.0
2,Midwest,66927001.0,66929725.0,66974416.0,67157800.0,67336743.0,67560379.0,67745167.0,67860583.0,67987540.0,68126781.0,68236628.0,68329004.0
3,South,114555744.0,114563030.0,114866680.0,116006522.0,117241208.0,118364400.0,119624037.0,120997341.0,122351760.0,123542189.0,124569433.0,125580448.0
4,West,71945553.0,71946907.0,72100436.0,72788329.0,73477823.0,74167130.0,74925793.0,75742555.0,76559681.0,77257329.0,77834820.0,78347268.0
5,Alabama,4779736.0,4780125.0,4785437.0,4799069.0,4815588.0,4830081.0,4841799.0,4852347.0,4863525.0,4874486.0,4887681.0,4903185.0
6,Alaska,710231.0,710249.0,713910.0,722128.0,730443.0,737068.0,736283.0,737498.0,741456.0,739700.0,735139.0,731545.0
7,Arizona,6392017.0,6392288.0,6407172.0,6472643.0,6554978.0,6632764.0,6730413.0,6829676.0,6941072.0,7044008.0,7158024.0,7278717.0
8,Arkansas,2915918.0,2916031.0,2921964.0,2940667.0,2952164.0,2959400.0,2967392.0,2978048.0,2989918.0,3001345.0,3009733.0,3017804.0
9,California,37253956.0,37254519.0,37319502.0,37638369.0,37948800.0,38260787.0,38596972.0,38918045.0,39167117.0,39358497.0,39461588.0,39512223.0


In [37]:
# census 2020s data
census_2020s_df = pd.read_csv(r"C:\Users\dell\Downloads/NST-EST2022-POP.csv", header=3, thousands=",")

# Perform data wrangling on census_2010_df
census_2020s_df = (
    census_2020s_df
    .reset_index()
    .drop(columns=["index", "Unnamed: 1"])
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    .dropna()  )
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
census_2020s_df

Unnamed: 0,Geographic Area,2020,2021,2022
0,United States,331511512.00,332031554.00,333287557.00
1,Northeast,57448898.00,57259257.00,57040406.00
2,Midwest,68961043.00,68836505.00,68787595.00
3,South,126450613.00,127346029.00,128716192.00
4,West,78650958.00,78589763.00,78743364.00
...,...,...,...,...
52,Washington,7724031.00,7740745.00,7785786.00
53,West Virginia,1791420.00,1785526.00,1775156.00
54,Wisconsin,5896271.00,5880101.00,5892539.00
55,Wyoming,577605.00,579483.00,581381.00


### Join Data (Merge DataFrames)

Time to `merge`!

In [38]:
# merge TB dataframe with two US (2010-2019 and 2020) census dataframes
tb_census_df = pd.merge(left = census_2010s_df, right = census_2020s_df,
left_on= 'Geographic Area', right_on= 'Geographic Area')
tb_census_df


Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,United States,308745538.00,308758105.00,309321666.00,311556874.00,313830990.00,315993715.00,318301008.00,320635163.00,322941311.00,324985539.00,326687501.00,328239523.00,331511512.00,332031554.00,333287557.00
1,Northeast,55317240.00,55318443.00,55380134.00,55604223.00,55775216.00,55901806.00,56006011.00,56034684.00,56042330.00,56059240.00,56046620.00,55982803.00,57448898.00,57259257.00,57040406.00
2,Midwest,66927001.00,66929725.00,66974416.00,67157800.00,67336743.00,67560379.00,67745167.00,67860583.00,67987540.00,68126781.00,68236628.00,68329004.00,68961043.00,68836505.00,68787595.00
3,South,114555744.00,114563030.00,114866680.00,116006522.00,117241208.00,118364400.00,119624037.00,120997341.00,122351760.00,123542189.00,124569433.00,125580448.00,126450613.00,127346029.00,128716192.00
4,West,71945553.00,71946907.00,72100436.00,72788329.00,73477823.00,74167130.00,74925793.00,75742555.00,76559681.00,77257329.00,77834820.00,78347268.00,78650958.00,78589763.00,78743364.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,Washington,6724540.00,6724540.00,6742830.00,6826627.00,6897058.00,6963985.00,7054655.00,7163657.00,7294771.00,7423362.00,7523869.00,7614893.00,7724031.00,7740745.00,7785786.00
53,West Virginia,1852994.00,1853018.00,1854239.00,1856301.00,1856872.00,1853914.00,1849489.00,1842050.00,1831023.00,1817004.00,1804291.00,1792147.00,1791420.00,1785526.00,1775156.00
54,Wisconsin,5686986.00,5687285.00,5690475.00,5705288.00,5719960.00,5736754.00,5751525.00,5760940.00,5772628.00,5790186.00,5807406.00,5822434.00,5896271.00,5880101.00,5892539.00
55,Wyoming,563626.00,563775.00,564487.00,567299.00,576305.00,582122.00,582531.00,585613.00,584215.00,578931.00,577601.00,578759.00,577605.00,579483.00,581381.00


This is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrames. Let's do the latter.

In [39]:
# try merging again, but cleaner this time
tb_census_df = tb_census_df.drop(columns= ['2010','2011','2012','2013','2014','2015','2016','2017','2018'])
tb_census_df

Unnamed: 0,Geographic Area,Census,Estimates Base,2019,2020,2021,2022
0,United States,308745538.00,308758105.00,328239523.00,331511512.00,332031554.00,333287557.00
1,Northeast,55317240.00,55318443.00,55982803.00,57448898.00,57259257.00,57040406.00
2,Midwest,66927001.00,66929725.00,68329004.00,68961043.00,68836505.00,68787595.00
3,South,114555744.00,114563030.00,125580448.00,126450613.00,127346029.00,128716192.00
4,West,71945553.00,71946907.00,78347268.00,78650958.00,78589763.00,78743364.00
...,...,...,...,...,...,...,...
52,Washington,6724540.00,6724540.00,7614893.00,7724031.00,7740745.00,7785786.00
53,West Virginia,1852994.00,1853018.00,1792147.00,1791420.00,1785526.00,1775156.00
54,Wisconsin,5686986.00,5687285.00,5822434.00,5896271.00,5880101.00,5892539.00
55,Wyoming,563626.00,563775.00,578759.00,577605.00,579483.00,581381.00


### Reproduce incidence

Let's recompute incidence to make sure we know where the original CDC numbers came from.

From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”

If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as

$$\text{TB incidence} = \frac{\text{\# TB cases in population}}{\text{\# groups in population}} = \frac{\text{\# TB cases in population}}{\text{population}/100000} $$

$$= \frac{\text{\# TB cases in population}}{\text{population}} \times 100000$$

Let's try this for 2019:

In [50]:
#recompute incidence for 2019
(tb['Incidences in 2019']/census_2010s_df['2019'])*100000


0    0.00
1    0.00
2    0.01
3    0.00
4    0.00
     ... 
52    NaN
53    NaN
54    NaN
55    NaN
57    NaN
Length: 57, dtype: float64

Awesome!!!

Let's use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([Python documentation](https://docs.python.org/3/tutorial/inputoutput.html)).

In [58]:
# recompute incidence for all years (2019, 2020, 2021)
years = (2019, 2020, 2021)
inc_dict = {}
for y in years:
  inc_dict[y] = (tb_census_df[f'{y}']/tb_census_df[f'{y}'])*100000
inc_dict

{2019: 0    100000.00
 1    100000.00
 2    100000.00
 3    100000.00
 4    100000.00
         ...   
 52   100000.00
 53   100000.00
 54   100000.00
 55   100000.00
 56   100000.00
 Name: 2019, Length: 57, dtype: float64,
 2020: 0    100000.00
 1    100000.00
 2    100000.00
 3    100000.00
 4    100000.00
         ...   
 52   100000.00
 53   100000.00
 54   100000.00
 55   100000.00
 56   100000.00
 Name: 2020, Length: 57, dtype: float64,
 2021: 0    100000.00
 1    100000.00
 2    100000.00
 3    100000.00
 4    100000.00
         ...   
 52   100000.00
 53   100000.00
 54   100000.00
 55   100000.00
 56   100000.00
 Name: 2021, Length: 57, dtype: float64}

These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021.

I'll also leave the part with reproducing the "9.4%" increase to you! Or, you can also check the bonus section in Lecture 4's demo notebook to see how we did it.


<br/><br/><br/>

---

## Structure: Different File Formats

There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...
* Documentation will be your best friend to understand how to process many of these file types.
* In lecture, we covered TSV and JSON since pandas supports them out-of-box.

### TSV

**TSV** (Tab-Separated Values) files are very similar to CSVs, but now items are delimited by tabs.

Let's check out `cdc_tuberculosis.tsv`, which is the same data but now in a TSV.

Quick Python reminders: Ways to represent data of file
* Python's `print()` prints each string (including the newline), and an additional newline on top of that.
* We use the `repr()` function to return the raw string with all special characters.
* The `enumerate(x)` function returns a counter along with the elements of `x`.

In [60]:
with open(r"C:\Users\dell\Downloads\cdc_tuberculosis.tsv") as f:

  for i, line in enumerate(f):
      print(repr(line)) # print raw strings
      # only for first 4 lines
      if i >3: break

'\tNo. of TB cases\t\t\tTB incidence\t\t\n'
'U.S. jurisdiction\t2019\t2020\t2021\t2019\t2020\t2021\n'
'Total\t"8,900"\t"7,173"\t"7,860"\t2.71\t2.16\t2.37\n'
'Alabama\t87\t72\t92\t1.77\t1.43\t1.83\n'
'Alaska\t58\t58\t58\t7.91\t7.92\t7.92\n'


A quick note: the above is a very explicit way to loop over the first 4 lines of the file by controlling a line counter. We can do the same with more concise code by letting Python read the lines in the file for us and grabbing the first four using **readlines()**: function

In [61]:
with open(r"C:\Users\dell\Downloads\cdc_tuberculosis.tsv") as f:
  # this time now use readline() function instead of enumerate()
  for i in range(0,4):
    print(f.readline())


	No. of TB cases			TB incidence		

U.S. jurisdiction	2019	2020	2021	2019	2020	2021

Total	"8,900"	"7,173"	"7,860"	2.71	2.16	2.37

Alabama	87	72	92	1.77	1.43	1.83



The only drawback here is that we read the entire file when we only want the first few lines. That can be wasteful. The Python `zip` built-in function ([docs here](https://docs.python.org/3/library/functions.html#zip)) is a useful thing to know about. This code may look a little odd at first, but it does the same as the first example above but much more concisely, and once you get used to thinking about `zip`, it becomes a very natural tool to express various iteration strategies:

In [62]:
with open(r"C:\Users\dell\Downloads\cdc_tuberculosis.tsv") as f:
  # read 1st four lines with zip
  for i, line in zip(f,range(0,4)):
    print(i)

	No. of TB cases			TB incidence		

U.S. jurisdiction	2019	2020	2021	2019	2020	2021

Total	"8,900"	"7,173"	"7,860"	2.71	2.16	2.37

Alabama	87	72	92	1.77	1.43	1.83



The `pd.read_csv` function also reads in TSVs if we specify the **delimiter** with parameter `sep='\t'` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).

In [63]:
# now use pandas function to read tsv file\\arslaaaaaaaaan
pd.read_csv(r"C:\Users\dell\Downloads\cdc_tuberculosis.tsv", delimiter='\t')

Unnamed: 0.1,Unnamed: 0,No. of TB cases,Unnamed: 2,Unnamed: 3,TB incidence,Unnamed: 5,Unnamed: 6
0,U.S. jurisdiction,2019,2020,2021,2019.00,2020.00,2021.00
1,Total,8900,7173,7860,2.71,2.16,2.37
2,Alabama,87,72,92,1.77,1.43,1.83
3,Alaska,58,58,58,7.91,7.92,7.92
4,Arizona,183,136,129,2.51,1.89,1.77
...,...,...,...,...,...,...,...
48,Virginia,191,169,161,2.23,1.96,1.86
49,Washington,221,163,199,2.90,2.11,2.57
50,West Virginia,9,13,7,0.50,0.73,0.39
51,Wisconsin,51,35,66,0.88,0.59,1.12


*Side note*: there was a question last time on how pandas differentiates a comma delimiter vs. a comma within the field itself, e.g., `8,900`. Check out the documentation for the `quotechar` parameter.

### JSON
The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date.

Let's first check out this website.

Next, let's download this file, saving it as a JSON (note the source URL file type).

In the interest of **reproducible data science** we will download the data programatically.  We have defined some helper functions in the **utils.py** file.  I can then reuse these helper functions in many different notebooks.

In [91]:
# just run this cell
#from Lec07ds100_utils import fetch_and_cache
import requests
from pathlib import Path
import time


def fetch_and_cache(data_url, file, data_dir=r"C:\Users\dell\Downloads", force=False):
    """
    Download and cache a url and return the file object.

    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded

    return: The pathlib.Path object representing the file.
    """

    ### BEGIN SOLUTION
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok = True)
    file_path = data_dir / Path(file)
    # If the file already exists and we want to force a download then
    # delete the file first so that the creation date is correct.
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
        last_modified_time = time.ctime(file_path.stat().st_mtime)
    else:
        last_modified_time = time.ctime(file_path.stat().st_mtime)
        print("Using cached version that was downloaded (UTC):", last_modified_time)
    return file_path
    ### END SOLUTION


In [92]:
covid_file = fetch_and_cache(
    "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
     "confirmed-cases.json",
    force=False)
covid_file          # a file path wrapper object

Using cached version that was downloaded (UTC): Mon Oct  2 10:53:36 2023


WindowsPath('C:/Users/dell/Downloads/confirmed-cases.json')

#### File size

Often, I like to start my analysis by getting a rough estimate of the size of the data.  This will help inform the tools I use and how I view the data.  If it is relatively small I might use a text editor or a spreadsheet to look at the data.  If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

However here we will use Python tools to probe the file.

Since these seem to be text files I might also want to investigate the number of lines, which often corresponds to the number of records.

You can use .getsize to identify the size of a file on specific path.
And iterat over line in a file to count total number of lines.

In [139]:
import os

# File size
byt_size = os.path.getsize(r'C:\Users\dell\Downloads\confirmed-cases.json')
print(f"File Size: {byt_size / 1000}KB")

# Number of lines
with open(r'C:\Users\dell\Downloads\confirmed-cases.json') as f:
    lines = f.readlines()
    print(len(lines))



File Size: 219.138KB
1798


As part of your workflow, you should also learn some basic Unix commands, as these are often very handy (in fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com) that explores this idea in depth!).

In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax `{expr}`.

Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form". We also use the `wc` command for "word count", but with the `-l` flag, which asks for line counts instead of words.

These two give us the same information as the code above, albeit in a slightly different form:

In [147]:
!ls -lh {covid_file}
!wc -l {covid_file}

'ls' is not recognized as an internal or external command,
operable program or batch file.
'wc' is not recognized as an internal or external command,
operable program or batch file.


#### File contents

Because we have a text file in a visual IDE like Jupyter/DataHub, I'm going to visually explore the data via the built-in file explorer.

1. To the Jupyter view!

2. To the Python view...?

In [141]:
covid_file

WindowsPath('C:/Users/dell/Downloads/confirmed-cases.json')

In [143]:
with open(covid_file, "r") as f:
  #  visualize explore the data via enumerate as did above
  # code here
  for i, ln in enumerate(f):
    print(ln)

{

  "meta" : {

    "view" : {

      "id" : "xn6j-b766",

      "name" : "COVID-19 Confirmed Cases",

      "assetType" : "dataset",

      "attribution" : "City of Berkeley",

      "averageRating" : 0,

      "category" : "Health",

      "createdAt" : 1587074071,

      "description" : "Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.",

      "displayType" : "table",

      "downloadCount" : 3839,

      "hideFromCatalog" : false,

      "hideFromDataJson" : false,

      "newBackend" : true,

      "numberOfComments" : 0,

      "oid" : 37306599,

      "provenance" : "official",

      "publicationAppendEnabled" : false,

      "publicationDate" : 1623695944,

      "publicationGroup" : 17032857,

      "publicationStage" : "published",

      "rowsUpdatedAt" : 1695220732,

      "rowsUpdatedBy" : "g3qt-vv5v",

      "tableId" : 18345932

In the same vein, we can use the `head` Unix command (which is where Pandas' `head` method comes from!) to see the first few lines of the file:

In [144]:
!head -5 {covid_file}

'head' is not recognized as an internal or external command,
operable program or batch file.


1. Back to the Python view.

    In order to load the JSON file into pandas, Let's first do some **EDA** with the Python `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into Pandas.

### EDA: Digging into JSON

Python has relatively good support for JSON data since it closely matches the internal python object model.  In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.

In [149]:
import json
f = open(covid_file)
covid_data = json.load(f)
f.close()


# load file using json

In [150]:
covid_data.keys()

dict_keys(['meta', 'data'])

The `covid_json` variable is now a dictionary encoding the data in the file:

In [151]:
# find type of jason file
# code here
covid_data['meta']['view'].keys()

dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'clientContext', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

In [152]:
covid_data['meta']['view']['description']

'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.'

#### Examine what keys are in the top level json object

We can list the keys to determine what data is stored in the object.

In [153]:
# explore keys in jason file
# code here
covid_data.keys()

dict_keys(['meta', 'data'])

**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.

<br/>

We can investigate the meta data further by examining the keys associated with the metadata.

In [154]:
# code here
covid_data['meta']['view']['columns']

[{'id': -1,
  'name': 'sid',
  'dataTypeName': 'meta_data',
  'fieldName': ':sid',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'id',
  'dataTypeName': 'meta_data',
  'fieldName': ':id',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'position',
  'dataTypeName': 'meta_data',
  'fieldName': ':position',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_at',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_meta',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_meta',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'updated_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':updated_at'

In [155]:
type(covid_data['data'])

list

The `meta` key contains another dictionary called `view`.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views when we study SQL later in the class.    

In [156]:
# code here
for i, ln in zip(range(0,5),covid_data['data']):
  print(len(ln))

11
11
11
11
11


Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:

```
meta
|-> data
    | ... (haven't explored yet)
|-> view
    | -> id
    | -> name
    | -> attribution
    ...
    | -> description
    ...
    | -> columns
    ...
```

There is a key called description in the view sub dictionary.  This likely contains a description of the data:

In [157]:
# code here
covid_data['meta']['view']["description"]

'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.'


#### Examining the Data Field for Records

We can look at a few entries in the `data` field. This is what we'll load into Pandas.


In [159]:
for i in range(3):
    print(f"{i:03} | {covid_data['data'][i]}")

000 | ['row-hnud-24uw.xb8m', '00000000-0000-0000-D120-90E10D5E4D52', 0, 1695220732, None, 1695220732, None, '{ }', '2019-12-01T00:00:00', '0', '0']
001 | ['row-p6cs_iytt_geg6', '00000000-0000-0000-54C6-A436CE07F9A0', 0, 1695220732, None, 1695220732, None, '{ }', '2019-12-02T00:00:00', '0', '0']
002 | ['row-ket5~zzzf-58km', '00000000-0000-0000-1856-296591B17540', 0, 1695220732, None, 1695220732, None, '{ }', '2019-12-03T00:00:00', '0', '0']


Observations:
* These look like equal-length records, so maybe `data` is a table!
* But what do each of values in the record mean? Where can we find column headers?

Back to the metadata.

#### Columns Metadata

Another potentially useful key in the metadata dictionary is the `columns`.  This returns a list:

In [160]:
# code here
covid_data['meta']['view']['columns']

[{'id': -1,
  'name': 'sid',
  'dataTypeName': 'meta_data',
  'fieldName': ':sid',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'id',
  'dataTypeName': 'meta_data',
  'fieldName': ':id',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'position',
  'dataTypeName': 'meta_data',
  'fieldName': ':position',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_at',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_meta',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_meta',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'updated_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':updated_at'

In [161]:
# Column names
col_names = []
for val in covid_data['meta']['view']['columns']:
  col_names.append(val["name"])
col_names

['sid',
 'id',
 'position',
 'created_at',
 'created_meta',
 'updated_at',
 'updated_meta',
 'meta',
 'Date',
 'New Cases',
 'Cumulative Cases']

Let's go back to the file explorer.

Based on the contents of this key, what are reasonable names for each column in the `data` table?

You can also get the view that Jupyter provides in the file explorer by using Python. This displays our JSON object as an interacive graphical object with a built-in search box:

In [163]:
from IPython.display import JSON
JSON(covid_data)

<IPython.core.display.JSON object>

#### Summary of exploring the JSON file

1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
1. Self documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.

### JSON with pandas

After our above EDA, let's finally go about loading the data (not the metadata) into a pandas dataframe.

In the following block of code we:
1. Translate the JSON records into a dataframe:

    * fields: `covid_json['meta']['view']['columns']`
    * records: `covid_json['data']`
    
1. Remove columns that have no metadata description.  This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
1. Examine the `tail` of the table.

In [164]:
# Load the data from JSON and assign column titles
covid_df = pd.DataFrame(covid_data["data"], columns=col_names)
covid_df.tail()


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,Date,New Cases,Cumulative Cases
1382,row-6imp-imjs-6h2k,00000000-0000-0000-601A-112EFE08B774,0,1695220732,,1695220732,,{ },2023-09-13T00:00:00,7,24005
1383,row-xm5s~rwa2_nbdx,00000000-0000-0000-C844-79D3361B4C15,0,1695220732,,1695220732,,{ },2023-09-14T00:00:00,7,24012
1384,row-7yz4~pbe5-nfim,00000000-0000-0000-E0F5-9365AFC0FC21,0,1695220732,,1695220732,,{ },2023-09-15T00:00:00,6,24018
1385,row-trdy_pu8j.g89w,00000000-0000-0000-38AF-38C924E84128,0,1695220732,,1695220732,,{ },2023-09-16T00:00:00,1,24019
1386,row-tbbq_hk62.izic,00000000-0000-0000-BE70-547100C87241,0,1695220732,,1695220732,,{ },2023-09-17T00:00:00,0,24019


In [167]:
covid_df['sid'] = covid_df.sid.str.replace("row-", '')
covid_df.head()

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,Date,New Cases,Cumulative Cases
0,hnud-24uw.xb8m,00000000-0000-0000-D120-90E10D5E4D52,0,1695220732,,1695220732,,{ },2019-12-01T00:00:00,0,0
1,p6cs_iytt_geg6,00000000-0000-0000-54C6-A436CE07F9A0,0,1695220732,,1695220732,,{ },2019-12-02T00:00:00,0,0
2,ket5~zzzf-58km,00000000-0000-0000-1856-296591B17540,0,1695220732,,1695220732,,{ },2019-12-03T00:00:00,0,0
3,znsg.x2gj.zsq5,00000000-0000-0000-CBD3-B8DB138CC82F,0,1695220732,,1695220732,,{ },2019-12-04T00:00:00,0,0
4,45y3.e3sv.3cn6,00000000-0000-0000-5E4F-018801583977,0,1695220732,,1695220732,,{ },2019-12-05T00:00:00,0,0


<br/>

---


## Temporality

Let's briefly look at how we can use pandas `dt` accessors to work with dates/times in a dataset.

We will use the dataset from Lab 3: the Berkeley PD Calls for Service dataset.

In [169]:
calls = pd.read_csv(r"C:\Users\dell\Downloads\Berkeley_PD_-_Calls_for_Service.csv")
calls.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA


Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.

Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.

If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.

In [174]:
# code here
calls['EVENTDT'] = pd.to_datetime(calls['EVENTDT'])
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,21014296,THEFT MISD. (UNDER $950),2021-04-01,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
1,21014391,THEFT MISD. (UNDER $950),2021-04-01,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
2,21090494,THEFT MISD. (UNDER $950),2021-04-19,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA
3,21090204,THEFT FELONY (OVER $950),2021-02-13,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA
4,21090179,BURGLARY AUTO,2021-02-08,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA
...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,2020-12-21,12:45,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA
2628,21008017,BRANDISHING,2021-02-24,15:06,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA
2629,21013239,THEFT FELONY (OVER $950),2021-03-24,0:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA
2630,21018143,THEFT MISD. (UNDER $950),2021-04-24,18:35,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA


In [180]:
calls['EVENTTM'] = pd.to_datetime(calls['EVENTTM'])
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,21014296,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:58:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
1,21014391,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:38:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
2,21090494,THEFT MISD. (UNDER $950),2021-04-19,2023-10-08 12:15:00,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA
3,21090204,THEFT FELONY (OVER $950),2021-02-13,2023-10-08 17:00:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA
4,21090179,BURGLARY AUTO,2021-02-08,2023-10-08 06:20:00,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA
...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,2020-12-21,2023-10-08 12:45:00,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA
2628,21008017,BRANDISHING,2021-02-24,2023-10-08 15:06:00,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA
2629,21013239,THEFT FELONY (OVER $950),2021-03-24,2023-10-08 00:00:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA
2630,21018143,THEFT MISD. (UNDER $950),2021-04-24,2023-10-08 18:35:00,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA


Now we can use the `dt` accessor on this column.

We can get the month:

In [182]:
# code here
calls['month'] = calls['EVENTDT'].dt.month
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,month
0,21014296,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:58:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,4
1,21014391,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:38:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,4
2,21090494,THEFT MISD. (UNDER $950),2021-04-19,2023-10-08 12:15:00,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,4
3,21090204,THEFT FELONY (OVER $950),2021-02-13,2023-10-08 17:00:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,2
4,21090179,BURGLARY AUTO,2021-02-08,2023-10-08 06:20:00,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,2
...,...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,2020-12-21,2023-10-08 12:45:00,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA,12
2628,21008017,BRANDISHING,2021-02-24,2023-10-08 15:06:00,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA,2
2629,21013239,THEFT FELONY (OVER $950),2021-03-24,2023-10-08 00:00:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA,3
2630,21018143,THEFT MISD. (UNDER $950),2021-04-24,2023-10-08 18:35:00,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA,4


Which day of the week the date is on:

In [183]:
# code here
calls['day_of_week'] = calls['EVENTDT'].dt.day_name()
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,month,day_of_week
0,21014296,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:58:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,4,Thursday
1,21014391,THEFT MISD. (UNDER $950),2021-04-01,2023-10-08 10:38:00,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,4,Thursday
2,21090494,THEFT MISD. (UNDER $950),2021-04-19,2023-10-08 12:15:00,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,4,Monday
3,21090204,THEFT FELONY (OVER $950),2021-02-13,2023-10-08 17:00:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,2,Saturday
4,21090179,BURGLARY AUTO,2021-02-08,2023-10-08 06:20:00,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,2,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,2020-12-21,2023-10-08 12:45:00,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA,12,Monday
2628,21008017,BRANDISHING,2021-02-24,2023-10-08 15:06:00,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA,2,Wednesday
2629,21013239,THEFT FELONY (OVER $950),2021-03-24,2023-10-08 00:00:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA,3,Wednesday
2630,21018143,THEFT MISD. (UNDER $950),2021-04-24,2023-10-08 18:35:00,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA,4,Saturday


Check the mimimum values to see if there are any suspicious-looking, 70s dates:

In [184]:
# sort on dates
# code here
min_date = calls['EVENTDT'].min()
is_suspicious = (min_date >= pd.Timestamp('1970-01-01')) and (min_date <= pd.Timestamp('1979-12-31'))

if is_suspicious:
    print("The minimum date is in the 1970s.")
else:
    print("The minimum date is not in the 1970s.")



The minimum date is not in the 1970s.


Doesn't look like it! We are good!


We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).