# 06.01.01 - CSVImportExport

## Purpose

This notebook specifically talks about dealing with CSV files.  Both input/output.  Please note that we won't be doing really any analysis on these files.  Just pulling them in, doing simple operations, and exporting them

This notebook is part of a set of notebooks about file input/output.

## Requirements

This notebook will read from the data directory within the Github repo.  Pulling this notebook, by itself, will likely NOT work.  You will need to pull down the entire repository.

This notebook will also read from a CSV online.  So internet access at the time of run is required.

This notebook is developed in PyCharm.  It's possible that you may need to modify some paths to get this working if you are using Jupyter or have a different environment.  Please look at and modify the _dataDir_ and __exportDir__ variable below, as needed.  These need to exist, and are directory names.  Subsequent runs of this should overwrite files in the exportDir, and these aren't version controlled.

In [1]:
# IMPORTANT: CHANGE THIS PATH IF IT DOESN'T WORK FOR YOU!!!
dataDir = "../../data/"
exportDir = "../../export/"

import pandas as pd

# Reading from on-disk locations

In [2]:
boyNames = pd.read_csv(f"{dataDir}/BoyNames.csv")
boyNames

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
3,4,1980,Matthew,2112
4,5,1980,David,2088
...,...,...,...,...
845,21,2013,NATHAN,470
846,22,2013,ANDREW,468
847,23,2013,HENRY,463
848,24,2013,DAVID,461


Processing/dealing with the file is pretty much the same as we've seen.  We can do filtering and pull data from it.

In [3]:
davidFrequency = boyNames[boyNames['Name'] == "David"].loc[:, ("Name", "Year", "Frequency")]
davidFrequency.reset_index(inplace=True, drop=True)
davidFrequency

Unnamed: 0,Name,Year,Frequency
0,David,1980,2088
1,David,1981,2043
2,David,1982,1983
3,David,1983,1940
4,David,1984,1847
5,David,1985,1932
6,David,1986,1749
7,David,1987,1707
8,David,1988,1647
9,David,1989,1616


# Writing to an on-disk location

In [4]:
davidFrequency.to_csv(f"{exportDir}/davidFrequency.csv")

# Note, this doesn't really tell us much, unless it fails.  Check the exportDirectory above!

# Reading from an online-resource
Note, to be able to do this from a Github repo, you need to use the "raw" URL to do so.

In [5]:
webUsers = pd.read_csv("https://raw.githubusercontent.com/TheDarkTrumpet/BAIS-6040-0EXP-spr2021/master/data/web-users.csv")
webUsers

Unnamed: 0,DATE,USERS
0,01/01/2016,73404
1,01/02/2016,66795
2,01/03/2016,57185
3,01/04/2016,106916
4,01/05/2016,99982
...,...,...
1340,10/02/2019,164249
1341,10/03/2019,159935
1342,10/04/2019,159628
1343,10/05/2019,131067


Processing/dealing with it in this way is exactly the same as we've done with other Pandas objects.  From the point it's loaded, on out, it's standard operations.  One important note is that in some cases we may need to convert, manually, what comes back to us.  Our datetime from the web-users wasn't pulled in. So we need to modify that, so we can filter.

In [6]:
webUsers['DATE'] = pd.to_datetime(webUsers['DATE'])
webUsers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1345 entries, 0 to 1344
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   DATE    1345 non-null   datetime64[ns]
 1   USERS   1345 non-null   int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 21.1 KB


In [7]:
usersIn2016 = webUsers[(webUsers['DATE'] >= '2016-01-01') & (webUsers['DATE'] <= '2016-12-31')]
usersIn2016

Unnamed: 0,DATE,USERS
0,2016-01-01,73404
1,2016-01-02,66795
2,2016-01-03,57185
3,2016-01-04,106916
4,2016-01-05,99982
...,...,...
331,2016-12-27,96198
332,2016-12-28,102165
333,2016-12-29,96002
334,2016-12-30,95243


In [8]:
usersIn2016.to_csv(f"{exportDir}/2016Users.csv")
