## Rationale
Data that can be fruitfully analyzed with Pandas exist in many forms and locations. Moreover, after performing data analysis/cleanup you need to be able to save your results. As such, Pandas includes several methods that are designed to make File I/0 (input/output) work as smoothly as possible.

## Reading/writing flat files
The workhorse method for inputting data into a Pandas DataFrame is pd.read_csv(). CSV stands for "Comma Separated Values," and is a very commonly encountered file format for tabular data, since it is human-readable and requires a minimal amount of extra characters to specify what goes where. At its most basic usage, this method takes just a path to a file, and returns a DataFrame:

In [4]:
import pandas as pd

In [8]:
df = pd.read_csv('csv/merged_data.csv')

In [10]:
df

Unnamed: 0.1,Unnamed: 0,county_code,COUNTY,STATEABBREVIATION,YEAR,AMAT_fac,HIVdiagnoses,HIVincidence,HIVprevalence,MH_fac,...,pctunmetneed,nonmedpain,ADULTMEN,MSM12MTH,MSM5YEAR,%msm12month,%msm5yr,unemployment_rate,poverty_rate,household_income
0,0,1001,Autauga County,AL,2015,0.0,5.0,10.9,225.5,1.0,...,95.70,5.12,19410,333,514,1.715611,2.648120,8.5,12.8,20304
1,1,1003,Baldwin County,AL,2015,0.0,15.0,8.7,163.9,4.0,...,91.34,5.27,69724,925,1429,1.326659,2.049509,8.6,13.8,73058
2,2,1005,Barbour County,AL,2015,0.0,0.0,0.0,436.0,1.0,...,91.34,5.27,11567,82,127,0.708913,1.097951,14.2,24.1,9145
3,3,1007,Bibb County,AL,2015,0.0,0.0,0.0,191.9,0.0,...,91.86,5.62,9508,119,184,1.251578,1.935212,10.9,17.0,7078
4,4,1009,Blount County,AL,2015,0.0,5.0,10.4,95.4,1.0,...,91.86,5.62,21368,601,928,2.812617,4.342943,9.3,17.3,20934
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3135,3135,56037,Sweetwater County,WY,2015,0.0,0.0,0.0,86.5,3.0,...,87.02,3.38,16941,177,274,1.044803,1.617378,5.6,12.2,16687
3136,3136,56039,Teton County,WY,2015,0.0,0.0,0.0,50.5,2.0,...,89.16,3.42,9172,50,78,0.545137,0.850414,3.6,8.5,7873
3137,3137,56041,Uinta County,WY,2015,0.0,0.0,0.0,0.0,4.0,...,87.02,3.38,7401,75,116,1.013377,1.567356,5.8,14.2,7557
3138,3138,56043,Washakie County,WY,2015,0.0,0.0,0.0,0.0,1.0,...,86.12,3.26,3141,17,27,0.541229,0.859599,7.9,14.2,3461


Hence, you will typically assign the call to a variable, as with df above. Despite its name, pd.read_csv() can be used with files where the different fields of the table are separated by something other than commas. For example, you could load a file that uses the tab character as a separator by calling:

In [27]:
df = pd.read_csv('csv/merged_data.csv', sep = '\t')

In [28]:
df

Unnamed: 0,",county_code,COUNTY,STATEABBREVIATION,YEAR,AMAT_fac,HIVdiagnoses,HIVincidence,HIVprevalence,MH_fac,Med_AMAT_fac,Med_MH_fac,Med_SA_fac,Med_SMAT_fac,Med_TMAT_fac,PLHIV,Population,SA_fac,SMAT_fac,TMAT_fac,drugdeathrate,drugdeathrate_est,drugdeaths,mme_percap,partD30dayrxrate,pctunins,num_SSPs,bup_phys,drugdep,pctunmetneed,nonmedpain,ADULTMEN,MSM12MTH,MSM5YEAR,%msm12month,%msm5yr,unemployment_rate,poverty_rate,household_income"
0,"0,1001,Autauga County,AL,2015,0.0,5.0,10.9,225..."
1,"1,1003,Baldwin County,AL,2015,0.0,15.0,8.7,163..."
2,"2,1005,Barbour County,AL,2015,0.0,0.0,0.0,436...."
3,"3,1007,Bibb County,AL,2015,0.0,0.0,0.0,191.9,0..."
4,"4,1009,Blount County,AL,2015,0.0,5.0,10.4,95.4..."
...,...
3135,"3135,56037,Sweetwater County,WY,2015,0.0,0.0,0..."
3136,"3136,56039,Teton County,WY,2015,0.0,0.0,0.0,50..."
3137,"3137,56041,Uinta County,WY,2015,0.0,0.0,0.0,0...."
3138,"3138,56043,Washakie County,WY,2015,0.0,0.0,0.0..."


Since there are a huge number of ways of constructing files from tables of data that follow this general pattern of having columns separated by some special character or characters and rows separated with new-line characters, pd.read_csv() inludes a large number of optional keywords. For example: Does the file include a header with the names of the columns? Does it include an index of the rows? Are there other portions of the file that need to be treated in a different way than the main body of the table? Do you want Pandas to try to infer the datatype of each column, or specify what is expected? These are just some of the more common considerations that must be dealt with on a case-by-case basis when reading some new dataset. As usual, the Pandas docs are a good reference. See here for a tutorial  and here for the standard docstring  for pd.read_csv().

Once you have manipulated your data in some way and want to save a copy, you may want to use a CSV file for this purpose.

In [15]:
df.to_csv('csv/copy_merged_data.csv')

As you can see, .to_csv() is a method on a DataFrame (and Series) for saving data in Pandas to a flat file. The syntax and options are similar to what you will find in pd.read_csv(). By default, .to_csv() includes the column names as a header in the file, and the index as the left-most column. If your index is arbitrary, such as just a sequence of integers, you might want to set index = False, since there is no need to save that to disk, and of course, uses commas as separators.

## Other file types
Pandas includes methods for importing several other commonly used file types, which are all listed in the docs  . These include JSON files, Microsoft Excel spreadsheets, and HTML tables. You don't even necessarily have to have your data saved to disk, as pd.read_clipboard() will work with data you have copied to your computer's clipboard.

Several other methods work with data in various saved formats that are used for larger datasets, such as Parquet and HDF5, or from outputs from statistical software packages, like SAS and Stata.

## Reading/writing to a SQL database
The major remaining way to read in data to Pandas is by connecting to a SQL database. You can submit a query to the database with pd.read_sql_query() or simply request an entire table with pd.read_sql_table(). Both these methods (along with the one for writing to the database, .to_sql()) depend on another Python library to manage your connection to the database.

**SQLAlchemy**  is a widely used package that has support for several flavors of **SQL**, and it is used in the examples in the **Pandas docs**  . SQLAlchemy is quite extensive and has a lot of functionality beyond the relatively simple need for a database connection here. A lighter-weight package specifically designed to work with **PostgreSQL** is: **psycopg2**  . You would use one of these (or similar) libraries to connect to your database, which might require specifying things like a port address, username, and password. The details of this setup are beyond the scope of this unit (see the docs linked above), but at the end of the process you will have a connection object that you can pass to Pandas. For example, you could do something like:

In [16]:
df = pd.read_sql_query("SELECT * FROM transactions;", con = connection)

NameError: name 'connection' is not defined

This assumes you called your connection object connection and the database it is connected to contains a table named "transactions". Of course, you can write more complex queries as necessary. In this simple example you would actually be better off using pd.read_sql_table rather than doing a "SELECT *".

Although the additional work required to make your Python session or script connect to a database may seem daunting, it is a useful skill for a data scientist that comes up in other contexts as well. For this particular use case, it is easy to imagine a workflow that iteratively develops queries to a database, with future iterations guided by the results of some analysis done in Pandas. Working solely within a Python session would be much more efficient than continually switching to query the database directly and then dumping the results of each query to a file to be read into Pandas.



An additional consideration when reading data into Pandas, especially from databases, is that Pandas holds data in memory. This means that the RAM limits of the computer you're developing on (your laptop, or perhaps a remote server) will dictate how much data you can work with at once. So, for example, if you know you don't need all the data represented in a table because you only care about columns C and D, it is probably better to run something like "SELECT C, D FROM transactions" rather than "SELECT * FROM transactions" so that you don't load more data into memory than you actually intend to work with. This same thought process applies to loading data from other sources; consider whether you can load only the data you care about, rather than holding in memory things you don't intend to use.

1. You have a file called challenge_data.txt that you want to input into a DataFrame called challenge_df. The first few lines of the file look like this:

In [18]:
'''
Col A|Col B|Col C
940|833|1
239|92|575
680|467|480
'''

'\nCol A|Col B|Col C\n940|833|1\n239|92|575\n680|467|480\n'

Enter the Python code to instantiate this DataFrame.

Note: You can assume that your code would be run from the same directory as challenge_data.txt, and you can assume that Pandas has been imported using the standard convention. Also, it's fine to depend on Pandas' default behavior for dealing with headers and indexes in this case.

challenge_df = pd.read_csv('challenge_data.txt', sep = "|")