# Reading Files into a Dataframe

* Once your data is in a Pandas `DataFrame` you can easily use a ton of analytical tools
* You just have to get your data to fit into a dataframe
* Getting data to fit is a big part of the "data janitor" work...it is the craft of data carpentry
* However, as we will see, there is still a lot of carpentry work to do once your data fits into a `DataFrame`

In [None]:
# load pandas
import pandas as pd

### Open the file and load it into memory

* Pandas provides some very handy functions for reading in CSV files.

In [None]:
# look at a CSV files using the unix command head
!head community-center-attendance.csv

* This is how we open a CSV file with pure Python

In [None]:
# Load up the CSV module
import csv

# open a file hander
file_handler = open('community-center-attendance.csv')
# Load the file into the CSV module
reader = csv.reader(file_handler)

# read the headers first
headers = next(reader)

# Read all the data into a variable as a list. look ma! one line!
center_attendance_python = [row for row in reader]

# close the file handler for good hygiene 
file_handler.close()

# print headesr
print(headers)

# display the first five rows 
center_attendance_python[0:5]

* We could load these into a dataframe using the dataframe syntax we just learned

In [None]:
center_attendence_pandas = pd.DataFrame(center_attendance_python, columns=headers)
center_attendence_pandas

* Pandas can do this much easier using the `read_csv()` function

In [None]:
# Open up the csv file directly with pandas 
center_attendance_pandas = pd.read_csv("community-center-attendance.csv", 
                                       index_col="_id") # use the column named _id as the row index
# Display the first five rows
center_attendance_pandas.iloc[0:5]

* Notice that Pandas figured out there is a header row and it create a row index from one of the columns
* Pandas also has a special function, `head(n)` for looking at the first *n* rows in a dataframe

In [None]:
# Use the head function to look at the "head" 
# of the dataframe. Default is 5 rows.
center_attendance_pandas.head()

In [None]:
# Use the head function to look at the "head" 
# of the dataframe. Default is 5 rows.
center_attendance_pandas.head(10)

* Notice the index starts at 1 instead of zero, that is because we told Pandas to use the "_id" column as the row index.
* This is when it is important to understand the difference between `loc` and `iloc`

In [None]:
# Select row by index name
center_attendance_pandas.loc[1]

In [None]:
# Select row by index location
center_attendance_pandas.iloc[1]

---

## Writing CSV Files

* If you have your data loaded into a Dataframe you can easily write it to a file with the `to_csv()` method
* There are also functions for writing to a bunch of different datatypes (excel, json, sql, etc.)


In [None]:
# create a list of lists, each sub-list is an observation/row
dead_people_list = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names seperately
column_names = ["ssn","first_name", "last_name", "age"]

# make a Dataframe with column names specified separately
dead_people = pd.DataFrame(dead_people_list, columns=column_names)
dead_people

In [None]:
# use the to_csv function to write it to a file
dead_people.to_csv("dead_people.csv", index=False)

In [None]:
!head dead_people.csv

---