# Writing Data

Pandas has many options for writing data including to csv which is usually the most popular, to excel, to json, etc. The most basic usage is to call to_csv on a dataframe with the argument of the filepath/filename. If you do just specify a file name then you will end up writing the file to the current directory.

In [1]:
import pandas as pd

#Create test data
test_data = pd.DataFrame([[1, 2, 3],
                         [4, 5, 6],
                         [7, 8, 9]], index=['A','B','C'], columns=["Col 1", "Col 2", "Col 3"])

#Write the test data to csv
test_data.to_csv("TestData.csv")

# Reading Data

Within pandas there is read_csv and similar files which will read in data. We can call this with a path to a file and it will return back a dataframe. For example, let's read in the data we just wrote.

In [2]:
df = pd.read_csv("TestData.csv")
print(df)

  Unnamed: 0  Col 1  Col 2  Col 3
0          A      1      2      3
1          B      4      5      6
2          C      7      8      9


When doing this you can specify the index column before reading in. Above you see that if we do not do that then pandas assumes there is no index and all columns should be in the dataset. Instead let's read in with the index.

In [3]:
df = pd.read_csv("TestData.csv", index_col=0)
print(df)

   Col 1  Col 2  Col 3
A      1      2      3
B      4      5      6
C      7      8      9


### Toggling Index and Header

When we are using functions to write data, in all the cases we can also use optional arguments such as index and header to toggle writing the columns and the index. Let's make a second file without index or header.

In [4]:
test_data.to_csv("TestData2.csv", index=False, header=False)

Now read in the data that we just wrote and see what happens.

In [5]:
df = pd.read_csv("TestData2.csv")
print(df)

   1  2  3
0  4  5  6
1  7  8  9


Pandas is going to always assume that the first row is the header of your data. To read in the data assuming no header you can pass the argument header=None.

In [6]:
df = pd.read_csv("TestData2.csv", header=None)
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


# Using os to Modify Files

The os library will allow you to make certain changes to your file system or modify files which can make your life much easier when you start doing things like writing hundreds of files down. Let's begin with a simple example. If you want to make a folder in your current directory, you can call os.mkdir with the folder name.

In [7]:
import os
#Create a folder called Test

os.mkdir("Test")

Keep in mind if you try to create the directory a second time it will throw an error because it already exists

In [8]:
os.mkdir("Test")

FileExistsError: [Errno 17] File exists: 'Test'

### Listing File Contents

You are also able to list a directory's contents to better understand where you are. The os.listdir function takes a path and returns the contents in that path.

In [9]:
#List the contents of the current directory
print(os.listdir("."))

['.ipynb_checkpoints', '1 Pandas Basics.ipynb', '2 Data Transformations.ipynb', '3 Statistics.ipynb', '4 Reading and Writing Data.ipynb', '5 Joins.ipynb', '6 Grouping.ipynb', '7 Introduction to numpy.ipynb', '8 Randomness.ipynb', 'Test', 'TestData.csv', 'TestData2.csv']


In [10]:
#List the contents of the Test directory
print(os.listdir("./Test"))

[]


### Removing and Renaming Files

Remove can be done with os.remove for files or os.rmdir for directories. The function os.rename takes first a path to a current file and then the new name and renames it to that.

In [11]:
#Remove the first test data
os.remove("TestData.csv")

In [12]:
#Rename the second test data set
os.rename("TestData2.csv", "TestData.csv")

In [13]:
#List the directory contents
print(os.listdir("."))

['.ipynb_checkpoints', '1 Pandas Basics.ipynb', '2 Data Transformations.ipynb', '3 Statistics.ipynb', '4 Reading and Writing Data.ipynb', '5 Joins.ipynb', '6 Grouping.ipynb', '7 Introduction to numpy.ipynb', '8 Randomness.ipynb', 'Test', 'TestData.csv']


In [14]:
#Remove both the new file and the new directory
os.remove("TestData.csv")
os.rmdir("Test")

In [15]:
#List the directory
print(os.listdir("."))

['.ipynb_checkpoints', '1 Pandas Basics.ipynb', '2 Data Transformations.ipynb', '3 Statistics.ipynb', '4 Reading and Writing Data.ipynb', '5 Joins.ipynb', '6 Grouping.ipynb', '7 Introduction to numpy.ipynb', '8 Randomness.ipynb']


## Reading in Chunks

A final feature that can be a lifesaver is the ability to read in chunks. What this does is allows you to take a large dataset and only read in pieces at a time to make it more manageable. Let's begin with creating an example file with 100 rows (in reality we would only do this with a large dataset, but this is just to show the example).

In [16]:
#Create some dummy data
test_data = pd.DataFrame([[x, x**2, x*5] for x in range(1, 101)])
test_data.to_csv("TestData.csv")

The simple objective is to get the sum of all values in the dataframe. In our example, you can image that instead of 100 values we have 100 million+ values and so we may not be able to read the dataset into our computer depending on how much memory the computer has. An important thing to ask yourself is whether or not what you are trying to achieve can be done in chunks.... there are times that you can only run operations on the full dataset for one reason or another and if that is the case your best bet is moving to a database.

If you give the chunksize argument with a number of rows to read in each time, you will be able to piece the reading into parts. The code below will chunk the dataframe into sets of 10 rows and then we can loop through each chunk to get the sum.

In [17]:
chunks = pd.read_csv("TestData.csv", index_col=0,chunksize=10)
total = 0
for chunk in chunks:
    s = chunk.sum().sum()
    print("Sum of current chunk: {}".format(s))
    print()
    total += s
print("Total sum: {}".format(total))

Sum of current chunk: 715

Sum of current chunk: 3415

Sum of current chunk: 8115

Sum of current chunk: 14815

Sum of current chunk: 23515

Sum of current chunk: 34215

Sum of current chunk: 46915

Sum of current chunk: 61615

Sum of current chunk: 78315

Sum of current chunk: 97015

Total sum: 368650
