# Writing files

Let's open a file for writing. `open` takes an optional parameters that specifies the "mode" - in this example, `'w'` indicates that we're opening the file for writing.

We'll use **relative paths** for our files.

In [4]:
!ls *.csv

ls: cannot access '*.csv': No such file or directory


In [5]:
writer = open('class_codes.csv', 'w')

for line in open('sample_data/classes.csv'):
    class_list = line.strip().split(',')
    class_code, class_name = class_list[0:2]
    out_line = class_code + ',' + class_name + '\n'
    writer.write(out_line)
    writer.flush() # Forces writing now

In [6]:
!ls *.csv

class_codes.csv


In [7]:
!cat class_codes.csv

class_code,class_name
cs-130,Introduction to Programming in Python
cs-257,User Experience Design
cs-401,Senior Comprehensive Project
cs-491,Computer Science Internship


The file above will persist after this program stops running because it is stored on the server's hard drive.

# f-strings

So far we've constructed strings in a less-than-ideal way:

In [8]:
month = 'November'
day = '17'
year = '2025'
full_date = month + ' ' + day + ', ' + year
print(full_date)

November 17, 2025


We can use f-strings to write this more clearly. The letter `f` should go before the opening quotation mark, and expressions go in curly braces.

In [9]:
full_date = f'{month} {day}, {year}'
print(full_date)

November 17, 2025


We can simplify our `out_line` code as follows:

In [None]:
writer = open('class_codes.csv', 'w')

for line in open('classes.csv'):
    class_list = line.strip().split(',')
    class_code, class_name = class_list[0:2]
    #out_line = class_code + ',' + class_name + '\n'
    out_line = f'{class_code},{class_name}\n'
    writer.write(out_line)
    writer.flush() # Forces writing now

In an f-string, an expression in curly brace is converted to a string, so you can include lists, dictionaries, and other types.

In [10]:
t = [1, 2, 3]
d = {'one': 1}
f'Here is a list {t} and a dictionary {d}'

"Here is a list [1, 2, 3] and a dictionary {'one': 1}"

You're also not limited to variables:

In [11]:
f'1 + 1 is {1 + 1}'

'1 + 1 is 2'

# What is Pandas?
* Pandas is the most popular Python library for data analysis.
* There are two core objects in pandas: the DataFrame and the Series.

# DataFrame
* A DataFrame is a table.
* It contains an array of individual entries, each of which has a certain value.
* Each entry corresponds to a row (or record) and a column.

In [12]:
import pandas as pd
pd.DataFrame({'Classes Attended': [30, 27], 'Lab Grade': [92, 88], 'Final Grade': [96, 84]})

Unnamed: 0,Classes Attended,Lab Grade,Final Grade
0,30,92,96
1,27,88,84


Doesn't just accept integers, here's an example using strings.

In [13]:
pd.DataFrame({'Lecture Feedback': ['None', 'Great!'], 'Lab Feedback': ['Too hard', 'Too easy']})

Unnamed: 0,Lecture Feedback,Lab Feedback
0,,Too hard
1,Great!,Too easy


Above, we're using the `pd.DataFrame()` constructor to generate these DataFrame objects.

The syntax for declaring a new one is a dictionary:
* whose keys are the column names (Lecture Feedback and Lab Feedback in this example), and
* whose values are a list of entries.

In [None]:
l1 = ['None', 'Great!'] #A list
l2 = ['Too hard', 'Too easy'] #A list
d = {'Lecture Feedback': l1, 'Lab Feedback': l2} #A dictionary
pd.DataFrame(d)

The `pd.DataFrame()` constructor just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels (or indexes). Sometimes this is OK, but we will often want to assign these labels ourselves.

We can assign our own values to each index by sending a list to the index parameter in our constructor:

In [14]:
pd.DataFrame({'Lecture Feedback': ['None', 'Great!'],
              'Lab Feedback': ['Too hard', 'Too easy']},
              index=['John Doe','Jane Doe'])

Unnamed: 0,Lecture Feedback,Lab Feedback
John Doe,,Too hard
Jane Doe,Great!,Too easy


# Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

In fact, you can create one with nothing more than a list:

In [None]:
pd.Series([1, 2, 3, 4, 5])

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter.

However, a Series does not have a column name, it only has one overall name:

In [15]:
pd.Series([30, 35, 40],
          index=['2015 Sales', '2016 Sales', '2017 Sales'],
          name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

# Creating from file
Most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the CSV file. We can use the `pd.read_csv()` function to read the data into a DataFrame.

In [16]:
import pandas as pd
housing = pd.read_csv("sample_data/california_housing_test.csv")

The first step with an unfamiliar dataset is to get a big picture sense of the data. You can check how big the resulting DataFrame is.

`shape` is not a DataFrame method, but an attribute, so we do not use the `()`

In [17]:
housing.shape

(3000, 9)

So our new DataFrame has 3,000 records split across 9 different columns.

We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows:

In [18]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0


# Your data

* Import pandas
* Load your CSV into a data frame
* Run some code to see if it worked

In [None]:
import pandas as pd
your_data = pd.read_csv("your_data.csv")
your_data.head()