# Week 2: Reading a file and handling basic data types

## EEB125 | W2023
## Tomo Parins-Fukuchi



### Learning Objectives
By the end of this lecture, you will be able to
- work with basic python data types to answer data science questions
- open and read a data file using Python and Jupyter Notebook
  * specifically, a dataset capturing occurrences of fossils across Canada
- parse through the data to extract useful information

## Why Python?
- Python is a nice, general purpose programming language
- It has excellent support for data science tools
- It is easy to learn and sets you up for success with other tools


## Techniques and concepts
- Ints vs floats
- Operators
- Manipulating strings
- Variables
- Lists
- Dictionaries
- Looping
- Tabular data
- The most fossiliferous(?) Canadian province

### Data types

Programming languages specify different types of data
  - E.g., numbers and letters/words are often represented as different types
  - In Python, some basic types are:
    + integers ('ints'): whole numbers (0,1,2)
    + floats: decimal numbers (e.g., 1.347)
    + strings: letters and words (e.g., 'horse', 'cat')


## Integers

- Round numbers (can be positive or negative)

In [None]:
type(23)

int

## Floats

- Decimal values

In [None]:
type(23.54321)

float

### Operators

- Python has built-in operators for performing operations on ints and floats
- In addition to `+`, there are others:
  + `-`,`*`,`**`,`/`,`%`
- Let's explore what they do:

In [None]:
2 ** 3

8

## Strings

- Words, letters, etc

In [None]:
type("michael jordan")

### Variables

Programming languages store data using _variables_
  -  A programming variable represents a piece of data
    + Can generally be any type
  -  Variables reference value(s) that we assign to them
  -  We can then manipulate those variables to perform tasks
  

### Example

Create the variable 'x' and assign an integer value to it

In [None]:
x = 1

### Example

`x` is now stored in memory. We will make another and call it 'y'

In [None]:
x = 1
y = 1

### Example

We can now use both of these variables to do some arithmetic:

In [None]:
x = 1
y = 1
x+y

2

### Example

We can assign pretty much any data type to a variable:

In [None]:
a = "spongebob "
b = "squarepants"
print(a)
print(b)

spongebob 
squarepants


### Example

We also can use some of the same operators on other data types:

In [None]:
a = "spongebob "
b = "squarepants"
a+b

'spongebob squarepants'

### Reassigning to variables

We can also change the value assigned to a variable:

In [None]:
b = "loserpants"
print(a+b)
b = "squarepants"

spongebob loserpants


In [None]:
# can even reassign a variable to itself

b = b
print(a+b)
a = a + b
print(a)

spongebob squarepants
spongebob squarepants


### 'Print' statement

+ Often, we may want to see the results of some operation in our notebook or computer terminal
+ We can use the "print" function in python to do this

In [None]:
print("whoever thought that i would be the greatest growing up?")

whoever thought that i would be the greatest growing up?


### Example

Adding two strings basically slams both together. Can we add a string and a number?

In [None]:
# a+x

### Nope! 

Interpreting errors is one of the most important parts of programming
  - This is telling us that we cannot mix these data types when adding

### Other errors 

- We have to be very careful with what we tell Python. It is very particular. 
- E.g.:

In [None]:
real = 33
# print(rea)

### Other rules for variables 

- variable names can contain only letters, numbers, underscores
- cannot start w a number
- no spaces
- don't use python keywords (e.g., `print`)

### What else can we do with strings?

- Python has many built-in tools for manipulating strings. Let's explore some:

In [None]:
# make uppercase 
test = "    what's the deal    "

print(test.upper())

    WHAT'S THE DEAL    


In [None]:
# remove whitespace
test = "    what's the deal    "
print(test.strip())

what's the deal


In [None]:
# do both
test = "    what's the deal    "
print(test.upper().strip())

WHAT'S THE DEAL


In [None]:
# do both and reassign to original variable
test = "    what's the deal    "
print(test)
test = test.upper().strip()
print(test)

    what's the deal    
WHAT'S THE DEAL


In [None]:
# replace part of a string

print(test)
test = test.replace("WHAT'S","THIS IS")
print(test)

WHAT'S THE DEAL
THIS IS THE DEAL


### Converting between types

- In some cases, Python will also allow us to convert between data types

In [None]:
# we can convert a float to a string:

a = 3.21
a = str(a)
print(a)

3.21


In [None]:
# now, if we try to treat it like it is a float, we will get into trouble:

# print(a+3.3)

In [None]:
# so we can also convert it back:

a = float(a)
print(a)
print(a+3.3)

3.21
6.51


### Be careful with types

- This flexibility allows us to be very sloppy with types
- Not always obvious what type a variable will be
- Be careful

### Containers

- Besides the three we have examined, Python has many other data types
- Some of these we can refer to as 'containers'
  + These allow us to store many values within a single variable

### Lists

- Lists are a very common python container
- We can specify a list using square brackets
  
```
my_list = []
```

Generates an empty list. We can create a list with things in it: 

In [None]:
emcees = ["tupac","eminem","biggy","weezy"]
print(emcees)

['tupac', 'eminem', 'biggy', 'weezy']


### Lists

We can also add items to an existing list:

In [None]:
emcees.append("kendrick")
print(emcees)

['tupac', 'eminem', 'biggy', 'weezy', 'kendrick']


### Indexing lists

- We can select individual items from a list by 'indexing' it

In [None]:
# the first item in python of anything is always accessed as the zeroth item
print(emcees[0])

tupac


In [None]:
print(emcees[1])

eminem


In [None]:
# can also index from the other end
print(emcees[-1])

kendrick


### Slicing lists

- We can select ranges of items from a list by 'slicing' it

In [None]:
print(emcees)
print(emcees[1:3])

['tupac', 'eminem', 'biggy', 'weezy', 'kendrick']
['eminem', 'biggy']


In [None]:
print(emcees)
print(emcees[1:])

['tupac', 'eminem', 'biggy', 'weezy', 'kendrick']
['eminem', 'biggy', 'weezy', 'kendrick']


In [None]:
print(emcees)
print(emcees[:3])

['tupac', 'eminem', 'biggy', 'weezy', 'kendrick']
['tupac', 'eminem', 'biggy']


In [None]:
# lists also have tools associated with them
# how many times does "biggy " appear in the list emcees?
print(emcees.count("biggy"))

1


## Creating lists from strings

- Python also has built-in tools for creating a list of smaller strings from a longer string

In [None]:
emcees_str = "tupac, eminem, biggy, weezy, kendrick"
emcees_ls = emcees_str.split(",")
print(emcees_ls)

['tupac', ' eminem', ' biggy', ' weezy', ' kendrick']


## Creating strings from lists

- We can also join the elements of a list composed of all strings into a larger string

In [None]:
emcees_str2 = ",".join(emcees_ls)
print(emcees_str2)

tupac, eminem, biggy, weezy, kendrick


## Other tools

- Python has many built-in tools for dealing with a variety of data types

In [None]:
# how long is a list?
print(len(emcees_ls))
# how long is a string?
print(len(emcees_str))

5
37


### Reviewing

- Python represents data using different types
  - ints, floats, strings
- We can assign data to variables
  + This allows us to commit it to memory and perform operations later
- We can store data in 'containers'
  + We can access data stored in containers by indexing slicing
  + Containers can also be assigned to variables
  

### Looping 

- Much of computing fundamentally involves performing an operation on one piece of data at a time
- Given a collecton of data, we can examine one item at a time by constructing a **loop**
  


### For loops 

- One way of looping over data in Python is by using a 'for loop'
  


In [None]:
print(emcees_ls)

for mc in emcees_ls:
    print(mc)

['tupac', ' eminem', ' biggy', ' weezy', ' kendrick']
tupac
 eminem
 biggy
 weezy
 kendrick


In [None]:
print(emcees_ls)

for mc in emcees_ls:
    print(mc.strip().upper())

['tupac', ' eminem', ' biggy', ' weezy', ' kendrick']
TUPAC
EMINEM
BIGGY
WEEZY
KENDRICK


### Tabular data

Much of the data that we will work with is in 'tabular' format
  - Data contained within rows and columns
  - Sort of like an Excel spreadsheet
  


### Tabular data

We are often used to seeing data in table form:

| Team          | # of Cups | Last cup | Country      | 
| -----------------------|------------|--------|--------------|
| Canadiens          | 23          | 1993   | Canada |
| Maple Leafs           | 13          | 1967  | Canada |
| Red Wings                | 11          | 2008   | US        |
| Bruins                   | 6          | 2011      | US|
| Blackhawks                   | 6          | 2015   | US       |
| Oilers    | 5          | 1990   | Canada |
| Penguins               | 5          | 2017 | US      |



### Tabular data

- A common way of storing such data for programming is by separating each cell by a pre-specified character. Commas are common:

```
team,nCups,lastCup,country
Canadiens,23,1993,Canada
MapleLeafs,13,1967,Canada
RedWings,11,2008,US
Bruins,6,2011,US
Blackhawks,6,2015,US
Oilers,5,1990,Canada
Penguins,5,2017,US
```

- We often refer to this as a "comma-separated values" (csv) file 
- Columns are separated by columns
- The first line usually contains a guide to the data (the 'header')


### Reading and writing data

We often want to work with data stored in external files
  - This requires us to read files into our active memory so we can perform analyses
  - We may also want to write new files to save modified data or analyses
  - We can do this using the `open` function in Python

In [None]:
open

<function io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)>

### 'Calling' (using) a function

To actually use a function, we add parentheses after the function name:

```open()```


In [None]:
#open()

### Calling `open()`

Functions require 'arguments'-- information we provide that is necessary to perform the function
  - We are being told that our call needs to specify a `'file'`
  - We need to tell Python where to find the file we want to open
  
![Alt Text](https://media2.giphy.com/media/gsrw6aoAncbKw/giphy.gif?cid=ecf05e47keqvjg4grf3bez5o5trof37x1oh22v51vpx3ghnz&rid=giphy.gif&ct=g)

## Today's Data

- We will be using data from the Paleobiology Database (https://paleobiodb.org/#/)
- Public database of fossil occurrences all over the world
    - What species did a fossil come from?
    - Where did it occur?
    - When did it occur

### Reading a file

We can use `open` to read a file that is stored in the directory and save it to a variable:

```file = open("TESTFILE.csv")```


In [None]:
file = open("pbdb_data.csv","r")
file

<_io.TextIOWrapper name='pbdb_data.csv' mode='r' encoding='cp1252'>

### Reading a file

The information from the file is now assigned to the variable `file`
  - How can we make this information human-readable?
  
```lines = file.readlines()```

Will read all of the lines of text contained within the file and assign them to the variable `lines`


In [None]:
lines = file.readlines()
lines

['identified_name,identified_rank,identified_no,accepted_name,accepted_rank,accepted_no,early_interval,max_ma,min_ma,reference_no,cc,state\n',
 'Bonniopsis sp.,genus,19508,Bonniopsis,genus,19508,Cambrian,541,485.4,61234,CA,Nunavut\n',
 'Fremontia sp.,genus,155360,Mesonacis,genus,19142,Cambrian,541,485.4,61234,CA,Nunavut\n',
 'Trilobita indet.,class,19100,Trilobita,class,19100,Cambrian,541,485.4,61234,CA,Nunavut\n',
 'Paterina sp.,genus,26583,Paterina,genus,26583,Cambrian,541,485.4,61234,CA,Nunavut\n',
 'Hyolithes sp.,genus,7746,Hyolithes,genus,7746,Cambrian,541,485.4,61234,CA,Nunavut\n',
 'Bonniopsis sp.,genus,19508,Bonniopsis,genus,19508,Caerfai,530,513,61234,CA,Nunavut\n',
 'Paedumias sp.,genus,19150,Olenellus (Paedeumias),subgenus,155701,Caerfai,530,513,61234,CA,Nunavut\n',
 'Trilobita indet.,class,19100,Trilobita,class,19100,Caerfai,530,513,61234,CA,Nunavut\n',
 'Hyolithes sp.,genus,7746,Hyolithes,genus,7746,Caerfai,530,513,61234,CA,Nunavut\n',
 'Circotheca sp.,genus,7634,Circothec

### Reading a file

We now have our information in a way we can interpret
  - The lines of the file are stored as strings contained within a list
  - The first line is a "header"-- it describes the data
  - How can we examine this header?


In [None]:
header = lines[0]
print(header)

identified_name,identified_rank,identified_no,accepted_name,accepted_rank,accepted_no,early_interval,max_ma,min_ma,reference_no,cc,state



In [None]:
# assign the rest of the data to a variable called data
data = lines[1:]

### Which Canadian province is the most fossil-rich?

- We have the lines of our data stored in a list
- That means we can loop over the lines and extract data one line at a time


In [None]:
# create an empty dictionary that we will populate with data
province_recs = []

for line in lines[1:]:
    line_dat = line.strip().split(",")
    province = line_dat[-1].strip()
    province_recs.append(province)

## Quiz

- What type is `province_recs`?

- What information does it contain from our data file?

In [None]:
print(province_recs)

['Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'Nunavut', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'Northwest Territories', 'Northwest Territories', 'Northwest Territories', 'Northwest Territories', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia', 'British Columbia'

## One at a time

- We could look up the number of fossils one province/territory at a time
- Count the number of times that province/territory occurs in the list using `.count()`

In [None]:
province = "Ontario"
province_recs.count(province)

4786

## Brainstorm

- How else might we do this? Is there a more efficient way to get results for every province?

In [None]:
province_ls = ["Ontario","British Columbia","Alberta","Saskatchewan","Manitoba","Newfoundland and Labrador","Northwest Territories","Yukon","Prince Edward Island","Nunavut","Nova Scotia","Quebec","New Brunswick"]
print(province_ls)

['Ontario', 'British Columbia', 'Alberta', 'Saskatchewan', 'Manitoba', 'Newfoundland and Labrador', 'Northwest Territories', 'Yukon', 'Prince Edward Island', 'Nunavut', 'Nova Scotia', 'Quebec', 'New Brunswick']


In [None]:
for i in province_ls:
    print(i,province_recs.count(i))

Ontario 4786
British Columbia 20473
Alberta 10578
Saskatchewan 1756
Manitoba 2432
Newfoundland and Labrador 2679
Northwest Territories 8736
Yukon 3692
Prince Edward Island 44
Nunavut 6827
Nova Scotia 1655
Quebec 7072
New Brunswick 841


## END :)