# Python Fundamentals for Tuesday, September 19


Follow along as we work our way through this document. Use the code cells to test and experiment as we go. Try to complete all of the practice exercises and ask questions if you need assistance.

This is a list of the topics we will cover today, starting where we left off on Thursday:

1. Authoring User-defined functions
2. Practice with string methods, lists, tuples, and dictionaries
3. Importing files
4. Managing Data with Pandas

Follow along as we work our way through this document. Use the code cells to test and experiment as we go. Try to complete all of the practice exercises and ask questions if you need assistance. 

In Python, a user-defined function is a custom function created by the programmer to perform a specific task or set of tasks. User-defined functions are essential for organizing code, making it more modular, and promoting code reuse. These functions can be defined and used throughout your program to encapsulate logic and improve code readability.

To define a user-defined function in Python, you typically use the `def` keyword followed by the function name, a pair of parentheses containing any parameters the function takes, and a colon to indicate the start of the function's code block. Here's a basic syntax example:

```
def function_name(parameter1, parameter2, ...):
    # Function code goes here
    # This code block is indented
    # You can include one or more statements

# Example usage of the function
result = function_name(argument1, argument2, ...)
```
The components of a user-defined function are:

* **Function Name**: This is the name you give to your function. It should be descriptive of the function's purpose and follow Python naming conventions (e.g., using lowercase letters with underscores for multi-word function names).

* **Parameters**: Parameters are placeholders for values that the function expects to receive when it is called. They are defined within the parentheses and act as input to the function. Functions can have zero or more parameters.

* **Function Code**: The function code block contains the statements that define what the function does. It is indented to indicate that it is part of the function definition. You can include any valid Python code within the function.

* **Function Call**: To use the function, you call it by its name and provide actual values (arguments) for the parameters. The function is executed with these arguments, and it can return a result or perform some action.

An example appears below

In [1]:
def multiply_numbers(x, y):
    result = x*y
    return result

# Calling the function
total = multiply_numbers(15, 30)
print(total)  # Output: 450

450


In the above function `multiply_numbers` we use two parameters, x and y to return their product. When you call `multiply_numbers(15, 30)`, it returns 450, which is stored in the `total` variable.

User-defined functions are a fundamental concept in Python programming, and they allow you to encapsulate logic, promote code reusability, and make your code more maintainable.

## Practice: Functions
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.

Write a change counting function. Pass the function the number of pennies, nickels, dimes, and quarters, and return the value of the coins. Test it with 5 pennies, 4 dimes, 2 quarters. 

In [6]:
a = input("how many pennies do you have")
b = input("how many nickels do you have")
c = input("how many dimes do you have")
d = input("how many quarters do you have")

def change_counter(a,b,c,d):
    total = (1*int(a)) + (5*int(b)) + (10*int(c)) + (25*int(d))
    dollars = total//100
    cents = total%100
    print("You Have " + str(dollars) + " dollars and " + str(cents) + " cents")

change_counter(a,b,c,d)

You Have 0 dollars and 95 cents


## String Methods

Calling a method works almost exactly like calling a function, however the syntax is different. String methods are built-in functions that operate on strings (sequences of characters). They allow you to manipulate strings to fit your needs. 

Here is another example that fixes capitalization problems when users are asked to enter their name and creates a single variable with the complete name. It makes use of python's string methods

In [7]:
def name_fixer(first, middle, last):
    """
    Fix any capitalization problems and create a single variable with the complete name.
    """
    return first.title() + ' ' + middle.title() + ' ' + last.title()           # the sting method title() makes the fist letter capital

Some commonly used string methods in Python:

**str.capitalize()**: Converts the first character of the string to uppercase and the rest to lowercase.


In [8]:
text = "hello, world"
capitalized = text.capitalize()
print(capitalized)  # Output: "Hello, world"

Hello, world


**str.upper() and str.lower()**: Convert the entire string to uppercase or lowercase, respectively.

In [9]:
text = "Hello, World"
upper_case = text.upper()
lower_case = text.lower()
print(upper_case)  # Output: "HELLO, WORLD"
print(lower_case)  # Output: "hello, world"

HELLO, WORLD
hello, world


**str.strip(), str.lstrip(), and str.rstrip()**: Remove leading and trailing whitespace characters from a string. strip() removes from both ends, lstrip() from the left end, and rstrip() from the right end.

In [10]:
text = "   Python   "
stripped = text.strip()
print(stripped)  # Output: "Python"

Python


**str.split()**: Splits a string into a list of substrings based on a specified delimiter (default is whitespace).

In [12]:
text = "apple,banana,cherry"
fruits = text.split(",")
print(fruits)  # Output: ['apple', 'banana', 'cherry']

['apple', 'banana', 'cherry']


**str.join(iterable)**: Joins the elements of an iterable (e.g., a list) into a single string, using the string as a separator.

In [13]:
fruits = ['apple', 'banana', 'cherry']
text = ",".join(fruits)
print(text)  # Output: "apple,banana,cherry"

apple,banana,cherry


**str.replace(old, new)**: Replaces all occurrences of a substring old with another substring new in the string.

In [14]:
text = "Hello, World"
replaced = text.replace("World", "Python")
print(replaced)  # Output: "Hello, Python"

Hello, Python


**str.startswith(prefix) and str.endswith(suffix)**: Checks if the string starts with a specified prefix or ends with a specified suffix and returns a Boolean value.

In [15]:
text = "Hello, World"
starts_with_hello = text.startswith("Hello")
ends_with_world = text.endswith("World")
print(starts_with_hello)  # Output: True
print(ends_with_world)    # Output: True

True
True


**str.find(substring) and str.rfind(substring)**: Find the first (or last, in the case of rfind) occurrence of a substring within the string and return its index. If not found, it returns -1.

In [17]:
text = "Python is easy and Python is fun"
first_occurrence = text.find("Python")
last_occurrence = text.rfind("Python")
print(first_occurrence)  # Output: 0
print(last_occurrence)   # Output: 20

0
19


Returning to our user-defined function `name_fixer()`, we see how these string methods can be employed to fix capitalization issues. 

In [18]:
mascot_first = 'bucKingham'
mascot_middle = 'u'
mascot_last = 'badger'

full_name = name_fixer(mascot_first, mascot_middle, mascot_last)
print(full_name)

Buckingham U Badger


## Practice Exercises

Can you write a function in the codeblock below that will return a name in all capital letters?

In [24]:
name = "carson"
def capitalize(name):
    print(str(name).upper())

capitalize(name)

CARSON


Now try writing some code to create a new string made of an input string’s first and second upper case letters.

In [28]:
state = "Wisconsin"
def shorten(state):
    print(state[0:2].upper())

shorten(state)

WI


Now let's play the name game with python!
You can sing 'The Name Game' with (almost) every name.<br>


### The regular verse:<br>

The verse for the name 'Gary' would be like this:<br>
Gary, Gary, bo-bary<br>
Banana-fana fo-fary<br>
Fee-fi-mo-mary<br>
Gary!<br>

At the end of every line, the name gets repeated without the first letter: Gary becomes ary<br>
If we take (X) as the full name (Gary) and (Y) as the name without the first letter (ary) the verse would look like this:<br>

**(X), (X), bo-b(Y)<br>
Banana-fana fo-f(Y)<br>
Fee-fi-mo-m(Y)<br>
(X)!**<br>

Got it?<br>

Now write a module titled `namegame.py` that will take users names as inputs and return the four lines above. Put the code for the module in the cell below, then import the module in another cell just below. Test you code to be sure it runs correctly. For advanced coders, try to include the special rules for the name game in your module. They are: 

### Vowel as first letter of the name
If you have a vowel as the first letter of your name (e.g. Earl) you do not truncate the name.<br>
The verse looks like this:<br>

Earl, Earl, bo-bearl<br>
Banana-fana fo-fearl<br>
Fee-fi-mo-mearl<br>
Earl!<br>
'B', 'F' or 'M' as first letter of the name<br>
In case of a 'B', an 'F' or an 'M' (e.g. Billy, Felix, Mary) there is a special rule.<br>
The line which would 'rebuild' the name (e.g. bo-billy) is sang without the first letter of the name.<br>
The verse for the name Billy looks like this:<br>

Billy, Billy, bo-illy<br>
Banana-fana fo-filly<br>
Fee-fi-mo-milly<br>
Billy!<br>
For the name 'Felix', this would be right:<br>

Felix, Felix, bo-belix<br>
Banana-fana fo-elix<br>
Fee-fi-mo-melix<br>
Felix!<br>


In [42]:
import namegame

namegame.song("Carson")

Carson, Carson, bo-barson
Banana-fana fo-farson
Fee-fi-mo-marson
Carson!


Note the use of f-strings above. To create an f-string, prefix the string with the letter “ f ”. The string itself can be formatted in much the same way that you would with str.format(). F-strings provide a concise and convenient way to embed python expressions inside string literals for formatting

In the code provided for the namegame.py module, the if __name__ == "__main__": block is a common Python idiom used to control the execution of code when the script is run directly (as the main program) and not when it is imported as a module in another script.

Here's how it works:

1. When you run a Python script, the Python interpreter sets a special built-in variable called __name__ for the script. If the script is the main program being executed, Python sets __name__ to "__main__". If the script is imported as a module in another script, __name__ is set to the name of the module (e.g., the filename without the .py extension).

2. The if __name__ == "__main__": block allows you to specify code that should only be executed when the script is run directly as the main program, not when it is imported as a module. This is useful for separating reusable code from code that should only run when the script is used as the entry point.

In the first code block above, the code under if __name__ == "__main__": is the part that will execute when you run `namegame.py`as a standalone script. It prompts the user for a name, generates the name game verse, and prints it. However, if you import namegame.py as a module into another script, this code block won't run automatically; you can use the functions and variables defined in the module without executing this part of the script. This separation allows you to reuse the module's functionality in different contexts.

# Lists

A list is an ordered and modifiable collection of objects. Like a string, a list is a sequence of values, but unlike a string, the values can be any type (not just characters). Values are called elements of the list.

Lists are enclosed in brakets (“[” and ”]”) and can take many forms:




In [43]:
# Creating a blank List
first_list = []
print(first_list)
  
# Creating a list of numbers
second_list = [11, 12, 13, 11]
print("\nList of numbers: ", second_list)

  
# Creating a List of strings and accessing using index
third_list = ["ECON", "Is", "Awesome"]
print("\nList Items: ")
print(third_list[0], third_list[1], third_list[2])
  
# Creating a nested list
fourth_list = [['ECON', 'is'] , ['Awesome']]
print("\nNested List: ", fourth_list)
print(fourth_list[0])


[]

List of numbers:  [11, 12, 13, 11]

List Items: 
ECON Is Awesome

Nested List:  [['ECON', 'is'], ['Awesome']]
['ECON', 'is']


Because a list is a sequence and not a set, a list may contain duplicate values in distinct positions.

Once a list is created, it is possible to add elements to the list via the `append()` method.

In [44]:

new_list = []
print(new_list)
  
# Add elements to the list
new_list.append(1)
new_list.append(2)
new_list.append('three')
new_list.append(4)
print(new_list)

[]
[1, 2, 'three', 4]


The `append()` method adds a single element to the end of a list. To add multiple elements at the end of a list, use the extend() method. If you wish to add an element at a specific position, the `insert()` method should be used. While `append()` only acceps one argument, the `insert()` method requires two arguments: insert(position, value). Remember that lists are ordered and start with 0.

In [45]:
new_list = []
print(new_list)
  
# Add elements to the list
new_list.append(1)
new_list.extend([2, 'three', 4])
print(new_list)


# Addition of element at specific position
new_list = []
new_list.append(1)
new_list.append(2)
new_list.append(4)
new_list.insert(2, 3)
print(new_list)

[]
[1, 2, 'three', 4]
[1, 2, 3, 4]


Working with lists is very intuitive. For instance, you are able to concatenate lists using the + operator
The '+' operator 'knows' what kinds of objects it is working with (lists, ints, strings) and acts appropriately. The `*` operator repeats a list as many times as dictated. Let's take a few minutes to practice working with lists. Raise you hand when you have completed the following tasks:

1. Create a list containing all the letters of the alphabet, in alphabetical order. Name the list 'letters'
2. Create a list containing all of the integers between 8 and 14. Name the list 'number_list'
3. Repeat the 'number_list' four times and save it as a new list named 'repeat_list'.
3. Create a list containing all of the integers between 8 and 14, but where each integer is stored as a string. Name the list 'string_list'
4. Merge your `number_list` and `string_list` into a single list named `merged_list` and print it out to confirm accuracy.

You can create a list containing all the letters of the alphabet in alphabetical order using a simple Python list comprehension. Here's the code to do that:

In [54]:
letters = [chr(ord('a')+ i) for i in range(26)]
letters

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

We use a list comprehension to generate the list of letters. `ord('a')` returns the ASCII code of the letter 'a', which is 97. We then use `chr()` to convert the ASCII code back to a character. By adding i to 97, we get the ASCII codes for all the lowercase letters from 'a' to 'z'.



The loop runs for i values from 0 to 25, generating all the lowercase letters of the alphabet.
The resulting list letters will contain all the lowercase letters in alphabetical order. Alternatively this could be written:

In [53]:
letters2 = []
for i in range(0,26):
    newletter = [chr(ord("a")+i)]
    letters2 += newletter

letters2

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

We could create a list containing all the integers between 8 and 14 a number of different ways. Below we are using a Python list comprehension or the range() function.

In [57]:
number_list = [i for i in range(8,15)]

In [62]:
numbers2 = list(range(8,15))
numbers2

[8, 9, 10, 11, 12, 13, 14]

To repeat the number_list four times and save it as a new list named repeat_list, you can use the * operator to replicate the contents of number_list multiple times. It is easiest to write code once you break the code down into easy to accomplish steps:

1. Create the number_list containing integers from 8 to 14

2. Use the * operator. Multiplying a list by an integer n creates a new list with the original list's elements repeated n times.


In [63]:
# Create the number_list
number_list = [i for i in range(8,15)]

# Repeat the number_list four times
repeat_list = number_list * 4

# Print the repeat_list
print(repeat_list)

[8, 9, 10, 11, 12, 13, 14, 8, 9, 10, 11, 12, 13, 14, 8, 9, 10, 11, 12, 13, 14, 8, 9, 10, 11, 12, 13, 14]


In [64]:
# Create the number_list
print(number_list)
# Create the string_list
print(letters)
# Merge the two lists into merged_list
merge_list = number_list+letters
# Print the merged_list
print(merge_list)

[8, 9, 10, 11, 12, 13, 14]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
[8, 9, 10, 11, 12, 13, 14, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


# Navigating Lists

To take action with respect to multiple elements of a list you should consider using a `for` loop. In order to write or update list elements you will need to specify the index value, which can be accomplished any number of ways. Take a look at the code below and observe how it works. The for loop runs through the list and updates each element. It uses `len()` to return the length of the list. The `range()` function returns a list of indices from 0 to n − 1, where n is the length of the list. Each time through the loop, i gets the index of the next element.

In [65]:
num_list = [4,5,3,7,8]
x = len(num_list)
print(x)
for i in range(len(num_list)):
    num_list[i] = num_list[i] * 2
print(num_list)

5
[8, 10, 6, 14, 16]


# Tuples

Tuples are collections of objects, like lists, but they are immutable - meaning they cannot be changed once created. The values stored in a tuple can be any type. They are indexed by integers. Tuples are also comparable and hashable so we can sort lists of tuples and use them as key values in Python dictionaries.

In [66]:
tuple_one = ('a', 'b', 'c', 'd', 'e')
tuple_two = 'a', 'b', 'c', 'd', 'e'
tuple_three = tuple('abcde')

print(type(tuple_one), type(tuple_two), type(tuple_three))


<class 'tuple'> <class 'tuple'> <class 'tuple'>


The comparison operators work with tuples as well. Python starts by comparing the first element from each sequence. If they are equal, it goes on to the next element, and so on, until it finds elements that differ. Subsequent elements are not considered (even if they are really big).

The sort function works similarly. This functionality can be remembered via the acronym DSU:

**Decorate**
a sequence by building a list of tuples with one or more sort keys preceding the elements from the sequence,
**Sort**
the list of tuples using the Python built-in sort, and
**Undecorate**
by extracting the sorted elements of the sequence.

See how python works by executing the following

In [68]:
txt = 'if you want to be a badger just come along with me'
words = txt.split()
t = []
for word in words:
    t.append((len(word), word))
    print(t)

[(2, 'if')]
[(2, 'if'), (3, 'you')]
[(2, 'if'), (3, 'you'), (4, 'want')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come'), (5, 'along')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come'), (5, 'along'), (4, 'with')]
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come'), (5, 'along'), (4, 'with'), (2, 'me')]


In [69]:
txt = 'if you want to be a badger just come along with me'
words = txt.split()


t = []
for x in words:
    t.append((len(x), x))

print(type(t[2]))
print(type(t))
print(t)
t[2] = (7, 'several')
print(t)
t.sort(reverse=True)

res = []
for length, word in t:
    res.append(word)

print(res)

<class 'tuple'>
<class 'list'>
[(2, 'if'), (3, 'you'), (4, 'want'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come'), (5, 'along'), (4, 'with'), (2, 'me')]
[(2, 'if'), (3, 'you'), (7, 'several'), (2, 'to'), (2, 'be'), (1, 'a'), (6, 'badger'), (4, 'just'), (4, 'come'), (5, 'along'), (4, 'with'), (2, 'me')]
['several', 'badger', 'along', 'with', 'just', 'come', 'you', 'to', 'me', 'if', 'be', 'a']


In [70]:
txt = 'if you want to be a badger just come along with me'
words = txt.split()
type(words)


list

# Dictionaries

Dictionaries are unordered key-value pairs. A 'dict' is made of a key (that must be unique) and an associated value (which need not be unique). Dicts are made using curly brackets. 

In [71]:
ways_to_score = {'touchdown':6, 'field_goal':3, 'safety':2, 'two_point_conversion':2, 'extra_point':1}   
print(ways_to_score)

print(ways_to_score['extra_point'])

{'touchdown': 6, 'field_goal': 3, 'safety': 2, 'two_point_conversion': 2, 'extra_point': 1}
1


Dictionaries do not work in reverse. The key will unlock the value, but the value will not recover the key.

In [73]:
print(ways_to_score[2])

KeyError: 2

To add elements to a dictionary or modify an existing element, simply reference the key (whether it exists or not)

In [74]:
ways_to_score['touchdown'] = 6.0
print(ways_to_score)

ways_to_score['blocked_extra_point_return'] = 2
print(ways_to_score)

{'touchdown': 6.0, 'field_goal': 3, 'safety': 2, 'two_point_conversion': 2, 'extra_point': 1}
{'touchdown': 6.0, 'field_goal': 3, 'safety': 2, 'two_point_conversion': 2, 'extra_point': 1, 'blocked_extra_point_return': 2}


Ranges, lists, and strings are all 'iterable'. Ranges iterate over whole numbers while lists iterate over the elements of the list. Strings iterate through the characters in the string and dicts iterate over the keys

Iteration is the foundation of big data analysis, so it is important that we get comfortable looping with these objects.

# Practice with Dicts & Lists


1. Create a dict with keys for each college in the Big Ten conference. Give each key a value corresponding how many other Big Ten schools are located in the same state. Print the dictionary you just created. 

2. Now assume that in the next round of realignment Notre Dame, Iowa State, & Pitt join the Big Ten conference. Revisit your previous work with this update and print your dict when you are done. 

3. Loop through the following list: teams = ['University of Wisconsin-Madison','University of Iowa','University of Wisconsin-Madison','University of Michigan','University of Wisconsin-Madison','University of Minnesota']. If the team is 'University of Wisconsin-Madison' print out the phrase 'Go Big Red'. If any other team arises print out the word 'BOOOOOOOOOOO'.

In [76]:
#1

big_ten = {"Nebraska":0,
           "Iowa":0,
           "Minnesota":0,
           "Illinois":1,
           "Northwestern":1,
           "Wisconsin":0,
           "Indiana":1,
           "Purdue":1,
           "Michigan":1,
           "Michigan State":1,
           "Ohio State":0,
           "Penn State":0,
           "Rutgers":0,
           "Maryland":0}
print(big_ten)


{'Nebraska': 0, 'Iowa': 0, 'Minnesota': 0, 'Illinois': 1, 'Northwestern': 1, 'Wisconsin': 0, 'Indiana': 1, 'Purdue': 1, 'Michigan': 1, 'Michigan State': 1, 'Ohio State': 0, 'Penn State': 0, 'Rutgers': 0, 'Maryland': 0}


In [78]:
#2
big_ten["Notre Dame"] = 2
big_ten["Pitt"] = 1
big_ten["Iowa State"] = 1
big_ten["Iowa"] = 1
big_ten["Penn State"] = 1
big_ten["Indiana"] = 2
big_ten["Purdue"] = 2
print(big_ten)

{'Nebraska': 0, 'Iowa': 1, 'Minnesota': 0, 'Illinois': 1, 'Northwestern': 1, 'Wisconsin': 0, 'Indiana': 2, 'Purdue': 2, 'Michigan': 1, 'Michigan State': 1, 'Ohio State': 0, 'Penn State': 1, 'Rutgers': 0, 'Maryland': 0, 'Notre Dame': 2, 'Pitt': 1, 'Iowa State': 1}


In [80]:
#3 
teams = ['University of Wisconsin-Madison','University of Iowa','University of Wisconsin-Madison','University of Michigan','University of Wisconsin-Madison','University of Minnesota']

for t in teams:
    if t == "University of Wisconsin-Madison":
        print("Go Big Red!")
    else:
        print("BOOOOOOOOOOOOOOOOOO")

Go Big Red!
BOOOOOOOOOOOOOOOOOO
Go Big Red!
BOOOOOOOOOOOOOOOOOO
Go Big Red!
BOOOOOOOOOOOOOOOOOO


# Files

When we want to read or write a file (say on your hard drive), we first must open the file. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists. In this example, we open the file mbox.txt, which should be stored in the same folder that you are in when you start Python. You can download this file from www.py4e.com/code3/mbox.txt

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

In Python, we represent the newline character as a backslash-n in string constants. Even though this looks like two characters, it is actually a single character. When we look at the variable by entering “stuff” in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.

While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

In [None]:

fhand = open('mbox.txt') # The open() function opens a file, and returns it as a file object.
count = 0
for line in fhand:
    count = count + 1
print('Line Count:', count)

We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, “for each line in the file represented by the file handle, add one to the count variable.”

The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file.

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.

In [None]:
fhand = open('mbox.txt', 'r')
inp = fhand.read()
print(len(inp))
print(inp[:20])

The open function takes two arguments, the filename and the mode. There are four possible modes, with read being the default when no mode is specified. 

* "r" - Read - Default value. Opens a file for reading, error if the file does not exist

* "a" - Append - Opens a file for appending, creates the file if it does not exist

* "w" - Write - Opens a file for writing, creates the file if it does not exist

* "x" - Create - Creates the specified file, returns an error if the file exist

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix “From:”, we could use the string method `startswith()` to select only those lines with the desired prefix:

In [None]:
fhand = open('mbox-short.txt')
count = 0
for line in fhand:
    if line.startswith('From:'):
        print(line)

Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the rstrip method which strips whitespace from the right side of a string as follows:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('From:'):
        print(line)

To write a file, you have to open it with mode “w” as a second parameter. If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

In [None]:
fout = open('output.txt', 'w')

The write method of the file handle object puts data into the file, returning the number of characters written. The default write mode is text for writing (and reading) strings.

In [None]:
line1 = "This here's the wattle,\n"
fout.write(line1)
line2 = 'the emblem of our land.\n'
fout.write(line2)
fout.close()

When you are done writing, you have to close the file to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

# Pandas Fundamentals 

`pandas` is an open-source Python library that provides data structures and data analysis tools for working with structured data. It is widely used in data science, data analysis, and data manipulation tasks. Pandas is built on top of the NumPy library and provides easy-to-use data structures such as Series and DataFrame, which are designed to handle and manipulate data efficiently.`pandas` is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an Excel spreadsheet

* Ordered and unordered (not necessarily fixed-frequency) time series data.

* Matrix data (homogeneously typed or heterogeneous) with row and column labels

* Any observational / statistical data sets.


The two primary data structures of pandas are Series (1-dimensional) and DataFrame (2-dimensional). A DataFrame does essentially everything that R’s `data.frame` does and then some. `pandas` is built on top of NumPy and integrates well with many other 3rd party libraries.


## Pandas Data Frames

You can think of a data frame as the python analog to an excel spreadsheet; it is a simple table with an **unlimited** number of rows and columns. Typically the rows of a data frame will reference the observational unit while columns will reference variables describing each observational unit. In a dataframe the information is **related** both across both rows and columns. 

### Creating Data Frames

Data frames are created in Anaconda python using the `DataFrame()` function. To call the `DataFrame()` function from the pandas package you should type something similar to `pd.DataFrame()`, where the prefix can vary depending on how you have imported pandas. The parameter names of `DataFrame()` are used as the column names in the data frame and the parameter data are the variables of the data frame (columns).

Use the code below to create you own data frame. The code initially creates a data frame with 4 observations of 3 variables, however you should edit this to create data frames of varying dimensionality. The name that has been assigned to this particular data frame is df, which you will find is a common 'pythonic' naming convention. The variables have been named X, Y,  and Z.

In [None]:
import pandas as pd  #load the pandas package and call it pd

df = (
    pd.DataFrame(data={ 
        'X': [1, 2, 3, 4],
        'Y': [5, 3, 2, 1],
        'Z': [4, 2, 1, 7]}))

We see that both a `DataFrame` and `module` have been created. You can use `?` to check them out. Other packages that you will frequently use in this course are numpy and scipy. As a **quick aside** we can call for those to load as well and inspect them to see how they might be used.

In [None]:
pd?

A full list of the packages included in the Anaconda distribution is [here](https://docs.anaconda.com/anaconda/packages/pkg-docs/). Should you need to access a package that is not pre-loaded into Anaconda you can almost certainly install it via the Conda prompt. We will work through some examples together later in the course, but for now we will be satisfied with the modules already accessible.

**Returning to our data frame**, note that the name and the assignment appear on one line of code, ending in a open parenthesis. This allows the code to be continued over multiple lines of code. The name of the function, here `pd.DataFrame()`, is on a seperate line and it is indented, as is the name for each column. This nesting is not necessary, but it makes the data structure clear to the reader. The functions and methods being employed are spotlighted by this approach. 

The *data parameter* is given inside of `{ and}`. Remember those squigly brackets indicate this is a Python dictionary object.  Here the dict maps the names X, Y, and Z to lists. The list operators are `[ and ]` and the values of the list are seperated by commas. Each list in this assignment statement will become a column of the data frame. You can see what the data looks like by printing it just as your would any other object:

In [None]:
print(df)                     # Print the dataframe. 
print('\n', type(df))

Dataframes are easy to read when printed. Let's break it down: 

This created a DataFrame object. The syntax prefixing `DataFrame` in the print out describes the hierarchy: `DataFrame` is part of the `frame` group which is part of the `core` group of the package `pandas`. All you need to know is that you have just 'called' the `DataFrame()` function from the pandas (`pd.`) package. The `DataFrame()` function creates dataframes from other objects. If we don't pass an argument, it creates an empty DataFrame. For example:

In [None]:
df_empty = pd.DataFrame()
print(df_empty)

Circling back to our first dataframe (df), let's reference the shape attribute of a DataFrame to see how big our data is. Run the following to see if your mental mapping matches how Anaconda has stored the data:

In [None]:
print('Data frame shape:', df.shape)   

print('Data frame size:', df.size)  

print('Data frame types:', df.dtypes)

A more visual inspection of the data frame can be achieved without the print() function. The jupyter notebook adds some nice formatting that makes it easy to identify variable values for each observation.

### The Data Frame Index

Notice The column of numbers on the left-hand side of the data frame. It does not have a header. This is the **index** that pandas uses to tell our observations (rows) apart.  The index does not need to run from 0 to n. For example, if working with a time series it would be advantageous for our index to be a time variable. The index can be altered just like any of the other columns. To reference a column in the DataFrame simply use the name in the header. To print the last column on the right you would type

In [None]:
print(df['Z'])

Printing a column returns both the index and the column, as well as the type of data contained in the column. 

In [None]:
z_var = df['Z']
print(z_var)
print(type(z_var))

When we extract a single column from a DataFrame, we are given a **Series**.

In [None]:
print(type(df['Z']))

When working with big data it is likely you will only want to view small portions of the whole. The beginning of a data frame can be displayed in table format with the `head()` data frame method. If the command below shows you the 'head' of the dataset, how would you find the 'tail'?

In [None]:
print(df.tail(1))

## Practice Exercise

1. Create a data frame with seven observations of four variables. Name the variables john, paul, george, and ringo. Make the first column the first 7 odd numbers going from zero towards infinity. Make each subsequent column twice the value of the column to its left.

2. Print out the number of observations and variables in the data frame from (1)

## Operations on Rows and Columns.

Many of the data files we will be working with are organized as a table (rows and columns). Each row is recorded on a separate line of the file and columns separated by a delimiting character. These files can be viewed using text editors such as Notepad. 

A csv file is a very common type of delimited file that uses commas as the deliminators. Other common separators in delimited files are tabs and pipes, "|". They are the same as csv files except for the use of different separators. These other non comma delimited files will often have a file type of .txt or .dat. 

Importing any type of data is trivial if you know how to give good directions. 

In [None]:
wine_path = 'WineData.csv'
wines = pd.read_csv(wine_path)

The following displays the class, column types, and the values of the first few rows.

In [None]:
print(wines.shape)
print(wines.dtypes)

We can also calculate summary statistics for each variable in the dataframe:

In [None]:
# Calculate summary statistics
print(df.describe())


Pandas provides a method called “loc” that can retrieve rows from the data frame. Rows can also be selected using the “iloc” function. The loc and iloc functions are similar, but there is a subtle difference between the two. Whereas the `loc` function selects rows and columns with specific labels, `iloc` selects rows and columns at specific integer positions. This is not of particular important when working with rows, but will be important when working with columns.

In [None]:
df

Now let's change the row index from a series of ascending numbers to a series of labels with no discernable order:

In [None]:
df1 = df.copy() #create a copy so the original data can still be accessed
df1.index = ['Row_1', 'Row_2', 'Row_3', 'Row_4'] #reset index values

**Note**: With the parameter `deep=False`, only the reference to the data (and index) will be copied. This is a shallow copy. If a shallow copy is created, only changes made in the original will be reflected in the copy, and, any changes made in the copy will be reflected in the original.

In [None]:
df1

At this point, the difference between `loc` and `iloc` becomes more important. See the examples below where we reference the same data using different methods:

In [None]:
row = df1.iloc[3] #use the row position to access the data in the X,Y,Z columns
row

In [None]:
row = df1.loc['Row_4'] #use the row label to access the data in the X,Y,Z columns
row

To select a particular column call the name of the column inside the data frame. It is also common to use the “loc” method. This method requires the coder to pass the index of the data frame as a parameter. The loc method accepts only integers as a parameter.

In [None]:
column = df1['X']
column

In [None]:
#make it prettier with the square bracket
column = df1[['X']]
column

In [None]:
#Find a particular cell value
cell =  df1[['Z']].loc['Row_4']
cell

## Renaming Columns or Indices of a DataFrame

We previously re-indexed the data using the `.index()` method. To give the (other) columns a different value, it’s best to use the .rename() method.

In [None]:
df1

In [None]:
newnames = {'X': 'A', 'Y':'B', 'Z':'C'}
df1.rename(columns=newnames, inplace=True)
df1

In [None]:
#rename just one column at a time
df1.rename(columns={'A':'a'}, inplace=True)
df1

## Practice Exercises

Take 10 minutes to complete the following tasks. If necessary, refer to the tasks we worked through previously in the notebook - but try your best to work from memory. 

1. Import the "Car.csv" data set.

2. Print the type of each variable of the Car data set.

3. Return the number of observations in the Car data set.

4. Print all of the information associated with the Pontiac Firebird.

5. Rename the disp column "Engine Displacement"

6. Return the Engine Displacement of the Datsun 710 model.

In [None]:
#1


In [None]:
#2


In [None]:
#3



In [None]:
#4



In [None]:
#5




In [None]:
#6


