# Python Practice Lecture 4 MATH 342W Queens College
# - More Data Structures, Pandas, and Functions
## Author: Amir ElTabakh
## Date: February 8, 2022

In this demo we will look at other means of storing some elements such as tuples, and sets. We will then take a look at functions, explore dictionaries a little more and lastly introduce the Pandas Dataframe object.

Before jumping into tuples or sets, lets re-jog our memories with lists.

In [None]:
# List of some classes in the Data Science major
DS_classes = ['Math 341', 'CS 111', 'Math 231']

# Output element at index 0
DS_classes[0]

# Output second to last element
DS_classes[-2]

# Change the last element to value 'Math 241'
DS_classes[-1] = 'Math 241'

# Add the value 'CS 211' to end of the list
DS_classes += ['CS 211']

# Remove the element at index 1
DS_classes.pop(1)

print(DS_classes)

## Tuples

Another data structure thats similar to the list is the tuple. A tuple is an *immutable* object where a list is a *mutable* object. To create a tuple object use paranthesis `()`, and we've seen lists use square brackets `[]`.

In [None]:
# Tuple of apple cultivars
apples = ("Golden", "Red", "Fuji", "Granny")

# Output the data type of the apples object
type(apples)

In [None]:
# Call the first element in apples
apples[0]

# Call the first three elements in the apples
apples[0:3]

# Change the first element of apples to 'Honeycrisp'
apples[0] = "Honeycrisp"

# What went wrong?

We have casting before, it is the process of changing the datatype of an object to a valid one. You can also cast a tuple as a list, and a list as a tuple.

In [None]:
# Creating a list
programming_langs = ["Python", "C++", "Java", "JavaScript"]

# Print the type of the programming_langs object
print(type(programming_langs))

# Cast the programming_langs object as a tuple
programming_langs_tuple = tuple(programming_langs)

# Print the type of the programming_langs_tuple object
print(type(programming_langs_tuple))

In [None]:
# Print the object to confirm it is a tuple
print(type(programming_langs_tuple))

# Cast a tuple to a list
programming_langs_list = list(programming_langs_tuple)

# Print the type of the programming_langs_list object
print(type(programming_langs_list))

# Lets try to change the first element of programming_langs_list to 'C#'
programming_langs_list[0] = 'C#'

print(programming_langs_list)

## Sets

Sets are used to store multiple elements in a single variable, similarly to lists, tuples, and dictionaries. The set is the last of 4 built-in data types in Python used to store collections of data. Each have different qualities and uses.

* Lists: Mutable, ordered
* Tuples: Immutable, ordered
* Dictionaries: Mutable, Key-Value paired, does not allow for duplicate elements
* Sets: Immutable-ish, unordered

Elements in sets cannot be changed, but you may remove elements and add new ones.

In [None]:
# Set of cat breeds
cat_set = {"Siamese", "Bengal", "Calico", "Chartreux"}

# Print the length of the set
print(len(cat_set))

## Back to dictionaries

If we want talk about Pandas Dataframes, we should explore dictionaries a little bit more. Below is an example of a dictionary, lets practice some new operations on it.

Note in the below we use an f-string in the print statement. An f-string allows the programmer to directly add the value of a variable into the string using curly braces. Notice how we did not have to cast the non-string values into strings for this to work.

In [None]:
athlete_1 = {'Name' : 'Max Verstappen',
             'Sport' : 'Formula 1',
             'Team' : 'Red Bull Racing',
             'WDC' : 1,
             'Age' : 24
             }

# Use a for loop to iterate over the key pairs in the dictionary
for k, v in athlete_1.items():
    print(f'Key: {k} - Value: {v}')

We can use the `in` and `not in` operators to check whether a value exists in a dictionary, list, etc.

In [None]:
'Name' in athlete_1.keys()

In [None]:
'Name' not in athlete_1.values()

In [None]:
'Sport' in athlete_1

### The `get()` method

It can become tedious to check whether a key exists in a dictionary before accessing that key's value. Fortunately, dictionries have a `get()` method that takes two arguments:

- The key of the value to retrieve
- A fallback value to return if that key does not exist

In [None]:
athlete_1 = {'Name' : 'Max Verstappen',
             'Sport' : 'Formula 1',
             'Team' : 'Red Bull Racing',
             'WDC' : 1,
             'Age' : 24
             }

# Print the athletes name
print(f"The name of the athlete is {athlete_1.get('Name')}.")

In [None]:
# Print the athletes team
team = athlete_1.get('Team')
print(f"The athlete plays for {team}.")

# Print the athletes nationality
fallback = "-oh, I am not sure."
print(f"The athlete is from {athlete_1.get('Nationality', fallback)}.")

In [None]:
# What happens if you use the get method to find the value of a key, but that key does not exist in the dictionary
print(athlete_1.get('Nationality'))

## Pandas

Pandas is a Python library used for data manipulation and analysis. It's name is a play on "Python Data Analysis", and was published as an open source library in 2009 by Wes McKinney.

Pandas does not come with standard Python. Python is open source and developers are creating new libraries all the time. These developers can upload these packages as open-source for others to install and use! To install Pandas on our machine we will pip install it. pip is the standard package manager for Python, it allows you to install and manage additional packages. The Python installer installs pip, so it should be ready for us to use. Verify that pip is installed by running the following command:

In [None]:
!pip --version

The cell aboce should return the version of your pip as well as where it is stored on your machine. Note when using a Notebook, such as this one on Jupyter, we can run shell commands by starting a line with an exclamation mark `!`.

In [None]:
# Update pip
!python -m pip install --upgrade pip

In [None]:
# Install Pandas on your machine
!pip install pandas

Now that we've installed Pandas, lets import the library. Note that we only have to install a library once per machine, but we have to import it in every program we wish to use the library in.

---

Pandas is the most common Python library for data analytics, and data wrangling. Thankfully theres a lot of documentation for us to use in case we get stuck.

https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide

What is a Pandas Dataframe? Well, lets refer to the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. 

Features of DataFrame
- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

In [None]:
# import the pandas library
import pandas as pd

# Create a dictionary of the chocolate bars
menu = {'Item Name': ['Snickers', 'Twix', 'KitKat', 'M&Ms'],
       'Price':[0.25, 0.49, 2.50, 1.00],
       'Mini':[1, 1, 0, 0],
       'Family Size':[False, False, True, False]
       }

# Output dictionary
menu

In [None]:
# Convert the dictionary aboce into a Pandas DataFrame
menu_df = pd.DataFrame(menu)

menu_df

This is our first dataframe, lets practice some useful operations on it.

In [None]:
# Get the Mini column
menu_df[['Mini']]

In [None]:
# Get the Price and Family Size columns
menu_df[['Mini', 'Family Size']]

In [None]:
menu_df['Item Name']

The difference this cell, and the one twice above it, is the single square brackets. Lets check the data type of each.

We find that the single column is of a series data type. What is a Series?

### Series
The Pandas Series object is a one-dimensional 'ndarray' with axis labels. It is analogous to an indexed one dimensional column vector. We will explore them more in future demos. Back to dataframes.

In [None]:
print(type(menu_df['Item Name']))
print(type(menu_df[['Item Name']]))

In [None]:
# Return the data type of each column in menu_df
menu_df.dtypes

In [None]:
# Print a concise summary of menu_df
menu_df.info()

In [None]:
# Print out the size of menu_df
print(menu_df.size)

Nothing too exciting here. We only just explored how to gather different characteristics of our DataFrame, lets take a deeper dive.

Lets create a more complex DataFrame.

In [None]:
list_of_names = ["Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
                      "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
                      "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
                      "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
                      "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
                      "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
                      "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
                      "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
                      "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
                      "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
                      "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
                      "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
                      "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
                      "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
                      "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
                      "Lincoln"]

# Generating list of n many salaries
from random import gauss

n = len(list_of_names)
mu = 50000
sigma = 20000
list_of_salaries = []

for i in range(n):
    list_of_salaries += [int(gauss(mu, sigma))]
    

# Generating list of n many past_crime_severity values
# We will use the numpy library
import numpy as np
items = ["no crime", "infraction", "misdimeanor", "felony"]
probs = [.50, .40, .08, .02]
list_of_past_crime_severity = np.random.choice(items, n, p = probs) # run `help(choices)` to read documentation


# Generating list of n many has_past_unpaid_loan values

list_of_has_past_unpaid_loan = np.random.binomial(n = 1, size = n, p = 0.2)




# Initializing Pandas DataFrame
df = pd.DataFrame({'Salary' : pd.Series(list_of_salaries, index = list_of_names),
                  'past_crime_severity' : pd.Series(list_of_past_crime_severity, index = list_of_names),
                  'has_past_unpaid_loan' : pd.Series(list_of_has_paid_past_loan, index = list_of_names)})

df

In [None]:
# Lets read documentation to see what univariate distributions are available in the numpy.random module
import numpy
help(numpy.random)

In [None]:
# Capture a snapshot of df
df.head(10) # default is 5

50% of people have no crime, 40% have an infraction, 8% a misdimeanor and 2% a felony. There is a 20% chance that any individual has a past unpaid loan. Is this a reasonable fabrication of this dataset? No... since salary and not paying back a loan are dependent r.v.'s. But... we will ignore this now.

It would be nice to see a summary of values. Would median and mean be appropriate here? Not for categorical variables!

In [None]:
# You can view summary statistics of each feature with the .describe() method
df.describe()

In [None]:
# Get column labels
df.columns

In [None]:
# get data types of columns
df.dtypes

`has_past_unpaid_loan` should not be an integer value! The difference between someone paying vs. not paying a loan is not the value 1, it is boolean. Let's cast the feature as a boolean instead of an integer value.

In [None]:
# Cast the has_past_unpaid_loan feature as bool
df['has_past_unpaid_loan'] = df['has_past_unpaid_loan'].astype('bool')

df.head()

In [None]:
df['past_crime_severity'].describe()

In [None]:
# Here are some base functions to be aware of

# min
print(f"Min: {df['Salary'].min()}")

# max
print(f"Max: {df['Salary'].max()}")

# median
print(f"Median: {df['Salary'].median()}")

# mean
print(f"Mean: {df['Salary'].mean()}")

# standard deviation
print(f"Std: {int(df['Salary'].std())}")

# variance
print(f"Variance: {int(df['Salary'].var())}")

# Quantile function
print(f"Quantile: {df['Salary'].quantile(0.2)}")

# Number of distinct elements
print(f"Distinct Values: {df['Salary'].nunique()}")

# Calculate interquartile range
q3, q1 = np.percentile(df['Salary'], [75, 25])
iqr = q3 - q1
print(f"IQR: {iqr}")

Great work! We've created a Pandas DataFrame object, explored and manipulated the data. This DataFrame is our training set `D`. We are missing one final variable, the response! Let's add it and say that 90\% of people are creditworthy i.e. they paid back their loan.

In [None]:
# Creating response variable (this is your y)
df['paid_back_loan'] = np.random.binomial(n = 1, size = n, p = 0.9).astype('bool')

df.head()

Conceptually - why does this make no sense at all? `y` is independent of `X` --- what happens then? No function `f` can ever have any predictive / explanatory power! This is just a silly example to show you the data types. We will work with real data soon. Don't worry.

---

## Functions

A function is a block of code which only runs when it is called. The functions we have worked with already come from imported or standard libraries, however we can create our own functions! You can define the arguments the function will accept, when you pass variables into the function call these values are called parameters (nomenclature counts).

A function entails:
- The `def` statement
- The name of the function followed by ()
- You may define arguments inside the parenthesis
- a colon
- Starting on a new line and indented, the body of the function to execute (also called the function clause)

In [None]:
# Our first function
def hello():
    print("Hello!")

Cool, our first function! How come nothing was outputted? We did not call the function, lets do that now.

In [None]:
hello()

A function can take on multiple arguments. An argument is information that can be passed into a function. There is a very subtle difference between an argument and a parameter.

* A parameter is the variable listed inside the parentheses in the function definition.
* An argument is the value that are sent to the function when it is called.

Lets edit the `hello()` function to include a parameter called `name`, that way we can include the persons name in the print statement. When I call the function, we'll pass our own name as an argument in the function.

You can add a default to an argument in case a corresponding parameter is not passed in the function call.

The `return` statement is exactly what you think it is. Once it is called, the interpreter exits the function and returns whatever value comes after the statement. You don't need to include a return statement.

In [None]:
def hello(name, age = 40):
    hello_statement = f"Hello {name}! I am {str(age)} years old."
    print(hello_statement)
    
hello("Amir", 21)

hello("Amir")

hello()
# What is wrong with the line above?

Lets take a look at a fun function. For those of you in CS 220 you might be learning about encryption and decryption. One cipher you learn about is the Shift Cypher, where you simply replace the letter with the letter n positions away. The function for the shift cipher is generally `f(p) = (p + k) % 26` where `p` is a specific letter in the string and `k` is a given key. Here is a Python function for the shift cypher.

In [None]:
# Shift Cipher Decryptor Function
def shift_cipher_decrypt(coded_message, key):
    coded_message = list(coded_message)
    encrypted_message = ''
    letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
    letters = letters.split()
    
    for letter in coded_message:
        if letter in letters:
            x = letters.index(str(letter))
            char = (x - key) % 26
            encrypted_message += letters[char]
            
        else:
            encrypted_message += ' '
            
    return print(encrypted_message)

Consider the following message encrypted with a simple shift cipher: "CKKZ SKNG"
Use the encryption key `100` to decode the message.

In [None]:
shift_cipher_decrypt("CKKZ SKNG", 100)

Below is a Python function to encrypt messages with the shift cypher. To encrypt and decrypt the same message make sure your key is the same! These two functions might look intimidating, but its nothing we haven't already gone over. Go through each function line by line, and make sense of each. Now you're a Python programmer and a spy!

In [None]:
def shift_cipher_encrypt(coded_message, key):
    coded_message = list(coded_message)
    encrypted_message = ''
    letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
    letters = letters.split()
    
    for letter in coded_message:
        if letter in letters:
            x = letters.index(str(letter))
            char = (x + key) % 26
            encrypted_message += letters[char]
            
        else:
            encrypted_message += ' '
            
    return print(encrypted_message)

In [None]:
shift_cipher_encrypt("GOOD WORK", 100)