# Introduction
# Installation and Preparation

Hello! If you're viewing this notebook then you've successfully downloaded Anaconda. Good job!

You can edit the contents of any cell by double-clicking it or pressing "Return" or "Enter". If you click on this cell, you will notice that I've used the "Markdown" mode, as opposed to "Code", to write this introduction. When you're done editing a cell, you can press "Shift-Return" (Mac) or "Shift-Enter" (Windows) to run the cell. When a cell has a number to the side that means that the cell has been executed (only in Code mode). The number tells you the order in which the cells were executed (i.e. cell [2] was executed after cell [1])

There are other useful shortcuts that you should be aware of. I've included a list of them in the appendix of the HW1 description pdf file.

Anaconda installs not only Python, but also a number of additional packages that we will use in this course.



### Coding best practices

You should always document your code with comments to help make your work readable and informative. For example, it is a good idea to write (in words) function inputs and outputs, or what the cells is accomplishing. You can put comments in your code by starting any line with '#'  - anything written in that line will be colored green. 

You can also easily comment/uncomment multiple lines. Just select multiple lines with your cursor and then press `Cmd and /` (`Ctrl and /` for Windows).

After week 3, you will be writing complex web-scapring code that will contain blocks of code within other blocks, and you will need to change the indentation of multiple lines of code. You can indent multiple lines by selecting them and then pressing `Tab`. To de-indent, you press `Shift+Tab` instead.

Finally, it is a good idea to split up your code into parts that can be run independenly. For example, you can use one cell to install the necessary packages, another one to read in the data, and a third one to transform the data or do analysis. Writing your code out in multiple cells also helps with debugging - you will know exactly which part of your code is throwing you the error. To create more cells, click the "+", second icon from the left.


In [None]:
# This is a comment inside a Code cell




## Basic data types

Python has eight built-in data types. Four of those are quite simple, in the sense that they can **store a single value**:

* Integers
* Floats
* Booleans
* Strings


The other four are denoted collections because they can store arbitrary numbers of values. Python's four collection data types are:

* Lists
* Tuples
* Sets
* Dictionaries

Now, let's start with one of the most basic data types, the integer.

## Integers

An Python integer is a **natural number**, i.e. the numbers you count.

Python allows you to do basic arithmetic with integers whether you define variables or not. Those operation are represented using the same notation you saw on a calculator.

In [None]:
123 + 256

We can also store the result of an operation into a variable. The variable will store the evaluated answer, not the arithmetic expression.

In [None]:
first_result = 10 / 3
first_result

The division operator stores the answer that we are used to, which is $3.3\bar{3}$. If you want to maintain the integer type after division, you can use truncating division with `//` like this:

In [None]:
second_result = 10 // 3
second_result

Finally, to get the remainder, use the symbol *%*

In [None]:
4.2 % 2

## Floats

A Python float is a **rational number**.

In [None]:
new_float = 109.234
print(new_float)

To change an integer to a float or vice versa, we can cast the number using the data type name that we want to transform it to:

In [None]:
int(109.234)

However, it won't stay that way unless we assign it to a new a variable!

In [None]:
basic_int = 109
print( float(basic_int) )
print( type(basic_int) )

In [None]:
float_basic_int = float(109)
print( type(float_basic_int) )

## Comparing numbers

To compare the result of computations, we use the standard set of comparisons:

The $==$ allows us to check if one side of the operator is equal to the other side.

Python evaluates the expression and tells us that it is `True` if it is correct or `False` if it is incorrect.

The $!=$ operator allows us to check if one side does not equal the other side:

The greater than, less than, greater than or equal to, and less than or equal to operators all work as we would expect.

## Booleans

A Python Boolean is what is math is called a **logical variable** The Boolean data type is primarily associated with conditional statements (if/then), which allow different actions and change control flow depending on whether a programmer-specified Boolean condition evaluates to `True` or `False`.

In [None]:
first = 1
second = 2
third = first + second

print(third==3)

## Strings

A Python string is an ordered sequence of characters.

In [None]:
bem = "bem106"
print(bem)

You can use basic math operators to add strings together and make a longer string, either with + or *

In [None]:
print(bem + bem)
print(bem*10)

There are functions that perform actions that you may need when working with strings:
* `capitalize()`, makes the first character uppercase
* `lower()`, makes the entire string lowercase
* `upper()`, makes the entire string uppercase
* `title()`, capitalizes every word in a string

In [None]:
print(bem.capitalize())
print(bem.lower())
print(bem.upper())
bem2 = bem+" and " +bem
print(bem2.title())

What if we want to get rid off specific characters?

* `strip`, deletes characters everywhere
* `lstrip`, deletes characters from the left

In [None]:
test = "When is the "+bem +" lecture today?"
print(test)
print(test.lstrip("today?"))
print(test.strip("today?"))
print(test.lstrip("When "))
print(test.strip("When "))

We can also easily check whether the contents of string are either all alphabetical or numeric.

In [None]:
test.isalpha() #note that test includes both numbers (106) and characters

In [None]:
test.isnumeric() #note that test includes both numbers (106) and characters

In [None]:
"234".isnumeric() # string only includes numbers

In [None]:
"bem".isalpha() # string only includes characters

### Slicing a string

We can get an element from a string with slicing

The syntax for slicing is deceptively simple, the full syntax is:

`variable[start_index : stop_index : step]`

You'll see that all of the inputs go within the `[]` and the `:` separates each input. 

The `start_index` tells python which index we want to start getting elements from. The first element is denoted by 0.

The `stop_index` tells python which index we want elements **up to but not including**

The `step` tells python how many steps to take between elements within the range. This means that you don't need to take every element. You could take **every other** element. To do that you just specify a `step` of `2`.

Python allows you to access elements by counting from the end of the string too.

When counting from the end, you use negative indicies. The index **`-1`** denotes the last character in a string.

In [None]:
print(bem[1])
print(bem[1:4])
print(bem[1:10:2])
print(bem[-4:0])

## This is the end of the python/Jupyter basics tutorial.
Next up: brief homework exercises.

# Homework exercises: Introduction to Pandas

The Pandas package allows you to manipulate spreadsheet-type data, like a company's stock price over time. 

## Benefits of Pandas

1. Pandas easily handles reading in and outputing CSV/Excel files.
2. Pandas takes care of many type conversions after reading a file.
3. Accessing data from a Pandas dataframe is intuitive.
4. Using Pandas is like working with SQL, which is something you'll encounter if you continue to program.

Additional reading to get you started:

The Pandas tutorial pages http://pandas.pydata.org/pandas-docs/stable/tutorials.html

10 minutes to Pandas http://pandas.pydata.org/pandas-docs/stable/10min.html

In this and future homeworks, I will provide you with prompts of what your code needs to accomplish. I expect you to reseach and investigate the answers on your own. Each block of code will be a seperate "Exercise" to keep things sequential and organized. Before jumping in to Pandas, I suggest that you research how to create and access the other data types in python: lists, tuples, sets, and dictionaries.

### Exercise 1: Import Pandas

In [1]:
## Import the package and any other packages that we need
import pandas as pd

### Exercise 2: Read in the Excel file of CEO/median worker pay

In [2]:
##  Read in the excel file containing the data
import os

print(os.chdir())
print(os.listdir())
filename = "Jayden-Nyamiaka/Popular-Methods-in-Data-Science/Introduction/ceo-worker-pay.xlsx"
#pd.read_excel(filename)

C:\Users\jmtot\AppData\Local\Programs\Microsoft VS Code
['bin', 'chrome_100_percent.pak', 'chrome_200_percent.pak', 'Code.exe', 'Code.VisualElementsManifest.xml', 'd3dcompiler_47.dll', 'ffmpeg.dll', 'icudtl.dat', 'libEGL.dll', 'libGLESv2.dll', 'LICENSES.chromium.html', 'locales', 'policies', 'resources', 'resources.pak', 'snapshot_blob.bin', 'tools', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'v8_context_snapshot.bin', 'vk_swiftshader.dll', 'vk_swiftshader_icd.json', 'vulkan-1.dll']


Pandas reads in a `CSV`  or Excel files and it turns them into its own data structure called a `Dataframe`. This `Dataframe` is actually a Python class, you can think of it as just a type of *object*. Our data is inside this *object* and it controls how we can interact with it.

### Exercise 3: The basics of a dataframe

A `Dataframe` has two basic ways to access values inside of it.

The **columns** run across the **top**

The **indices** run down the **left** (for now, you can think of these as rows)

We can get see the variables by calling them by name from the `dataframe`

In [None]:
# Access/Print the column names


# Access/Print the row/index names


### Exercise 4: Accessing Data in Pandas

You can access a column using two ways: either by typing the name of the column after the dataframe name, and the other way is similar to accessing a key in a dictionary.

In [None]:
# Access/Print the column that lists median worker pay through its column name


# Access/Print the column that lists the CEO name through key-in-dictionary way

### Exercise 5: Accessing indeces/rows

In [None]:
# Access/Print the first three rows


# Access/Print the last column


## Access/Print all rows where TOtal CEO Compensation is above $20mil

### Exercise 6: Slicing in Pandas

There are two other ways of accessing data in pandas, using the .loc and .iloc methods.

* .iloc is integer based and works on the index
* .loc is strictly label based

In [None]:
## Access the 100-104 rows and the column names 

### Access the index label

## Reindex the dataframe to have the Company Name as the new index

## Access the following companies through the .loc method (after re-indexing): Aramark, Blackrock Inc, Cigna Corp

### Exercise 7: Basic data plots

In [None]:
## Generate a scatterplot with x-axis being median worker pay and y-axis CEO Total compensation

## Print the correlation between median worker pay CEO Total compensation

## Run a regression (with a constant) of median worker pay on CEO Total Compensation

## Plot your regression line on the scatterplot

## Generate a histogram of % CEO compensation that is cash


### Exercise 8: Advanced data plots

In [None]:
## Create a pie plot of total ceo compensation of all CEOs whose first name is "Jeffrey"

## Create a histogram of the first letter in CEO last names

## Create a stacked bar plot where the x-axis is the decile (0-10%, 10-20%, etc) of %CEO Comp that is cash and the y-axis is median worker pay and total ceo compensation.
