# Data Science Crash Course

## Session 1: Introduction to Python and Jupyter Notebooks

Created by: Aaron Scherf (aaron_scherf@berkeley.edu)

Instructor Edition

## Contents
- Welcome to Python!
- Commands and Objects
- The Global Environment
- Creating a Dataframe and Indexing
- Conclusion and Review

ProTip: To adjust the size of slides in the Reveal.JS presentation, just zoom in or out on your browser.

## Today's Packages
- `RISE`
- `os`
- `pandas`

## Today's Commands:
- `conda install`
- `pip install`
- `print()`
- `dir()`
- `%reset`
- `del`
- `import`
- `pd.Series()`
- `pd.DataFrame()`
- `iloc[]`
- `loc[]`

# Welcome to Python!

## Python and Jupyter Notebooks

[Python](https://www.python.org/) is a powerful open-source programming language that is popular among  data scientists and other computer programmers for its versatility and capacity to interface across different platforms. You can code in Python directly from the command line, but for reproducibility and ease-of-use many people prefer using a graphical user interface or notebook system. The most popular is [Jupyter Notebook](https://jupyter.org/), as it combines robust functionality with an appealing clean interface.

## Using Jupyter Notebooks

Python operates like most programming languages (and is quite similar to R or Stata, if you've ever used those), in that it has its own grammar and vocabulary based on commands and functions. 

Code is usually written in scripts or notebooks (like this one) and executed directly within the script environment. 

Jupyter Notebook is a bit different from other graphical user interfaces in that it doesn't have a separate console where you can execute code. Everything is done using the cells within the Notebook text. Otherwise it is very similar, offering traditional navigation tabs like "File" and "Edit" at the top.

## Intro to Jupyter Guides

If you have never used a Jupyter Notebook before, check out [this intro guide from Real Python](https://realpython.com/jupyter-notebook-introduction/) for more info on how it works! Jupyter actually isn't just for Python (the PYT in JuPYTer) but will run R code as well.

You can also run Notebooks directly in your browser using the integrated environment in [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/index.html)! This is very similar to the Notebook interface but contains everything in a cloud server.

## Markdown

Within Jupyter there are two main types of cells: code and Markdown. Code is, as you may have guessed, Python code and can be executed using **ctrl+enter** for Windows or **cmd+enter** for Macs. **shift+enter** will do the same thing but automatically move onto the next cell. 

Markdown cells are used for typical text, like this cell, that is not formatted as Python code. You can format the text using simple Markdown syntax, which is a common text-language shared by many programs (like GitHub and RStudio). Some common Markdown formatting options are titles (made by starting a line with a #, ##, or up to six #'s, depending on what "level" you want the title) as well as hyperlinks, such as [this link to a Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).

## Running Python Code

The other main type of cell is a code chunk, which you can execute directly inside of the Jupyter Notebook. Code is written in the special syntax of the Python language, though most structures and commands are intuitive.

To execute a command, you can either press the play button to the left of the cell or click on the cell and press **ctrl+enter**.

In [1]:
2+2

4

## HTML Output Formats

One of the amazing things about Jupyter Notebooks is that you can download your scripts directly as an output document or presentation. Jupyter comes pre-loaded with several format options, which you can view by clicking the "File" dropdown menu and scrolling over "Download as".

You'll see a dizzying array of options, though the most important (besides the Notebook .ipynb format) are likely HTML, PDF, and Reveal.js slides. You can try out different formats once you are making your own Python notebooks. This course is available for Python and R in HTML and Reveal.js formats.

## RISE and Reveal.JS Presentations

This notebook is formatted to generate a reveal.js presentation, with executable code inside of the presentation thanks to the `RISE` package. 

To create similar presentations in your Jupyter Notebooks, you can follow [this handy tutorial by Mike Driscoll at DZone](https://dzone.com/articles/creating-presentations-with-jupyter-notebook). You don't need to do this if you aren't interested in making presentations from your code, but it may help in understanding the materials for this course. 

If you do want to use reveal.js, the first step is installing the RISE package with our Anaconda installer. The code is a bit complex but it should run as long as you have [Anaconda](https://www.anaconda.com/download/) on your machine.

*Note: The code cell is left un-executed to preserve space in the presentation output.*

In [None]:
conda install -c conda-forge rise

## `pip install` Packages

In case you aren't runnning Python through [Anaconda](https://www.anaconda.com/download/) (which is recommended), you can also use the normal `pip install` method with the command below. Installing packages like `RISE` in Python is easy, all you need to do is write the name after the `pip install` function.

This is a common method to install packages directly within your Python session. Pip, funnily enough, stands for "Pip Installs Packages", making it a recursive acronym. Aren't programmers hilarious?

In [None]:
pip install RISE

## Why We RISE

The benefit of the RISE package in combination with Jupyter's in-built presentation functionality is that you can edit and execute code directly in the presentation mode. Think of it like a Powerpoint where you can edit the slides without leaving the presentation view. To try it for yourself, reload your notebook and click the little bar graph button at the top of the notebook, next to the keyboard button (command palette) below the "Help" dropdown menu.

# Commands and Objects

## Materialist Programs

Python is an object-oriented programming language. This just means that much of our work is done by creating temporary objects out of data like numbers, character strings, files, etc. This lets us reference the same data multiple times in a script quickly without having to specify the source. Later, when we import data from a comma-separated value (CSV) file, we only have to import it once and save it as an object, then we can call on the data many times by using the object name.

## Running Commands and Reproducibility

You can make objects (or run any code) by making a code cell and executing it. You can run code directly from the command line, but most programmers prefer to write code in a script (like this iPython Notebook) so that they can run it again or share it with others (the command line by itself doesn't "save" your work in a file, though you can see a history of your commands).

Let's get used to running commands inside our script, since this is generally best practice for [reproducibility](https://datacarpentry.org/rr-intro/02-toolkit/). Try running each of the following cells.

In [2]:
2+2

4

In [3]:
2+2==4

True

In [4]:
print("hello_world")

hello_world


## Types of Commands

As you can see, commands in Python can be:

- arithmetic (mathematical)
- logical (testing if things are `TRUE` or `FALSE`)
- functions like `print()`
- a combination of these. 

Much of learning Python is knowing different functions and how they fit together. Few people have memorized every possible function, so most of us rely extensively on Google and Stack Overflow to look up commands. About 80% of your programming time, at least at the start, will be spent looking up functions or example code and adapting it to your needs. This isn't stealing or cheating (unless you take code and pretend it's yours) so please borrow liberally from other sources (like these guides)!

## Storing and Printing Results

When you execute a command without saving to an object, Python will print out the result, as you saw above. You can also use the `print()` function to achieve the same output. This output can also be saved to an object and re-used later using the assignment operator `=`. You can also call the output by calling the object itself (without the `print()` function).

Let's try an example to show how we can save a command to an object and re-use it.

In [5]:
four = 2+2

In [6]:
four

4

## How Python Treats Objects

Notice how the output wasn't printed until you called the object `four` by running it by itself. Python won't automatically print the contents of an object when you make it. Also note that you can then use the object in subsequent code, treating it just like you would the contents `2+2` or, as it evaluates the math, `4`.

Note how the `=` was used in the first line. In this case, `four` is just a string of letters until it is assigned to the output of the function `2+2`. By writing out `four` and assigning it with the `=` we are telling R to save the formula `2+2` in a shorthand (actually it's longer but you get the idea) that we named `four`. 


In [7]:
four / 2

2.0

## Not all `=` are Equal

Note the difference between the single `=` that is used to assign objects and the `==` that runs a logical test of equivalency. `==` checks to see if the surrounding values are equal, and if so prints the logical value `TRUE`.

In [8]:
four == 4

True

In [9]:
four == 3

False

## Computers are Dumb

By itself, unless you previously assign it as an object, `four` means nothing arithmetically to R. We can see this if you try it with another character string.


In [10]:
six + 4

NameError: name 'six' is not defined

## Unless the Operator is Smart

See the difference when using a new object we haven't defined? `four / 2` is a recognizable command, since we defined the object `four` as `2+2`. 

`six`, however, isn't an object, it's just a string of text, so Python gives and error when you try to divide a number by letters.

`six` by itself means nothing to Python, unless you assign it as an object.

In [11]:
six = 12/2
six+four

10.0

## Silly Objects

Now Python can add the two objects `four` and `six` because you assigned them values. If you assign other values to the same name, it will overwrite the object you made before, so be careful with re-using names.

In [12]:
four = 3
six + four

9.0

## No Judgment from Python

Python won't judge you for making silly objects, like `four` being 3, but anyone you share your code with will, so be sure to use sensible names that would be comprehensible to others. This isn't required, necessarily, but it's good coding practice and will help make things easier for you and other programmers. We're all in this together, after all.

# The Current Environment (Namespace)

## Programmers Do Care About the Environment

Whenever you create an object by assigning values to a name, it will be saved in your **Current Environment** or **Namespace**. Your global environment keeps track of all the objects you've created within the Python session (the **Kernel**). These objects are stored in your computer's temporary memory, so if you quit the Python session (even if you save the script file) the objects won't stay.

## Displaying Objects with `dir()`

Jupyter Notebooks save your objects and their associated names but, unlike other interfaces like RStudio, there isn't a window to view the current objects. Instead, you can print out the names of all the current objects in the namespace using the `dir()` command. This will also print the built-in objects (cleverly named `__builtins__`). To read a bit more about Python namespaces, environments, and scopes (the hierarchy of built-in and user-defined objects) check out this [intro guide from wellsr.com](https://wellsr.com/python/basics/python-namespaces-variable-locations-and-scopes/).

In [13]:
dir()

['In',
 'Out',
 '_',
 '_1',
 '_11',
 '_12',
 '_2',
 '_3',
 '_6',
 '_7',
 '_8',
 '_9',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i2',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'four',
 'get_ipython',
 'quit',
 'six']

## Clearing the Namespace

In case you ever need to clear out the namespace, there are two main options: restarting your kernel from the drop-down menu at the top or using the `%reset` command. 

It is less common to clear the environment in Python, as it initiates a new Kernel for each workbook so there is less chance of accidentally carrying objects from another session, but it's very useful in Stata and R so we include it here for consistency with the other languages.

In [14]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


After following the dialogue box and deleting the variables, you can check the namespace using the `dir()` command again.

In [15]:
dir()

['In',
 'Out',
 '__builtin__',
 '__builtins__',
 '__name__',
 '_dh',
 '_i',
 '_i15',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'get_ipython',
 'quit']

## Removing Individual Objects

If we want to remove specific objects instead of just clearing the entire namespace, we can use the `del` function. Let's try an example, creating a new object, finding it in the directory, then deleting it.

In [16]:
new_object = "example"

In [None]:
# We left this un-executed in the presentations to preserve space. Feel free to run it in your own Notebook.

dir()

In [17]:
del new_object

# Creating a Dataframe and Indexing

## Intro to Dataframes

Most of your analysis will be on pre-existing data. It's rare to actually build a full data set using R, unless you automate it with a function. However, it's helpful to learn about making dataframes to better understand the structure and how the indexing system works. We'll take a brief detour from our `housing` data to create a new dataframe using the package `pandas`.

The primary data structures in `pandas` are implemented as two classes:

  * **DataFrames**, which you can imagine as a data table, with rows of observations and columns of variables similar to a spreadsheet.
  * **Series**, which holds a set of data. Series can contain any type of data (such as numeric values and strings) but most often they often they contain the same type. 
  
A dataframe contains one or more series and a name for each Series, also referred to as variables.

## Installing Pandas with `pip install`

Before we can use `pandas` we need to install the package. `pandas`, shorthand for "Python Data Analysis Library", apparently got its name from the term "panel data", an important type of survey data. For a quick background on the package [here is a useful guide from Towards Data Science](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673).

To install a new, user generated package we can use the `pip install` command. [Here is a guide from the Python Help files on installing packages in case you have trouble](https://docs.python.org/3.7/installing/index.html?highlight=pip%20install).

In [None]:
pip install pandas

## Importing `pandas`

Before we can use the `pandas` approach to building and using dataframes, however, we have to import it into our kernel. It is common to import `pandas` by the shorthand `pd` since it is used so often.

In [18]:
import pandas as pd

## Creating Series with Indices

Pandas allows you to create a series object directly using indexing notation, which just means you separate values with commas inside of a set of square brackets. We use the `pd.Series()` function to create a series out of a set of values as follows. Notice that the series is printed as output, not saved as an object.

In [19]:
pd.Series(['San Francisco', 'New York City', 'Austin'])

0    San Francisco
1    New York City
2           Austin
dtype: object

## Using the Dictionary

DataFrame objects can be created by mapping a dictionary of strings of column names to their respective Series. 

If the series don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values.

If you don't know what a dictionary or string are (in Python terms), Real Python has nice, quick tutorial articles on [Data Types](https://realpython.com/python-data-types/) and [Dictionaries](https://realpython.com/python-dicts/).

## Making a Dataframe with `pd.DataFrame()`

Once you have a series of values saved as an object, you can construct a dataframe using the `pd.DataFrame()` function. We create names for the series within the function using the dictionary notation `:`, contained within curly brackets `{}`.

In [20]:
city_names = pd.Series(['San Francisco', 'New York City', 'Austin'])
population = pd.Series([884363, 8623000, 950715])

City_Data = pd.DataFrame({ 'city_names': city_names, 'population': population })

## Calling an Entire Dataframe

We can also call the entire dataframe using its name, though browsing values like this is rarely useful for larger datasets. Since we only have 3 rows, however, we can see the entire dataframe.


In [21]:
City_Data

Unnamed: 0,city_names,population
0,San Francisco,884363
1,New York City,8623000
2,Austin,950715


## Indexing Dataframes

 Can you see how the dataframe is structured from the output above? The numbers to the left of the rows indicate the row position of each value. You can call specific rows, columns, or values from within the dataframe using its index position.

## Pop, `loc[]` and `iloc[]` It

There are two functions in Python to call object contents by position: `iloc[]` and `loc[]`. `iloc[]` calls values by their index number, while `loc[]` uses the labels of rows and columns. The results are the same but it's up to your preference whether you like names or numbers.

## Index Positions

Index positions are defined by [row,column] within square brackets. `:` refers to all values, so if you want to call all rows within the 0th column, for example, you would use [:,0] with the `iloc` function or [:,'column_name'] with the `loc` function.

In [22]:
# Call first column with numeric position
City_Data.iloc[:,0]

0    San Francisco
1    New York City
2           Austin
Name: city_names, dtype: object

In [23]:
# Call first column with series name
City_Data.loc[:,'city_names']

0    San Francisco
1    New York City
2           Austin
Name: city_names, dtype: object

You can also call rows by their position value or with the full index.

In [24]:
City_Data.iloc[0]

city_names    San Francisco
population           884363
Name: 0, dtype: object

In [25]:
City_Data.iloc[0,:]

city_names    San Francisco
population           884363
Name: 0, dtype: object

Or call a specific value using both its row and column.

In [26]:
City_Data.iloc[0,0]

'San Francisco'

Finally, and most conveniently, you can refer to columns by their name inside the square brackets without the `loc` or `iloc` function. This can help identify particular variables (series) in a dataframe for use in analysis. You can also specify the row of the series using another bracket.

In [27]:
City_Data['city_names']

0    San Francisco
1    New York City
2           Austin
Name: city_names, dtype: object

In [28]:
City_Data['city_names'][0]

'San Francisco'

As you can see, indices in Python start at 0, which is different than R and other langagues which start at 1. Why the difference? You can read the technical explanation in [this Medium article](https://medium.com/@albertkoz/why-does-array-start-with-index-0-65ffc07cbce8). I just think it's because some folks like to be different.

For more details on indexing in pandas, [here is another quick and easy online guide](https://brohrer.github.io/dataframe_indexing.html).

## Conclusion and Review



This process of clearing the environment, setting the working directory, importing a dataset, and summarizing its variables is often the first part of every data analysis project you will do. Congratulations, you're officially doing data science!

To review, here are the functions we learned today:

- `conda install`
- `pip install`
- `print()`
- `dir()`
- `%reset`
- `del`
- `import`
- `pd.Series()`
- `pd.DataFrame()`
- `iloc[]`
- `loc[]`

## Help is Your Best Friend

If you don't recognize any of these or what they do please feel free to go back up and review. These are all "bread and butter" commands that you will be using quite a lot, so make sure to know what they are. 

If you want to explore them in even more detail you can also look them up in the **Help** documentation in the drop down menu above. Note that you have different documentation for base Python, pandas, etc. The help-file for each function will give you a description, list of possible arguments and their default values, and some example code. For example, [here is the help file again](https://docs.python.org/3.7/installing/index.html?highlight=pip%20install) for installing packages via `pip install`. It's always good to check the help-file whenever you are using a new function!

## Next Time on the **Crash Course**

Finally, welcome to the Python community! Our next lesson will focus on importing data and generating summary statistics!