# Data Science Crash Course 4 Practitioners

## Session 1: Introduction to Python and Jupyter Notebooks

Created by: Aaron Scherf (aaron_scherf@berkeley.edu)

Instructor Edition

### Today's Packages
* `os`
* `pandas`

### Today's Commands:
* `print()`
* `dir()`
* `%reset`
* `del`
* `import`
* `os.getcwd()`
* `os.chdir()`
* `os.listdir()`
* `pip install`
* `pd.read_csv()`
* `head()`
* `pd.Series()`
* `pd.DataFrame()`
* `iloc[]`
* `loc[]`
* `describe()`
* `mean()`


## 0. Welcome to Python!

Python is a powerful open-source programming language that is popular among  data scientists and other computer programmers for its versatility and capacity to interface across different platforms. You can code in Python directly from the command line, but for reproducibility and ease-of-use many people prefer using a graphical user interface or notebook system. The most popular is Jupyter Notebook, as it combines robust functionality with an appealing clean interface.

Python operates like most programming languages (and is quite similar to R or Stata, if you've ever used those), in that it has its own grammar and vocabulary based on commands and functions. Code is usually written in scripts or notebooks (like this one) and executed directly within the script environment. Jupyter Notebook is a bit different from other graphical user interfaces in that it doesn't have a separate console where you can execute code. Everything is done using the cells within the Notebook text. Otherwise it is very similar, offering traditional navigation tabs like "File" and "Edit" at the top.

If you have never used a Jupyter Notebook before, check out [this intro guide from Real Python](https://realpython.com/jupyter-notebook-introduction/) for more info on how it works! Jupyter actually isn't just for Python (the PYT in JuPYTer) but will run R code as well.

You can also run Notebooks directly in your browser using the integrated environment in [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/index.html)! This is very similar to the Notebook interface but contains everything in a cloud server.

### Markdown

Within Jupyter there are two main types of cells: code and Markdown. Code is, as you may have guessed, Python code and can be executed using **ctrl+enter** for Windows or **cmd+enter** for Macs. **shift+enter** will do the same thing but automatically move onto the next cell. 

Markdown cells are used for typical text, like this cell, that is not formatted as Python code. You can format the text using simple Markdown syntax, which is a common text-language shared by many programs (like GitHub and RStudio). Some common Markdown formatting options are titles and subtitles (made by starting a line with a #, ##, or up to six #'s) as well as hyperlinks, such as [this link to a Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).

## 1. Objects and the Current Environment

Python is an object-oriented programming language. This just means that much of our work is done by creating temporary objects out of data like numbers, character strings, files, etc. This lets us reference the same data multiple times in a script quickly without having to specify the source. Later, when we import data from a comma-separated value (CSV) file, we only have to import it once and save it as an object, then we can call on the data many times by using the object name.

You can make objects (or run any code) by making a code cell and executing it. You can run code directly from the command line, but most programmers prefer to write code in a script (like this iPython Notebook) so that they can run it again or share it with others (the command line by itself doesn't "save" your work in a file, though you can see a history of your commands).

Let's get used to running commands inside our script, since this is generally best practice for [reproducibility](https://datacarpentry.org/rr-intro/02-toolkit/). Try running each of the following cells.

In [None]:
2+2

In [None]:
2+2==4

In [None]:
print("hello_world")

As you can see, commands in Python can be arithmetic (mathematical), logical (testing if things are `TRUE` or `FALSE`), functions like `print()`, or a combination of these. Much of learning Python is knowing different functions and how they fit together. Few people have memorized every possible function, so most of us rely extensively on Google and Stack Overflow to look up commands. About 80% of your programming time, at least at the start, will be spent looking up functions or example code and adapting it to your needs. This isn't stealing or cheating (unless you take code and pretend it's yours) so please borrow liberally from other sources (like these guides)!

When you execute a command without saving to an object, Python will print out the result, as you saw above. You can also use the `print()` function to achieve the same output. This output can also be saved to an object and re-used later using the assignment operator `=`. You can also call the output by calling the object itself (without the `print()` function).

In [None]:
four = 2+2

In [None]:
four

Notice how the output wasn't printed until you called the object `four` by running it by itself. Python won't automatically print the contents of an object when you make it. Also note that you can then use the object in subsequent code, treating it just like you would the contents `2+2` or, as it evaluates the math, `4`.

Note how the `=` was used in the first line. In this case, `four` is just a string of letters until it is assigned to the output of the function `2+2`. By writing out `four` and assigning it with the `=` we are telling R to save the formula `2+2` in a shorthand (actually it's longer but you get the idea) that we named `four`. 


In [None]:
four / 2

Note the difference between the single `=` that is used to assign objects and the `==` that runs a logical test of equivalency. `==` checks to see if the surrounding values are equal, and if so prints the logical value `TRUE`.

In [None]:
four == 4

By itself, unless you previously assign it as an object, `four` means nothing arithmetically to R. We can see this if you try it with another character string.


In [None]:
six + 4

See the difference between the two cells above? `four / 2` is a recognizable command, since we defined the object `four` as `2+2`. `two`, however, isn't an object, it's just a string of text, so Python gives and error when you try to divide a number by letters.

`six` by itself means nothing to Python, unless you assign it as an object.

In [None]:
six = 12/2
six+four

Now Python can add the two objects `four` and `six` because you assigned them values. If you assign other values to the same name, it will overwrite the object you made before, so be careful with re-using names.

In [None]:
four = 3
six + four

Python won't judge you for making silly objects, like `four` being 3, but anyone you share your code with will, so be sure to use sensible names that would be comprehensible to others. This isn't required, necessarily, but it's good coding practice and will help make things easier for you and other programmers. We're all in this together, after all.

### Introducing: The Current Environment (Namespace)

Whenever you create an object by assigning values to a name, it will be saved in your **Current Environment** or **Namespace**. Your global environment keeps track of all the objects you've created within the Python session (the **Kernel**). These objects are stored in your computer's temporary memory, so if you quit the Python session (even if you save the script file) the objects won't stay.

Jupyter Notebooks save your objects and their associated names but, unlike other interfaces like RStudio, there isn't a window to view the current objects. Instead, you can print out the names of all the current objects in the namespace using the `dir()` command. This will also print the built-in objects (cleverly named `__builtins__`). To read a bit more about Python namespaces, environments, and scopes (the hierarchy of built-in and user-defined objects) check out this [intro guide from wellsr.com](https://wellsr.com/python/basics/python-namespaces-variable-locations-and-scopes/).


In [None]:
dir()

## 2. Clear the Namespace

In case you ever need to clear out the namespace, there are two main options: restarting your kernel from the drop-down menu at the top or using the `%reset` command. It is less common to clear the environment in Python, as it initiates a new Kernel for each workbook so there is less chance of accidentally carrying objects from another session, but it's very useful in Stata and R so we include it here for consistency with the other languages.

In [None]:
%reset

After following the dialogue box and deleting the variables, you can check the namespace using the `dir()` command again.

In [None]:
dir()

If we want to remove specific objects instead of just clearing the entire namespace, we can use the `del` function. Let's try an example, creating a new object, finding it in the directory, then deleting it.

In [None]:
new_object = "example"

In [None]:
dir()

In [None]:
del new_object

## 3. Working Directory and File Paths

You can make and define data files in Python, but more often than not you will be importing data that was generated elsewhere. Importing files requires telling Python where on your computer (or in the cloud) it is, which is usually defined by some kind of **filepath**. Folders on your computer are logical representations of these filepaths, often consisting of a nested list of paths or **directories**, culminating in the folder that contains the file you want. You can constantly just reference the entire filepath to call a file, but since programmers are lazy we often shorten the path into an object and tell Python to default towards a certain folder. This is called the **working directory**.

If you just call a file with an import command without any filepath, Python will begin by searching the working directory. Python typically defaults to a folder associated with your Python installation when you open a new session. You can always check where your current working directory is with the `getcwd()` function. This function isn't included in base Python, however, so you have to import the function from the `os` package using the `import` function. Then preface the function with the package name, so the actual function call is `os.getcwd()`.


In [None]:
import os

os.getcwd()

You can see from the output that the filepath is expressed as a character string. The format will depend on whether you are using a Windows, Mac, Linux, etc. so if you aren't familiar with filepaths you may need to look up the format for your machine.

Checking the current directory is helpful but it's more useful to change the working directory. You often start off a new script by setting the working directory to the folder you want, both to ensure you are working from the right directory and to help anyone else that uses your script know to set the directory to their computer.

#### Change the Working Directory

The function to change the directory in Python is `os.chdir()`. The following code cell is just an example, you will have to identify the folder where you have saved the scripts and data for this course and change the text to that filepath.

In [None]:
os.chdir('Your_Root_Working_Directory_Path')

In [None]:
# Example (make sure to change the 'User_Name'):
os.chdir('C:\\Users\\User_Name\\Desktop\\Crash-Course-4-Practitioners')

It's often good practice to organize your projects in a familiar way. Everyone does their folder structures differently but it's common to have a `Data` folder and various output folders, like `Plots` or `Logs` (meaning log-files). It's typical to set your working directory to the overall project folder (sometimes the `Master` folder, though personally I think it sounds weird). Then you can specify the sub-folder (like `Data`) when you call the file to import data, or create character strings for all of your various folders.

You can always open the folder you are working in using the regular old file explorer on your computer, but as programmers we don't like to leave the **Matrix** that is typing into a screen of code. To view the names of files inside of your directory, you can use the `listdir()` function, also from the `os` package.


#### List Files in Directory

In [None]:
os.listdir()

If you leave the part inside the parentheses blank (not passing any arguments into the function, therefore calling it using the default settings), `listdir()` will list all of the files and subfolders in your working directory as text strings. This is useful to get the exact names of files (or even select them as data values) as well as navigate to various folders.

We can make an object out of our working directory filepath, taking a shortcut by saving our working directory as a character string object.

In [None]:
path = 'C:\\Users\\theaa\\Desktop\\Crash-Course-4-Practitioners'

In [None]:
os.chdir(path)

Let's make a new object using the `path` object containing the path for our `Data` folder, then inspect what's inside it. You can easily concatenate (combine) strings of text in Python using the `+` sign.

In [None]:
path_data = path + '\\Data\\'
os.listdir(path_data)

For more on working with files, check out [this tutorial by Real Python](https://realpython.com/working-with-files-in-python/). Includes making temporary file directories, copying and renaming files (in Python), and working with zip files directly in Python.

## 4. Importing and Viewing CSV Files

Our first data is on housing prices in California, from [Cam Nugent's Kaggle Dataset](https://www.kaggle.com/camnugent/california-housing-prices/downloads/california-housing-prices.zip/1).

You can either download the data directly from Kaggle and extract it to your working directory or download it as part of the [course GitHub Repo](https://github.com/Data-Scholars-Discovery/Crash-Course-4-Practitioners). If you haven't used GitHub before, just click the green **Clone or Download** button at the top right, then select **Download Zip**. GitHub is the standard open-source data sharing and file management system, built on its own programming language called Git. You don't need to know Git to use GitHub but eventually it will become necessary, so best to get in the practice of using GitHub now. For a basic intro to GitHub check out the [GitHub Guides on their website](https://guides.github.com/activities/hello-world/).

Either way, make sure you have the `housing.csv` file in a folder called `Data` inside your main working directory.

#### Installing Pandas

Before we can import our CSV file, we need another package, called `pandas`. Pandas, shorthand for "Python Data Analysis Library", apparently got its name from "panel data", an important type of survey data. For a quick background on the package [here is a useful guide from Towards Data Science](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673).

To install a new, user generated package we use the `pip install` command. This is a common method to install packages directly within your Python session. Pip, funnily enough, stands for "Pip Installs Packages", making it a recursive acronym. Aren't programmers hilarious?

In [None]:
pip install pandas

Once you've installed `pandas`, we can import the functions from it using the `import` command like we did for the `os` package. Since we need to call the name of the package quite often it's common to shorten it to `pd` as follows.

In [None]:
import pandas as pd

#### Importing a CSV

Then we can create a new dataframe called `housing` by importing a CSV using the `pandas` function `read_csv()`. 

Dataframes are a general file type that can contain many forms of data, similarly to an Excel spreadsheet. They are structured in the familiar row and column format.

We'll call our CSV file using our `path_data` object and concatenating (or adding) the file name of `housing.csv`.

Then we can look at the first 5 observations in the file using the `head()` function.

In [None]:
housing = pd.read_csv(path_data + 'housing.csv') 
# Preview the first 5 lines of the loaded data 
housing.head()

### Intermission: Finding New Data

There is a lot of data out there but we recommend searching through the following sources:

* Kaggle Datasets - https://www.kaggle.com/datasets
* re3data Resources by Subject - https://www.re3data.org/browse/by-subject/
* World Health Organization Global Health Observatory - https://www.who.int/gho/database/en/
* World Bank Open Data - https://data.worldbank.org/
* Google Public Data - https://www.google.com/publicdata/directory
* Harvard Dataverse - https://dataverse.harvard.edu/

## 5. Creating a Dataframe

Most of your analysis will be on pre-existing data. It's rare to actually build a full data set using R, unless you automate it with a function. However, it's helpful to learn about making dataframes to better understand the structure and how the indexing system works. We'll take a brief detour from our `housing` data to create a new dataframe in `pandas`.

The primary data structures in `pandas` are implemented as two classes:

  * **DataFrame**, which you can imagine as a data table, with rows of observations and columns of variables similar to a spreadsheet.
  * **Series**, which holds a set of data. Series can contain any type of data (such as numeric values and strings) but most often they often they contain the same type. A dataframe contains one or more series and a name for each Series, also referred to as variables in many social sciences.

Pandas allows you to create a series object directly using indexing notation, which just means you separate values with commas inside of a set of square brackets. We use the `pd.Series()` function to create a series out of a set of values as follows. Notice that the series is printed as output, not saved as an object.

In [None]:
pd.Series(['San Francisco', 'New York City', 'Austin'])

DataFrame objects can be created by mapping a dictionary of strings of column names to their respective Series. 

If the series don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values.

If you don't know what a dictionary or string are (in Python terms), Real Python has nice, quick tutorial articles on [Data Types](https://realpython.com/python-data-types/) and [Dictionaries](https://realpython.com/python-dicts/).

Once you have a series of values saved as an object, you can construct a dataframe using the `pd.DataFrame()` function. We create names for the series within the function using the dictionary notation `:`, contained within curly brackets `{}`. After we save the dataframe as an object we can call it directly to view its contents.

In [None]:
city_names = pd.Series(['San Francisco', 'New York City', 'Austin'])
population = pd.Series([884363, 8623000, 950715])

City_Data = pd.DataFrame({ 'city_names': city_names, 'population': population })

City_Data

We can then call on particular values within our dataframe using indexing. Can you see how the dataframe is structured from the output above? The numbers to the left of the rows indicate the row position of each value. You can call specific rows, columns, or values from within the dataframe using its index position.

There are two functions in Python to call object contents by position: `iloc[]` and `loc[]`. `iloc[]` calls values by their index number, while `loc[]` uses the labels of rows and columns. The results are the same but it's up to your preference whether you like names or numbers.

Index positions are defined by [row,column] within square brackets. `:` refers to all values, so if you want to call all rows within the 0th column, for example, you would use [:,0] with the `iloc` function or [:,'column_name'] with the `loc` function.

In [None]:
# Call first column with numeric position
City_Data.iloc[:,0]

In [None]:
# Call first column with series name
City_Data.loc[:,'city_names']

You can also call rows by their position value or with the full index.

In [None]:
City_Data.iloc[0]

In [None]:
City_Data.iloc[0,:]

Or call a specific value using both its row and column.

In [None]:
City_Data.iloc[0,0]

Finally, and most conveniently, you can refer to columns by their name inside the square brackets without the `loc` or `iloc` function. This can help identify particular variables (series) in a dataframe for use in analysis. You can also specify the row of the series using another bracket.

In [None]:
City_Data['city_names']

In [None]:
City_Data['city_names'][0]

As you can see, indices in Python start at 0, which is different than R and other langagues which start at 1. Why the difference? You can read the technical explanation in [this Medium article](https://medium.com/@albertkoz/why-does-array-start-with-index-0-65ffc07cbce8). I just think it's because some folks like to be different.

For more details on indexing in pandas, [here is another quick and easy online guide](https://brohrer.github.io/dataframe_indexing.html).

## 6. Basic Descriptive Statistics

Pandas can automatically call summary statistics for all numeric variables using the `describe()` function.

In [None]:
housing.describe()

To call more specific sets of statistics you can specify the statistic you want:

In [None]:
housing.mean()

Then you can call statistics for particular series using indexing, with the square bracket `[]` to refer to the position of the series within the columns of the dataframe. Remember that Python, unlike some languages, starts indexing at 0!

In [None]:
housing.mean()[0]

You can also refer to series by name, indexing them in the dataframe and then calling a summary statistic.

In [None]:
housing["longitude"].mean()

In [None]:
housing["households"].describe()

## Conclusion and Review



This process of clearing the environment, setting the working directory, importing a dataset, and summarizing its variables is often the first part of every data analysis project you will do. Congratulations, you're officially doing data science!

To review, here are the functions we learned today:

* `print()`
* `dir()`
* `%reset`
* `del`
* `import`
* `os.getcwd()`
* `os.chdir()`
* `os.listdir()`
* `pip install`
* `pd.read_csv()`
* `head()`
* `pd.Series()`
* `pd.DataFrame()`
* `iloc[]`
* `loc[]`
* `describe()`
* `mean()`


If you don't recognize any of these or what they do please feel free to go back up and review. These are all "bread and butter" commands that you will be using quite a lot, so make sure to know what they are. If you want to explore them in even more detail you can also look them up in the **Help** documentation in the drop down menu above. Note that you have different documentation for base Python, pandas, etc. The help-file for each function will give you a description, list of possible arguments and their default values, and some example code. For example, [here is the help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe) for the `describe()` function. It's always good to check the help-file whenever you are using a new function!


Finally, welcome to the Python community! Our next lesson will focus on basic data management within dataframes.
