# Python for Policy Analysts

## Session 1: Introduction to Python and Jupyter Notebooks

Created by: Aaron Scherf (aaron_scherf@berkeley.edu)

Instructor Edition

### Today's Packages
* **os**
* **pandas**

### Today's Commands:
* `%reset`
* `import`
* `os.getcwd()`
* `os.chdir()`
* `os.listdir()`
* `pip install`
* `pd.read_csv()`
* `head()`
* `describe()`
* `pd.Series()`


## Entering and Viewing Data:

### 1. Clear Current Environment

Just in case you ever need to clear all the objects you've created from the **namespace**, which is what Python calls the current session environment.

This is less important in Python, since it initiates a new Kernel for each workbook, but it's often useful in Stata and R so we include it here for consistency.

In [None]:
%reset

### 2. Check Working Directory Path

In [None]:
import os

os.getcwd()

### 3. Change Working Directory

In [None]:
os.chdir('Your_Root_Working_Directory_Path')

In [None]:
# Example:
os.chdir('C:\\Users\\User_Name\\Desktop\\Crash-Course-4-Practitioners')

We can also save our directory path as a string.

In [None]:
path = 'C:\\Users\\User_Name\\Desktop\\Crash-Course-4-Practitioners'

In [None]:
os.chdir(path)

### 4. List Files in Directory

In [None]:
os.listdir()

Let's make a new object containing the path for our `Data` folder, then inspect what's inside it.

In [None]:
path_data = path + '\\Data\\'
os.listdir(path_data)

### 5. Importing a CSV File

Our first data is on housing prices in California, from [Cam Nugent's Kaggle Dataset](https://www.kaggle.com/camnugent/california-housing-prices/downloads/california-housing-prices.zip/1).

You can either download the data directly from Kaggle and extract it to your working directory or download it as part of the course GitHub Repo.

Either way, make sure you have the `housing.csv` file in a folder called `Data` inside your main working directory.

#### 5a) Installing Pandas

First make sure you have the **pandas** module installed and imported.

In [None]:
pip install pandas

In [None]:
import pandas as pd

#### 5b) Importing a CSV

Then we can create a new object called `housing` by importing a CSV using the *pandas* function `read_csv()`. 

We'll call our CSV file using our `path_data` object and concatenating (or adding) the file name of `housing.csv`.

Then we can look at the first 5 observations in the file using the `head()` function.

In [None]:
housing = pd.read_csv(path_data + 'housing.csv') 
# Preview the first 5 lines of the loaded data 
housing.head()

### Intermission: Finding New Data

There is a lot of data out there but we recommend searching through the following sources:

* Kaggle Datasets - https://www.kaggle.com/datasets
* re3data Resources by Subject - https://www.re3data.org/browse/by-subject/
* World Health Organization Global Health Observatory - https://www.who.int/gho/database/en/
* World Bank Open Data - https://data.worldbank.org/
* Google Public Data - https://www.google.com/publicdata/directory
* Harvard Dataverse - https://dataverse.harvard.edu/

### 6. Creating a Dataframe

The primary data structures in *pandas* are implemented as two classes:

  * **DataFrame**, which you can imagine as a data table, with rows of observations and columns of variables similar to a spreadsheet.
  * **Series**, which is a single column. A **DataFrame** contains one or more **Series** and a name for each **Series**.

Pandas allows you to create a series object directly using indexing notation.

In [None]:
pd.Series(['San Francisco', 'New York City', 'Austin'])

**DataFrame** objects can be created by mapping a dictitionary of **strings** of column names to their respective **Series**. 

If the **Series** don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values.

In [None]:
city_names = pd.Series(['San Francisco', 'New York City', 'Austin'])
population = pd.Series([884363, 8623000, 950715])

city_data = pd.DataFrame({ 'City name': city_names, 'Population': population })

city_data

### 7. Basic Descriptive Statistics

Pandas can automatically call summary statistics for all numeric variables using the `describe()` function.

In [None]:
housing.describe()

## That's it! Congrats on making it through your first Python crash course session!