# Python for Policy Analysts

## Session 1: Introduction to Python and Jupyter Notebooks

Created by: Aaron Scherf (aaron_scherf@berkeley.edu)

Instructor Edition

### Today's Packages
* **os**
* **pandas**

### Today's Commands:
* `%reset`
* `import`
* `os.getcwd()`
* `os.chdir()`
* `os.listdir()`
* `pip install`
* `pd.read_csv()`
* `head()`
* `describe()`
* `pd.Series()`


## Entering and Viewing Data:

### 1. Clear Current Environment

Just in case you ever need to clear all the objects you've created from the **namespace**, which is what Python calls the current session environment.

This is less important in Python, since it initiates a new Kernel for each workbook, but it's often useful in Stata and R so we include it here for consistency.

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


### 2. Check Working Directory Path

In [15]:
import os

os.getcwd()

'C:\\Users\\theaa\\Desktop\\Data Science Pedagogy Resources\\Joint Course'

### 3. Change Working Directory

In [23]:
os.chdir('Your_Root_Working_Directory_Path')

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Your_Root_Working_Directory_Path'

In [3]:
# Example:
os.chdir('C:\\Users\\theaa\\Desktop\\Data Science Pedagogy Resources\\Joint Course')

We can also save our directory path as a string.

In [4]:
path = 'C:\\Users\\theaa\\Desktop\\Data Science Pedagogy Resources\\Joint Course'

In [5]:
os.chdir(path)

### 4. List Files in Directory

In [6]:
os.listdir()

['Data',
 'do-files',
 'Python',
 'Stata Concurrent Course Lesson 1.pptx',
 'Stata Lab (shared)',
 '~$Stata Concurrent Course Lesson 1.pptx']

Let's make a new object containing the path for our `Data` folder, then inspect what's inside it.

In [7]:
path_data = path + '\\Data\\'
os.listdir(path_data)

['housing.csv']

### 5. Importing a CSV File

Our first data is on housing prices in California, from [Cam Nugent's Kaggle Dataset](https://www.kaggle.com/camnugent/california-housing-prices/downloads/california-housing-prices.zip/1).

You can either download the data directly from Kaggle and extract it to your working directory or download it as part of the course GitHub Repo.

Either way, make sure you have the `housing.csv` file in a folder called `Data` inside your main working directory.

#### 5a) Installing Pandas

First make sure you have the **pandas** module installed and imported.

In [58]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas as pd

#### 5b) Importing a CSV

Then we can create a new object called `CA_housing` by importing a CSV using the *pandas* function `read_csv()`. 

We'll call our CSV file using our `path_data` object and concatenating (or adding) the file name of `CA_housing.csv`.

Then we can look at the first 5 observations in the file using the `head()` function.

In [9]:
housing = pd.read_csv(path_data + 'CA_housing.csv') 
# Preview the first 5 lines of the loaded data 
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Intermission: Finding New Data

There is a lot of data out there but we recommend searching through the following sources:

* Kaggle Datasets - https://www.kaggle.com/datasets

### 6. Creating a Dataframe

The primary data structures in *pandas* are implemented as two classes:

  * **DataFrame**, which you can imagine as a data table, with rows of observations and columns of variables similar to a spreadsheet.
  * **Series**, which is a single column. A **DataFrame** contains one or more **Series** and a name for each **Series**.

Pandas allows you to create a series object directly using indexing notation.

In [12]:
pd.Series(['San Francisco', 'New York City', 'Austin'])

0    San Francisco
1    New York City
2           Austin
dtype: object

**DataFrame** objects can be created by mapping a dictitionary of **strings** of column names to their respective **Series**. 

If the **Series** don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values.

In [14]:
city_names = pd.Series(['San Francisco', 'New York City', 'Austin'])
population = pd.Series([884363, 8623000, 950715])

city_data = pd.DataFrame({ 'City name': city_names, 'Population': population })

city_data

Unnamed: 0,City name,Population
0,San Francisco,884363
1,New York City,8623000
2,Austin,950715


### 7. Basic Descriptive Statistics

Pandas can automatically call summary statistics for all numeric variables using the `describe()` function.

In [10]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0
