# Data Science Crash Course

## Session 2: Importing Data and Summary Stats

Created by: Aaron Scherf (aaron_scherf@berkeley.edu)

Instructor Edition

## Contents
- Review of Session 1
- File Paths and Directories
- Importing and Viewing CSV Files
- Data Resources
- Basic Descriptive Statistics
- Conclusion and Review

## Today's Packages

- `pandas`
- `os`

## Today's Commands:
- `os.getcwd()`
- `os.chdir()`
- `os.listdir()`
- `pd.read_csv()`
- `head()`
- `mean()`
- `median()`
- `var()`
- `std()`
- `describe()`

# Review of Session 1

## Step 1: 

Clear your working environment to ensure there's no other data.

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## Step 1.5: 

`import` the `pandas` package as `pd`.

In [2]:
import pandas as pd

## Step 2:

Create the `City_Data` object using the two series `city_names` and `population`.

In [3]:
city_names = pd.Series(['San Francisco', 'New York City', 'Austin'])
population = pd.Series([884363, 8623000, 950715])

City_Data = pd.DataFrame({ 'city_names': city_names, 'population': population })

## Step 3:

Call the population of San Francisco using indexing (either with matrix notation or variable name)

In [4]:
City_Data.iloc[0,1]

884363

In [5]:
City_Data['population'][0]

884363

## Step 4:

Remove the `City_Data` object.

In [6]:
del City_Data

## Step 5: Onwards!

Great, we're ready to move on to the first part of Session 2, where we will learn how to set the working directory, import data from a CSV, and describe it with summary statistics!

# File Paths and Directories

## How does Python Know Where to Look?

You can make and define data files in Python, but more often than not you will be importing data that was generated elsewhere. Importing files requires telling Python where on your computer (or in the cloud) it is, which is usually defined by some kind of **filepath**. 

Folders on your computer are representations of these filepaths, often consisting of a nested list of paths or **directories**, culminating in the folder that contains the file you want. 

You can constantly just reference the entire filepath to call a file, but since programmers are lazy we often shorten the path into an object and tell Python to default towards a certain folder. This is called the **working directory**.

## Retrieve the Current Directory with `os`

If you just call a file with an import command without any filepath, Python will begin by searching the working directory. Python typically defaults to a folder associated with your Python installation when you open a new session. 

You can always check where your current working directory is with the `getcwd()` function. This function isn't included in base Python, however, so you have to import the function from the `os` package using the `import` function. Then preface the function with the package name, so the actual function call is `os.getcwd()`.

In [7]:
import os

os.getcwd()

'C:\\Users\\theaa\\Desktop\\Crash-Course-4-Practitioners\\Python\\Session 2'

## Changing the Working Directory

You can see from the output that the filepath is expressed as a character string. The format will depend on whether you are using a Windows, Mac, Linux, etc. so if you aren't familiar with filepaths you may need to look up the format for your machine.

Checking the current directory is helpful but it's more useful to change the working directory. You often start off a new script by setting the working directory to the folder you want, both to ensure you are working from the right directory and to help anyone else that uses your script know to set the directory to their computer.

## Using `os.chdir()`

The function to change the directory in Python is `os.chdir()`. The following code cell is just an example, you will have to identify the folder where you have saved the scripts and data for this course and change the text to that filepath.

In [8]:
# Example (make sure to change the User_Name, currently `theaa`):
os.chdir('C:\\Users\\theaa\\Desktop\\Crash-Course-4-Practitioners')

## Organizing Your Filepaths

It's often good practice to organize your projects in a familiar way. Everyone organizes their folder structures differently but it's common to have a `Data` folder and various output folders, like `Plots` or `Logs` (meaning log-files). 

It's typical to set your working directory to the overall project folder (sometimes called the `Master` folder, though personally I think it sounds weird).

After you set the project folder as your working directory, you can just specify the sub-folder (like `Data`) when you call the file to import data. You can also change the working directory during the script, which can help reproducibility by giving users more flexibility in naming their sub-folders, but you have to be careful to change the directory back if you want to import or export other data.

## Seeing Your Files with `listdir()`

You can always open the folder you are working in using the regular old file explorer on your computer, but as programmers we don't like to leave the **Matrix** that is typing into a screen of code. To view the names of files inside of your directory, you can use the `listdir()` function, also from the `os` package.

In [9]:
os.listdir()

['.Rhistory', 'Data', 'Python', 'R', 'Stata']

## `listdir()` Default Arguments

If you leave the part inside the parentheses blank (not passing any arguments into the function, therefore calling it using the default settings), `listdir()` will list all of the files and subfolders in your working directory as text strings. 

This is useful to get the exact names of files (or even select them as data values) as well as navigate to various folders.

## Creating Sub-Folders as Objects

We can make an object out of our working directory filepath, taking a shortcut by saving our working directory as a character string object.

In [10]:
path = 'C:\\Users\\theaa\\Desktop\\Crash-Course-4-Practitioners'

In [11]:
os.chdir(path)

## The Path Less Traveled

Let's make a new object using the `path` object containing the path for our `Data` folder, then inspect what's inside it. You can easily concatenate (combine) strings of text in Python using the `+` sign.

In [12]:
path_data = path + '\\Data\\'
os.listdir(path_data)

['housing.csv', 'housing.dta', 'NYC']

For more on working with files, check out [this tutorial by Real Python](https://realpython.com/working-with-files-in-python/). Includes making temporary file directories, copying and renaming files (in Python), and working with zip files directly in Python.

# Importing and Viewing CSV Files

## Importing CSV Data with `read_csv()`

Now that we can set and explore our directories, it's a simple process to import data from a CSV file. The function is easy to remember: `read_csv()`

Let's import the `housing.csv` file that is in our data directory (if it isn't make sure to download the data from its [source on Kaggle](https://www.kaggle.com/camnugent/california-housing-prices/downloads/california-housing-prices.zip/1) or from the [Crash Course GitHub repo](https://github.com/Data-Scholars-Discovery/Crash-Course-4-Practitioners)).

## Using GitHub through the Website

If you haven't used GitHub before, just click the green **Clone or Download** button at the top right, then select **Download Zip**. GitHub is the standard open-source data sharing and file management system, built on its own programming language called Git. 

You don't need to know Git to use GitHub but eventually it will become necessary, so best to get in the practice of using GitHub now. For a basic intro to GitHub check out the [GitHub Guides on their website](https://guides.github.com/activities/hello-world/).

Either way, make sure you have the `housing.csv` file in a folder called `Data` inside your main working directory.

## Importing More `panda`s

Don't forget to `import` `pandas` again. You often run all of the `import` commands necessary at the start of your script, so other users know which packages your file depends on.

In [13]:
import pandas as pd

## Importing a CSV with `read_csv()`

Then we can create a new dataframe called `housing` by importing a CSV using the `pandas` function `read_csv()`. 

Remember that dataframes are a general file type that can contain many forms of data, similarly to an Excel spreadsheet. They are structured in the familiar row and column format.

We'll call our CSV file using our `path_data` object and concatenating (or adding) the file name of `housing.csv`.

Then we can look at the first 5 observations in the file using the `head()` function.

In [14]:
housing = pd.read_csv(path_data + 'housing.csv') 

In [15]:
# Preview the first 5 lines of the loaded data 
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.880001,41,880,129.0,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.860001,21,7099,1106.0,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.849998,52,1467,190.0,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.849998,52,1274,235.0,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.849998,52,1627,280.0,565,259,3.8462,342200,NEAR BAY


# Data Resources

## Finding New Data

There is a lot of data out there but we recommend searching through the following sources:

* Kaggle Datasets - https://www.kaggle.com/datasets
* re3data Resources by Subject - https://www.re3data.org/browse/by-subject/
* World Health Organization Global Health Observatory - https://www.who.int/gho/database/en/
* World Bank Open Data - https://data.worldbank.org/
* Google Public Data - https://www.google.com/publicdata/directory
* Harvard Dataverse - https://dataverse.harvard.edu/


## Data File Types

There are countless ways to store data files. Some are specific to particular programs (like `.dta` files in Stata) while others are shared by many programs (like comma separated value or `.csv` files). 

For ease of sharing and reproducibility, we prefer to use `.csv` files for tabular data. R can handle plenty of other data types, like images or spatial data, but we will stick with "spreadsheet" data for now.

## Basic Descriptive Statistics

## Single Summary Stats

Base Python has functions to calculate most typical summary statistics on their own. These functions are useful to check individual statistics or if you want to use those statistics within another function (like finding the difference between two means).

You can call statistics for all of the series in your dataframe or just singular series (or variables) at one time, using any of our indexing options. 

In [16]:
housing.mean()

longitude               -119.569704
latitude                  35.631861
housing_median_age        28.639486
total_rooms             2635.763081
total_bedrooms           537.870553
population              1425.476744
households               499.539680
median_income              3.870671
median_house_value    206855.816909
dtype: float64

In [17]:
housing['population'].mean()

1425.4767441860465

## Index then Stat, or Stat then Index?

Notice the difference between indexing a specific series and then calling its mean vs calling the list of means for the entire dataframe and then indexing the statistic you want. The result is the same but it helps illustrate the logic of Python code, which runs in order of the periods `.`

In [18]:
housing.iloc[:,5].mean()

1425.4767441860465

In [19]:
housing.mean()[5]

1425.4767441860465

## Single Summary Stats Pt.2

Other typical statistics include the median, variance, and standard deviation.

In [20]:
housing['median_income'].median()

3.53479995

In [21]:
housing['median_income'].var()

3.6093225553131996

In [22]:
housing['population'].std()

1132.462121765341

## Multiple Summary Stats with `describe()`

`pandas` can automatically call summary statistics for all numeric variables using the `describe()` function.

In [23]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.540001,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.259998,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.709999,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.950001,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


## Conclusion and Review



Look at that, only two short sessions in and you know how to perform most of the analysis required for data exploration! Not only can you import and view data tables, you can produce sharp tables of summary statistics in a presentation ready format!

## Review

- `os.getcwd()`
- `os.chdir()`
- `os.listdir()`
- `pd.read_csv()`
- `head()`
- `mean()`
- `median()`
- `var()`
- `std()`
- `describe()`

## More Help!
If you don't recognize any of these or what they do please feel free to go back up and review. These are all "bread and butter" commands that you will be using quite a lot, so make sure to know what they are. If you want to explore them in even more detail you can also look them up in the **Help** documentation in the drop down menu above. Note that you have different documentation for base Python, pandas, etc. The help-file for each function will give you a description, list of possible arguments and their default values, and some example code. 

For example, [here is the help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe) for the `describe()` function. It's always good to check the help-file whenever you are using a new function!

## Next Time on the **Crash Course**
Great job in keeping up with the second session! Our next lesson will focus on data management and data processing, two absolutely key tasks for any data scientist.