## Course Announcements

Due Sunday (11:59 PM):
- Pre-course Survey
- D1
- Q2
- A1
    
Notes:
- waitlist closees today ([Piazza post](https://piazza.com/class/lqzo7aetmf06m8/post/65) with details) 
- For those who want/need additional intro-level python practice [COGS 18 "textbook"](https://shanellis.github.io/pythonbook/content/intro.html) (still under development); link also in COGS 108 [Resources](https://github.com/COGS108/Resources)

**<center> [git demo] </center>**

**SSH Authentication**

(in the terminal)
1. `git config --global user.email "sellis@ucsd.edu"`  
   `git config --global user.name "ShanEllis"`
2. `ssh-keygen` -> After hitting enter/return to execute the above, you’ll **press return/enter three times to bypass specifying a location and passphrase**.
3. `cat ~/.ssh/id_rsa.pub` ->  highlight and copy the full result of this command. It will start with ssh-rsa.

(on GitHub)
1. In your browser, navigate to https://github.com/settings/keys
2. Click “New SSH Key”
3. Set title to `COGS 108` (or whatever you want it to be)
4. Paste what you copied in step 3 into the “Key” box
5. Click “Add SSH Key”

(return to the terminal)
1. `ssh git@github.com`
2. You *may* see a message like `The authenticity of host … can’t be established`. You can type yes and hit return/enter after doing so if you do.
3. You’ll then see a message like "You've successfully authenticated, but GitHub does not provide shell access." At this point, you’re all set!




For more specific/detailed instructions: https://cogs137.github.io/website/content/labs/01-lab-intro-r.html#step-1-email-and-username

# Data Wrangling

- `pandas`
- Where to find data?
   - Web Scraping & APIs

<center>
<img src="img/pandas.png" alt="pandas" width="600px">
</center>

Pandas is Python library for managing heterogenous data.

At its core, Pandas is used for the **DataFrame** object, which is:
- a data structure for labeled rows and columns of data
- associated methods and utilities for working with data.
- each column contains a `pandas` **Series**

## Setup

In [1]:
# Import standard libraries
import pandas as pd
import numpy as np

## Loading Data

In [2]:
# Load a csv file of data
df = pd.read_csv('data/my_data.csv')

Note: there are other `pd.read_*` functions for other files types. (i.e. `pd.read_excel()` and `pd.read_json()`)

In [3]:
# Check out a few rows of the dataframe
df.head()

Unnamed: 0,id,first_name,last_name,age,score,value
0,295,Andrea,Clark,46,-1,24547.87
1,620,Bill,Woods,46,492,46713.9
2,891,Alexander,Jacobson,48,489,32071.74
3,914,Derrick,Bradley,52,-1,30650.48
4,1736,Allison,Thomas,44,-1,9553.12


Pandas DataFrame:
- Index for each row
- Column name for each column (Series)
- Stores heterogenous types

## Slicing

In [4]:
# Slicing (Indexing): select a Series (column) using its name
df['last_name']

0         Clark
1         Woods
2      Jacobson
3       Bradley
4        Thomas
         ...   
195       Ortiz
196    Chambers
197       Pitts
198     Jenkins
199       Brown
Name: last_name, Length: 200, dtype: object

In [5]:
type(df['last_name'])

pandas.core.series.Series

In [6]:
df.loc[5:10]

Unnamed: 0,id,first_name,last_name,age,score,value
5,2049,Stephen,Williams,57,333,138936.92
6,2241,Malik,Wood,46,-1,10804.47
7,2607,Amber,Garcia,50,536,9367.27
8,2635,David,Coleman,68,351,66035.28
9,3585,Eric,Atkins,56,582,103977.32
10,4199,Justin,Johnson,59,500,34938.08


In [7]:
# Slicing: select a row & column with 'loc'
df.loc[10, 'score']

500

#### Clicker Question #1

What would be the output of `df['age'] > 10`?

- A) subset of `df` including only rows of individuals older than 10
- B) a Boolean with `True` for rows where age is greater than 10 and `False` otherwise
- C) `id`s of rows where observations are greater than 10 
- D) an error
- E) I'm super lost

In [20]:
## YOUR CODE HERE
df_new = df[df['age'] > 10]
df_new

Unnamed: 0,id,first_name,last_name,age,score,value
0,295,Andrea,Clark,46,-1,24547.87
1,620,Bill,Woods,46,492,46713.90
2,891,Alexander,Jacobson,48,489,32071.74
3,914,Derrick,Bradley,52,-1,30650.48
4,1736,Allison,Thomas,44,-1,9553.12
...,...,...,...,...,...,...
195,97441,Krista,Ortiz,34,-1,24074.79
196,97728,Anna,Chambers,37,598,0.00
197,98115,Jennifer,Pitts,29,606,6876.75
198,98284,Brittany,Jenkins,34,665,43525.88


## DataFrame Information

In [21]:
# Check how large our dataframe is
df.shape

(200, 6)

In [22]:
# Check what columns we have in our DataFrame
df.columns

Index(['id', 'first_name', 'last_name', 'age', 'score', 'value'], dtype='object')

In [23]:
# Check the datatypes of our variables
df.dtypes

id              int64
first_name     object
last_name      object
age             int64
score           int64
value         float64
dtype: object

## Exploring the data

- quantitative (numbers)
- qualitative (categorical)
- basic descriptive statistics

In [24]:
# Checking categorical data
df['first_name'].value_counts()

David       6
Michael     5
Eric        4
Charles     4
James       4
           ..
Alison      1
Andrew      1
Vanessa     1
Samantha    1
Katelyn     1
Name: first_name, Length: 134, dtype: int64

In [25]:
# Check a particular descriptive statistic
df['value'].mean()

28730.336296296293

In [26]:
# Describe a particular column
df['score'].describe()

count    200.000000
mean     416.595000
std      237.176674
min       -1.000000
25%      288.750000
50%      463.500000
75%      596.500000
max      942.000000
Name: score, dtype: float64

In [27]:
# Get descriptive statistics of all numerical columns
df.describe()

Unnamed: 0,id,age,score,value
count,200.0,200.0,200.0,189.0
mean,52929.15,46.02,416.595,28730.336296
std,29414.298899,10.028582,237.176674,32493.945741
min,295.0,14.0,-1.0,0.0
25%,26709.5,39.0,288.75,9593.03
50%,54643.5,46.0,463.5,17976.51
75%,80840.75,53.0,596.5,33163.31
max,98366.0,69.0,942.0,204999.96


#### Clicker Question #2

What's the average (mean) age of the individuals in this dataset?

- A) 14
- B) 46
- C) 28730
- D) NA
- E) I'm super lost/unsure

In [None]:
## YOUR CODE HERE
df['age'].mean()

## `pandas`: Common Manipulations

You'll want to be *very* familiar with a few common data manipulations when wrangling data, each of which is described below:

Manipulation | Description
-------|------------
**select** | select which columns to include in dataset
**filter** | filter dataset to only include specified rows
**mutate** | add a new column based on values in other columns
**groupby** | group values to apply a function within the specified groups
**summarize** | calculate specified summary metric of a specified variable
**arrange** | sort rows ascending or descending order of a specified column
**merge** | join separate datasets into a single dataset based on a common column



## Selecting & Dropping Columns

- include subset of columns of larger data frame

In [None]:
df.head()

In [None]:
# specify which columns to include
select_df = df[['id', 'age', 'score', 'value']]
select_df.head()

In [None]:
# Drop columns we don't want
df = df.drop(labels=['first_name', 'last_name'], axis='columns')

In [None]:
# Check out the DataFrame after dropping some columns
df.head()

In [None]:
# reminder about documentation:
df.drop?

## Filtering Data (slicing)

- include a subset (slice) of rows from larger data frame

In [None]:
# Check if we have any data from people below the age of 18
sum(df['age'] < 18)

In [None]:
# before filtering
df.shape

In [None]:
# Select only participants who are 18 or older
df = df[df['age'] >= 18]
df.shape

## Missing Data (NaNs)

 Examples: `isna()`, `dropna()`, `fillna()`.

In [None]:
# Check for missing values in a column (series)
df['value'].hasnans

In [None]:
# can operate on entire dataframe
df.isna()

In [None]:
# Check for null values
df.isna().sum(axis='rows')

In [None]:
# Have a look at the missing values
df[df['value'].isna()]

## Dealing with Missing Data - NaNs

In [None]:
# Dealing with null values: Drop rows with missing data
df2 = df.dropna()
df2.shape

## Finding & Dealing with Bad Values

In [None]:
# Check for the properties of specific columns
df['score'].describe()

In [None]:
# Check the plot of the data for score to see the distribution
df['sddcore'].plot(kind='hist', bins=25);

In [None]:
# Look for how many values have a -1 value in 'score'
sum(df['score'] == -1)

In [None]:
# Drop any row with -1 value in 'score'
df = df[df['score'] != -1]
df.shape

## Creating new columns (mutating)

- `assign` can be very helpful in adding a new column
- lambda functions can be used to carry out calculations

In [None]:
# convert age in years to age in (approximate) days
df = df.assign(age_days = df['age'] * 365)
df.head()

In [None]:
df['age_months'] = df['age'] * 12
df.head()

## Grouping & summarizing

- group by a particular variable
- calculate summary statistics/metrics within group

In [None]:
# caclculate average within each age
df.groupby('age').agg('mean')

## Sorting Rows (arrange)

- specify order in which to display rows

In [None]:
df.head()

In [None]:
# sort by values in age
df = df.sort_values(by = ['age'])
df.head()

## TBC: we'll return to everything below this point *after* the data notes

## Combining datasets
![](img/join.png)

In [None]:
## Create two DataFrames
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

In [None]:
left

In [None]:
right

In [None]:
left.merge?

In [None]:
# inner merge by default
pd.merge(left, right, on='key')

In [None]:
# same as above
pd.merge(left, right, on='key', how='inner')

In [None]:
# right merge
pd.merge(left, right, on='key', how='right')

In [None]:
# right merge
pd.merge(left, right, on='key', how='left')

In [None]:
# outer join
pd.merge(left, right, on='key', how='outer')

#### Clicker Question #3

If table A had 5 rows and table B had 5 rows and 3 of those rows in each table were from the same observations present in the *other* table, how many rows would be present if an **inner merge** were carried out?

- A) 3
- B) 5
- C) 10
- D) 13
- E) Totally unsure

#### Clicker Question #4

If table A had 5 rows and table B had 5 rows and 3 of those rows in each table were from the same observations present in the *other* table, how many rows would be present if a **left merge** were carried out?

- A) 3
- B) 5
- C) 10
- D) 13
- E) Totally unsure

## Application Program Interface (APIs)

- APIs are basically a way for software to talk to software 
    - They are an interface into an application / website / database designed for computers / software.

Notes on APIs:
- Follow API guidelines! 
- These guidelines typically specify the number / rate / size of requests

## Github API

You can access the github api with the following API. Just added specifiers for what you are looking for. 

https://api.github.com/

For example, the following URL will search for the user 'ShanEllis'

https://api.github.com/users/shanellis

<center>
<img src="img/github.png" alt="sql" height="100" width="100">
</center>

## Requesting Web Pages from Python

In [None]:
# The requests module allows you to send URL requests from python
import requests  
from bs4 import BeautifulSoup

In [None]:
# Request data from the Github API on a particular user
page = requests.get('https://api.github.com/users/shanellis')  

In [None]:
# The content we get back is a messily organized json file
page.content

#### clicker Question #5

What type/format of output is this?

- A) CSV
- B) XML
- C) JSON
- D) API
- E) I'm super lost

In [None]:
# We can read in the json data with pandas
git_data = pd.read_json(page.content, typ='series')

In [None]:
# Check out the pandas series object full of data
git_data  

### Authorized Access - OAuth

Open Authorization is a protocol to authorize access (of a user / application) to an API.

OAuth provides a secure way to 'log-in' without using account names and passwords. 

It is effectively a set of keys, and passwords you can use to access APIs. 

## Web Scraping vs. APIs

Web scraping and APIs are different approaches:

- APIs are an interface to interact with an application, designed for programmatic use
    - They allow systematic, controlled access to (for example) and applications database
    - They typically return structured (friendly) data 

- Web scraping (typically) involves navigating through the internet, programmatically following an architecture built for humans
    - This can be hard to systematize, being dependent on the idiosyncracies of a web page, at the time you request it
    - This typically returns relatively unstructured data
    - This entails much more wrangling of the data

## Where to Find Data?

* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets/blob/master/README.rst)
* [Data.gov](https://catalog.data.gov/dataset)
* [Data Is Plural](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)
* [UCSD Datasets](https://ucsd.libguides.com/data-statistics/home)
* [Datasets | Deep Learning](http://deeplearning.net/datasets/)
* [Stanford | Social Science Data Collection](https://data.stanford.edu/)
* [Eviction Lab (email required)](https://evictionlab.org/get-the-data/)
* [San Diego Data](https://data.sandiego.gov/)
* [US Census](https://www.census.gov/)
* [Open Climate Data](http://openclimatedata.net/)
* [Data and Story Library](https://dasl.datadescription.com/datafiles/)
* [UCSD behavioral mobile data](http://extrasensory.ucsd.edu/)
* [Kaggle](https://www.kaggle.com/)
* [FiveThirtyEight](https://data.fivethirtyeight.com/)
* [data.world](https://data.world/)
* [Free Datasets - R and Data Mining ](http://www.rdatamining.com/resources/data)
* [Data Sources for Cool Data Science Projects](https://blog.thedataincubator.com/2014/10/data-sources-for-cool-data-science-projects-part-1/)

## Notes on Working with Data

### Data Science is Ad-Hoc

- It is part of the job description to put things together that were not designed to go together.
- We do not have universal solutions, but haphazard, idiosyncratic systems, for data collection, storage and analysis.
- Data is everywhere. But relatively little of it was collected *as data*.

### Data Collection, Curation, and Storage are Difficult

- It can be difficult to choose broadly useful standards
- Take time to think about your data, and how you will load, store, organize and save it

### Data is Inherently Noisy

- We live in a messy, noisy, world, with messy, noisy, people, using messy, noisy instruments.
- There is no perfect data. 
    - There is better / or worse data, given the context.

### Different Objectives

- Humans and computers are different.
- We interact with '*data*' in different ways.
- This underlies many aspects of data wrangling
    - The 'friendliness' of data types / files
    - The difference between web scraping and APIs
    - A disconnect between data in the real world, and data we want to use

## So... What to do?

- Think about how your data are stored & its structure?
- Look at your data before you anayze it
    - are there missing values? 
    - outlier values? 
- Are your data trustworthy? 
    - source?
    - how was it generated?

## Specific Recommendations

- Prioritize using well structured, common, open file types
    - Take advantage of existing tools to deal with these files (numpy, pandas, etc.)

- Look into, and then follow, common conventions
    - Minimize custom objects, workflows and data files 
- Look for APIs. Ask if they are available.
    - Acknowledge that web scraping and/or wrangling unstructured data are complex / long tasks

- Think about data flow from the beginning. Organize your data pipeline, consider the 'wrangling' aspects throughout
    - Set yourself up with well organized, labelled approach to your data
    - Think about when and how you might want/need to save out intermediate results.