# Python for Data Science
### A Very Brief Introduction
##### By [Jiin Jung](http://jiinjung.com) & [Minjae Yun](https://sites.google.com/view/minjaeyun/home)
##### Claremont Graduate University
##### December 6 (Friday), 2019

# Welcome!
This workshop is a brief introduction to using Python and Jupyter Notebooks.
The material is adopted from [UC Berkeley Data 8 course](https://www.inferentialthinking.com/chapters/intro.html) for an 1hr workshop.

#### Workshop Objectives:
At the conculision of this workshop, you will be familar with the basic syntax of Python and Jupyter notebook to the extent that you can continue to self-study [Data 8](http://data8.org).  


# 1. Python

For most Data Science tasks there are two widely used Open Source languages: Python and R. R is favoured more by those with a mathematical background. Python is preferred by those with a programming background. Python is currenctly most popular language on Stack Overflow. See this [Most Popular Programming language on Stack Overflow Bar Chart Race](https://www.youtube.com/watch?v=cKzP61Gjf00). Choosing a language that is used by more pople allows you to communicate and collaborate with more people. 

Run  the following cell. You can run/execute cells with Ctrl-Enter (which will run the cell and keep the same cell selected), Shift-Enter (which will run the cell and then select the next cell), the Run button on the toolbar, the Run Cells in the Cell menu.


In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://businessoverbroadway.com/wp-content/uploads/2019/01/programming_languages_used.png")

Genevieve Hayes examined 100 data science job advertisements, across four English-speaking countries (Australia, Canada, UK and USA), found on LinkedIn between 22 April 2019 and 5 May 2019. Run the following cell to see the [top 10 data science programing languages](https://towardsdatascience.com/which-programming-language-should-data-scientists-learn-first-aac4d3fd3038).

In [None]:
Image(url= "https://miro.medium.com/max/2289/1*KWhvKrCjKG1JbbWPoSLO4g.png")

# 2. Jupyter notebooks

Jupyter notebooks are an incredible way to work and cowork. They allow you to present documentation and working code in the same file. People can read through the documentation and see the running code. They also make it easy for coworkers to share the file and edit the code collectively.

### Cells
You may want to edit this documentation and make some notes while taking this class.
There are two types of **_cells_**: markdown and code. This is a markdown cell. Code cells run actual python code!
You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.



### Markdown cells for documentation
In order to edit the documentaion. (1) select this cell, (2) click Enter, (3) edit the text, and (4) click Run. 

You can change heaings by the number of #.
# Heading
## Heading
### Heading
#### Heading

### Tips

When using cells, try to separate distinct pieces of code.

At the end of each cell, there is room for some output. The output could be blank, `printf`s, images, html. Pretty much anything you can think of.

### Code cells for python codes

The following cell is a "code cell". You'll see a In [ ]: next to each cell for code, which is a counter for the cells you have run. 


In [None]:
# This is a code cell

You may run the code above but it won't produce any output. It is because "#" deactivates the code.

Try running the following cell and see what it prints out:

In [None]:
print("Hello world!")

### Practice

Print this: The world is round.


Did you get the output? Did you encounter an error? Check whether you used " ", and/or typed 'print' in lowercase letters.


You run cells by pressing ctrl-enter. If you press shift-enter this will run the cell and advance.

You can find many more handy keyboard shortcuts by viewing "Help->Keyboard shortcuts"

# 3. Programming in Python

# Expressions

Run the following cells and see outputs.

In [None]:
3*4

In [None]:
3**4

In [None]:
9/2

In [None]:
9%2

In [None]:
5+2

In [None]:
5-2

Python expressions obey the same familiar rules of **_precedence_** as in algebra: 

- Multiplication and division occur before addition and subtraction. 

- Exponentiation occurs before multiplication and division.

- Parentheses can be used to group together smaller expressions within a larger expression.

Before you run the following cells, first calcuate your answers.

In [None]:
3**4*2

In [None]:
2*3**4

In [None]:
(2*3)**4

In [None]:
3**(4+2)

### Practice

Write a code for the expression: 3(2+5)^2 

# Names
Names are given to values in Python using an **_assignment_** statement. In an assignment, a name is followed by =, which is followed by any expression. The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions.

In [None]:
a = 3
b = 4
a*b

In [None]:
fahrenheit = 55
celsius = (fahrenheit-32) *5/9
celsius


In [None]:
int(celsius)

In [None]:
round(celsius, 2)

In [None]:
kelvin = celsius + 273.15
kelvin

### Practice

Complete the code below and calcuate how many seconds will be taken for 400g tennis ball fall from a 10 meter high building.

Here is the equation for a falling body:

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/10f3d0383ea94ebc8fa369018069467481453818" align="left">

Note: Near the surface of the Earth, the acceleration due to gravity (g) is 9.807 m/s2 (meters per second squared)

# Call Expressions

Call expressions invoke **_functions_**, which are named operations. The name of the function appears first, followed by expressions in parentheses.

In [None]:
abs(-45)

In [None]:
round(kelvin)

In [None]:
max(fahrenheit, celsius, kelvin)

A few functions are available by default, such as abs and round, but most functions that are built into the Python language are stored in a collection of functions called a module. An import statement is used to provide access to a module, such as math or operator.

In [None]:
import math
import operator
math.sqrt(operator.add(4, 5))

In [None]:
math.log?

In [None]:
math.log(16, 2)

The list of [Python's built-in functions](https://docs.python.org/3/library/functions.html) is quite long and includes many functions that are never needed in data science applications. The list of [mathematical functions in the math module](https://docs.python.org/3/library/math.html) is similarly long. This text will introduce the most important functions in context, rather than expecting the reader to memorize or understand these lists.

# 4. Data Types

The built-in **type ()** fundtion returns the type of the result of any expression.

## Numbers
Python distinguishes between two different types of numbers:
- Integers are called **int** values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don't have a franctional component.
- Real numbers are called **float** values (or floating point values) in the Python language. They can represent whole or fractional numbers but have some limitations.



In [None]:
2

In [None]:
type(2)

In [None]:
1.2

In [None]:
type(1.2)

## String

Much of the world's data is text. A piece of text represented in a computer is called a **string**.

A string can represent a word, a sentence, or even the contents of every book in a library. Since text can include numbers (5)or truth values (True), a string can also describe those things.


In [None]:
"data" + "science"

In [None]:
"data" + " " + "science"

In [None]:
'data'+'science'

Single and double quotes can both be used to create strings: 'science' and "science" are identical expressions. Double quote are often preferred because they allow you to include apostrophes inside of strongs.

In [None]:
"data's science"

The **str** function returns a string representation of any value.

In [None]:
"That's " + str(1+1) + ' ' + str (True)

In [None]:
3.14

In [None]:
str(3.14)

In [None]:
type(3.14)

In [None]:
type(str(3.14))

In [None]:
"That's " + "2" + ' ' + str (True)

In [None]:
type("2")

## Comparisons

Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values.

Note: In computer science, a boolean is a data type that has two possible values: it is either true, or false. It is named after the English mathematician and logician George Boole, whose algebraic and logical systems are used in all modern digital computers.

In [None]:
3 > 1+1

The value **True** indicates that the comparison is valid.

- Less than <
- Greater than >
- Less than or equal <=
- Greater than or equal >=
- Equal ==
- Not equal !=

Write a code to validate "the average of x and y is between the smaller number and the larger number."

In [None]:
x=12
y=5
min (x, y) <= (x+y)/2 <= max(x, y)

Strings can be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.


In [None]:
"Dog" > "Catastrophe" > "Cat"

# 5. Sequences

## Arrays

Values can be grouped together into collections. By grouping values together, we can write code that performs a computation on many pieces of data at once. Calling the function **make_arrary** on several values place them into an **_array_**.
Array is a kind of sequential collection.


### Installation of datascience module

Up to here, we only used built-in functions. From here, we will need to install [datascience module](https://github.com/data-8/datascience) to use advanced functions, for example, **_make_array_**.



In [None]:
pip install datascience

Now import all of the datascience modeule

In [None]:
from datascience import *

In [None]:
array = make_array(1,2,3,4)
array

In [None]:
len(array)

In [None]:
array.size

In [None]:
sum(array)

In [None]:
array.sum()

In [None]:
sum(array)/len(array)

In [None]:
array.mean()

In [None]:
5*array + 2

### Functions on Arrays

The **numpy** package, abbreviated **np** in programs, provides Python programmers with convenient and powerful functions for creating and manipulation arrays. 

First, import the numpy package.


In [None]:
import numpy as np

Numpy has many useful functions. Here are some examples:

In [None]:
np.prod(array) # Multiply all elements together

In [None]:
np.sum(array) # Add all elements together

In [None]:
np.cumprod(array) # a cumulative product: for each element, multiply all elements so far

In [None]:
np.exp(array) # Exponentiate each element

In [None]:
np.log(array) # Take the natual logatithm of each element

Here is [the full lists of Numpy functions](https://docs.scipy.org/doc/numpy/reference/).


## Ranges

A range is an array of numbers in increasing or decreasing order, each separated by a regular interval.

Ranges are defined using the **np.arange** functions, which takes either one, two, or three arguments: a start, and end, and a 'step'.

- np.arange(end): If you pass one argument to **np.arange**, this becomes the end value, with start=0, step=1 assumed. An array startging with 0 of increasing consecutive integers, stopping before end.
- np.arange(end, start): Two arguments give the start and end with step=1 assumed. An array of consecutive increasing integers from start, stopping before end.
- np.range(end, start, step): Three arguments give the start, end, and step explicitly. A range with a difference of step between each pair of consecutive values, starting from start and stopping before end. 

In [None]:
np.arange(5)

In [None]:
np.arange(3,9)

In [None]:
np.arange(1.5, -2, -0.5)

#### Example: Leibniz's formula for $\pi$

Gottfried Wilhelm Leibniz (1646 - 1716) discovered a wonderful formula for $\pi$ as an infinite sum of simple fractions. 

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image (url = "https://wikimedia.org/api/rest_v1/media/math/render/svg/fab3e3e4febf987b57159d81fd47995fb0af1240")

In [None]:
by_four_to_20 = np.arange (1,20,4)
by_four_to_20

In [None]:
positive_term_denominators = np.arange(1, 10000, 4)
positive_term_denominators 


In [None]:
positive_terms = 1/positive_term_denominators 
positive_terms

In [None]:
negative_terms = 1/(positive_term_denominators + 2)
negative_terms

In [None]:
4*(sum(positive_terms) - sum(negative_terms))

# 6. Tables



Tables are a fundamental object type for representing data sets.

In order to use tables, import all the datascience module, if you have not.

In [None]:
from datascience import *

### Create Tables

- **Table()** creates an empty table.
- **with_columns** method on a table constructs a new table with additional labeled columns. Each colums of a table is an array. To add one new column to a table, call withcolumns with a label and an array.
- We can give this table a name, and then extend the table with another column.




In [None]:
Table()

In [None]:
Table().with_columns('Number of petals', make_array(8, 34,5))

In [None]:
Table().with_columns('Number of petals', make_array(8,34,5),
                    'Name', make_array('lotus','sunflower','rose'))

In [None]:
flowers = Table().with_columns('Number of petals', make_array(8,34,5),
                    'Name', make_array('lotus','sunflower','rose'))

flowers.with_columns('Color', make_array('pink','yellow','red'))


In [None]:
flowers

### Read Tables

- **read_table** method reads a CSV file that contains data.


In [None]:
sat = Table.read_table('https://www.inferentialthinking.com/data/sat2014.csv')
sat

In [None]:
sat.num_columns

In [None]:
sat.num_rows

In [None]:
sat.labels

In [None]:
sat.relabeled('Critical Reading', 'Reading')

In [None]:
sat

In [None]:
sat = sat.relabeled('Critical Reading', 'Reading')
sat

In [None]:
sat.column('State')

Items in the array (column or row) are indexed 0, 1, 2, and so on.

In [None]:
sat.column(4)

In [None]:
sat.column(4).item(0)

In [None]:
sat.column(4).item(5)

### Practice
Disply 'North Dakota' in output by sing **sat.column().item()**.

### Select Column

In [None]:
sat.select('State','Combined')

In [None]:
sat.select(0,5)

In [None]:
sat.select('State')

In [None]:
sat.column('State')

### Sorting Rows



In [None]:
sat.num_rows

In [None]:
sat.show(5)

In [None]:
sat.sort('State').show(10)

In [None]:
sat.sort('Combined').show(10)

In [None]:
sat.sort('Combined', descending=True).show(10)

In [None]:
help(sat.sort)

### Selecting Rows

- **take()** method takes a specified set of rows.
- its argument is a row index.
- It creates a new table consisting of only those rows.

In [None]:
sat

In [None]:
sat.take(0)

In [None]:
sat.take(np.arange(3,6))

In [None]:
sat.sort('Combined', descending=True).take(np.arange(5))

In [None]:
sat.where('Combined', are.above(1500))

In [None]:
sat.where('State', are.equal_to('Kansas'))

In [None]:
sat.where('Combined', are.between(1200,1400))

### The end!

This workshop was a brief introduction to using Python and Jupyter Notebooks.
We covered upto the Chapter 6 of [UC Berkeley Data 8 course](https://www.inferentialthinking.com/chapters/intro.html).
We hope you are now familar with the basic syntax of Python and Jupyter notebook.
You can continue self-studying only 12 more chapters!
For more information, please visit [Data 8](http://data8.org) website.