# Welcome to Introduction to Python for Data Science

A quick overview of some basic concepts in Python, applied for the field of data science. This course is designed by the Galvanize community. Feel free to download, fork, and work with it as you please!


## In this course you will learn how to:

* Set up your computer
* Setting up iPython/Jupyter
* Basic Python commands 
* Explorations in NumPy, Pandas, and Matplotlib
* Sandbox time!

## Gut check, Galvanize-style!

* This course is for beginners
* Feel free to move ahead
* Help others when you can
* Be patient and nice
* We’ll all get through it!

## What IS Python?

The language of Python was created by Guido Van Rossum in the 1990s, and yes, it is totally named after Monty Python.
Presently, the growth and adoption of the language is a an enormous open-source project with a huge worldwide community.

## Why do we learn Python for data science?

Python is an easy to learn, general-purpose, scalable language with popular libraries such as:
* SciPy.org (Math, Science, Engineering)
* StatsModels (Statistics)
* Pandas (Frameworks)
* SciKit-Learn (Machine Learning)
* GGplot, MatplotLib, Plot.ly (Graphics)

## Let's set up your computer!

### Step 1: Install Anaconda.

Anaconda from Continuum Analytics provides virtually everything you need to get started in data science. Go to [continuum.io/downloads](https://continuum.io/downloads) and follow the instructions in the website - they vary per platform. 

### Step 2: Install Git (optional)

You'll need a command line prompt to launch Anaconda's Jupyter Notebooks for this lesson. We recommend Git in case you're interested in version control or cloud deployment in the future. Go to [git-scm.com/downloads](https://git-scm.com/downloads) and follow the directions there.

### Step 3: Activate your Jupyter Notebook

We're going to use Jupyter, formerly known as IPython, which is short for “Interactive Python.” This is a way for you to code within a browser in a faster, interactive way.

1. Open up your Git terminal
2. Navigate to your working directory
3. Type “jupyter notebook” into the prompt
4. Some computation should happen...
5. Go to your browser and type in this URL: http://localhost:8888/ (this may launch anyway)

If you see a new browser tab pop up, create a "New" notebook in the top right corner. Let's get started!

## Basic Python Commands

We're going to use this notebook to engage more actively with Python! For ease, I've created all the commands for you, but I recommend that you practice typing your own as well to get the "feel" for how a data scientist operates. *Learn by doing!*

### Data Types

Like all coding languages, Python allows us to manipulate types of data in different ways. The first step to understanding it is to know what kinds of data it handles.

There are generally **five** types of data:

* int - integer value
* float - decimal value
* bool - True/False
* complex - imaginary
* NoneType - null value

## LET'S CODE!

Run the commands in the notebook below by clicking on them and typing 'Shift+Enter.' *What are the outputs?*

In [None]:
type(3.1415)

In [None]:
type(10)

In [None]:
type(4+8j)

In [None]:
type(False)

In [None]:
type('')

### 'Arrays' in Python

Storing and organizing data in Python is immensely helpful, especially when you're attempting data science. Below are the five types of 'iterable data' available in Python (I use that term loosely for good reason).

* str - string/varchar immutable value, defined with quotes = ‘abc’
* list - collection of elements, defined with brackets = [‘a’, ‘b’]
* tuple - immutable list, defined with parentheses = (‘a’, ‘b’)
* dict - unordered key-value pairs, keys are unique and immutable, defined with braces = {‘a’:1, ‘b’:2} 
* set - unordered collection of unique elements, defined with braces = {‘a’, ‘b’}

## LET'S CODE

Using the cell below, create a list called `doc` that contains four elements of varying data types.

In [None]:
doc = [ 'Gigawatts', 88 , 'miles per hour' , 1.21 ] # What is the result of 'doc[2]'?

What is the value of `doc[2]`? Run the cell below.

In [None]:
doc[2]

### Control Flows in Python

Conditional statements are also very useful in coding, and for Python, the syntax is slightly different.

##### If, else statements

In [None]:
x, y = False, False
if x:
    print('Apple')
elif y:
    print('Orange')
else:
    print('sandwich')

*What do you think the output will be?*

##### While loops

Note the indents - a critical part of the syntax of Python. As always, make sure your `while` loops are written in such a way that will eventually break, or they could end up overloading your computer.

Try the code below.

In [None]:
x = 0
while True:
    print('Hello!')
    x += 1
    if x >= 3:
        break   

What do you think will be printed out from this loop?     

##### For Loops

In [None]:
for k in range(4):
    print(k ** 3)

### Functions - creating ways to use and interact with objects

Starting anything with `def` will define a function with parameters. You can run that function afterwards by passing an argument through its parentheses `()`. Check out the basic math functions we have below.

In [None]:
def x_plus_4(x):
    return x + 4

x_plus_4(5)

In [None]:
def subtract(x,y):
    return x - y

subtract(7,3)

### Import - bringing in libraries and frameworks to assist!

`import` makes our lives much easier by bringing in built-in functionality from Anaconda and elsewhere. For example, we can import pi instead of calculating it ourselves.

In [None]:
import math # Typically, we like to do imports at the top of a Python file
math.pi

In [None]:
from math import sin
sin(math.pi/2)

## LET'S CODE WITH NUMPY

##### What is NumPy?

Numpy is a python library of mathematical functions that allow us to operate on huge arrays and matrices of data. It's the first step to giving us the ability to do some interesting things with larger data sets.

For now, let's start small and built a basic array.

In [None]:
import numpy as np
a = np.array([0,1,5,7,6])
a[3]

Cool. Let's start building in more than one dimension of data, transpose it, and even multiply it with another array.

In [None]:
a = np.array([[1,2,3],[4,5,6]])
a.shape

In [None]:
a.T # transposing the array
a.T.shape

In [None]:
b = np.array([6,7])
np.dot(a.T,b) # matrix multiplication

Let's kick it up a notch. Potentially, you can work in the nth dimension!

In [None]:
aa = np.array(
    [[1,2,3],[4,5,6],[1,2,3],[4,5,6]]
    )
bb = np.array(
    [[[3],[4],[6]],[[6],[5],[7]]]
    )

In [None]:
aa.shape, bb.shape # why do we check?

In [None]:
np.dot(aa,bb)

## LET'S CODE WITH PANDAS

Pandas is an open-source Python library providing powerful data structures and analysis tools. It's commonly used by data scientists and will be helpful for some cool things we'll do here.

First, we'll `import` pandas and give it a short name for ease of use. 

Let's create a new dataframe! Here, we're importing a dictionary into Pandas.

In [None]:
import pandas as pd
import numpy as np # Probably don't need this, but just in case
dd = {
    ‘0’: pd.Series([1,2], index=[‘a’,‘b’]), 
    ‘1’: pd.Series([15,25,35], index=[‘a’,‘b’,‘c’])
    }
pd.DataFrame(dd)

We can also write dataframes directly with the function `Dataframe({})`. See below!

In [None]:
df = pd.DataFrame({
    ‘int_col’: [1,2,6,8,-1],
    ‘float_col’:[0.1,0.2,0.2,10.1,None],
    ‘str_col’: [‘a’,‘b’,None,‘c’,‘a’]})
df

Now that you've created a quantitative dataframe, we can do a few statistical analyses. Run the following.

In [None]:
df.describe() # basic stats

In [None]:
df.corr() # correlation

In [None]:
df.cov() # covariance

## LET'S MAKE VISUALIZATIONS

The best way to convey your data is by showing it in meaningful ways. To do so, we're going to use **Matplotlib**, another popular Python library for creating visualizations.

First, let's import the library and the command for showing these visualizations inline.

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

What might this informaton look like? Let's use our previous libraries to create some random data.

In [None]:
plot_df = pd.DataFrame(
    np.random.randn(100,2),columns=[‘x’,‘y’]
    )
plot_df[‘y’] += plot_df[‘x’]
plot_df.head()

We can now create a simple plot of the data. Run the code below. Did it work?

In [None]:
plot_df.plot()

Let's create a scatterplot! Run the code below. Do you see any major differences?

In [None]:
plot_df.plot(‘x’,‘y’,kind=‘scatter’)

Let's create a histogram! Run the code below. How does this visualization compare with the others?

In [None]:
plot_df.plot(kind=‘hist’,alpha=0.3)

## SANDBOX TIME!

You're almost done! Let's see how you do on your own.

**Try one of the following:**
* Merging and joining your data frames
* Removing and replacing some missing values
* Renaming your data columns
* Download a dataset and conduct some analysis


# YOU ARE NOW A DATA SCIENTIST! (KINDA)

Don't stop learning! Visit the [Galvanize Open Source](https://github.com/galvanizeopensource) project to learn more through our other coding courses. We update this often with the latest distributed work, so check back for more details.

## Interested in learning with Galvanize? 

This course was created by the geniuses who work at Galvanize. Here are some options:

**Data Science Fundamentals: Intro to Python**
* 6 week part-time workshop

**Data Science Immersive Program**
* 12 week full-time program

**GalvanizeU**
* 12 month program in San Francisco
* Fully-accredited by the University of New Haven

Learn more at our website here: [http://www.galvanize.com/courses/data-science](http://www.galvanize.com/courses/data-science)

Got feedback for us? Feel free to email us at [info@galvanize.com](mailto:info@galvanize.com).

## About this Course's Author

Lee Ngo is an evangelist for Galvanize based in Seattle. Previously he worked for UP Global (now Techstars) and founded his own ed-tech company in Pittsburgh, PA. Lee believes in learning by doing, engaging and sharing, and he teaches code through a combination of visual communication, teamwork, and project-oriented learning.

You can email him at lee.ngo@galvanize.com for any further questions.