# Data 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## 1. Python Crash Course

### 1.1. Assignment

In [None]:
# Strings
data = 'hello world' 
print(data[0]) 
print(len(data)) 
print(data)

In [None]:
# Numbers
value = 123.1
print(value)
value = 10
print(value)

In [None]:
# Boolean
a = True
b = False
print(a, b)

In [None]:
# Multiple Assignment
a, b, c = 1, 2, 3
print(a, b, c)

In [None]:
# No value
a = None
print(a)

### 1.2. Flow Control

In [None]:
#If-Then-Else Conditional
value = 99
if value == 99:
    print('That is fast') 
elif value > 200:
    print('That is too fast') 
else:
    print('That is safe')

In [None]:
# For-Loop
for i in range(10):
  print(i)

In [None]:
# While-Loop
i=0
while i < 10:
  print(i)
  i += 1

### 1.3. Data Structures

In [None]:
#Tuples are read-only collections of items.
a = (1, 2, 3)
print(a)

In [None]:
#Lists use the square bracket notation and can be indexed using array notation.
mylist = [1, 2, 3]
print("Zeroth Value: %d" % mylist[0])
mylist.append(4)
print("List Length: %d" % len(mylist))
for value in mylist:
  print(value)

In [None]:
#Dictionaries are mappings of names to values, like key-value pairs. 
#Note the use of the curly bracket and colon notations when defining the dictionary.
mydict = {'a': 1, 'b': 2, 'c': 3} 
print("A value: %d" % mydict['a']) 
mydict['a'] = 11
print("A value: %d" % mydict['a']) 
print("Keys: %s" % mydict.keys()) 
print("Values: %s" % mydict.values()) 
for key in mydict.keys():
  print(mydict[key])

### 1.4. Functions

In [None]:
# Sum function
def mysum(x, y):
  return x + y
# Test sum function
result = mysum(1, 3)
print(result)

## 2. NumPy Crash Course
NumPy is a powerful linear algebra library for Python. What makes it so important is that almost all of the libraries in the <a href='https://pydata.org/'>PyData</a> ecosystem (pandas, scipy, scikit-learn, etc.) rely on NumPy as one of their main building blocks. Plus we will use it to generate data for our analysis examples later on!

NumPy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).

### 2.1. Creating NumPy Arrays

#### From a Python List

We can create an array by directly converting a list or list of lists:

In [None]:
my_list = [1,2,3]
my_list

In [None]:
np.array(my_list)

In [None]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix

In [None]:
np.array(my_matrix)

#### Built-in Methods

There are lots of built-in ways to generate arrays.

### arange

Return evenly spaced values within a given interval. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html)]

In [None]:
np.arange(0,10)

In [None]:
np.arange(0,11,2)

### zeros and ones

Generate arrays of zeros or ones. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.zeros.html)]

In [None]:
np.zeros(3)

In [None]:
np.zeros((5,5))

In [None]:
np.ones(3)

In [None]:
np.ones((3,3))

### linspace 
Return evenly spaced numbers over a specified interval. [[reference](https://www.numpy.org/devdocs/reference/generated/numpy.linspace.html)]

In [None]:
np.linspace(0,10,3)

In [None]:
np.linspace(0,5,20)

<font color=green>Note that `.linspace()` *includes* the stop value. To obtain an array of common fractions, increase the number of items:</font>

In [None]:
np.linspace(0,5,21)

### eye

Creates an identity matrix [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.eye.html)]

In [None]:
np.eye(4)

## Random 
Numpy also has lots of ways to create random number arrays:

### rand
Creates an array of the given shape and populates it with random samples from a uniform distribution over ``[0, 1)``. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.rand.html)]

In [None]:
np.random.rand(2)

In [None]:
np.random.rand(5,5)

### randn

Returns a sample (or samples) from the "standard normal" distribution [σ = 1]. Unlike **rand** which is uniform, values closer to zero are more likely to appear. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randn.html)]

In [None]:
np.random.randn(2)

In [None]:
np.random.randn(5,5)

### randint
Returns random integers from `low` (inclusive) to `high` (exclusive).  [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html)]

In [None]:
np.random.randint(1,100)

In [None]:
np.random.randint(1,100,10)

### seed
Can be used to set the random state, so that the same "random" results can be reproduced. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html)]

In [None]:
np.random.seed(42)
np.random.rand(4)

In [None]:
np.random.seed(42)
np.random.rand(4)

## Array Attributes and Methods

Let's discuss some useful attributes and methods for an array:

In [None]:
arr = np.arange(25)
ranarr = np.random.randint(0,50,10)

In [None]:
arr

In [None]:
ranarr

## Reshape
Returns an array containing the same data with a new shape. [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.reshape.html)]

In [None]:
arr.reshape(5,5)

### max, min, argmax, argmin

These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax

In [None]:
ranarr

In [None]:
ranarr.max()

In [None]:
ranarr.argmax()

In [None]:
ranarr.min()

In [None]:
ranarr.argmin()

## Shape

Shape is an attribute that arrays have (not a method):  [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.shape.html)]

In [None]:
# Vector
arr.shape

In [None]:
# Notice the two sets of brackets
arr.reshape(1,25)

In [None]:
arr.reshape(1,25).shape

In [None]:
arr.reshape(25,1)

In [None]:
arr.reshape(25,1).shape

### dtype

You can also grab the data type of the object in the array: [[reference](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.dtype.html)]

In [None]:
arr.dtype

In [None]:
arr2 = np.array([1.2, 3.4, 5.6])
arr2.dtype

### 2.2. Indexing and Selection
In this lecture we will discuss how to select elements or groups of elements from an array.

In [None]:
#Creating sample array
arr = np.arange(0,11)
arr

## Bracket Indexing and Selection
The simplest way to pick one or some elements of an array looks very similar to python lists:

In [None]:
#Get a value at an index
arr[8]

In [None]:
#Get values in a range
arr[1:5]

In [None]:
#Get values in a range
arr[0:5]

## Broadcasting

NumPy arrays differ from normal Python lists because of their ability to broadcast. With lists, you can only reassign parts of a list with new parts of the same size and shape. That is, if you wanted to replace the first 5 elements in a list with a new value, you would have to pass in a new 5 element list. With NumPy arrays, you can broadcast a single value across a larger set of values:

In [None]:
#Setting a value with index range (Broadcasting)
arr[0:5]=100

#Show
arr

In [None]:
# Reset array, we'll see why I had to reset in  a moment
arr = np.arange(0,11)

#Show
arr

In [None]:
#Important notes on Slices
slice_of_arr = arr[0:6]

#Show slice
slice_of_arr

In [None]:
#Change Slice
slice_of_arr[:]=99

#Show Slice again
slice_of_arr

Now note the changes also occur in our original array!

In [None]:
arr

In [None]:
#To get a copy, need to be explicit
arr_copy = arr.copy()

arr_copy

## Indexing a 2D array (matrices)

The general format is **arr_2d[row][col]** or **arr_2d[row,col]**. I recommend using the comma notation for clarity.

In [None]:
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))

#Show
arr_2d

In [None]:
#Indexing row
arr_2d[1]

In [None]:
# Format is arr_2d[row][col] or arr_2d[row,col]

# Getting individual element value
arr_2d[1][0]

In [None]:
# Getting individual element value
arr_2d[1,0]

In [None]:
# 2D array slicing

#Shape (2,2) from top right corner
arr_2d[:2,1:]

In [None]:
#Shape bottom row
arr_2d[2]

In [None]:
#Shape bottom row
arr_2d[2,:]

## More Indexing Help
Indexing a 2D matrix can be a bit confusing at first, especially when you start to add in step size. Try google image searching *NumPy indexing* to find useful images, like this one:

<img src= 'numpy_indexing.png' width=500/> Image source: http://www.scipy-lectures.org/intro/numpy/numpy.html

## Conditional Selection

This is a very fundamental concept that will directly translate to pandas later on, make sure you understand this part!

Let's briefly go over how to use brackets for selection based off of comparison operators.

In [None]:
arr = np.arange(1,11)
arr

In [None]:
arr > 4

In [None]:
bool_arr = arr>4

In [None]:
bool_arr

In [None]:
arr[bool_arr]

In [None]:
arr[arr>2]

In [None]:
x = 2
arr[arr>x]

### 2.3. Operation
## Arithmetic

You can easily perform *array with array* arithmetic, or *scalar with array* arithmetic. Let's see some examples:

In [None]:
arr = np.arange(0,10)
arr

In [None]:
arr + arr

In [None]:
arr * arr

In [None]:
arr - arr

In [None]:
# This will raise a Warning on division by zero, but not an error!
# It just fills the spot with nan
arr/arr

In [None]:
# Also a warning (but not an error) relating to infinity
1/arr

In [None]:
arr**3

## Universal Array Functions

NumPy comes with many [universal array functions](http://docs.scipy.org/doc/numpy/reference/ufuncs.html), or <em>ufuncs</em>, which are essentially just mathematical operations that can be applied across the array.<br>Let's show some common ones:

In [None]:
# Taking Square Roots
np.sqrt(arr)

In [None]:
# Calculating exponential (e^)
np.exp(arr)

In [None]:
# Trigonometric Functions like sine
np.sin(arr)

In [None]:
# Taking the Natural Logarithm
np.log(arr)

## Summary Statistics on Arrays

NumPy also offers common summary statistics like <em>sum</em>, <em>mean</em> and <em>max</em>. You would call these as methods on an array.

In [None]:
arr = np.arange(0,10)
arr

In [None]:
arr.sum()

In [None]:
arr.mean()

In [None]:
arr.max()

## Axis Logic
When working with 2-dimensional arrays (matrices) we have to consider rows and columns. This becomes very important when we get to the section on pandas. In array terms, axis 0 (zero) is the vertical axis (rows), and axis 1 is the horizonal axis (columns). These values (0,1) correspond to the order in which <tt>arr.shape</tt> values are returned.

Let's see how this affects our summary statistic calculations from above.

In [None]:
arr_2d = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
arr_2d

In [None]:
arr_2d.sum(axis=0)

By passing in <tt>axis=0</tt>, we're returning an array of sums along the vertical axis, essentially <tt>[(1+5+9), (2+6+10), (3+7+11), (4+8+12)]</tt>

<img src='axis_logic.png' width=400/>

In [None]:
arr_2d.shape

This tells us that <tt>arr_2d</tt> has 3 rows and 4 columns.

In <tt>arr_2d.sum(axis=0)</tt> above, the first element in each row was summed, then the second element, and so forth.

So what should <tt>arr_2d.sum(axis=1)</tt> return?

In [None]:
# THINK ABOUT WHAT THIS WILL RETURN BEFORE RUNNING THE CELL!
arr_2d.sum(axis=1)

## 3. Pandas Crash Course
In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Operations
* Data Input and Output

### 3.1. Series
The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

## Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [None]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

### Using Lists

In [None]:
pd.Series(data=my_list)

In [None]:
pd.Series(data=my_list,index=labels)

In [None]:
pd.Series(my_list,labels)

### Using NumPy Arrays

In [None]:
pd.Series(arr)

In [None]:
pd.Series(arr,labels)

### Using Dictionaries

In [None]:
pd.Series(d)

### Data in a Series

A pandas Series can hold a variety of object types:

In [None]:
pd.Series(data=labels)

In [None]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [None]:
sales_Q1 = pd.Series(data=[250,450,200,150],index = ['USA', 'China','India', 'Brazil'])                                   

In [None]:
sales_Q1

In [None]:
sales_Q2 = pd.Series([260,500,210,100],index = ['USA', 'China','India', 'Japan'])                                   

In [None]:
sales_Q2

In [None]:
sales_Q1['USA']

In [None]:
# KEY ERROR!
# sales_Q1['Russia'] # wrong name!
# sales_Q1['USA '] # wrong string spacing!

In [None]:
# We'll explore how to deal with this later on!
sales_Q1 + sales_Q2

### 3.2. DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [None]:
columns= ['W', 'X', 'Y', 'Z'] # four columns
index= ['A', 'B', 'C', 'D', 'E'] # five rows

In [None]:
from numpy.random import randint
np.random.seed(42)
data = randint(-100,100,(5,4))

In [None]:
data

In [None]:
df = pd.DataFrame(data,index,columns)

In [None]:
df

# Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

# COLUMNS

## Grab a single column

In [None]:
df['W']

In [None]:
# Pass a list of column names
df[['W','Z']]

### Creating a new column:

In [None]:
df['new'] = df['W'] + df['Y']

In [None]:
df

## Removing Columns

In [None]:
# axis=1 because its a column
df.drop('new',axis=1)

In [None]:
# Not inplace unless reassigned!
df

In [None]:
df = df.drop('new',axis=1)

In [None]:
df

## Working with Rows
### Selecting one row by name

In [None]:
df.loc['A']

### Selecting multiple rows by name

In [None]:
df.loc[['A','C']]

### Select single row by integer index location

In [None]:
df.iloc[0]

### Select multiple rows by integer index location

In [None]:
df.iloc[0:2]

### Remove row by name

In [None]:
df.drop('C',axis=0)

In [None]:
# NOT IN PLACE!
df 

### Selecting subset of rows and columns at same time

In [None]:
df.loc[['A','C'],['W','Y']]

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [None]:
df

In [None]:
df>0

In [None]:
df['X']>0

In [None]:
df[df['X']>0]

In [None]:
df[df['X']>0]['Y']

In [None]:
df[df['X']>0][['Y','Z']]

In [None]:
#For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]

### More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [None]:
df

In [None]:
# Reset to default 0,1...n index
df.reset_index()

In [None]:
df

In [None]:
newind = 'CA NY WY OR CO'.split()

In [None]:
newind

In [None]:
df['States'] = newind

In [None]:
df

In [None]:
df.set_index('States')

In [None]:
df

In [None]:
df = df.set_index('States')

In [None]:
df

## DataFrame Summaries
There are a couple of ways to obtain summary data on DataFrames.<br>
<tt><strong>df.describe()</strong></tt> provides summary statistics on all numerical columns.<br>
<tt><strong>df.info and df.dtypes</strong></tt> displays the data type of all columns.

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df.info()

### 3.3. Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan,4],
                  'B':[5,np.nan,np.nan,8],
                  'C':[10,20,30,40]})

In [None]:
df

### Removing missing data

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

### Threshold (Require that many non-NA values.)

In [None]:
df.dropna(thresh=2)

### Filling in missing data

In [None]:
df.fillna(value='FILL VALUE')

In [None]:
df

In [None]:
df['A'].fillna(value=0)

In [None]:
df['A'].fillna(df['A'].mean())

In [None]:
df.fillna(df.mean())

### 3.4. Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [None]:
# We will cover reading in data in a lot more detail in a later lecture!
df = pd.read_csv('Universities.csv')

In [None]:
# Show first N rows (N=5 by default)
df.head()

Now you can use the .groupby() method to group rows together based off of a **categorical** column. This column will then be reassigned to be the index.

Notice we have 2 steps:

1. Choose a categorical column to group by
2. Choose your aggregation function. Recall an aggregation function should take multiple values and return a single value (e.g. max,min, mean, std, etc...)

In [None]:
# Step 1 simply returns a special groupby object waiting to have an aggregate method called on it!
df.groupby('Year')

In [None]:
df.groupby('Year').mean()

In [None]:
df.groupby('Year').mean().sort_index(ascending=False)

-----
### Other Aggregate Functions
<table><td><tt
><span
>count</span></tt></td><td>Number of non-null observations</td></tr><tr
><td><tt
><span
>sum</span></tt></td><td>Sum of values</td></tr><tr
><td><tt
><span
>mean</span></tt></td><td>Mean of values</td></tr><tr
><td><tt
><span
>mad</span></tt></td><td>Mean absolute deviation</td></tr><tr
><td><tt
><span
>median</span></tt></td><td>Arithmetic median of values</td></tr><tr
><td><tt
><span
>min</span></tt></td><td>Minimum</td></tr><tr
><td><tt
><span
>max</span></tt></td><td>Maximum</td></tr><tr
><td><tt
><span
>mode</span></tt></td><td>Mode</td></tr><tr
><td><tt
><span
>abs</span></tt></td><td>Absolute Value</td></tr><tr
><td><tt
><span
>prod</span></tt></td><td>Product of values</td></tr><tr
><td><tt
><span
>std</span></tt></td><td>Unbiased standard deviation</td></tr><tr
><td><tt
><span
>var</span></tt></td><td>Unbiased variance</td></tr><tr
><td><tt
><span
>sem</span></tt></td><td>Unbiased standard error of the mean</td></tr><tr
><td><tt
><span
>skew</span></tt></td><td>Unbiased skewness (3rd moment)</td></tr><tr
><td><tt
><span
>kurt</span></tt></td><td>Unbiased kurtosis (4th moment)</td></tr><tr
><td><tt
><span
>quantile</span></tt></td><td>Sample quantile (value at %)</td></tr><tr
><td><tt
><span
>cumsum</span></tt></td><td>Cumulative sum</td></tr><tr
><td><tt
><span
>cumprod</span></tt></td><td>Cumulative product</td></tr><tr
><td><tt
><span
>cummax</span></tt></td><td>Cumulative maximum</td></tr><tr
><td><tt
><span
>cummin</span></tt></td><td>Cumulative minimum</td></tr></tbody></table>

### Grouping By multiple columns

In [None]:
df.head()

In [None]:
df.groupby(['Year','Sector']).mean()

In [None]:
df.groupby('Year').describe()

In [None]:
df.groupby('Year').describe().transpose()

### 3.5. Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [None]:
df_one = pd.DataFrame({'k1':['A','A','B','B','C','C'],
                      'col1':[100,200,300,300,400,500],
                      'col2':['NY','CA','WA','WA','AK','NV']})

df_one

### Information on Unique Values

In [None]:
df_one['col2'].unique()

In [None]:
df_one['col2'].nunique()

In [None]:
df_one['col2'].value_counts()

In [None]:
df_one

In [None]:
df_one.drop_duplicates()

### Creating New Columns with Operations and Functions

We already know we can easily create new columns through basic arithmetic operations:

In [None]:
df_one

In [None]:
df_one['New Col'] = df_one['col1'] * 10
df_one

But we can also create new columns by applying any custom function we want, as you can imagine, this could be as complex as we want, and gives us great flexibility.

Step 1: Define the function that will operate on every row entry in a column

In [None]:
def grab_first_letter(state):
    # Given a state, return the first letter
    return state[0]

grab_first_letter('NY')

In [None]:
# Notice we only pass the function, we don't call it with ()
df_one['col2'].apply(grab_first_letter)

In [None]:
df_one['first letter'] = df_one['col2'].apply(grab_first_letter)
df_one

These functions can be as complex as you want, as long as it would be able to accept the items in each row. Watch our for data type issues!

In [None]:
def complex_letter(state):
    
    if state[0] == "W":
        return "Washington"
    else:
        return 'Error'

In [None]:
df_one['State Check'] = df_one['col2'].apply(complex_letter)

In [None]:
df_one

### Mapping

In [None]:
df_one['k1']

In [None]:
df_one['k1'].map({'A':1,'B':2,'C':3})

### Locating Index positions of max and min values

In [None]:
df_one

In [None]:
df_one['col1'].max()

In [None]:
df_one['col1'].min()

In [None]:
df_one['col1'].idxmin()

In [None]:
df_one['col1'].idxmax()

### Get column and index names:

In [None]:
df_one.columns

In [None]:
df_one.index

In [None]:
df_one.columns = ['C1','C2','C3','C4','C5','C6']

In [None]:
df_one

### Sorting and Ordering a DataFrame:

In [None]:
df_one

In [None]:
df_one.sort_values('C3')

### Concatenating DataFrames

In [None]:
features = pd.DataFrame({'A':[100,200,300,400,500],
                        'B':[12,13,14,15,16]})
predictions = pd.DataFrame({'pred':[0,1,1,0,1]})

In [None]:
features

In [None]:
predictions

In [None]:
# Pay careful attention to the axis parameter!
pd.concat([features,predictions])

In [None]:
pd.concat([features,predictions],axis=1)

## Creating Dummy Variables

In [None]:
df_one

In [None]:
df_one['C1']

In [None]:
pd.get_dummies(df_one['C1'])

### 3.6. Data Input and Output

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:

## Check out the references here! 

**This is the best online resource for how to read/write to a variety of data sources!**

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

----
----
<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="12%" />
<col width="40%" />
<col width="24%" />
<col width="24%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Format Type</th>
<th class="head">Data Description</th>
<th class="head">Reader</th>
<th class="head">Writer</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></td>
<td><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></td>
<td><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td><a class="reference external" href="https://www.json.org/">JSON</a></td>
<td><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></td>
<td><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></td>
</tr>
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></td>
<td><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></td>
<td><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td>Local clipboard</td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></td>
<td><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></td>
<td><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="http://www.opendocumentformat.org">OpenDocument</a></td>
<td><a class="reference internal" href="#io-ods"><span class="std std-ref">read_excel</span></a></td>
<td>&#160;</td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://msgpack.org/index.html">Msgpack</a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></td>
<td><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></td>
<td><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></td>
<td><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></td>
<td>&#160;</td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></td>
</tr>
<tr class="row-odd"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></td>
</tr>
<tr class="row-even"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google Big Query</a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></td>
</tr>
</tbody>
</table>

### Understanding File Paths

If your .py file or .ipynb notebook is located in the **exact** same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:
    
        df = pd.read_csv('some_file.csv')

#### Print your current directory file path with pwd

In [None]:
pwd

In [None]:
ls

### CSV Input

In [None]:
df = pd.read_csv('example.csv')
df

### CSV Output

In [None]:
df.to_csv('example.csv',index=False)

## 4. Matplotlib Basics

Here we cover the minimum basics of matplotlib functionality, just enough to understand how Pandas plotting and Seaborn are built on top of Matplotlib. We will mainly use Pandas or Seaborn plotting throughout the course, here we show just the basic interactions possible with matplotlib. Do not consider this a comprehensive guide! For more information on matplotlib, visit: https://matplotlib.org/tutorials/index.html

----
---
### Visualizing Plots

In [None]:
import matplotlib.pyplot as plt

# JUPYTER NOTEBOOK ONLY
# %matplotlib inline
x = [0,1,2]
y = [100,200,300]
plt.plot(x,y)

In [None]:
# When running a .py file , you need to add plt.show() at the end of your commands
# For running .py files!
plt.plot(x,y)
plt.show()

## Basic Tools

We will only use pure matplotlib for really quick,basic plots.

In [None]:
housing = pd.DataFrame({'rooms':[1,1,2,2,2,3,3,3],
                       'price':[100,120,190,200,230,310,330,305]})

housing

In [None]:
# Probably not a great plot, since this implies a continuous relationship!
# plt.plot(housing['rooms'],housing['price'])
plt.scatter(housing['rooms'],housing['price'])

## Style Calls

One of the main reasons to learn the absolute basics is to see how the style interactions effect the API.

In [None]:
plt.plot(x,y,color='red',marker='o',markersize=20,linestyle='--')

# Axis and ticks
plt.xlim(0,2)
plt.ylim(100,300)


# Labeling
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label');

# Seaborn Basics

Here we will focus on some very basic

## The Data

Context

This database contains 14 attributes. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Columns
* age  age in years
* sex  (1 = male; 0 = female)
* cp  chest pain type
* trestbps   resting blood pressure (in mm Hg on admission to the hospital)
* cholserum   cholestoral in mg/dl
* fbs(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg   resting electrocardiographic results
* thalach   maximum heart rate achieved
* exang    exercise induced angina (1 = yes; 0 = no)
* oldpeak   ST depression induced by exercise relative to rest
* slope    the slope of the peak exercise ST segment
* ca    number of major vessels (0-3) colored by flourosopy
* thal3 = normal; 6 = fixed defect; 7 = reversable defect
* target   1 or 0

In [None]:
import seaborn as sns

In [None]:
df = pd.read_csv('heart.csv')

In [None]:
df.head()

#### Count Plot

In [None]:
sns.countplot(x='sex',data=df)

In [None]:
sns.countplot(x='target',data=df)

In [None]:
sns.countplot(x='cp',data=df)

In [None]:
sns.countplot(x='cp',data=df,hue='sex')

In [None]:
sns.countplot(x='cp',data=df,palette='terrain')

#### Box Plot

Box plots show the distributions across different categories.

<img src='boxplot.png' style="max-width:50%;"></img>



In [None]:
sns.boxplot(x='sex',y='age',data=df)

In [None]:
sns.boxplot(x='target',y='thalach',data=df)

In [None]:
sns.boxplot(x='target',y='thalach',data=df,hue='sex')

#### Scatter  Plots

Scatter plots display the relationship between two continuous features.

https://seaborn.pydata.org/generated/seaborn.scatterplot.html

In [None]:
sns.scatterplot(x='chol',y='trestbps',data=df)

In [None]:
sns.scatterplot(x='chol',y='trestbps',data=df,hue='sex')

In [None]:
sns.scatterplot(x='chol',y='trestbps',data=df,hue='sex',palette='Dark2')

In [None]:
sns.scatterplot(x='chol',y='trestbps',data=df,hue='sex',size='age')

#### Pairplots

Pairplots perform scatterplots and histograms for every single column in your data set. This means it could be a huge plot for large datasets! Use with caution, as it could take a long time for large datasets and the figures could be too small!

INFO: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
iris = pd.read_csv('iris.csv')
iris.head()

In [None]:
sns.pairplot(iris)

In [None]:
# Shows KDEs instead of histograms along the diagonal
sns.pairplot(iris, hue="species")