# Python Data Science Prep Class - Intro to Pandas 
#### (JPW Lecture)

In [14]:
# import Numpy with name as 'np' by convention.
# import Pandas with name as 'pd' by convention.  
# We can now use 'np' and 'pd' to access all Numpy and Pandas methods and attributes, respectively.
import numpy as np
import pandas as pd

## Creating a New DataFrame
***
There are many ways we can create a DataFrame.

### 1. We can just pass a list. 
Note the columns and rows (index) are not labeled but are simply given a number. This is just a demonstration to show how DataFrames parse incoming data and would likely never be done.  

In [15]:
# Passing in a single list creates a single column.
one_col = [10, 20, 30, 40, 50]
pd.DataFrame(data=one_col)

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


In [16]:
# Passing in a nested list (note double brackets) creates a single row
one_row = [[10, 20, 30, 40, 50]]
pd.DataFrame(data=one_row)

Unnamed: 0,0,1,2,3,4
0,10,20,30,40,50


In [17]:
# Passing in a list of nested lists creates as many rows as there are nested lists
three_rows = [[10, 20, 30, 40, 50], [11, 21, 31, 41, 51], [12, 22, 32, 42, 52]]
pd.DataFrame(data=three_rows)

Unnamed: 0,0,1,2,3,4
0,10,20,30,40,50
1,11,21,31,41,51
2,12,22,32,42,52


### 2. We can create a _skeleton_ DataFrame by specifying the column and/or row names and dimensions at creation.  
Note we don't have to actually pass in any data to create a DataFrame.  We can just specify the structure of its rows (index) x columns and it will be created blank.

In [18]:
# Create skeleton DF
pd.DataFrame(index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
row_1,,,
row_2,,,
row_3,,,


When we run this, we get a 3x3 DataFrame as expected, but since we passed no actual data in we get `NaN` values.  
NaN stands for "__Not a Number__", and is equivalent to a null value.  

__Dense v. Sparse data__: To keep things simple here, a matrix that has NaNs for a given row or column is called "sparse" (it can be more technical than this, but that's the gist you need to take home).  If the DataFrame is mostly full,  it is called "dense."  In general, we will want to ideally work with "dense" data as most machine learning algorithms either perform better with it or actually require it.  The creation of a skeleton DF has uses for memory efficiency in certain use cases, but we will not encounter those today.  I only mention it here so you will be familiar with the terms and their meanings.  We will see how to deal with such "missing" data further below.

Normally we won't want an empty DataFrame, however.  So one way to fill it with data is to use the `data=` argument when we create it, like so:
### 3. We can create filled DF with same skeleton as \#2

In [19]:
# Use same skeleton as above, but also give data to fill the DF with
data = np.arange(1, 10).reshape(3,3)
pd.DataFrame(data=data, index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

Unnamed: 0,col_1,col_2,col_3
row_1,1,2,3
row_2,4,5,6
row_3,7,8,9


Here we use Numpy to create a list using `np.arange`, which stands for "array range", and will create an _array_ (Numpy's advaned version of a list) from the range you specify.  Functionally it is equivalent to `list(range(1, 10))`. 

In [20]:
np.arange(1, 10)    # returns a 1-dimensional array (list)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [21]:
list(range(1, 10))   # also returns a 1-dimensional list
# Note if you are on Python 2.x instead of 3.x, you don't have to use the enclosing 'list()' func.

[1, 2, 3, 4, 5, 6, 7, 8, 9]

There are two primary differences between Numpy's advanced `arange()` function (and its cousin the `linspace()` function) and the standard `list()`.

1. `np.arange()` can take any value for a number (not just an `int`) and use any increment between them.  
    + `np.arange(.5, 1, .1)` will return `[.5, .6, .7, .8, .9]`, for example.  
    
2. `np.arange()` can be reshaped using Numpy's `.reshape()` method for the array class.
    + `.reshape(r, c)`, where `r` = rows, `c` = cols.  
    
We need \#2 here to make our data fit the DF structure we have just created.  Let's see what happens when we try to pass in the exact same data, a list from 1 to 9, without changing its shape first.

In [22]:
# No reshaping of the data this time....
data = np.arange(1, 10)   
pd.DataFrame(data=data, index=['row_1', 'row_2', 'row_3'], columns=['col_1', 'col_2', 'col_3'])

ValueError: Shape of passed values is (1, 9), indices imply (3, 3)

Uh oh! We got "`ValueError: Shape of passed values is (1, 9), indices imply (3, 3)`"  Pandas is telling us, "hey, the DataFrame you created implies a shape of 3x3, but you gave me a 1x9 set of data.  Not cool. Never program again!"

Okay.  I made up the last part about never programming again, but Pandas absolutely said the other part.

Just a reminder, you can simply pass in the 3x3 data argument without naming the rows or columns and Pandas will create a 3x3 DF without named rows or columns.

There are a couple more prominent ways to create a DataFrame.
### 4. Create a DataFrame by using a dictionary as the data

In [None]:
data_dict = {'Col_A': 11, 'Col_B': 22, 'Col_C': 33}
pd.DataFrame(data=data_dict, index=range(3))

Note how we set the values for one row of data and then extended it three times by passing in the `index=range(3)` argument.  Pandas requires an `index` value when you pass in a single value dictionary. (delete the `index=range(3)` argument and run the cell above, you'll be scolded).

One way around this, and an approach which provides greater flexibility going forward, is to make the __values__ of the dictionary into a __list__.  Then, when Pandas reads this as data, it knows there are only as many rows as there are members in the list.

In [None]:
data_dict_list = {'Col_A': [11], 'Col_B': [22], 'Col_C': [33]}
pd.DataFrame(data=data_dict_list)

Boom.  Same dictionary as before but the values are now in a list and Pandas knows exactly how many rows to make.  One catch is that all lists must be the same length in the dictionary.  For one last point let's extend the lists in this dictionary to illustrate how easy it is to make a DataFrame from a dict.

In [None]:
extended_dict_list = {'Col_A': [11, 101, 1001, 10001], 'Col_B': [22, 202, 2002, 20002], 'Col_C': [33, 303, 3003, 30003]}
pd.DataFrame(data=extended_dict_list)

In general, using dictionaries or even lists of dictionaries tends to be more flexible than lists or lists of lists.

In [None]:
list_of_dicts = [{'Col_A': 'Ohhhhh'}, {'Col_B': 'Mahhhhh'}, {'Col_C': 'Gawddddd'}]
pd.DataFrame(data=list_of_dicts)

## Inspecting the Data
***
Now that we know how to make a DataFrame from scratch, let's load one with data that already exists in it so we can get this party started.  By convention, when loading a DataFrame we usually call it __df__, and variations arise from that.  Pandas has many ways to load different formats of data, but the most prevalent is likely from spreadsheet files, like Excel's `.xlsx` format or `.csv` files.
If possible, always choose the `.csv` file to import into a DataFrame because the `.xlsx` files have a lot of info overhead that Pandas has to strip away, resulting in much faster parsing and load times for `.csv` files of a large size.

We will now import a `.csv` file into a new DataFrame using the `read_csv()` function.  There are many options to how you import a `.csv` file (see [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) here), most of which are dependent upon the data itself, but we will stick to a vanilla import here.

In [23]:
df = pd.read_csv('pandas_dating_demo_df.csv')

In [24]:
df.head()

Unnamed: 0,Sex,Name,Age,Height(in.),Attraction,Hair,Intellectual_Connection,Humor,Attitude,Wine,Politics,Income,Divorced,Kids,Second_Date,Like_This_Person?
0,F,Cara,34,61,3.0,Brunette,10.0,10.0,Positive,Red,Independent,High,Yes,No,Yes,Yes
1,F,Sara,26,63,7.0,Blonde,4.0,3.0,Neutral,White,Left,Low,No,Yes,No,No
2,F,Heather 1,36,62,8.5,Blonde,2.0,3.5,Negative,Red,Right,Low,Yes,Yes,Yes,No
3,F,Jennifer,22,66,6.5,Blonde,7.5,6.5,Complainer,Red,Right,Low,No,No,Yes,Yes
4,F,Katie 1,21,65,7.5,Brunette,8.0,7.5,Complainer,Red,Left,Medium,No,No,Yes,Yes
