In [None]:
# it's easier calling submodules of pandas using pd. less typing! plus using pd is standard
import pandas as pd 

# pandas DataFrames
A [`pandas.Dataframe`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is a specialized 2-d data type with column labels, much like a spreadsheet. DataFrames are the core data type of the pandas package.  
A dataframe has row labels, column labels, and values in each "cell".

In [None]:
row_l = ['me', 'you', 'him', 'her']
col_l = ['height', 'weight', 'eyecolor']
x = [[68, 155, 'brown'], [56, 200, 'blue'], [70, 170, 'black'], [50, 130, 'brown']]

simple_df = pd.DataFrame(data=x, index=row_l, columns=col_l)
print(simple_df)

Another way to make a DataFrame is with a dict as the argument. 

In [None]:
pd.DataFrame({'name': ['Sandra', 'Rajitha', 'John'], 'height': [54, 46, 70]})

*notice how we don't need to `print` something if it's the last command that returns a value?* 

Anyway, back to our `simple_df`. Normally, you leave index empty and pandas assigned values automatically from 0 to n for the rows

In [None]:
col_l = ['person'] + col_l # add 'person label to beginning of columns'

for i, row in enumerate(x):  # add person values previously in row_l to x
    x[i] = [row_l[i]] + row

print('\nCOL_L')
print(col_l)
print('\nX')
print(x)

simple_df = pd.DataFrame(data=x, columns=col_l) # notice how we did not use index arg this time

print('\nSIMPLE_DF')
print(simple_df)

pandas automatically guesses the variable type for each column. each column can only have one var type.

In [None]:
# get the data types of each columne
print(simple_df.dtypes)

# Manipulating DataFrames
You can select the values within a DataFrame column. pandas returns another type of data structure called a Series. Each column in a DataFrame is a Series

In [None]:
heights = simple_df['height']
print(heights)
print
print(type(heights))

You can easily transpose your DataFrame.

In [None]:
print(simple_df.T)

pandas supports a DataFrame manipulation pattern called 'slicing' for selecting rows

In [None]:
print(simple_df[:])  # selects all rows (rows are the first dimension)
print
print(simple_df[:3]) # selects the rows from start up to but not including row 3 ( same as [0:3] slice)
print
print(simple_df[1:2]) # selects rows 1 up to and not including row 2 (ie, just row 1)
print
print(simple_df[::2]) # selects every 2nd row (0 and 2)

In [None]:
# you can join values to a pandas dataframe along the row axis (0) or column axis (1)

print("\nJOINED ON AXIS=0")
print(pd.concat([simple_df, simple_df])) # axis=0 by default when no argument is passed
print("\nJOINED ON AXIS=1")
print(pd.concat([simple_df, simple_df], axis=1))

You can use boolean logic to get parts of a DataFrame. What if we wanted to get a DataFrame back with only the brown eyed people?

In [None]:
# when you select a row by name and test its' value, you get back a boolean for each element in that row
is_brown_eyed = simple_df['eyecolor'] == 'brown'
print(is_brown_eyed)

In [None]:
simple_df[is_brown_eyed]

# Importing csv files as DataFrames
Remember how I mentioned that DataFrames are very much like Excel sheets? We can in fact import csvs and even Excel files into python as DataFrames.

Now, let's import some real data froma a csv using [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). We've got a csv file at in our `assets/datasets` folder called `titanic.csv`

In [None]:
df = pd.read_csv('../assets/datasets/titanic.csv')
print(len(df))

We should look at our data set columns and maybe some of the data bu we probably don't want to print all 891 lines of our DataFrame. How do we preview it? We can use `pandas.DataFrame.head` and `pandas.DataFrame.tail` methods

In [None]:
# prints the first lines of the DataFrame. can take an int as an argument for number of lines, 5 by default
df.head() 

In [None]:
df.head(10)

In [None]:
# df.tail prints from the end in the same way
df.tail(3)

Looks like there are a lot of `NaN` (not a number) values in our cabin column. Let's get rid of any rows
that have NaN in them using [`pandas.DataFrame.dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

In [None]:
clean_df = df.dropna()

print(len(clean_df), len(df)) # let's see how many rows were dropped

Yikes. we lost over 700 rows! Is ther another way to get rid of the `NaN` values? Yes there is! We can fill them in with [`pandas.DataFrame.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

In [None]:
filled_df = df.fillna(method='backfill')

print(len(filled_df), len(df)) # let's see how many rows were dropped

OK we kept all of our rows! But what did `method='backfill'` do? Let's look at the docs..






Actually, let's forget about cabin. In fact, let's get rid of some other unintersting columns too. Let's drop `name`, `sibsp`, `ticket`, `fare`, `cabin`, and `embarked` using [`pandas.DataFrame.drop`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.drop.html)

In [None]:
df = filled_df # let's get our filled in values into our previously named df DataFrame
df = df.drop(labels=['name','sibsp', 'ticket', 'fare', 'cabin', 'embarked'], axis=1)
print(df.head())

Sometimes for a categorical variable with more than 2 levels, you want a simply boolean/binary value for whether a data point belongs to a level or not. In orde to do this, we can use [`pd.get_dummies`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.get_dummies.html) to create what are called dummy variables

In [None]:
df = pd.get_dummies(data=df, columns=['pclass'])
print(df.head())

Now lastly, let's get some summary data on our titanic data. Let's figure out how many people survived of each gender
using [`pandas.DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)


In [None]:
# just running groupby returns a pandas.core.groupby.DataFrameGroupBy object
# in order to get data out of it, you have to do something with the groups sum as 'count', 'mean', etc
df.groupby(['sex']).count()

Now let's get just the number of survivors in our groupby table

In [None]:
df.groupby(['sex']).count()['survived']

Lastly, we can easily get descriptive statistics about our DataFrame:

In [None]:
df.describe()

# [Pandas Cookbook](https://github.com/jvns/pandas-cookbook#how-to-use-this-cookbook) - an excellent resource for getting further into pandas