<img src="support_files/cropped-SummerWorkshop_Header.png">  

<h1 align="center">Python Bootcamp</h1> 
<h3 align="center">August 24-25, 2019, Seattle, WA</h3> 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<center><h1>Introduction to Pandas</h1></center>

<p>
    <code>pandas</code> is a library with high-level data structures and manipulation tools:
<p><ul> 
<li>Data loading/saving
<li>Data exploration
<li>Filtering, selecting
<li>Plotting/visualization
<li>Computing summary statistics
<li>Groupby operations
</ul>

<p>
    <b>DataFrame Object</b>
<ul>
<li>Represents a tabular, spreadsheet-like data structure
<li>Ordered collection of columns
<li>Each column can be a different value type (numeric, string, boolean, etc.)
</ul>
<p>This introduction will only just scratch the surface of Pandas functionality. For more information, check out the <a href="http://pandas.pydata.org/pandas-docs/stable/index.html">full documentation</a>.
<p>Or check out the <a href="http://pandas.pydata.org/pandas-docs/stable/10min.html">'10 minutes to Pandas'</a> tutorial here (note: title may mischaracterize time investment).
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Let's roll</h2>
<p>
</div>

In [None]:
from IPython.display import HTML
HTML("""<iframe src="https://giphy.com/embed/QoCoLo2opwUW4" width="480" height="278" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/panda-playing-QoCoLo2opwUW4">via GIPHY</a></p>""")

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Imports</h2>
<p>
</div>

In [None]:
# Convention for import naming
import pandas as pd

In [None]:
import numpy as np
import os
import matplotlib.pyplot as plt

%matplotlib inline

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Loading data</h2>
<p>Pandas has a lot of convenient built-in methods for reading data of various formats.
<p>Make and print a list of all of the Pandas methods with the word 'read' in them:
</div>

In [None]:
read_methods = [x for x in dir(pd) if 'read' in x]
for method in read_methods:
    print(method)


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Load tabular data from CSV file</h2>

<p>A simple csv file is saved in the working directory on your hard drive. We'll take a minute to open the file and view it.
<p>Pandas can quickly load and display it. Note that it automatically parses the column names
</div>

In [None]:
df = pd.read_csv(os.path.join('..', 'support_files', 'pokemon_alopez247.csv'))

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Use <code>head()</code> and <code>tail()</code> methods to take quick look at data structure</h2>
<p>The <code>head()</code> method displays the first N rows, with N=5 by default

<p>The <code>tail()</code> method displays the last N rows, with N=5 by default
</div>

In [None]:
df.head()

In [None]:
df.tail(2)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Many familiar functions/methods work with DataFrames</h2>
<p>
</div>

In [None]:
# numpy function
np.shape(df)

In [None]:
# python built-in function
len(df)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>This is because a DataFrame is "simply" a numpy array with columns and rows labelled</h2>
<p>
</div>

In [None]:
# we can grab the underlying numpy array alone
df.values

In [None]:
# we can also get the columns
df.columns

In [None]:
# and the rows
df.index

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
    <p><b>Exercise 6.1:</b>
<p>Identify another familiar function that works with the DataFrame
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Selecting columns</h2>

<p>Retrieve column based on column name.
<p>There are two notations that allow you to access data from a column:
<ul>
<li>bracket notation
<li>dot notation
</ul>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Bracket notation:

</div>

In [None]:
attack = df['Attack']
attack.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Dot notation:
<p>note that this is sensitive to special characters in the variable name such as spaces, dashes, etc.

</div>

In [None]:
body_style = df.Body_Style
print(body_style.head())

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
The returned column is a Series object
</div>

In [None]:
print(type(body_style))

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
    <p><b>Exercise 6.3:</b>
<ol>
<li>What data type are entries in the column "Catch_Rate"?
<li>What data type are entries in the column "Height_m"?
</ol>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>The Series object has a lot of useful built-in functions</h2>
<p>Start with <code>unique</code>
</div>

In [None]:
df['Color'].unique()

In [None]:
print("Pokemon types in this dataset:")
for line in df['Type_1'].unique():
    print("  ",line)

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.2:**
<ol>
<li> How many different Egg Groups exist in this dataset?
<li> How many different Body Styles exist in this dataset?
</ol>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Get values as numpy ndarray</h2>
<p>
</div>


In [None]:
weight = df['Weight_kg'].values
weight.shape

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Print the type of <code>weight</code>:
</div>

In [None]:
print(type(weight))

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Plot the weights using Matplotlib</h2>
<p>We can use Matplotlib to plot the array that we just extracted from the dataframe:
</div>

In [None]:
# Plot array to inspect array
fig,ax = plt.subplots(1,1)
ax.plot(weight,'.')

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Plot the weights using the Pandas built-in plotting method</h2>
<p>Pandas also has a built-in plotting function that will allow us to make the plot directly from the dataframe
<p>It does some nice formatting for you, but you still have access to matplotlib methods
</div>

In [None]:
ax = df.plot(y='Weight_kg',marker='.',linestyle='none')

ax.set_title('Weights for all Pokemon')

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Plot the weights using Seaborn</h2>
<p>Seaborn is a data visualization package that operates on Pandas dataframes, similar to the built-in plotting methods. 
    It makes certain types of common visualizations very easy, including capturing multiple dimensions of data in the same plot.
    
(Bonus exercise:  can you make the marker color match the Color name in the example below?  Try looking at the <a href="https://seaborn.pydata.org/generated/seaborn.scatterplot.html">scatterplot documentation</a>)
</div>

In [None]:
import seaborn as sns

ax = sns.scatterplot(data=df, x=df.index, y='Weight_kg', hue='Color')

ax.set_title('Weights for all Pokemon')

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.4:**
<p>Retrieve the heights and make plot of the relationship between weight and height.
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Select multiple columns</h2>
<p>We can make a new dataframe that contains only a subset of the column data from the first dataframe
</div>

In [None]:
# Use copy to get new DataFrame object instead of a 'view' on existing object
df2 = df[['Type_1','Type_2','Weight_kg']].copy()

In [None]:
df2.head(10)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Adding, deleting columns</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Let's add a column denoting whether the pokemon has a subtype.
    <p>Note that otherwise they have a <code>NaN</code> in the <code>Type_2</code> column
</div>

In [None]:
df2['Type_2'].head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Step 1:
<p>We can use the <code>isnull</code> method to find all of the entries with <code>NaN</code> or <code>None</code>
</div>

In [None]:
has_subtype = ~df2['Type_2'].isnull() #isnull() returns True if value is NaN or None. 
print(has_subtype.head())

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Step 2:
<p>We can create a new column and assign the <code>has_subtype</code> series that we just created to that column
</div>

In [None]:
df2['has_subtype'] = has_subtype

In [None]:
df2.head(5)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Delete column (note: inplace argument)</h2>
<p>
</div>

In [None]:
df2.drop('Type_2',axis=1,inplace=True)
# note: this would the same as df2 = df2.drop('Type_2',axis=1)

In [None]:
df2.head(6)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Selecting rows and filtering</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>Slice rows</b>
<p>We can use Numpy-like slicing to access particular rows
</div>

In [None]:
# this works...
df[150:190:10] # [start:end:step]

In [None]:
# but can be confusing since that's how we get columns. Better:
df.loc[150:190:10]

In [None]:
# indices are maintained when you create a series object from a column
df['Name'][150:190:10]

In [None]:
# the index can be any unique set of values. Let's make the index the name of the pokemon

df.set_index('Name',inplace=True)

In [None]:
df.head()

In [None]:
# Now this breaks because our indices are no longer integers
try:
    df.loc[150:190:10]
except Exception as e:
    print(e)

In [None]:
# instead, we can use .loc to grab rows just like we grab columns
df.loc[['Bulbasaur','Ivysaur','Charmander']]

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>Select rows based on boolean array (very commonly used)</b>
<p>This is very powerful as it lets you slice the dataframe using logical conditions
<p>Let's keep working with our new <code>df2</code> for now
</div>

In [None]:
df.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>We can create a boolean array based on our <code>has_gender</code> column
</div>

In [None]:
has_gender = df['hasGender']
print(has_gender.mean())

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>And if we apply that boolean array to the entire dataframe, we'll be left with only rows where the boolean array was <code>True</code>
</div>

In [None]:
df[has_gender].head(15)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>Expression in brackets that yields boolean array</b>
<p>This can be done in one line by putting an expression into the brackets that will yield a boolean array
</div>

In [None]:
df[df['hasGender']==False].head(5)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>We can combine multiple logical statements using the <code>&</code> or <code>|</code> characters
<p>For instance, let's find all of the non-gendered quadruped pokemon in our full dataframe:
</div>

In [None]:
df['Body_Style'].value_counts()

In [None]:
boolean_mask = (
    (df['hasGender']==False)
    & (df['Body_Style']=='quadruped')
)
df[boolean_mask]

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.5:**
<ol>
<li>Generate a new dataframe with only pokemon who can fly.
<li>How fast are flying pokemon?
<ol>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>More useful methods</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<code>isin()</code>
<p> Use <code>isin()</code> to find all pokemon that are either Pink or Purple
</div>

In [None]:
color_list = ['Purple','Pink']
purple_and_pink = df[df['Color'].isin(color_list)] #This is an alternative to using OR

print(
    'There are {} {} pokemon'.format(
        len(purple_and_pink),
        ' & '.join(color_list)
    )
)

In [None]:
purple_and_pink.head(6)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<code>value_counts()</code>
<p>This method returns an object containing counts of unique values, in descending order.
</div>

In [None]:
# Top 20 Cre lines used in connectivity atlas
df['Color'].value_counts()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Summary statistics</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Get summary statistics of a particular column
</div>

In [None]:
df['Weight_kg'].describe()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Bar plot</h2>
<p>Use the built-in bar plot method
</div>

In [None]:
fig,ax=plt.subplots(figsize=(12,6))
df['Color'].value_counts().plot(kind='bar')
ax.set_title("Popular Pokemon Colors");
ax.set_ylabel("# Pokemon");
fig.tight_layout() #this keeps the x-labels from getting cut off

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.6:**
<p>Make a bar plot of the top pokemon morphologies.
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Swarm plot</h2>
<p>For one more taste of the Seaborn capabilities, let's show a <code>swarmplot</code>, to visualize the distribution of a quantitative variable, weight, across different categories specified by a second variable (in this case color).
</div>

In [None]:
fig,ax=plt.subplots(figsize=(12,6))
sns.swarmplot(data=df, x='Color', y='Weight_kg')

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Groupby operations</h2>
<p>We're going to group by two characteristics: the body style and the gender, the find the minimum injection volume in each group
</div>

In [None]:
grouped = df.groupby(['Body_Style','hasGender',]).mean()

columns_to_display = ['Total','Attack','Defense']

grouped[columns_to_display].head(20)

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.7:**
<p>Use groupby to compute mean attack points for each color of pokemon.
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Making a DataFrame from scratch</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>From an array</b>
</div>

In [None]:
data = np.random.rand(100,3)
columns = ['feature_1','feature_2','feature_3']
df_arr = pd.DataFrame(data,columns=columns)
df_arr.head()

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>From a dictionary of lists</b>
</div>

In [None]:
name = ['Larry','Moe','Curly']
score = [1.,3.2,39.]

dict_of_lists = {
    'name': name,
    'score': score,
}

df_from_dict = pd.DataFrame(dict_of_lists)
df_from_dict

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    <b>From a list of dictionaries</b>
</div>

In [None]:
list_of_dicts = [
    {'name': 'Larry', 'score': 1.0},
    {'name': 'Moe', 'score': 3.2},
    {'name': 'Curly'},
]

df_from_list = pd.DataFrame(list_of_dicts)
df_from_list

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Saving (to_csv(), to_excel())</h2>
<p>
</div>

In [None]:
save_methods = [x for x in dir(df) if 'to_' in x]
print("save_methods:")
for method in save_methods:
    print(method)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Save to Excel
</div>

In [None]:
df_arr.to_excel('random_df.xls')

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Save to a csv
</div>

In [None]:
df_arr.to_csv('random_df.csv')

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.8:** 
<p>Some pokemon show a skew in their gender distribution, designated in the column <code>Pr_Male</code>. Is there a relationship between the color and gender tendencies?
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

<b>Exercise 6.9:</b>
<p>The <code>Catch_Rate</code> column indicates the prevalence of different kinds of pokemon in the population, with pokemon with larger values getting caught more frequently. Use the pandas documentation or Stack Overflow to figure how to sort a dataframe by values in this column to see the most common pokemon.
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.10:** 
<p>Explore the relationship between rarity and a skill of your choice.

<p> Choose one skill ('Attack','Defense' or 'Speed') and do the following.
<ol>
<li>Use the scipy package to assess whether Catch_Rate predicts the skill.
<li>Create a scatterplot to visualize how the skill depends upon the rarity of the pokemon.
<li>Overlay a best fit line onto the scatterplot.
</ol>
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.11**:
<p>Explore the pokemon data some more and find something interesting!!</p>

<ul>
<li>Are pink pokemon faster/stronger/bigger than others?</li>
<li>Are there tradeoffs in skills</li>
<li>What kind of gender stereotypes exist in Pokemon?</li>

</ul>

</div>