# Introduction to Python and Pandas
## DS3 2022 Workshop
**Keagan Benson and Radu Manea**

# Setting up Jupyter Notebook

# What you need to do: 
1). Download Anaconda<br>
> This will install Python, numpy, pandas, and allow you to use Jupyter notebooks.

2). Get a Kaggle Account <br>

3). Download the Pokemon Dataset.
> https://www.kaggle.com/abcsds/pokemon

# 1). Download the most recent version of Python for your OS
> https://www.python.org/downloads/

# 2). Download Anaconda (3.7)
> https://www.anaconda.com/distribution/

# 3). Open Jupyter Notebook

## There are two ways to do this

> Open the Anaconda Navigator app
>> Launch Jupyter Notebook
    >> - this will be opened in your browser

**OR**
> Open Terminal
>> type in "jupyter notebook"

# 4) Make a new Notebook

Hit the "New" drop down in the top right next to upload, click on Python3

# Why Python is Used for Data Science
<ol>
<li>Easy to learn and use</li>
<li>Libaries and Frameworks</li>
<li>Well Supported</li>
<li>Monty Python</li>
</ol>

# What is Python?

# Python Programming Language
<ol>
<li>Interpreted</li>
<li>Object-oriented</li>
<li>High-level programming language</li>
<li>with dynamic semantics</li>
</ol>

# Interpreted

    Interpreted languages are not directly executed by the target machine.
    
    Different from compiled programming languages.
    
    Read and executed by some other computer program known called the *intepreter*.


# Object-Oriented

Python, like any other Object-Oriented Programming (OOP) language is organized around objects.

Everything in Python is an object:

> <ol>
> <li>Lists</li>
> <li>Dictionaries</li>
> <li>Classes</li>
> <li>... and so forth</li>
> </ol>


# High-level programming langugage
The syntax of our code is easy for humans to interpret.
> Python is easily readable, and has very little syntax requirements compared to other programming languages.

If you have to display something on the screen, the built-in function for this is called print;
> *Typing print() to actually "print" something on the screen is pretty intuitive!*

In [1]:
print("Hello World!")

Hello World!


![Screen%20Shot%202022-11-02%20at%205.25.23%20PM.png](attachment:Screen%20Shot%202022-11-02%20at%205.25.23%20PM.png)

# Dynamic Semantics ~ typing
Dynamic objects are instances of values contained into constructs in the code, and they exist at a run-time level.

We can assign to one object multiple values, since it will update itself, differently from a static semantic langugae.

> If we set a = 2 and then a = 'hello', the string value will substitute the integer as soon as the line is executed

# Can't do this in Java
    
**int a = 2;** <br>
**a = "hello"** ERROR

We can in Python!

In [2]:
a = 2
print(a)
a = "hello"
print(a)

2
hello


# How do we write & run code in Python?

With a simple text editor
    <ol>
    <li> Write code in Python and save as [FILENAME].py </li>
    <li> Run $ python3 [FILENAME].py in your terminal </li>
    <ol>
In a work environment

# Some great Text Editors/IDEs
1). VSCode by Microsoft <br>
2). Sublime Text <br>
3). Atom from GitHub <br>
4). PyCharm <br>
5). VIM <br>

# Cells in Jupyter Notebook

## There are two types of Cells!

In [3]:
def freedomFunction():
    print("Python is really cool")

In [4]:
#Code Cell
freedomFunction()

Python is really cool


# Markdown Cell
We can type all kinds of wacky text
> And do all kind of
**Crazy Formatting** <br>

*in these boxes*
<ol>
    <li>Item1</li>
    <li>Item2</li>
<ol>

# Importing Libaries
> import [MODULE] //as [ALIAS; nickname for your own benefit] <br>
> FROM [MODULE] import [A]

In [5]:
#Basic Python Libraries; Feel free to add more!

import pandas as pd
import numpy as np
import matplotlib
from sklearn import preprocessing

# Python Syntax
1). NO MORE SEMICOLONS :)
2). Whitespace in Python
> Since we're no longer using semicolons to terminate lines, a newline signifies the end of a line. <br> 
> Whitespace and Indentation matters!

3). Strings can be 'string' or "string"

In [6]:
def thisFunctionIsWrong():
return "it's gonna throw and error"

IndentationError: expected an indented block (3868164552.py, line 2)

# Python identifiers
<ol>
    <li>Identifier is a name used to identify a variable, function, class, module, or other object</li>
    <li>Starts with a letter A-Z, a-z, or underscore(_) followed by zero or more letters, underscores, and digits (0-9)</li>
    - Python does not allow punctuation characters such as @, $, and % within identifiers.
    <li>Case sensitive</li>
<ol>

# Python Variable Types
> <ol>
> <li>Numbers (intergers & Floats)</li>
> <li>Strings</li>
    > This includes individual characters
> <li>Lists</li>
> <li>Tuple</li>
    > <li>Dictionary</li>
> </ol>

In [7]:
a = 2
print(type(a))
b = 2.0
print(type(b))
c = "never gonna give you up"
print(type(c))

<class 'int'>
<class 'float'>
<class 'str'>


# Creating List and List Operations in Python
Lists are great for storing stuff!
We can store anything in a list, including lists!

In [8]:
listOne = [1, 2, 3 ,4]
listOne

[1, 2, 3, 4]

In [9]:
listTwo = ["Bob", 1, 2.0, [4, 5, (1, 2)]]
listTwo

['Bob', 1, 2.0, [4, 5, (1, 2)]]

# List Operations

In [10]:
listThree = []
#to add stuff to a list, use the .append() function
listThree.append(1)
listThree.append("bad data")
listThree

[1, 'bad data']

In [11]:
#remove stuff from a list, use the .remove() function
listThree.remove("bad data")
listThree

[1]

# Lists are Zero Indexed!

In [12]:
#slicing through a list
listFour = [1, 2, 3, 4, 5]
print(listFour[0]) # get the first element
print(listFour[-1]) #get the last element
print(listFour[:3]) #get the first three items, does not include third index (non-inclusive)
print(listFour[1:]) #get all the items except the first
print(listFour[1:3]) #get item at index 1 up to index 3 not including 3

1
5
[1, 2, 3]
[2, 3, 4, 5]
[2, 3]


# Tuples, like lists but immutable

In [13]:
tupleOne = (1, 2, 3, 4)
#use parenthesis
tupleOne

(1, 2, 3, 4)

# Dictionaries

In [14]:
#like lists, but with keys-value pairs
dictOne = {"Bob": 12, "Joe": 17, "Bill": 18}
dictOne

{'Bob': 12, 'Joe': 17, 'Bill': 18}

# The first value is called the key, the second value are the values

In [15]:
print(dictOne.keys()) #returns a list of the keys
print(dictOne.values()) #returns a list of the values
print(dictOne.items()) #returns a list of the key value pairs as tuples

#NOTE YOU MUST CAST THESE INTO LISTS TO USE LIST OPERATIONS ON THEM
print(list(dictOne.keys()))

dict_keys(['Bob', 'Joe', 'Bill'])
dict_values([12, 17, 18])
dict_items([('Bob', 12), ('Joe', 17), ('Bill', 18)])
['Bob', 'Joe', 'Bill']


# Modifying Dictionary and List Values

In [16]:
listFive = [1, 2, 3, 4, 5]
dictTwo = {"Bob": 12, "Joe": 17, "Bill": 18}
listFive[0] = 10 #use the index to change the value
print(listFive[0])
dictTwo["Bob"] = 13 #use the key to change the value
print(dictTwo["Bob"])

#SAME LOGIC APPLIES TO TUPLES

10
13


# Math in Python

In [17]:
a=4
b=2

In [18]:
a + b

6

In [19]:
a-b

2

In [20]:
a*b

8

In [21]:
a**b

16

In [22]:
a/b

2.0

In [23]:
a=7
a//b #floor/integer division

3

# Functions

In [24]:
def func1(var):
    var1 = 2
    var2 = 3
    return var + var1 + var2

var = 1
func1(var)

6

# Conditionals

In [25]:
#if (condition)
x = 5
if (x == 1):
    print('x is equal to 1')
elif (x == 5):
    print('x is equal to 5')
else:
    print('x is not equal to 1 or 5')

x is equal to 5


# Conditional Operators
!= is does not equals <br>
or, and, not

# Loops

While we tend to avoid loops with tabular data, it is still good to know them!

In [26]:
#for loop
for i in range(10):
    print(i)
#for index in a list of range 0-9
#range is non inclusive of the number inputted

0
1
2
3
4
5
6
7
8
9


# Range()

In [27]:
print(list(range(10)))
print(list(range(0, 10, 2))) #start, stop, step
print(list(range(10, -1, -1))) #decrement

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 2, 4, 6, 8]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


# Iterating through a list or string

We can use a loop if we just want to iterate through values in a string or list

In [28]:
testString = "Python"
for letter in testString:
    print(letter)

P
y
t
h
o
n


In [29]:
testList = ['P', 'y', 't', 'h', 'o','n']
for letter in testList:
    print(letter)

P
y
t
h
o
n


# Iterating through indices

In [30]:
testList = ['the', 'big', 'dog', 'wants', 'some', 'food']
for i in range(len(testList)):
    print(testList[i])

the
big
dog
wants
some
food


# List Comprehensions
These are very useful tools that let us loop through something in a single line.
Syntax goes as follows:

[ [VALUE WE WANT] for [VALUE WE WANT] in [ITERABLE] ]

In [31]:
#example
kantoStarters = ["Charmander", "Bulbasaur", "Squirtle", "Pikachu"]

starters = [pokemon for pokemon in kantoStarters] #This returns a list of the items
print(starters)
#list comprehensions are useful for applying a function to a list

upperStarters = [pokemon.upper() for pokemon in kantoStarters]
print(upperStarters)

#or if we want to filtering quickly
bestStarter = [pokemon for pokemon in kantoStarters if pokemon == "Charmander"] #conditional comes at the end
print(bestStarter)

['Charmander', 'Bulbasaur', 'Squirtle', 'Pikachu']
['CHARMANDER', 'BULBASAUR', 'SQUIRTLE', 'PIKACHU']
['Charmander']


# Intro to Pandas
![pandaasheader.png](attachment:pandaasheader.png)

# What is pandas?
Not related to real Pandas, sadly. :( <br>

The name is derived from "Panel Data". <br>

Python Library used for Data Science! <br>

Good tool to use when you are working with a large dataset. <br>

Powerful in application, yet easily utilized... through practice of course !

# Why pandas?

Pandas uses fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.

Very well documented API with detailed examples.

# Starting, at last!
Once you have your notebook and csv in the same directory, you're ready to start!

The first thing we want to make sure we have done is importing pandas and numpy, as they are crucial to any data task.

Let's take a look at what Pandas has to offer!

In [32]:
import pandas as pd
import numpy as np

# Reading
Read in your csv file using the following method
> For documentation please visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [33]:
df = pd.read_csv("Pokemon.csv")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


# What is DF?
DF is short for dataframe. A dataframe is essentially any sort of tabular data. <br>
We can change the name df to whatever we want. <br>

In [34]:
pokemon = pd.read_csv("Pokemon.csv")
pokemon.head() # we can input a number to list more or less

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


## Heads or Tails!
You can see the top few elements of your dataframe using DataFrame.head(). This avoids displaying more of a dataset that is currently needed, and is a nice way of confirming everything has loaded correctly.

You can also use .tail() to get the last five elements.

By adding an integer as an argument (i.e. .head(100)) you can see the first 100 rows!

# Describing Data
You can use describe() to quickly gather some general insights on your dataset

In [35]:
pokemon.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


**Keep in mind some of these values may not make sense or be too useful. For example, we don't really care about average pokemon generation or Pokedex entry (ID number or #)**

# Data Cleaning

Most important part of the ML process <br>

Select columns to work with <br>

Make sure no values are missing <br>

Add new columns with combinations of features

# Checking Data Types


Using .dtypes we can check the type of each column in our DataFrame, this is important for cleaning the data!

In [36]:
pokemon.dtypes

#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

# Dropping Columns
We can drop columns using **df.drop()!**

# Solution

In [37]:
#lets get rid of Generation, we don't need that right now
pokemon = df.drop('Generation', axis = 1)
pokemon.head(3)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,False


# Adding Columns
You may want a new column. Just pick a name and throw some data in.

In [38]:
import random

pokemon['New Pokedex ID'] = [random.randint(1, 10000) for i in range(len(df.Name))]
pokemon[['Name', "Type 1", "Type 2", "New Pokedex ID"]]

Unnamed: 0,Name,Type 1,Type 2,New Pokedex ID
0,Bulbasaur,Grass,Poison,1919
1,Ivysaur,Grass,Poison,3570
2,Venusaur,Grass,Poison,8741
3,VenusaurMega Venusaur,Grass,Poison,9893
4,Charmander,Fire,,4521
...,...,...,...,...
795,Diancie,Rock,Fairy,9678
796,DiancieMega Diancie,Rock,Fairy,3873
797,HoopaHoopa Confined,Psychic,Ghost,3703
798,HoopaHoopa Unbound,Psychic,Dark,5550


# Renaming
It's nice to be able to quickly read what column names are. The column "Sp. Atk" could be changed to "Sp. Attack".

**Task 2**: Rename "Sp. Atk" to "Sp. Attack"

# Solution

In [39]:
pokemon = pokemon.rename(columns={"Sp. Atk": "Sp. Attack"})
pokemon.head(3)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Attack,Sp. Def,Speed,Legendary,New Pokedex ID
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,False,1919
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,False,3570
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,False,8741


# Replacing Data

In [40]:
pokemon = pd.read_csv("Pokemon.csv")
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [41]:
pokemon.replace(1, "Kanto")
#replacing the generation to the region they're from

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Kanto,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,Kanto,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,Kanto,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,Kanto,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,Kanto,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,Kanto,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


# Null Values
Your choice on how to deal with these (but be sure to fit with the context of the data)

Drop them <br>
Fill them with the mean <br>
Fill them backwards or forwards <br>
Fill them with whatever you want <br>

## Null Values
Here's a bunch of null values <br>
How can we fill these?

In [42]:
col = pd.Series([np.nan]*10)
col

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
dtype: float64

## Null values
Fill using anything!

In [43]:
col.fillna(1)

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
dtype: float64

## Null values
Fill using the mean.

In [44]:
col = pd.Series([2] + [np.nan] * 10 + [1])
col.fillna(col.mean())

0     2.0
1     1.5
2     1.5
3     1.5
4     1.5
5     1.5
6     1.5
7     1.5
8     1.5
9     1.5
10    1.5
11    1.0
dtype: float64

## Null values

Or get rid of them completely!

In [45]:
col = pd.Series([2] + [np.nan] * 10 + [1])
col

0     2.0
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    1.0
dtype: float64

In [46]:
col.dropna()

0     2.0
11    1.0
dtype: float64

# Selecting Data
How can we select data from a DataFrame? There are a couple different ways. <br>
We can pull data out using indexing by column, using dot notation, or by using the actual index of a column


# Examples

All three of these lines get the exact same thing: the 'Name' column of the DataFrame as a panda Series.

In [47]:
pokemon['Name'].head()
pokemon.iloc[:, 1].head()
pokemon.Name.head()

0                Bulbasaur
1                  Ivysaur
2                 Venusaur
3    VenusaurMega Venusaur
4               Charmander
Name: Name, dtype: object

# Selection by Column
You can get access to a column by using its name in the same kind of format as you would access an element from an list using its index

In [48]:
pokemon['Name'].head()

0                Bulbasaur
1                  Ivysaur
2                 Venusaur
3    VenusaurMega Venusaur
4               Charmander
Name: Name, dtype: object

# Selection by Column v2
You can index the exact same way, as
before, although instead of using
brackets ‘[ ]’, you can just directly use
the name of the column. There is no real
difference in the two, it is personal
preference. The main downfall is that it
cannot access columns with spaces in the
name.

In [49]:
df.Name.head()

0                Bulbasaur
1                  Ivysaur
2                 Venusaur
3    VenusaurMega Venusaur
4               Charmander
Name: Name, dtype: object

# Selection by index
Selection using iloc works by specifying the row
as the first parameter, and the column as the
second parameter. In this case, ‘:’ indicates that
we want every row, and -1 means that we want
the last column.

More on iloc and loc in a second.

In [50]:
df.iloc[:, 1].head()

0                Bulbasaur
1                  Ivysaur
2                 Venusaur
3    VenusaurMega Venusaur
4               Charmander
Name: Name, dtype: object

# Selection by Index Example
How would you get the 3rd row (index 2) and get the Pokémon stored in that cell?

# Solution

In [51]:
pokemon.iloc[2, 1]

'Venusaur'

# .loc vs .iloc
Both .loc and .iloc can pull out specific rows

In [52]:
pokemon.loc[2]

#                    3
Name          Venusaur
Type 1           Grass
Type 2          Poison
Total              525
HP                  80
Attack              82
Defense             83
Sp. Atk            100
Sp. Def            100
Speed               80
Generation           1
Legendary        False
Name: 2, dtype: object

In [53]:
pokemon.iloc[2]

#                    3
Name          Venusaur
Type 1           Grass
Type 2          Poison
Total              525
HP                  80
Attack              82
Defense             83
Sp. Atk            100
Sp. Def            100
Speed               80
Generation           1
Legendary        False
Name: 2, dtype: object

# .loc vs .iloc
.loc can index by string name, .iloc indexes by integer

In [54]:
pokemon.loc[:, "Name"].head()

0                Bulbasaur
1                  Ivysaur
2                 Venusaur
3    VenusaurMega Venusaur
4               Charmander
Name: Name, dtype: object

In [55]:
pokemon.iloc[:, 'Date'].head()

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

# DataFrame vs Series
You may have noticed when working with iloc and loc, we get a different
looking output than our DataFrame. These are pandas Series, which are
just the data type that a column (or row) is stored in.

A Series is a one-dimensional array containing information from a row or
column.

# Selecting Multiple Columns
You can select multiple
columns by putting each
column name in a list.

In [56]:
pokemon[["Name", "Total"]]

Unnamed: 0,Name,Total
0,Bulbasaur,318
1,Ivysaur,405
2,Venusaur,525
3,VenusaurMega Venusaur,625
4,Charmander,309
...,...,...
795,Diancie,600
796,DiancieMega Diancie,700
797,HoopaHoopa Confined,600
798,HoopaHoopa Unbound,680


In [57]:
pokemon[["Name"]] #this also works for single columns

Unnamed: 0,Name
0,Bulbasaur
1,Ivysaur
2,Venusaur
3,VenusaurMega Venusaur
4,Charmander
...,...
795,Diancie
796,DiancieMega Diancie
797,HoopaHoopa Confined
798,HoopaHoopa Unbound


# Useful Pandas Functions

# Selecting Data - Conditionals
Use conditionals to select only specific data based on criteria.

In [58]:
pokemon[pokemon['Type 1'] == 'Water'].head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
9,7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
10,8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
11,9,Blastoise,Water,,530,79,83,100,85,105,78,1,False
12,9,BlastoiseMega Blastoise,Water,,630,79,103,120,135,115,78,1,False
59,54,Psyduck,Water,,320,50,52,48,65,50,55,1,False


How does this work? <br>
Boolean Series!

In [59]:
pokemon['Type 1'] == 'Steel'

0      False
1      False
2      False
3      False
4      False
       ...  
795    False
796    False
797    False
798    False
799    False
Name: Type 1, Length: 800, dtype: bool

# Selecting Data - Multiple Conditionals
What if you want more than one condition to select data?

In [60]:
pokemon[(pokemon['Type 1'] == "Water") & (pokemon['Type 2'] == 'Ground')]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
209,194,Wooper,Water,Ground,210,55,45,45,25,25,15,2,False
210,195,Quagsire,Water,Ground,430,95,85,85,65,65,35,2,False
281,259,Marshtomp,Water,Ground,405,70,85,70,60,70,50,3,False
282,260,Swampert,Water,Ground,535,100,110,90,85,90,60,3,False
283,260,SwampertMega Swampert,Water,Ground,635,100,150,110,95,110,70,3,False
371,339,Barboach,Water,Ground,288,50,48,43,46,41,60,3,False
372,340,Whiscash,Water,Ground,468,110,78,73,76,71,60,3,False
470,423,Gastrodon,Water,Ground,475,111,83,68,92,82,39,4,False
596,536,Palpitoad,Water,Ground,384,75,65,55,65,55,69,5,False
597,537,Seismitoad,Water,Ground,509,105,95,75,85,75,74,5,False


# Exercise
Pull out name of all Pokemon with the first type of Fire and the second type of Fighting.

# Solution

In [61]:
exercise = pokemon[(pokemon["Type 1"] == "Fire") & (pokemon["Type 2"] == "Fighting")]
exercise[["Name"]]

Unnamed: 0,Name
277,Combusken
278,Blaziken
279,BlazikenMega Blaziken
436,Monferno
437,Infernape
558,Pignite
559,Emboar


# Example
Pull out all the pokemon who's health is below 100 HP that are Fire types

In [62]:
exercise = df[(pokemon['HP'] < 100) & (df['Type 1'] == 'Fire')]
exercise

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
42,37,Vulpix,Fire,,299,38,41,40,50,65,65,1,False
43,38,Ninetales,Fire,,505,73,76,75,81,100,100,1,False
63,58,Growlithe,Fire,,350,55,70,45,70,50,60,1,False
64,59,Arcanine,Fire,,555,90,110,80,100,80,95,1,False
83,77,Ponyta,Fire,,410,50,85,55,65,65,90,1,False


# Proportions
Do you want to know what proportion of the total Pokemon are Poison types? Great, because it's super easy to find out

# Summing and Dividing
You can sum a column or
DataFrame’s columns by using
the .sum() function. We can
select our region to be
‘SanDiego’ and sum its volume,
then sum the volume of the entire
dataset.

From there we can calculate our
proportion.

In [63]:
poison = pokemon[pokemon['Type 1'] == 'Poison'].shape[0]
pokemon_total = pokemon.shape[0]
poison/pokemon_total

0.035

# Exercise 
What is the proportion of Pokemon that have type 1 of Fire who have type 2 of fighting?

# Solution

In [64]:
fire = pokemon[pokemon["Type 1"] == 'Fire'].shape[0]
fighting_fire = pokemon[(pokemon["Type 1"] == 'Fire') & (pokemon["Type 2"] == 'Fighting')].shape[0]
fighting_fire/fire

0.1346153846153846

# Applying
What if you wanted to subtract a number from an entire column? What if
you wanted to use some specific function for every entry in a column?

That’s where apply comes in. You can pass a function into an apply call,
and it will perform that function on every element in the column.

# Subtract the Mean
We are now going to subtract the mean Attack from the attack category.
First we need to get the mean, then create a simple lambda function to apply!

# Code

In [65]:
mean = pokemon['Attack'].mean()
subtract_mean = lambda x: x - mean

pokemon['Attack'] = pokemon['Attack'].apply(subtract_mean)
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,-30.00125,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,-17.00125,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,2.99875,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,20.99875,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,-27.00125,43,60,50,65,1,False


# Lambda Functions
A lambda function is a small anonymous function.

A lambda function can take any number of arguments, but can only have one expression.

Think of it as a mini-function

In [66]:
lambda x: x + 10

#the x is the parameter
#this is equivalent to 

def doSomething(x):
    return x + 10

# Grouping
- You can take a DataFrame and group the entire thing based on the
unique values of a column, then aggregate how you feel.

- Picking how you want to aggregate your information and setting a
function up can be the tricky part.

# Grouping 
Simple example: group based on the Pokemon type, sum together each
column of the groups.

In [67]:
pokemon.groupby('Type 1').sum()

Unnamed: 0_level_0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bug,23080,26146,3925,-554.08625,4880,3717,4471,4256,222,0
Dark,14302,13818,2071,290.96125,2177,2314,2155,2361,125,2
Dragon,15180,17617,2666,1059.96,2764,3099,2843,2657,124,12
Electric,15994,19510,2631,-436.055,2917,3961,3243,3718,144,4
Fairy,7642,7024,1260,-297.02125,1117,1335,1440,826,70,1
Fighting,9824,11244,1886,479.96625,1780,1434,1747,1784,91,0
Fire,17025,23820,3635,299.935,3524,4627,3755,3871,167,5
Flying,2711,1940,283,-1.005,265,377,290,410,22,2
Ghost,15568,14066,2062,-167.04,2598,2539,2447,2059,134,2
Grass,24141,29480,4709,-405.0875,4956,5425,4930,4335,235,3


# Grouping
You can also group on multiple columns!

In [68]:
pokemon.groupby(['Type 1', 'Type 2']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
Type 1,Type 2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Bug,Electric,1191,791,120,-34.00250,110,154,110,173,10,0
Bug,Fighting,428,1100,160,151.99750,190,80,200,160,4,0
Bug,Fire,1273,910,140,-13.00250,120,185,160,160,10,0
Bug,Flying,4008,5873,882,-124.01750,862,1020,967,1160,40,0
Bug,Ghost,292,236,1,10.99875,45,30,30,40,3,0
...,...,...,...,...,...,...,...,...,...,...,...
Water,Ice,309,1535,270,12.99625,340,240,235,200,3,0
Water,Poison,356,1280,185,-32.00375,175,185,275,255,4,0
Water,Psychic,559,2405,435,-30.00625,520,470,395,220,6,0
Water,Rock,1720,1715,283,14.99500,451,246,260,144,15,0


# Grouping - aggregate
We can use our own functions to group everything using .aggregate

In [69]:
pokemon.groupby('HP').aggregate(np.mean)

Unnamed: 0_level_0,#,Total,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
HP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,292.000000,236.000000,10.998750,45.000000,30.000000,30.000000,40.000000,3.0,0.0
10,50.000000,265.000000,-24.001250,25.000000,35.000000,45.000000,95.000000,1.0,0.0
20,276.166667,285.833333,-55.667917,75.833333,28.333333,86.666667,51.666667,2.5,0.0
25,72.000000,317.500000,-51.501250,42.500000,100.000000,55.000000,67.500000,1.0,0.0
28,280.000000,198.000000,-54.001250,25.000000,45.000000,35.000000,40.000000,3.0,0.0
...,...,...,...,...,...,...,...,...,...
165,594.000000,470.000000,-4.001250,80.000000,40.000000,45.000000,65.000000,5.0,0.0
170,321.000000,500.000000,10.998750,45.000000,90.000000,45.000000,60.000000,3.0,0.0
190,202.000000,405.000000,-46.001250,58.000000,33.000000,58.000000,33.000000,2.0,0.0
250,113.000000,450.000000,-74.001250,5.000000,35.000000,105.000000,50.000000,1.0,0.0


# Merging, Joining, and Concatinating
If you have two tables that have matching information, you may want to merge them.

In [70]:
types = pokemon[['Type 1', 'Type 2']]
types.head()

Unnamed: 0,Type 1,Type 2
0,Grass,Poison
1,Grass,Poison
2,Grass,Poison
3,Grass,Poison
4,Fire,


In [71]:
name = pokemon[['Name']]
name.head()

Unnamed: 0,Name
0,Bulbasaur
1,Ivysaur
2,Venusaur
3,VenusaurMega Venusaur
4,Charmander


# Merging
Like this:

In [72]:
name.merge(types, left_index = True, right_index = True).head()

Unnamed: 0,Name,Type 1,Type 2
0,Bulbasaur,Grass,Poison
1,Ivysaur,Grass,Poison
2,Venusaur,Grass,Poison
3,VenusaurMega Venusaur,Grass,Poison
4,Charmander,Fire,


# Thank you for attending our workshop
## Good luck using Python and Pandas!
Radu Manea: <br/>
<b> https://www.linkedin.com/in/radumanea/ </b> <br />
Keagan Benson:<br />
<b>https://www.linkedin.com/in/keagan-benson-694395188/</b> <br />
DS3 Instagram <br/>
https://www.instagram.com/ds3.ucsd/ </b>
