# Introduction to Python and Pandas
## DS3 2022 Workshop
**Keagan Benson and Radu Manea**

# Setting up Jupyter Notebook

# What you need to do: 
1). Download Anaconda<br>
> This will install Python, numpy, pandas, and allow you to use Jupyter notebooks.

2). Get a Kaggle Account <br>

3). Download the Pokemon Dataset.
> https://www.kaggle.com/abcsds/pokemon

# 1). Download the most recent version of Python for your OS
> https://www.python.org/downloads/

# 2). Download Anaconda (3.7)
> https://www.anaconda.com/distribution/

# 3). Open Jupyter Notebook

## There are two ways to do this

> Open the Anaconda Navigator app
>> Launch Jupyter Notebook
    >> - this will be opened in your browser

**OR**
> Open Terminal
>> type in "jupyter notebook"

# 4) Make a new Notebook

Hit the "New" drop down in the top right next to upload, click on Python3

# Why Python is Used for Data Science
<ol>
<li>Easy to learn and use</li>
<li>Libaries and Frameworks</li>
<li>Well Supported</li>
<li>Monty Python</li>
</ol>

# What is Python?

# Python Programming Language
<ol>
<li>Interpreted</li>
<li>Object-oriented</li>
<li>High-level programming language</li>
<li>with dynamic semantics</li>
</ol>

# Interpreted

    Interpreted languages are not directly executed by the target machine.
    
    Different from compiled programming languages.
    
    Read and executed by some other computer program known called the *intepreter*.


# Object-Oriented

Python, like any other Object-Oriented Programming (OOP) language is organized around objects.

Everything in Python is an object:

> <ol>
> <li>Lists</li>
> <li>Dictionaries</li>
> <li>Classes</li>
> <li>... and so forth</li>
> </ol>


# High-level programming langugage
The syntax of our code is easy for humans to interpret.
> Python is easily readable, and has very little syntax requirements compared to other programming languages.

If you have to display something on the screen, the built-in function for this is called print;
> *Typing print() to actually "print" something on the screen is pretty intuitive!*

In [1]:
print("Hello World!")

Hello World!


![Screen%20Shot%202022-11-02%20at%205.25.23%20PM.png](attachment:Screen%20Shot%202022-11-02%20at%205.25.23%20PM.png)

# Dynamic Semantics ~ typing
Dynamic objects are instances of values contained into constructs in the code, and they exist at a run-time level.

We can assign to one object multiple values, since it will update itself, differently from a static semantic langugae.

> If we set a = 2 and then a = 'hello', the string value will substitute the integer as soon as the line is executed

# Can't do this in Java
    
**int a = 2;** <br>
**a = "hello"** ERROR

We can in Python!

In [2]:
a = 2
print(a)
a = "hello"
print(a)

2
hello


# How do we write & run code in Python?

With a simple text editor
    <ol>
    <li> Write code in Python and save as [FILENAME].py </li>
    <li> Run $ python3 [FILENAME].py in your terminal </li>
    <ol>
In a work environment

# Some great Text Editors/IDEs
1). VSCode by Microsoft <br>
2). Sublime Text <br>
3). Atom from GitHub <br>
4). PyCharm <br>
5). VIM <br>

# Cells in Jupyter Notebook

## There are two types of Cells!

In [3]:
def freedomFunction():
    print("Python is really cool")

In [4]:
#Code Cell
freedomFunction()

Python is really cool


# Markdown Cell
We can type all kinds of wacky text
> And do all kind of
**Crazy Formatting** <br>

*in these boxes*
<ol>
    <li>Item1</li>
    <li>Item2</li>
<ol>

# Importing Libaries
> import [MODULE] //as [ALIAS; nickname for your own benefit] <br>
> FROM [MODULE] import [A]

In [5]:
#Basic Python Libraries; Feel free to add more!

import pandas as pd
import numpy as np
import matplotlib
from sklearn import preprocessing

# Python Syntax
1). NO MORE SEMICOLONS :)
2). Whitespace in Python
> Since we're no longer using semicolons to terminate lines, a newline signifies the end of a line. <br> 
> Whitespace and Indentation matters!

3). Strings can be 'string' or "string"

In [6]:
def thisFunctionIsWrong():
return "it's gonna throw and error"

IndentationError: expected an indented block (3868164552.py, line 2)

# Python identifiers
<ol>
    <li>Identifier is a name used to identify a variable, function, class, module, or other object</li>
    <li>Starts with a letter A-Z, a-z, or underscore(_) followed by zero or more letters, underscores, and digits (0-9)</li>
    - Python does not allow punctuation characters such as @, $, and % within identifiers.
    <li>Case sensitive</li>
<ol>

# Python Variable Types
> <ol>
> <li>Numbers (intergers & Floats)</li>
> <li>Strings</li>
    > This includes individual characters
> <li>Lists</li>
> <li>Tuple</li>
    > <li>Dictionary</li>
> </ol>

In [7]:
a = 2
print(type(a))
b = 2.0
print(type(b))
c = "never gonna give you up"
print(type(c))

<class 'int'>
<class 'float'>
<class 'str'>


# Creating List and List Operations in Python
Lists are great for storing stuff!
We can store anything in a list, including lists!

In [8]:
listOne = [1, 2, 3 ,4]
listOne

[1, 2, 3, 4]

In [9]:
listTwo = ["Bob", 1, 2.0, [4, 5, (1, 2)]]
listTwo

['Bob', 1, 2.0, [4, 5, (1, 2)]]

# List Operations

In [10]:
listThree = []
#to add stuff to a list, use the .append() function
listThree.append(1)
listThree.append("bad data")
listThree

[1, 'bad data']

In [11]:
#remove stuff from a list, use the .remove() function
listThree.remove("bad data")
listThree

[1]

# Lists are Zero Indexed!

In [12]:
#slicing through a list
listFour = [1, 2, 3, 4, 5]
print(listFour[0]) # get the first element
print(listFour[-1]) #get the last element
print(listFour[:3]) #get the first three items, does not include third index (non-inclusive)
print(listFour[1:]) #get all the items except the first
print(listFour[1:3]) #get item at index 1 up to index 3 not including 3

1
5
[1, 2, 3]
[2, 3, 4, 5]
[2, 3]


# Tuples, like lists but immutable

In [13]:
tupleOne = (1, 2, 3, 4)
#use parenthesis
tupleOne

(1, 2, 3, 4)

# Dictionaries

In [14]:
#like lists, but with keys-value pairs
dictOne = {"Bob": 12, "Joe": 17, "Bill": 18}
dictOne

{'Bob': 12, 'Joe': 17, 'Bill': 18}

# The first value is called the key, the second value are the values

In [15]:
print(dictOne.keys()) #returns a list of the keys
print(dictOne.values()) #returns a list of the values
print(dictOne.items()) #returns a list of the key value pairs as tuples

#NOTE YOU MUST CAST THESE INTO LISTS TO USE LIST OPERATIONS ON THEM
print(list(dictOne.keys()))

dict_keys(['Bob', 'Joe', 'Bill'])
dict_values([12, 17, 18])
dict_items([('Bob', 12), ('Joe', 17), ('Bill', 18)])
['Bob', 'Joe', 'Bill']


# Modifying Dictionary and List Values

In [16]:
listFive = [1, 2, 3, 4, 5]
dictTwo = {"Bob": 12, "Joe": 17, "Bill": 18}
listFive[0] = 10 #use the index to change the value
print(listFive[0])
dictTwo["Bob"] = 13 #use the key to change the value
print(dictTwo["Bob"])

#SAME LOGIC APPLIES TO TUPLES

10
13


# Math in Python

In [17]:
a=4
b=2

In [18]:
a + b

6

In [19]:
a-b

2

In [20]:
a*b

8

In [21]:
a**b

16

In [22]:
a/b

2.0

In [23]:
a=7
a//b #floor/integer division

3

# Functions

In [24]:
def func1(var):
    var1 = 2
    var2 = 3
    return var + var1 + var2

var = 1
func1(var)

6

# Conditionals

In [25]:
#if (condition)
x = 5
if (x == 1):
    print('x is equal to 1')
elif (x == 5):
    print('x is equal to 5')
else:
    print('x is not equal to 1 or 5')

x is equal to 5


# Conditional Operators
!= is does not equals <br>
or, and, not

# Loops

While we tend to avoid loops with tabular data, it is still good to know them!

In [26]:
#for loop
for i in range(10):
    print(i)
#for index in a list of range 0-9
#range is non inclusive of the number inputted

0
1
2
3
4
5
6
7
8
9


# Range()

In [27]:
print(list(range(10)))
print(list(range(0, 10, 2))) #start, stop, step
print(list(range(10, -1, -1))) #decrement

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 2, 4, 6, 8]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


# Iterating through a list or string

We can use a loop if we just want to iterate through values in a string or list

In [28]:
testString = "Python"
for letter in testString:
    print(letter)

P
y
t
h
o
n


In [29]:
testList = ['P', 'y', 't', 'h', 'o','n']
for letter in testList:
    print(letter)

P
y
t
h
o
n


# Iterating through indices

In [30]:
testList = ['the', 'big', 'dog', 'wants', 'some', 'food']
for i in range(len(testList)):
    print(testList[i])

the
big
dog
wants
some
food


# List Comprehensions
These are very useful tools that let us loop through something in a single line.
Syntax goes as follows:

[ [VALUE WE WANT] for [VALUE WE WANT] in [ITERABLE] ]

In [31]:
#example
kantoStarters = ["Charmander", "Bulbasaur", "Squirtle", "Pikachu"]

starters = [pokemon for pokemon in kantoStarters] #This returns a list of the items
print(starters)
#list comprehensions are useful for applying a function to a list

upperStarters = [pokemon.upper() for pokemon in kantoStarters]
print(upperStarters)

#or if we want to filtering quickly
bestStarter = [pokemon for pokemon in kantoStarters if pokemon == "Charmander"] #conditional comes at the end
print(bestStarter)

['Charmander', 'Bulbasaur', 'Squirtle', 'Pikachu']
['CHARMANDER', 'BULBASAUR', 'SQUIRTLE', 'PIKACHU']
['Charmander']


# Intro to Pandas

# What is pandas?
Not related to real Pandas, sadly. <br>

The name is derived from "Panel Data". <br>

Python Library <br>

Good tool to use when you are working with a large dataset.

# Starting, at last!
Once you have your notebook and csv in the same directory, you're ready to start!

The first thing we want to make sure we have done is importing pandas and numpy, as they are crucial to any data task.

In [None]:
import pandas as pd
import numpy as np

# Reading
Read in your csv file using the following method
> For documentation please visit: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
df = pd.read_csv("Pokemon.csv")
df.head()

# What is DF?
DF is short for dataframe. A dataframe is essentially any sort of tabular data. <br>
We can change the name df to whatever we want. <br>

In [None]:
pokemon = pd.read_csv("Pokemon.csv")
pokemon.head() # we can input a number to list more or less

## What's a head?
You can see the top few elements of your dataframe using DataFrame.head(). This avoids displaying more of a dataset that is currently needed, and is a nice way of confirming everything has loaded correctly.

# Describing Data
You can use describe() to quickly gather some general insights on your dataset

In [None]:
pokemon.describe()

**Keep in mind some of these values may not make sense or be too useful. For example, we don't really care about average pokemon generation or Pokedex entry (ID number or #)**

# Data Cleaning

Most important part of the ML process <br>

Select columns to work with <br>

Make sure no values are missing <br>

Add new columns with combinations of features

# Dropping Columns
We can drop columns using **df.drop()!**

# Solution

In [None]:
#lets get rid of Generation, we don't need that right now
pokemon = df.drop('Generation', axis = 1)
pokemon.head(3)

# Adding Columns
You may want a new column. Just pick a name and throw some data in.

In [None]:
pokemon['New Pokedex ID'] = [random.randint(1, 10000) for i in range(len(df.Name))]
pokemon[['Name', "Type 1", "Type 2", "New Pokedex ID"]]

# Renaming
It's nice to be able to quickly read what column names are. The column "Sp. Atk" could be changed to "Sp. Attack".

**Task 2**: Rename "Sp. Atk" to "Sp. Attack"

# Solution

In [None]:
pokemon = pokemon.rename(columns={"Sp. Atk": "Sp. Attack"})
pokemon.head(3)

# Replacing Data

In [None]:
pokemon = pd.read_csv("Pokemon.csv")
pokemon.head()

In [None]:
pokemon.replace(1, "Kanto")
#replacing the generation to the region they're from

# Null Values
Your choice on how to deal with these (but be sure to fit with the context of the data)

Drop them <br>
Fill them with the mean <br>
Fill them backwards or forwards <br>
Fill them with whatever you want <br>

## Null Values
Here's a bunch of null values <br>
How can we fill these?

In [None]:
col = pd.Series([np.nan]*10)
col

## Null values
Fill using anything!

In [None]:
col.fillna(1)

## Null values
Fill using the mean.

In [None]:
col = pd.Series([2] + [np.nan] * 10 + [1])
col.fillna(col.mean())

## Null values

Or get rid of them completely!

In [None]:
col = pd.Series([2] + [np.nan] * 10 + [1])
col.dropna()

# Selecting Data
How can we select data from a DataFrame? There are a couple different ways. <br>
We can pull data out using indexing by column, using dot notation, or by using the actual index of a column


# Examples

All three of these lines get the exact same thing: the 'Name' column of the DataFrame as a panda Series.

In [None]:
pokemon['Name'].head()
pokemon.iloc[:, 1].head()
pokemon.Name.head()

# Selection by Column
You can get access to a column by using its name in the same kind of format as you would access an element from an list using its index

In [None]:
pokemon['Name'].head()

# Selection by Column v2
You can index the exact same way, as
before, although instead of using
brackets ‘[ ]’, you can just directly use
the name of the column. There is no real
difference in the two, it is personal
preference. The main downfall is that it
cannot access columns with spaces in the
name.

In [None]:
df.Name.head()

# Selection by index
Selection using iloc works by specifying the row
as the first parameter, and the column as the
second parameter. In this case, ‘:’ indicates that
we want every row, and -1 means that we want
the last column.

More on iloc and loc in a second.

In [None]:
df.iloc[:, 1].head()

# Selection by Index Example
How would you get the 3rd row (index 2) and get the Pokémon stored in that cell?

# Solution

In [None]:
pokemon.iloc[2, 1]

# .loc vs .iloc
Both .loc and .iloc can pull out specific rows

In [None]:
pokemon.loc[2]

In [None]:
pokemon.iloc[2]

# .loc vs .iloc
.loc can index by string name, .iloc indexes by integer

In [None]:
pokemon.loc[:, "Name"].head()

In [None]:
pokemon.iloc[:, 'Date'].head()

# DataFrame vs Series
You may have noticed when working with iloc and loc, we get a different
looking output than our DataFrame. These are pandas Series, which are
just the data type that a column (or row) is stored in.

A Series is a one-dimensional array containing information from a row or
column.

# Selecting Multiple Columns
You can select multiple
columns by putting each
column name in a list.

In [None]:
pokemon[["Name", "Total"]]

In [None]:
pokemon[["Name"]] #this also works for single columns

# Useful Pandas Functions

# Selecting Data - Conditionals
Use conditionals to select only specific data based on criteria.

In [None]:
pokemon[pokemon['Type 1'] == 'Water'].head()

How does this work? <br>
Boolean Series!

In [None]:
pokemon['Type 1'] == 'Steel'

# Selecting Data - Multiple Conditionals
What if you want more than one condition to select data?

In [None]:
pokemon[(pokemon['Type 1'] == "Water") & (pokemon['Type 2'] == 'Ground')]

# Exercise
Pull out name of all Pokemon with the first type of Fire and the second type of Fighting.

# Solution

In [None]:
exercise = pokemon[(pokemon["Type 1"] == "Fire") & (pokemon["Type 2"] == "Fighting")]
exercise[["Name"]]

# Example
Pull out all the pokemon who's health is below 100 HP that are Fire types

In [None]:
exercise = df[(pokemon['HP'] < 100) & (df['Type 1'] == 'Fire')]
exercise

# Proportions
Do you want to know what proportion of the total Pokemon are Poison types? Great, because it's super easy to find out

# Summing and Dividing
You can sum a column or
DataFrame’s columns by using
the .sum() function. We can
select our region to be
‘SanDiego’ and sum its volume,
then sum the volume of the entire
dataset.

From there we can calculate our
proportion.

In [None]:
poison = pokemon[pokemon['Type 1'] == 'Poison'].shape[0]
pokemon_total = pokemon.shape[0]
poison/pokemon_total

# Exercise 
What is the proportion of Pokemon that have type 1 of Fire who have type 2 of fighting?

# Solution

In [None]:
fire = pokemon[pokemon["Type 1"] == 'Fire'].shape[0]
fighting_fire = pokemon[(pokemon["Type 1"] == 'Fire') & (pokemon["Type 2"] == 'Fighting')].shape[0]
fighting_fire/fire

# Applying
What if you wanted to subtract a number from an entire column? What if
you wanted to use some specific function for every entry in a column?

That’s where apply comes in. You can pass a function into an apply call,
and it will perform that function on every element in the column.

# Subtract the Mean
We are now going to subtract the mean Attack from the attack category.
First we need to get the mean, then create a simple lambda function to apply!

# Code

In [None]:
mean = pokemon['Attack'].mean()
subtract_mean = lambda x: x - mean

pokemon['Attack'] = pokemon['Attack'].apply(subtract_mean)
pokemon.head()

# Lambda Functions
A lambda function is a small anonymous function.

A lambda function can take any number of arguments, but can only have one expression.

Think of it as a mini-function

In [None]:
lambda x: x + 10

#the x is the parameter
#this is equivalent to 

def doSomething(x):
    return x + 10

# Grouping
- You can take a DataFrame and group the entire thing based on the
unique values of a column, then aggregate how you feel.

- Picking how you want to aggregate your information and setting a
function up can be the tricky part.

# Grouping 
Simple example: group based on the Pokemon type, sum together each
column of the groups.

In [None]:
pokemon.groupby('Type 1').sum()

# Grouping
You can also group on multiple columns!

In [None]:
pokemon.groupby(['Type 1', 'Type 2']).sum()

# Grouping - aggregate
We can use our own functions to group everything using .aggregate

In [None]:
pokemon.groupby('HP').aggregate(np.mean)

# Merging
If you have two tables

that have matching

information, you may

want to merge them.

In [None]:
types = pokemon[['Type 1', 'Type 2']]
types.head()

In [None]:
name = pokemon[['Name']]
name.head()

# Merging
Like this:

In [None]:
name.merge(types, left_index = True, right_index = True).head()

# Thank you for attending our workshop
## Good luck with Hacking!