# Python Introduction Tutorial

This tutorial is intended to introduce python programming for beginners & explain common analytic concepts. Detailed math explanations are beyond the scope of this tutorial, however, links will be provided throughout that reference sites to additional research. 

#### Topics Covered
1. Installing Python & Jupyter Notebook
2. Base Python Operations
3. Loading Packages & Data with Python
4. Data Types
5. Data Manipulation with Python
6. Writing If Statements, Loops, and Functions in Python




## Install Python
Python is an open source programming tool that has become one of the most popular in the world due to its versatility & ease of use. To download python, navigate to https://www.python.org/downloads/ & select the version appropriate for your machine. As with most programs, there will be prompts that ask if any custom settings are needed - selecting the defaults are fine.

Congratulations! You have downloaded & installed python. Now the fun part - how to use python. Python has a number of IDE (Integrated Development Environment) options. If this is a new term, think "user-interface". A few options are listed below - some are easier to get setup than others. In my opinion, whatever you can figure out to install & use is the best choice. 

1. JupyterLab / Jupyter Notebook (technically different things)
2. Visual Studio Code
3. Atom
4. Command Prompt
5. Most text editors.

Installing JupyterLab / Jupyter Notebooks is very straightforward & is great for making Markdown files. So I'll run through that example. To install on Windows: 

1. Type Command Prompt into your 'Start Menu'. 
2. Type "py" - this starts python. 
3. Type "-m pip install jupyterlab"
4. A few questions may pop up, type 'yes' as needed.
5. After installation, open up your 'Start Menu' again & search for JupyterLab. 
6. The user-interface will launch & you'll be able to create python scripts from scratch. 

## Base Python Operations
Sure, python can do a lot of fancy things. However, at its core, python is still a programming language. To do those fancy things, there needs to be building blocks. This section will cover some base functionality that will seem trivial but will ultimately be useful as you begin to develop more complex programs.

##### Basic Operators
1. '+' Addition
2. '-' Subtraction
3. '*' Multiplication
4. '/' Division
5. ** Exponent
6. '%' Modulus
7. '//' Floor Division

In [None]:
### Python Addition
4 + 4

In [None]:
### Python Subtraction
10 - 5

In [None]:
### Python Multiplcation
5 * 5

In [None]:
### Python Division
40 / 8

In [None]:
### Python Exponent
3**2

In [None]:
### Python Modulus
# Modulus = remainder
23 % 5

In [None]:
### Python Floor Division
# Division Rounded Down
23 // 5

All of these operators can be combined together in a continuous string. The operations will be executed following PEMDAS rules. What will 8*((9-4)%2)+1 = ? Explain why.

In [None]:
### Multiple Operations Example
8*((9-4)%2)+1

## Loading Packages & Data with Python

### Loading Packages
A package is a collection of functions that someone or some organization made to simplify frequently used processes. For example, if I'm frequently having to calculate the square root of a number and add 5 & am tired of manually typing the steps every time, I can write a function to simplify the process: 

def sq_root(num int): <br>
&nbsp;&nbsp;&nbsp;&nbsp;   sqrt(num) + 5
    
Now consider I have to do the same process 1000 times with 1000 different numbers. Rather than type it out 1000 times, I can make 1000 functions (or one more complex function) and then package them, so I don't have to recreate them. This is a simple example, but the idea stands - a package is a grouping of functions that can be imported to your working session. 

To import a function use the syntax import (package) as (alias). Replace (package) with the actual name of the package and (alias) with an abbreviation. Aliases are optional but convenient as they can reference packages with fewer keystrokes. More on that later. 

Many common packages are already installed when you download python, however, some packages do need to be installed. This can be directly completed in a python script or it can be completed using the command prompt. Packages only need to be installed once, so it is best practice to remove package installations from your scripts.

In [5]:
### Import pandas & numpy packages

# Import pandas
import pandas as pd

# Import Numpy
import numpy as np

In [6]:
### Install a pip package in the current Jupyter kernel

# Import sys package
import sys

# Run sys.executable to install the 
# desired package. numpy in this case.
!{sys.executable} -m pip install numpy



### Loading Data

Several packages can be used to import data - this tutorial will focus on pandas. Pandas is one of the most popular packages used in python & therefore has a lot of documentation on how to use its functions. The following examples are based on loading data from csv & excel files. <br>

**Load csv files**<br>
read_csv(*filepath*)<br><br>
**Load excel files**<br>
read_excel(*filepath*)



In [7]:
### Load a csv file
# Loading an excel file follows the same process & will not be covered here

# Define the file path
filepath_csv = 'C:/Users/JoeRatterman/Documents/GitHub/MarchMadness2021/boxscores/2021_boxscores.csv'

# Load file
df = pd.read_csv(filepath_csv)

# Print first few rows data
df.head(3)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
0,41.4,12,2.4,1,96.0,64.5,20,0.594,53,0.547,...,22,0.0,0,"Germain Arena, Estero, Florida",AUSTIN-PEAY,Austin Peay,75.2,Away,ABILENE-CHRISTIAN,Abilene Christian
1,57.7,15,0.0,0,78.4,81.5,22,0.5,58,0.448,...,21,0.0,0,"Germain Arena, Estero, Florida",NEBRASKA-OMAHA,Omaha,73.8,Away,ABILENE-CHRISTIAN,Abilene Christian
2,52.9,9,8.1,3,98.8,56.4,22,0.4,50,0.34,...,21,0.0,0,"Moody Coliseum , Abilene, Texas",Howard Payne\n\t\t\t,Howard Payne\n\t\t\t,81.6,Home,ABILENE-CHRISTIAN,Abilene Christian


Data can also be read from databases. The process for this is similar, however, it can require different packages. The appropriate package is dependent on the goal of the data load. There are packages that can connect to a database and read a table or execute a stored procedure & there are even packages that allow you to code SQL in your python environment. Using SQL in python will not be covered in this tutorial, but if you're interested search Google for sqlite3 or pyodbc.

Note that the below example (and most instances of connecting to databases) requires an odbc connection. I won't cover that in this tutorial but this link provides a base explanation: https://www.progress.com/faqs/datadirect-odbc-faqs/what-is-an-odbc-driver. 


In [8]:
# Import pyodbc to connect to the sql database
import pyodbc

# Create the connection string - this has all of the
# parameters needed to connect to the database. 
# No username or password is required in this example
# but that is a common arguments. I have commented them
# out in the code for an example.

conn_str = (
    r"Driver={ODBC Driver 17 for SQL Server};"
    r"Server=(local);"
    r"Database=NCAABasketball;"
    r"Trusted_Connection=yes;"
    #r"UID=***;"
    #r"PWD=***;"
    )
# Read the connection string
conn = pyodbc.connect(conn_str)

# Write Query
sql = "SELECT * FROM NCAABasketball.dbo.boxscores"

# Run query & import results into pandas df. 
# Note pandas was imported in an early block.
raw_data = pd.read_sql(sql, conn)
raw_data.head(3)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name,season
0,,8,,4,,,22,,53,0.34,...,1.0,1,"Clune Arena , Colorado Springs, Colorado",Colorado-Colorado Springs\n\t\t\t,Colorado-Colorado Springs\n\t\t\t,61.8,Home,AIR-FORCE,Air Force,2011
1,,12,,1,,,26,,58,0.328,...,0.5,1,"Clune Arena , Colorado Springs, Colorado",AIR-FORCE,Air Force,62.2,Away,Colorado College\n\t\t\t,Colorado College\n\t\t\t,2011
2,57.1,12,5.7,2,122.5,68.2,15,0.447,57,0.368,...,0.667,2,"Clune Arena , Colorado Springs, Colorado",TENNESSEE-STATE,Tennessee State,71.4,Home,AIR-FORCE,Air Force,2011


Now that the data is loaded into the python environment, it can be treated & manipulated as a normal dataframe. The sql string that we ran in this example does not use any additional 'WHERE', 'GROUP BY', etc. statements but those would all be perfectly fine to include. <br><br>
One final note it is best practice to read the data in & then make a copy before performing any manipulations - this will prevent overwriting the original raw data pull. This is especially useful using jupyter notebook. <br><br>
**Example** <br>
sql = "SELECT * FROM NCAABasketball.dbo.boxscores"<br>
raw_data = pd.read_sql(sql, conn)<br>
df = raw

## Data Types
The concept of a data type should be somewhat familiar if you've taken any programming courses in the past (R, Python, SQL, any other). The simplest example of data type is Text vs. Numeric - think of Text data types as words or strings of letters like 'hello' or 'hh' while numeric would be a number like 17. In all there are seven main data types in python, most of which have sub-types as well. For more information please visit: https://www.w3schools.com/python/python_datatypes.asp.  <br><br>

- **Text Type** (str)
- **Numeric Types** (int, float, complex)
- **Sequence Types** (list, tuple, range)
- **Mapping Type** (dict)
- **Set Types** (set, frozenset)
- **Boolean Type** (bool)
- **Binary Types** (bytes, bytearray, memoryview)

The website referenced has examples of all of these data types & more information can be found on Google. If they aren't referenced below, I personally haven't needed to use them to this point in my limited python career. <br>

**Text Type** <br>
*str* are combinations of chacters. They can be stored as variables or a column type in a dataframe. Examples would be a first and last name in a dataframe or a file path stored as a variable. Below I define the str 'C:/Users/JoeRatterman/Documents/GitHub/MarchMadness2021/boxscores/2021_boxscores.csv' as the variable **filepath_csv**.<br>

*filepath_csv* = 'C:/Users/JoeRatterman/Documents/GitHub/MarchMadness2021/boxscores/2021_boxscores.csv'<br><br>

**Numeric Types** <br>
*int* are whole numbers such as 1, 2, or 3. A *float* is a number that can contain decimals. Examples being 1.0 or 100.9234. Similar to Text Types, these types can also be stored as variables or numbers in a dataframe. Int types take up less storage space than floats, so when possible it is better to store data as an int. <br><br>

**Sequence Types** <br>
A *list* is a set of objects that can be edited. Lists can be used to store data or create an iterable object for a loop. For example, I can define the list *shopping_list* = ['bread', 'chicken', 'apples']. 

In [9]:
# Define list of fruits
shopping_list = ['bread', 'chicken', 'apples']
print(shopping_list)

['bread', 'chicken', 'apples']


This list is now an object that I can call below in my working session. A key feature of a list is that it can be edited. Below, I decide I don't want to buy apples anymore & instead I want to buy bananas. I can simply update my list to reflect my decision. 

In [10]:
# Edit list of fruits - remove 'apple'
shopping_list.remove('apples')
shopping_list.append('bananas')
print(shopping_list)

['bread', 'chicken', 'bananas']


My list is now updated & I can head to the store to shop. <br>

A *tuple* is a set of objects that cannot be edited. They would more commonly be used for data that you want to ensure doesn't get manipulated on accident. The setup is similar - see below:

In [11]:
# Create shopping_tuple
shopping_tuple = ('bread', 'chicken', 'apples')
print(shopping_tuple)

('bread', 'chicken', 'apples')


In [12]:
# Attempt to edit tuple
# This will return an error. 
shopping_tuple.remove('apples')

AttributeError: 'tuple' object has no attribute 'remove'

I won't add too much detail on *ranges*, other than to say they are commonly used for loops. There is an example below - though this is a simple example, *ranges* can be extremely useful iterating through data. Loops will be covered in depth in Chapter 6. 

In [None]:
# Define a range that is 3 numbers
range_ex1 = range(3)

# Loop through each item in the range
for i in range_ex1:
    print(i)

# A more practical example of using a range.
# Here, I want to print each item in my
# shopping list. To do so, I make a range
# that is equal to the length of my list
# then iterate through it.

# Define range
range_ex2 = range(len(shopping_list))

# Iterate through shopping list using define range
for i in range_ex2:
    print(shopping_list[i])

**Mapping Type**<br>
A *dictionary* is a very popular data type in python consisting of a key-value pair. A key is similar to a category, which has a value. For example, I have a list of AMEND employee's favorite fruit. I can store these data points together in a dictionary to quickly access. This would be equivalent to dataframe with one column for employee name & one column for fruit name. The dictionary is useful in this scenario because dictionaries can be processed faster with larger datasets. 

In [None]:
# Define dictionary
employee_fruit = {
    'Ratterman': 'Apple',
    'Emsley': 'Pear',
    'Accorti': 'Pineapple',
    'Welp': 'Dragonfruit'
}

print(employee_fruit)

In [None]:
# Slice the dictionary with the 
# selected employee.

# Return Emsley's favorite fruit
print(employee_fruit['Emsley'])

**Boolean Type**<br>
*Boolean* is a fancy term for TRUE/FALSE named after the mathematician George Boole (no joke). In practice, I typically use booleans in dataframes & have rarely (if ever) defined a variable as a boolean. An example could be a dataframe of customer data that has a column for if the customer spent over $100. It is also common to use booleans in *if* statements: *if* this = TRUE then do this... else do that.

In [None]:
# Define my name
name = "Joe"

# If statment to check why my name is
if name == "Henry":
    print('My Name is not ' + name + '.')
else:
    print('My name is ' + name + '.')

## Data Manipulation in Python
Whether you want to be a data scientist, data engineer, or just automate some boring tasks with python, you'll spend about 85% of your time manipulating data.<br>

To help understand what I mean by data manipulation, think about getting a monthly report of sales numbers from your boss. Your company is global, but you're tasked with finding the sales rep with the most sales in North America. What steps do you need to do to find this information? Maybe something like this:

1. Filter the file for the North America Region
2. Add a look-up of distinct sales reps
3. Add a sumif to determine the total sales by sales rep
4. Sort the total sales from highest to lowest.

All of these steps are manipulating the raw data to get the final answer, thus you're using python for data manipulation. In this section, I'm going to start with examples of common transformations before showing some analysis on a dataset that I haven't looked at before. The intention is to provide a real-world example of how to approach a dataset & find insights. Don't take this as an exhaustive list of functions / processes. I like to think that anything you could imagine can be done programming - most of your time will be spent problem-solving figuring how to get from Point A to Point B.

Some key functions are below, if you're familiar with sql most of these will look familiar. 

- Select / Drop
- Filter
- Distinct
- Order By
- Mutate
- Group By & Summarize
- Join/Merge
- Append / Union

The dataset that I'll be looking through is 2021 Men's NCAA Basketball Boxscores. 

In [None]:
# Define the csv file path
filepath_csv = 'C:/Users/JoeRatterman/Documents/GitHub/MarchMadness2021/boxscores/2021_boxscores.csv'

# Load file
df = pd.read_csv(filepath_csv)

# Print the column names
print(df.columns)

# Print first 3 rows of data
df.head(3)

A few of the first takeaways from looking at the column names & the first few rows are below:
- away_/home_ are used as prefixs for most of the dataset
- the winner column looks like it has values of 'Home' & 'Away'
- I'll need to use the winner, winning_name, & losing_name columns to determine which team is home / away

First, I want to get a distinct list of team names. Then I can use that list as the basis to collect season stats whether a team was home or away. My goal will be to clean the dataset so that there are is a 1 row for each team for each game. However, before diving into that analysis, we'll cover the aforementioned functions starting with select / drop.

**Select / Drop**<br>
More examples at: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

In [19]:
# Option 1: Indexing with column name
# brackets & type column names in ""
df[["winning_name", "losing_name"]]

Unnamed: 0,winning_name,losing_name
0,Abilene Christian,Austin Peay
1,Abilene Christian,Omaha
2,Abilene Christian,Howard Payne\n\t\t\t
3,Abilene Christian,Tarleton State
4,Texas Tech,Abilene Christian
...,...,...
8035,Youngstown State,Purdue-Fort Wayne
8036,IUPUI,Youngstown State
8037,Youngstown State,IUPUI
8038,Youngstown State,UIC


In [None]:
# Option 2: Indexing with column number
# Below is an example selecting all rows ":"
# and columns 11 & 12. 
df.iloc[:,[11,12]]

In [None]:
# Below is an example selecting rows 5 - 9
# and all columns. 
df[5:10]

In [26]:
# Option 3: Drop columns that are not
# needed. Efficient if you need to keep
# most columns

# Define unneeded columns in list
drop_col = ["winning_name", "losing_name"]

# Drop columns from df
# axis = 1 means drop columns rather than rows
df.drop(drop_col, axis = 1)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goal_attempts,home_two_point_field_goal_percentage,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,pace,winner,winning_abbr
0,41.4,12,2.4,1,96.0,64.5,20,0.594,53,0.547,...,42,0.524,22,0.0,0,"Germain Arena, Estero, Florida",AUSTIN-PEAY,75.2,Away,ABILENE-CHRISTIAN
1,57.7,15,0.0,0,78.4,81.5,22,0.500,58,0.448,...,38,0.553,21,0.0,0,"Germain Arena, Estero, Florida",NEBRASKA-OMAHA,73.8,Away,ABILENE-CHRISTIAN
2,52.9,9,8.1,3,98.8,56.4,22,0.400,50,0.340,...,37,0.568,21,0.0,0,"Moody Coliseum , Abilene, Texas",Howard Payne\n\t\t\t,81.6,Home,ABILENE-CHRISTIAN
3,33.3,5,13.8,4,97.2,71.4,25,0.443,35,0.429,...,29,0.483,14,0.0,0,"Moody Coliseum , Abilene, Texas",TARLETON-STATE,71.1,Home,ABILENE-CHRISTIAN
4,37.5,6,10.7,3,77.3,64.5,20,0.422,45,0.356,...,28,0.357,10,0.0,0,"United Supermarkets Arena, Lubbock, Texas",ABILENE-CHRISTIAN,65.9,Home,TEXAS-TECH
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8035,56.0,14,13.9,5,104.3,69.4,25,0.617,47,0.532,...,36,0.500,18,0.0,0,"Beeghly Center, Youngstown, Ohio",IPFW,68.5,Home,YOUNGSTOWN-STATE
8036,52.0,13,5.4,2,92.1,82.1,23,0.537,54,0.463,...,37,0.514,19,0.0,0,"Beeghly Center, Youngstown, Ohio",YOUNGSTOWN-STATE,75.8,Away,IUPUI
8037,61.9,13,10.0,4,95.1,91.3,21,0.379,62,0.339,...,40,0.575,23,0.0,0,"Beeghly Center, Youngstown, Ohio",IUPUI,80.8,Home,YOUNGSTOWN-STATE
8038,37.5,9,5.1,2,108.8,82.1,23,0.519,53,0.453,...,39,0.564,22,0.0,0,"Beeghly Center, Youngstown, Ohio",ILLINOIS-CHICAGO,68.1,Home,YOUNGSTOWN-STATE


**Filter**

In [28]:
# Filter winning_name = 'Cincinnati'
df[df["winning_name"] == 'Cincinnati'].head()

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
1158,66.7,16,6.7,2,111.3,58.6,17,0.492,60,0.4,...,19,0.0,0,"Fifth Third Arena, Cincinnati, Ohio",CENTRAL-FLORIDA,UCF,62.1,Home,CINCINNATI,Cincinnati
1269,77.8,14,0.0,0,98.5,68.6,24,0.476,41,0.439,...,24,0.0,0,"Fifth Third Arena, Cincinnati, Ohio",LIPSCOMB,Lipscomb,68.1,Home,CINCINNATI,Cincinnati
1271,65.2,15,8.6,3,108.3,78.9,15,0.457,58,0.397,...,23,0.0,0,"Fifth Third Arena, Cincinnati, Ohio",FURMAN,Furman,71.8,Home,CINCINNATI,Cincinnati
1277,78.6,22,17.1,7,89.6,91.4,32,0.541,61,0.459,...,23,0.0,0,"Moody Coliseum, Dallas, Texas",SOUTHERN-METHODIST,SMU,76.7,Away,CINCINNATI,Cincinnati
1279,54.5,12,14.7,5,83.3,83.3,25,0.465,57,0.386,...,14,0.0,0,"Liacouras Center, Philadelphia, Pennsylvania",TEMPLE,Temple,72.1,Away,CINCINNATI,Cincinnati


In [30]:
# Filter away assist percentage > 80
df[df["away_assist_percentage"] > 80].head()

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
12,91.3,21,8.3,3,95.5,54.5,18,0.402,61,0.377,...,15,0.0,0,"Johnson Coliseum, Huntsville, Texas",ABILENE-CHRISTIAN,Abilene Christian,67.2,Home,SAM-HOUSTON-STATE,Sam Houston State
21,82.1,23,2.8,1,110.5,73.5,25,0.548,62,0.452,...,21,0.0,0,"Jeff Farris Center, Conway, Arkansas",ABILENE-CHRISTIAN,Abilene Christian,76.1,Home,CENTRAL-ARKANSAS,Central Arkansas
329,81.5,22,5.7,2,87.5,62.5,20,0.585,53,0.509,...,15,0.0,0,"H.O. Clemmons Arena, Pine Bluff, Arkansas",ARKANSAS-PINE-BLUFF,Arkansas-Pine Bluff,64.1,Away,PRAIRIE-VIEW,Prairie View
554,81.6,31,0.0,0,92.5,70.8,17,0.733,58,0.655,...,25,0.0,0,"Sam Vadalabene Center , Edwardsville, Illinois",SOUTHERN-ILLINOIS-EDWARDSVILLE,SIU-Edwardsville,67.0,Away,BELMONT,Belmont
799,82.6,19,8.3,2,111.4,57.1,12,0.466,59,0.39,...,13,0.0,0,"McDonough Gymnasium, Washington, District of C...",BUTLER,Butler,69.9,Home,GEORGETOWN,Georgetown


**Distinct / Unique Values**

In [45]:
# Unique values in a list
shopping_list = ['apples', 'apples', 'bananas', 'apples', 'bread', 'apples']
list(np.unique(shopping_list))


['apples', 'bananas', 'bread']

In [47]:
# Unique values in a column
df[["winning_name"]].drop_duplicates().head()

Unnamed: 0,winning_name
0,Abilene Christian
4,Texas Tech
7,Arkansas
12,Sam Houston State
21,Central Arkansas


In [48]:
# Unique values in a dataframe
df.drop_duplicates().head()

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
0,41.4,12,2.4,1,96.0,64.5,20,0.594,53,0.547,...,22,0.0,0,"Germain Arena, Estero, Florida",AUSTIN-PEAY,Austin Peay,75.2,Away,ABILENE-CHRISTIAN,Abilene Christian
1,57.7,15,0.0,0,78.4,81.5,22,0.5,58,0.448,...,21,0.0,0,"Germain Arena, Estero, Florida",NEBRASKA-OMAHA,Omaha,73.8,Away,ABILENE-CHRISTIAN,Abilene Christian
2,52.9,9,8.1,3,98.8,56.4,22,0.4,50,0.34,...,21,0.0,0,"Moody Coliseum , Abilene, Texas",Howard Payne\n\t\t\t,Howard Payne\n\t\t\t,81.6,Home,ABILENE-CHRISTIAN,Abilene Christian
3,33.3,5,13.8,4,97.2,71.4,25,0.443,35,0.429,...,14,0.0,0,"Moody Coliseum , Abilene, Texas",TARLETON-STATE,Tarleton State,71.1,Home,ABILENE-CHRISTIAN,Abilene Christian
4,37.5,6,10.7,3,77.3,64.5,20,0.422,45,0.356,...,10,0.0,0,"United Supermarkets Arena, Lubbock, Texas",ABILENE-CHRISTIAN,Abilene Christian,65.9,Home,TEXAS-TECH,Texas Tech


**Arrange / Order By**

In [52]:
# Arrange a list in alphabetical order
shopping_list.sort()
shopping_list

['apples', 'apples', 'apples', 'apples', 'bananas', 'bread']

In [54]:
# Arrange a list in reverse alphabetical order
shopping_list.sort(reverse = True)
shopping_list

['bread', 'bananas', 'apples', 'apples', 'apples', 'apples']

In [56]:
# Arrange a dataframe by a column
df.sort_values("winning_name").head(8)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
0,41.4,12,2.4,1,96.0,64.5,20,0.594,53,0.547,...,22,0.0,0,"Germain Arena, Estero, Florida",AUSTIN-PEAY,Austin Peay,75.2,Away,ABILENE-CHRISTIAN,Abilene Christian
22,59.1,13,3.8,1,85.9,85.7,18,0.533,45,0.489,...,16,0.0,0,"William Johnson Coliseum, Nacogdoches, Texas",STEPHEN-F-AUSTIN,Stephen F. Austin,71.0,Away,ABILENE-CHRISTIAN,Abilene Christian
23,73.7,14,5.7,2,114.9,62.9,22,0.453,53,0.358,...,18,0.0,0,"Moody Coliseum , Abilene, Texas",INCARNATE-WORD,Incarnate Word,74.1,Home,ABILENE-CHRISTIAN,Abilene Christian
24,31.0,9,18.6,8,129.2,62.1,18,0.58,56,0.518,...,25,0.0,0,"Leonard E. Merrell Center, Katy, Texas",LAMAR,Lamar,71.9,Home,ABILENE-CHRISTIAN,Abilene Christian
25,65.5,19,8.1,3,62.5,76.0,38,0.539,64,0.453,...,15,0.0,0,"Leonard E. Merrell Center, Katy, Texas",NICHOLLS-STATE,Nicholls State,72.2,Away,ABILENE-CHRISTIAN,Abilene Christian
2657,63.0,17,9.4,3,87.5,90.0,27,0.438,65,0.415,...,14,0.0,0,"Frank and Lucille Sharp Gymnasium, Houston, Texas",HOUSTON-BAPTIST,Houston Baptist,71.8,Away,ABILENE-CHRISTIAN,Abilene Christian
6413,54.5,6,16.7,5,104.1,71.9,23,0.361,36,0.306,...,15,0.0,0,"Moody Coliseum , Abilene, Texas",SOUTHEASTERN-LOUISIANA,Southeastern Louisiana,72.6,Home,ABILENE-CHRISTIAN,Abilene Christian
6016,48.3,14,10.5,4,119.4,70.4,19,0.525,60,0.483,...,19,0.0,0,"Teague Special Events Center, Abilene, Texas",SAM-HOUSTON-STATE,Sam Houston State,72.3,Home,ABILENE-CHRISTIAN,Abilene Christian


In [60]:
# Arrange a dataframe by a column - Reverse
df.sort_values("winning_name", ascending = False).head(8)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
5812,59.3,16,4.3,2,98.7,71.8,28,0.573,55,0.491,...,24,0.0,0,"UPMC Events Center, Moon, Pennsylvania",ROBERT-MORRIS,Robert Morris,70.1,Away,YOUNGSTOWN-STATE,Youngstown State
2771,37.5,9,5.1,2,108.8,82.1,23,0.519,53,0.453,...,22,0.0,0,"Beeghly Center, Youngstown, Ohio",ILLINOIS-CHICAGO,UIC,68.1,Home,YOUNGSTOWN-STATE,Youngstown State
2987,56.0,14,13.9,5,104.3,69.4,25,0.617,47,0.532,...,18,0.0,0,"Beeghly Center, Youngstown, Ohio",IPFW,Purdue-Fort Wayne,68.5,Home,YOUNGSTOWN-STATE,Youngstown State
2762,51.7,15,8.0,4,116.4,64.9,24,0.639,54,0.537,...,25,0.0,0,"Beeghly Center, Youngstown, Ohio",ILLINOIS-CHICAGO,UIC,72.8,Home,YOUNGSTOWN-STATE,Youngstown State
3008,61.9,13,10.0,4,95.1,91.3,21,0.379,62,0.339,...,23,0.0,0,"Beeghly Center, Youngstown, Ohio",IUPUI,IUPUI,80.8,Home,YOUNGSTOWN-STATE,Youngstown State
5089,34.8,8,0.0,0,95.2,70.0,21,0.482,55,0.418,...,17,0.0,0,"Bank of Kentucky Center, Highland Heights, Ken...",NORTHERN-KENTUCKY,Northern Kentucky,63.4,Away,YOUNGSTOWN-STATE,Youngstown State
8013,40.0,8,18.6,8,100.0,61.7,29,0.435,54,0.37,...,16,0.0,0,"Beeghly Center, Youngstown, Ohio",Point Park\n\t\t\t,Point Park\n\t\t\t,71.6,Home,YOUNGSTOWN-STATE,Youngstown State
8014,35.5,11,8.3,2,98.5,63.3,19,0.565,62,0.5,...,15,0.0,0,"Binghamton University Events Center, Vestal, N...",BINGHAMTON,Binghamton,66.2,Away,YOUNGSTOWN-STATE,Youngstown State


**Mutate / Add Column**<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

This method requires a *lambda* function. For more information please visit this link:<br>
https://realpython.com/python-lambda/

In [69]:
# Assign function
# Select a few columns to manipulate
new_df = df[["home_assists", "away_assists", "winning_name", "losing_name"]]

# Create total_assists column
new_df = new_df.assign(
    total_assists = lambda x: x.home_assists + x.away_assists
)

new_df.head()

Unnamed: 0,home_assists,away_assists,winning_name,losing_name,total_assists
0,12,12,Abilene Christian,Austin Peay,24
1,11,15,Abilene Christian,Omaha,26
2,19,9,Abilene Christian,Howard Payne\n\t\t\t,28
3,15,5,Abilene Christian,Tarleton State,20
4,7,6,Texas Tech,Abilene Christian,13


**Group By / Summarize**<br>
Group By and Summarize are a combination of functions that can be used to get group totals, averages, minimums, maximums, etc.

In [88]:
# Calculate total assists for in games
# where Cincinnati was the winning team. 

# Define list of columns to sum
sum_cols = ["home_assists", "away_assists"]

# Group by winning team & add sum
assist_df = new_df.groupby('winning_name')[sum_cols].sum()
assist_df.head()

Unnamed: 0_level_0,home_assists,away_assists
winning_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Abilene Christian,669,495
Air Force,146,98
Akron,376,306
Alabama,676,526
Alabama A&M,176,128


**Merge / Join Data**<br>
Joining dataframes together is useful when pulling data from multiple sources. The link below will cover more examples.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html <br> <br>

The join logic is the same as you would find in sql - the link below reviews the types of joins. <br>
https://www.w3schools.com/sql/sql_join.asp


In [92]:
# Merge the assist_df to the new_df. 
# Duplicate column names get '_x' & '_y'
# suffixes to identify which dataframe
# duplicate columns come from.
new_df.merge(assist_df, how = 'left', left_on = 'winning_name', right_on = 'winning_name').head()

Unnamed: 0,home_assists_x,away_assists_x,winning_name,losing_name,total_assists,home_assists_y,away_assists_y
0,12,12,Abilene Christian,Austin Peay,24,669,495
1,11,15,Abilene Christian,Omaha,26,669,495
2,19,9,Abilene Christian,Howard Payne\n\t\t\t,28,669,495
3,15,5,Abilene Christian,Tarleton State,20,669,495
4,7,6,Texas Tech,Abilene Christian,13,464,298


**Append / Union Data**

In [107]:
# Keep only unique rows of data. 
df = df.drop_duplicates()

# Select winning_name & losing_name columns
teams_df = df[["winning_name", "losing_name"]]

# Keep unique values for winning & losing names
winning_list = teams_df[["winning_name"]].drop_duplicates()
losing_list = teams_df[["losing_name"]].drop_duplicates()

# Rename columns to allow for appending
winning_list.columns = ['team_name']
losing_list.columns = ['team_name']

# Append team names & take final distinct lookup
teams_df = winning_list.append(losing_list).reset_index(drop = True)
teams_df.head()

Unnamed: 0,team_name
0,Abilene Christian
1,Texas Tech
2,Arkansas
3,Sam Houston State
4,Central Arkansas


# insert data manipulation example here
### summary team stats

## Writing If Statements, Loops, & Functions in Python
Now that we've covered the basics of python, we'll start getting into how to functionally program. If statements help to make programs follow logic & make "decisions". Loops & functions are useful to help with repetitive tasks - the examples covered in this section will help to explain the use cases for each tool. 

**If Statements**

If this happens do this, but if that happens do that. Similar to an IF statement that is in Excel, If statements help make code flow more smoothly & can help make decisions in an automated fashion. <br>

The base setup is <br>

**if** *criteria*: <br>
&emsp;*action*: <br>
**else**: <br>
&emsp;*alternate action* <br><br>

There can also be tests for multiple criteria - using **elif** <br>


**if** *criteria*: <br>
&emsp;*action*: <br>
**elif** *second criteria*: <br>
&emsp;*action 2* <br>
**else**:
&emsp;*action 3* <br><br>

If statements can also be nested - the logic & structure stay the same.<br>
**if** *criteria*: <br>
&emsp;**if** *nested criteria*:<br>
&emsp;&emsp;*nested action 1*: <br>
&emsp;**else**<br>
&emsp;&emsp;*nested action 2*: <br>
**else**: <br>
&emsp;*alternate action* <br><br>

In [1]:
# First example - single if statement
shopping_list = ['apple', 'orange', 'banana']

# Check if my shopping list is complete
if len(shopping_list) == 3:
    print('List is complete.')
else:
    print('Add more items to my list.')

List is complete.


In [2]:
# Second example - double if statement
shopping_list = ['apple', 'orange']

# Check if my shopping list is complete
if len(shopping_list) == 3:
    print('List is complete.')
elif len(shopping_list) == 2:
    print('Add one more item to my list.')
else: 
    print('Add multiple items to my list.')

Add one more item to my list.


In [3]:
# Third example - nested if statement
shopping_list = ['apple', 'orange', 'banana']

if len(shopping_list) == 3:
    if 'apple' in shopping_list:
        print('My list is complete. It includes apples.')
    else: 
        print('my list is complete. It does not include apples.')
elif len(shopping_list) == 2:
    print('Add one more item to my list.')
else: 
    print('Add multiple items to my list.')

My list is complete. It includes apples.


**Loops**
There are a few types of loops in python - here we will focus on **for** and **while**. A **for** loop is most appropriate when there is a finite number of iterations to run through. For example - I want to print every item in my list. A **while** loop is more common when the number of iterations may not be known in advance, thus you want to complete an action until the loop criteria is no longer true. Loops can be a little confusing for beginners (I was confused for several months), but they are extremely valueable to learn. They can save a lot of lines of code repeating a process & process code very efficiently. <br><br>

The base setup of a **for** loop is:<br>

**for** *iterator* **in** *iterable object*:
&emsp;

In [4]:
# For loop example
shopping_list = ['apple', 'orange', 'banana']

for item in shopping_list:
    print(item)

apple
orange
banana
