# Data Manipulation in Python
Whether you want to be a data scientist, data engineer, or just automate some boring tasks with python, you'll spend a large portion of of your time manipulating data.<br>

Data manipulation can take many forms, but in essence it is converting data from one format to another. Think about getting a monthly report of sales numbers from your boss. Your company is global, but you're tasked with finding the sales rep with the most sales in North America. What steps do you need to do to find this information? Maybe something like this:

1. Filter the file for the North America Region
2. Add a look-up of distinct sales reps
3. Add a sumif to determine the total sales by sales rep
4. Sort the total sales from highest to lowest.

All of these steps are manipulating the raw data to get the final answer. <br>

Section 03 showed how to load a csv file into python. When the data is loaded in using pandas, it is loaded into a special data structure - a dataframe. A data frame is a table structure made up of rows and columns. Much like an excel table, a column typically represents an attribute or variable while a row represents a record: <br>

| Rep ID | Rep Name | Month | Sales (units)|
| --- | --- | --- | --- | 
| 1aa | Joe | January | 100 |
| 1aa | Joe | February | 200 |
| 1bb | John | January | 150 |
| 1bb | John | February | 150 |

In this table, we have 4 records & 4 columns describing each record. <br>

Pandas is a very popular package, so there is great documentation online & plenty of questions posted on Stack Overflow. There are other data structures that can store data in a similar way - dictionary, matrix, parquet, other. Choosing one is dependent on your programs needs. In general, a pandas dataframe is a good place to start. <br>

We'll cover these common data manipulation action: 

- Select / Drop
- Filter
- Distinct
- Order By
- Mutate
- Group By & Summarize
- Merge
- Append / Union

Let's load the data and get started. 

In [4]:
# Import pandas
import pandas as pd

# Define the file path
file_path = '../../2022-fall-python-tutorial/data/2022_boxscores.csv' # '../..' helps python to find the file

# Load file
df = pd.read_csv(file_path)

# Print first 3 rows of data
df.head(3)

Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
0,50.0,11,15.4,6,98.6,51.3,20,0.421,57,0.386,...,17,0.0,0,"Jon M. Huntsman Center, Salt Lake City, Utah",ABILENE-CHRISTIAN,Abilene Christian,71.0,Home,UTAH,Utah
1,68.8,22,0.0,0,101.3,67.7,21,0.507,73,0.438,...,15,0.0,0,"Reed Arena, College Station, Texas",ABILENE-CHRISTIAN,Abilene Christian,64.3,Home,TEXAS-AM,Texas A&M
2,59.1,13,0.0,0,86.6,76.7,23,0.381,67,0.328,...,18,0.0,0,"College Park Center, Arlington, Texas",TEXAS-ARLINGTON,UT Arlington,72.6,Away,ABILENE-CHRISTIAN,Abilene Christian


### Selecting / Dropping Data
After loading the boxscore dataframe, notice on the far left side that there is a number on each row 0, 1, 2, ... This is called the **index** - this helps python perform data manipulation functions efficiently & also allows you to select certain rows of data. <br>

**Select** a single column of a dataframe using: 
- df['column_name']

**Select** multiple columns of a dataframe using:
- df[['column_name_1', 'column_name_2']]

Notice that selecing a single column by name has a single '[' bracket while selecing multiple columns uses multiple brackets '[['. 

In [14]:
# Select all of the winning names in df
print('Winning Names: ')
df['winning_name'].head(3)

# Select all of the winning names AND losing names in df
print('Winning Names AND Losing Names: ')
df[['winning_name', 'losing_name']].head(3)

Winning Names: 
Winning Names AND Losing Names: 


Unnamed: 0,winning_name,losing_name
0,Utah,Abilene Christian
1,Texas A&M,Abilene Christian
2,Abilene Christian,UT Arlington


**Select** a single row of a dataframe using: 
- df.iloc[row_num]

**Select** multiple rows of a dataframe using: 
- df.iloc[row_start:row_end]

The *row_end* is not inclusive, meaning that I'm selecting rows up to **not** including that row index added. 

**Important** - the row index starts at 0 by default! Always remember that when selecting rows. 

In [19]:
# Select 1st row of data
print('1st row of data: ')
df.iloc[0]

# Select 2nd & 3rd rows of data
print('2nd & 3rd rows of data: ')
df.iloc[1:3] # remember that 1 is the second row - index starts at 0. 

1st row of data: 
2nd & 3rd rows of data: 


Unnamed: 0,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,...,home_two_point_field_goals,home_win_percentage,home_wins,location,losing_abbr,losing_name,pace,winner,winning_abbr,winning_name
1,68.8,22,0.0,0,101.3,67.7,21,0.507,73,0.438,...,15,0.0,0,"Reed Arena, College Station, Texas",ABILENE-CHRISTIAN,Abilene Christian,64.3,Home,TEXAS-AM,Texas A&M
2,59.1,13,0.0,0,86.6,76.7,23,0.381,67,0.328,...,18,0.0,0,"College Park Center, Arlington, Texas",TEXAS-ARLINGTON,UT Arlington,72.6,Away,ABILENE-CHRISTIAN,Abilene Christian
