# How to access the data


## Import needed libraries

In [1]:
import pandas as pd
import numpy as np

## Reading data from file
Based on the type of file you're reading from use the appropriate Pandas method to read in the file



In [8]:
# load the data from Salaries.csv

salaries_df = pd.read_csv('Salaries.csv')

salaries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26428 entries, 0 to 26427
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   yearID    26428 non-null  int64 
 1   teamID    26428 non-null  object
 2   lgID      26428 non-null  object
 3   playerID  26428 non-null  object
 4   salary    26428 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 1.0+ MB


# Basic dataframe information

## Dataframe Info

###  [`df.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) and  [`df.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)


---


`df.shape`

Returns a tuple representing the dimensionality of a DataFrame or Series. This tuple provides the number of rows and columns for a DataFrame, or the number of elements for a Series.


---


`.info()`

Provides a concise summary of a DataFrame. This summary is printed to the console or a specified buffer and includes:
* **Index data type**: Information about the DataFrame's index.
* **Column information**: For each column:
    * Column name.
    * Number of non-**null values**, which helps identify missing data.
    * Data type (Dtype) of the column (e.g., int64, float64, object for strings or mixed types).
* **Memory usage**: The amount of memory consumed by the DataFrame.
* **Overall data types summary**: A summary of the different data types present in the DataFrame.





In [None]:
# use each method to display information about Portland Weather Dataframe



## Getting a glimpse of the Dataframe


### [`df.head( )`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html),  [`df.tail( )`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html), and [`df.sample( )`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

`df.head()`

A method in the Pandas used to retrieve the first **n** rows of a DataFrame or Series. It provides a quick way to inspect the beginning of your data.

* **Default behavior**: If no argument is provided, `head()` returns the first 5 rows by default.
* **Customizable**: You can specify the number of rows to return by passing an integer argument n to the method, e.g., `df.head(10)` to get the first 10 rows.
* **Use cases**: It is commonly used for:
    * Quickly checking the structure and content of a DataFrame or Series after loading or manipulation.
    * Verifying that data has been loaded correctly.
    * Inspecting column names and data types in the initial records.


---


`df.tail()`

A method in Pandas is used to return the last **n** rows of a DataFrame or Series. *By default, if no argument is provided, it returns the last 5 rows.*


---

`df.sample()`

A method in Pandas used to generate a random sample of rows or columns from a DataFrame or Series. This is particularly useful when working with large datasets where analyzing the entire dataset is computationally expensive or unnecessary. *By default, if no argument is provided, it returns 1 randomly chosen row*.


In [13]:
# use each method to inspect the first n rows and last n rows of the dataframe

# salaries_df.head(10)
# salaries_df.tail(10)
salaries_df.sample(10)

Unnamed: 0,yearID,teamID,lgID,playerID,salary
20173,2009,LAA,AL,mathije01,450000
8560,1996,CHA,AL,snopech01,125000
24536,2014,PIT,NL,sanchga01,2300000
2158,1988,HOU,NL,medvisc01,62500
16798,2005,KCA,AL,berroan01,500000
16873,2005,LAN,NL,repkoja01,316000
26095,2016,NYA,AL,cessalu01,507500
13036,2000,TBA,AL,rollsda01,200000
7750,1995,DET,AL,steveto01,109000
9957,1997,OAK,AL,johnsda05,172500


# Indexing, Selecting, and Assigning


Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

## Accessing Columns
In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way.

1.   As an attribute   ->          `df.col_name`

If we have a Python dictionary, we can access its values using the indexing (`[]`) operator. We can do the same with columns in a DataFrame:




2.   As a dictionary keyword  -> `df["col_name"]`   



In [17]:
# use the attribute name only if there are no spaces in the name
# salaries_df.salary

salaries_df.salary

0          870000
1          550000
2          545000
3          633333
4          625000
           ...   
26423    10400000
26424      524000
26425      524900
26426    21733615
26427    14000000
Name: salary, Length: 26428, dtype: int64

In [18]:
# alternative access option, and must use if there is a space in the name

salaries_df['salary']

0          870000
1          550000
2          545000
3          633333
4          625000
           ...   
26423    10400000
26424      524000
26425      524900
26426    21733615
26427    14000000
Name: salary, Length: 26428, dtype: int64

## Indexing in Pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.



### Index-Based Selection

Single Row/Col
```
df.iloc[row, col]
```
Multiple Rows/Cols
```
df.iloc[row_start:[row_stop], col_start:[col_stop]]
```
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

Both `loc` and `iloc` are row-first, column-second. This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns.

To select the first row of data in a DataFrame, we may use the following:

In [24]:
# get the first row of the dataframe

salaries_df.iloc[0]

yearID           1985
teamID            ATL
lgID               NL
playerID    barkele01
salary         870000
Name: 0, dtype: object

To get a column with `iloc`, we can do the following:

In [27]:
# get all rows of the first column

salaries_df.iloc[:, 0]

0        1985
1        1985
2        1985
3        1985
4        1985
         ... 
26423    2016
26424    2016
26425    2016
26426    2016
26427    2016
Name: yearID, Length: 26428, dtype: int64

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.

In [29]:
# select the first 3 rows of the 'salary' column

salaries_df.iloc[:3, 4]

0    870000
1    550000
2    545000
Name: salary, dtype: int64

It's also possible to pass a list:

In [31]:
# select the first 3 rows of the 'salary' column, using a list of rows

salaries_df.iloc[[0, 100, 2000], 4]

0       870000
100     727500
2000     62500
Name: salary, dtype: int64

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values.

In [None]:
# get the last 5 rows of the dataframe



### Label-base Selection

The second method for attribute selection is the one followed by the `loc` operator: label-based selection. In this method, it's the data index value, not its position, which matters. The start and stop values will be the row(index)/column names rather than their index value.

In [None]:
# get the first entry in 'playerID'

player = salaries_df.loc[0, 'playerID']

'barkele01'

`iloc` is conceptually simpler than loc because it ignores the dataset's indices. When we use `iloc` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. `loc`, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead.

In [38]:
# get all the rows from the last 3 columns

salaries_df.loc[:, "lgID": "salary"].sample(10)

Unnamed: 0,lgID,playerID,salary
4912,AL,darwida01,3350000
568,NL,olwined01,60000
1389,AL,whitede03,70000
23401,AL,peraljh01,6000000
20644,NL,ryalru01,401000
19933,AL,jenksbo01,5600000
23514,NL,uribeju01,7295911
16342,NL,perezne01,2750000
5306,AL,ramosjo01,109000
12787,AL,ledeeri01,240000


### Choosing between `loc` and `iloc`

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. `loc`, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that `loc` can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet']` (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.loc[0:999]`.

Otherwise, the semantics of using `loc` are the same as those for `iloc`.

## Changing the Index

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The `set_index()` method can be used to do this

        df.set_index('col_name')

In [40]:
# Change the index of the salaries df to be the 'playerID'
salaries_df.set_index('playerID')
salaries_df

Unnamed: 0,yearID,teamID,lgID,playerID,salary
0,1985,ATL,NL,barkele01,870000
1,1985,ATL,NL,bedrost01,550000
2,1985,ATL,NL,benedbr01,545000
3,1985,ATL,NL,campri01,633333
4,1985,ATL,NL,ceronri01,625000
...,...,...,...,...,...
26423,2016,WAS,NL,strasst01,10400000
26424,2016,WAS,NL,taylomi02,524000
26425,2016,WAS,NL,treinbl01,524900
26426,2016,WAS,NL,werthja01,21733615


This is useful if you can come up with an index for the dataset which is better than the current one.





# Temporary vs Permanent Change

Notice that if you display the dataframe, the change didn't last.  Many Pandas methods that make a change to the dataframe only do so temporarily.  To make the change permanent you have two choices:

## Make a copy of the original Dataframe

If you decide that you need to keep the original dataframe, make a **deep copy** of it first, then make changes to the deep copy. A deep copy creates a completely independent new DataFrame in memory, including copies of both the data and the index. Modifications made to the copied DataFrame will not affect the original, and vice versa. This is the default behavior when `deep=True` or when the `deep` parameter is omitted.

```
    df.copy(deep=True)    # deep=True is the default so not necessary
```



In [None]:
# make a copy of the current dataframe and set the indices to the playerID

salaries_df_copy1 = salaries_df.copy()

salaries_df_copy1.set_index('playerID', inplace=True)

salaries_df_copy1

Unnamed: 0_level_0,yearID,teamID,lgID,salary
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
barkele01,1985,ATL,NL,870000
bedrost01,1985,ATL,NL,550000
benedbr01,1985,ATL,NL,545000
campri01,1985,ATL,NL,633333
ceronri01,1985,ATL,NL,625000
...,...,...,...,...
strasst01,2016,WAS,NL,10400000
taylomi02,2016,WAS,NL,524000
treinbl01,2016,WAS,NL,524900
werthja01,2016,WAS,NL,21733615


## Method Keyword Parameter

If the method makes a change that is not persistant, it will have an parameter that can be used to make it so. To make the change to our index values persistant we can add the `inplace = True` parameter.

**** When choosing this option you will lose access to the original dataframe

In [None]:
# make a copy of the current dataframe and set the indices to the 'teamID'



# Conditional Selection

So far we've been indexing the data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

In [None]:
# get the salary information for the Chicago Cubs teamID -> 'CHN'

salaries_df.teamID == 'CHN'



0        False
1        False
2        False
3        False
4        False
         ...  
26423    False
26424    False
26425    False
26426    False
26427    False
Name: teamID, Length: 26428, dtype: bool

## Simple Conditionals

This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of `loc` to select the relevant data

In [45]:
# get the salary information for the Cubs using loc

salaries_df.loc[salaries_df.teamID == 'BOS']

Unnamed: 0,yearID,teamID,lgID,playerID,salary
44,1985,BOS,AL,armasto01,915000
45,1985,BOS,AL,barrema02,272500
46,1985,BOS,AL,boggswa01,1000000
47,1985,BOS,AL,bordiri01,115000
48,1985,BOS,AL,boydoi01,177500
...,...,...,...,...,...
25683,2016,BOS,AL,ueharko01,9000000
25684,2016,BOS,AL,vazquch01,513000
25685,2016,BOS,AL,workmbr01,539500
25686,2016,BOS,AL,wrighst01,514500


This DataFrame has ~900 rows. The original had ~26,000. That means that around 3% of salary data is from the Chicago Cubs.



## Compound Conditionals with `and`

If we further wanted to know whose salaries were between 100,000 **and** 500,000 dollars. We can use the **ampersand** (`&`) to bring the two questions together

In [49]:
# Find the Cubs players whose salaries were between $100000 and $500000

salaries_df.loc[(salaries_df.teamID == 'BOS') & ((salaries_df.salary > 100000) & (salaries_df.salary < 500000))]

Unnamed: 0,yearID,teamID,lgID,playerID,salary
45,1985,BOS,AL,barrema02,272500
47,1985,BOS,AL,bordiri01,115000
48,1985,BOS,AL,boydoi01,177500
51,1985,BOS,AL,clemero02,140000
52,1985,BOS,AL,crawfst01,160000
...,...,...,...,...,...
22384,2012,BOS,AL,bowdemi01,484000
22386,2012,BOS,AL,carpech02,482000
22388,2012,BOS,AL,doubrfe01,484000
22393,2012,BOS,AL,kalisry01,483000


## Compound Conditionals with `or`

Suppose we'll buy any player from the American League with a salary less than 100,000 **or** greater than 10,000,000. For an "or" condition we use a **pipe** (`|`)

In [50]:
# find all players in the American League, regardless of team, that made less
# than $100,000 or more than $10,000,000

salaries_df.loc[(salaries_df.lgID == 'AL') & ((salaries_df.salary < 100000) | (salaries_df.salary > 10000000))]

Unnamed: 0,yearID,teamID,lgID,playerID,salary
40,1985,BAL,AL,sheetla01,60000
74,1985,CAL,AL,clibust02,60000
85,1985,CAL,AL,mccaski01,60000
105,1985,CHA,AL,guilloz01,60000
185,1985,DET,AL,castima02,76000
...,...,...,...,...,...
26380,2016,TOR,AL,dickera01,12000000
26381,2016,TOR,AL,donaljo02,11650000
26383,2016,TOR,AL,estrama01,11500000
26389,2016,TOR,AL,martiru01,15000000


## Built-in selector `.isin()`

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is `isin`. `isin` is lets you select data whose value "is in" a list of values.

```
      df.loc[df.col_name.isin([sequence of selectors])
```



In [51]:
# Get the salary data for the teams from the National League Central division
# CHN, CIN, SLN, MIL, PIT

salaries_df.loc[salaries_df.teamID.isin(['CHN', 'CIN', 'SLN', 'MIL', 'PIT'])]

Unnamed: 0,yearID,teamID,lgID,playerID,salary
118,1985,CHN,NL,bosleth01,215000
119,1985,CHN,NL,brusswa01,375000
120,1985,CHN,NL,ceyro01,1450000
121,1985,CHN,NL,davisjo02,550000
122,1985,CHN,NL,dernibo01,406250
...,...,...,...,...,...
26311,2016,SLN,NL,tejadru01,1500000
26312,2016,SLN,NL,wachami01,539000
26313,2016,SLN,NL,wainwad01,19500000
26314,2016,SLN,NL,waldejo01,3675000


## Built-in selector `.isnull`

The second is `isnull` (and its companion `notnull`). These methods let you highlight values which are (or are not) empty (`NaN`).

In [58]:
# Find all rows, in the salary column that have null values
salaries_df.loc[0:4, 'salary'] = np.nan
salaries_df.loc[salaries_df.salary.isnull()]

Unnamed: 0,yearID,teamID,lgID,playerID,salary
0,1985,ATL,NL,barkele01,
1,1985,ATL,NL,bedrost01,
2,1985,ATL,NL,benedbr01,
3,1985,ATL,NL,campri01,
4,1985,ATL,NL,ceronri01,


# Assigning Data

Going the other way, assigning data to a DataFrame is easy. Using `loc` you can assign either a constant value

In [57]:
# Assign a new salary to the first player, 'barkele01', in the dataframe

salaries_df.loc[0:4, 'salary'] = np.nan
salaries_df

Unnamed: 0,yearID,teamID,lgID,playerID,salary
0,1985,ATL,NL,barkele01,
1,1985,ATL,NL,bedrost01,
2,1985,ATL,NL,benedbr01,
3,1985,ATL,NL,campri01,
4,1985,ATL,NL,ceronri01,
...,...,...,...,...,...
26423,2016,WAS,NL,strasst01,10400000.0
26424,2016,WAS,NL,taylomi02,524000.0
26425,2016,WAS,NL,treinbl01,524900.0
26426,2016,WAS,NL,werthja01,21733615.0


Or with an iterable of values:

In [None]:
# add a new column with values from 0 to length of df - 1

