<a href="https://colab.research.google.com/github/MonkeyWrenchGang/MGTPython/blob/main/module_2/2_pandas_follow_along.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2. Pandas Follow Along


---



Join me on a thrilling journey as we explore the versatile world of Pandas. In this follow-along we'll dive into the following features:

1. Effortlessly import data from a CSV file
2. Visually inspect and analyze the data
3. Utilize the power of Pandas to gain valuable insights through descriptive statistics
4. Master the art of working with dates in Pandas
5. Filter rows using the classic df[df["column"] == "value"] method and the more advanced df.query('EXPRESSION')
  - **spend time understanding the query method!**
6. Select specific columns with ease using the [[]] operator
7. Experience the simplicity of sorting data with df.sort_values()
8. Transpose your data with ease using df.T
9. Learn to leverage basic column functions such as .mean(), .min(), .max(), and .median()
10 Create visually stunning histograms and line plots using Pandas.

### Let's begin!"



## Import Libraries 


---
Before we dive into the exciting world of Pandas, let's set the stage by importing the necessary libraries and configuring our environment. With all the tools at our disposal and the optimal settings in place, we'll be ready to tackle any challenge that comes our way.



In [2]:
# -- notebook options -- 
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
import warnings
warnings.filterwarnings('ignore')
# ------------------------------------------------------------------

# -- key libraries --
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# -- need this to render charts in notebook -- 
%matplotlib inline

In [4]:
## Read SPY data 
spy = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_2/data/SPY.csv")
spy.head()

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
0,SPY,2019-01-02,245.979996,251.210007,245.949997,250.179993,126925200,234.061646
1,SPY,2019-01-03,248.229996,248.570007,243.669998,244.210007,144140700,228.476257
2,SPY,2019-01-04,247.589996,253.110001,247.169998,252.389999,142628800,236.129257
3,SPY,2019-01-07,252.690002,255.949997,251.690002,254.380005,103139100,237.991043
4,SPY,2019-01-08,256.820007,257.309998,254.0,256.769989,102512600,240.227051


## Clean up Column Names 

One simple yet powerful tip that will make your Pandas experience smoother is to clean up your column names. This includes making them all lowercase(a-z), removingspecial characters, and replacing spaces with underscores. This will ensure that dealing with columns is consistent and effortless. Trust me, this one small step will save you a lot of headaches in the long run.

let's see how to do this copy and paste the following code : 

```python
## Clean up Column Names
spy.columns = ( spy.columns
    .str.strip()           # -- remove leading / trailing spaces 
    .str.lower()           # -- lower case column names 
    .str.replace(' ', '_') # -- replace spaces with underscore 
    .str.replace('-', '_') # -- replace dash with underscore 
    .str.replace('(', '')  # -- remove open paren
    .str.replace(')', '')  # -- remove close paren
    .str.replace('?', '')  # -- remove question mark 
    .str.replace('\'', '') # -- remove single quote notice the backslash \ this is an escape character
)

print(spy.columns)
spy.head()

````


## Create a few new columns 


---



Now that our data is loaded and organized, we can start creating some new columns. One useful technique is using the .shift() function. This allows us to shift the values of a particular column, for example, the adjusted close, on different days. By calculating the difference between these shifted values, we can easily compute a daily return and gain valuable insights into our data. Let's explore the power of the .shift() function and see how it can enhance our analysis.

### copy and paste this code and run it:

```python
# -- lag the adjusted close by one day
spy["adjusted_lag"] = spy["adjusted"].shift(1)
# -- calculate a daily return 
spy["adjusted_daily_return"] = (spy["adjusted"] / spy["adjusted"].shift(1)) -1
# -- eyeball first 10 records 
spy.head()

```

#### Note about Shift()

The pandas.DataFrame.shift() function is used to shift the values of a particular column or DataFrame along a particular axis by a specified number of periods. This is equivalent to subtracting the original value from the value at the specified number of periods later.

The shift() function can be useful in a variety of situations, such as:

- Creating lagged variables: you can shift a variable up or down a certain number of rows to create a lagged version of the variable, which can be useful for calculating changes or differences over time.
- Shifting dates: you can use shift() to shift dates forward or backward a certain number of days to create new date columns.
- Shifting index: you can use shift() to change index of the DataFrame, this way you can align DataFrames that have been re-sampled at different rates.

#### Shift() takes several parameters:

- periods: Number of periods to shift. Can be positive or negative
- freq: Offset alias or date offset string, for example 'D' or 'BH' for business hour
- axis: 0 or 'index' for shifting rows, 1 or 'columns' for shifting columns
- fill_value: The scalar value to use for newly introduced missing values, by default it's None.

An example: 
```python 
# Let's say we have a DataFrame `df`
df = pd.DataFrame({'A': [1, 2, 3]})
df['A_shifted_up'] = df['A'].shift(-1)
df['A_shifted_down'] = df['A'].shift(1)
print(df)
```
This will create new columns A_shifted_up and A_shifted_down that has the values of A shifted up and down respectively.


In [5]:
# -- lag the adjusted close by one day
spy["adjusted_lag"] = spy["adjusted"].shift(1)
# -- calculate a daily return 
spy["adjusted_daily_return"] = (spy["adjusted"] / spy["adjusted"].shift(1)) -1
# -- eyeball first 10 records 
spy.head()

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted,adjusted_lag,adjusted_daily_return
0,SPY,2019-01-02,245.979996,251.210007,245.949997,250.179993,126925200,234.061646,,
1,SPY,2019-01-03,248.229996,248.570007,243.669998,244.210007,144140700,228.476257,234.061646,-0.023863
2,SPY,2019-01-04,247.589996,253.110001,247.169998,252.389999,142628800,236.129257,228.476257,0.033496
3,SPY,2019-01-07,252.690002,255.949997,251.690002,254.380005,103139100,237.991043,236.129257,0.007885
4,SPY,2019-01-08,256.820007,257.309998,254.0,256.769989,102512600,240.227051,237.991043,0.009395


## pandas.DataFrame.info()

The pandas.DataFrame.info() function is used to get a summary of the DataFrame, including the index dtype and column dtypes, non-null values and memory usage. It is a quick way to get a high-level understanding of the structure and content of your DataFrame.

When called, info() returns the following information:

- the name of the index dtype
- the number of columns and their names
- the number of non-null values for each column
- the data type of each column
- the amount of memory used by the DataFrame

Here is an example of how you can use the info() function:
```python
import pandas as pd

df = pd.read_csv("data.csv")
df.info()

```

In our case use .info() on the SPY dataframe ~ copy and paste the following into a code cell: 

```python
spy.info()

```

In [6]:
spy.info

<bound method DataFrame.info of      symbol        date        open        high         low       close  \
0       SPY  2019-01-02  245.979996  251.210007  245.949997  250.179993   
1       SPY  2019-01-03  248.229996  248.570007  243.669998  244.210007   
2       SPY  2019-01-04  247.589996  253.110001  247.169998  252.389999   
3       SPY  2019-01-07  252.690002  255.949997  251.690002  254.380005   
4       SPY  2019-01-08  256.820007  257.309998  254.000000  256.769989   
...     ...         ...         ...         ...         ...         ...   
1003    SPY  2022-12-23  379.649994  383.059998  378.029999  382.910004   
1004    SPY  2022-12-27  382.790009  383.149994  379.649994  381.399994   
1005    SPY  2022-12-28  381.329987  383.390015  376.420013  376.660004   
1006    SPY  2022-12-29  379.630005  384.350006  379.079987  383.440002   
1007    SPY  2022-12-30  380.640015  382.579987  378.429993  382.429993   

         volume    adjusted  adjusted_lag  adjusted_daily_return  


## pandas.DataFrame.describe() 

The pandas.DataFrame.describe() function is used to generate descriptive statistics of the DataFrame. It is a useful tool for getting a high-level understanding of the distribution of the data in the DataFrame.

When called, describe() returns a summary of the following statistics for each column in the DataFrame:

- count: the number of non-null observations
- mean: the mean of the non-null observations
- std: the standard deviation of the non-null observations
- min: the minimum of the non-null observations
- 25%: the 25th percentile of the non-null observations
- 50%: the median (50th percentile) of the non-null observations
- 75%: the 75th percentile of the non-null observations
- max: the maximum of the non-null observations

Here is an example of how you can use the describe() function:

```python
import pandas as pd

df = pd.read_csv("data.csv")
df.describe()
```

This will display a summary of the statistics for each numerical column in the DataFrame. You can also include non-numeric columns with the include parameter, it takes a list of datatypes and by default it only includes numeric columns.

```python
df.describe(include='all')
```
This will display summary statistics for all columns including non-numeric columns.

```python 
# -- copy this below -- 
spy.describe(include='all')
```


In [7]:
spy.describe(include='all')

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted,adjusted_lag,adjusted_daily_return
count,1008,1008,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1007.0,1007.0
unique,1,1008,,,,,,,,
top,SPY,2019-01-02,,,,,,,,
freq,1008,1,,,,,,,,
mean,,,361.715099,364.068502,359.15499,361.769276,84744670.0,350.994165,350.962948,0.000589
std,,,63.236054,63.396761,62.922583,63.164565,44470990.0,66.470242,66.495877,0.014198
min,,,228.190002,229.679993,218.259995,222.949997,20270000.0,213.785492,213.785492,-0.109424
25%,,,300.279991,301.137513,298.580002,300.219993,56980100.0,284.261291,284.22351,-0.005291
50%,,,367.869995,370.029999,364.449997,366.835006,74269700.0,358.353516,358.332703,0.0009
75%,,,417.272499,419.297493,414.777512,417.262505,97188200.0,408.937058,409.001434,0.007506


## .drop()

In pandas, you can use the drop() function to remove one or more columns from a DataFrame. The drop() function takes two main arguments: the labels of the columns that you want to remove, and the axis (defaults to 0) which tells pandas whether you want to remove a row (axis=0) or a column (axis=1).

Here is an example of how you can use the drop() function to remove a column from a DataFrame:

```python
df = pd.read_csv("data.csv")
df = df.drop("column_name", axis=1)

```

Drop the symbol column from the SPY dataset. 

```python
spy.drop("symbol", axis=1)
```

What happens if you don't overwrite the dataset? does it acually get dropped? 

```python
spy = spy.drop("symbol", axis=1)
spy
```

Unnamed: 0,date,open,high,low,close,volume,adjusted,adjusted_lag,adjusted_daily_return
0,2019-01-02,245.979996,251.210007,245.949997,250.179993,126925200,234.061646,,
1,2019-01-03,248.229996,248.570007,243.669998,244.210007,144140700,228.476257,234.061646,-0.023863
2,2019-01-04,247.589996,253.110001,247.169998,252.389999,142628800,236.129257,228.476257,0.033496
3,2019-01-07,252.690002,255.949997,251.690002,254.380005,103139100,237.991043,236.129257,0.007885
4,2019-01-08,256.820007,257.309998,254.000000,256.769989,102512600,240.227051,237.991043,0.009395
...,...,...,...,...,...,...,...,...,...
1003,2022-12-23,379.649994,383.059998,378.029999,382.910004,59857300,382.910004,380.720001,0.005752
1004,2022-12-27,382.790009,383.149994,379.649994,381.399994,51638200,381.399994,382.910004,-0.003944
1005,2022-12-28,381.329987,383.390015,376.420013,376.660004,70911500,376.660004,381.399994,-0.012428
1006,2022-12-29,379.630005,384.350006,379.079987,383.440002,66970900,383.440002,376.660004,0.018000


## pandas.to_datetime() 

The pandas.to_datetime() function is used to convert a string or a column of strings that represent dates and/or times to a datetime data type. This can be useful when working with data that has dates and times stored as strings and you want to perform operations on them.

Here is an example of how you can use the to_datetime() function:

```python
import pandas as pd

df = pd.read_csv("data.csv")
df['date_col'] = pd.to_datetime(df['date_col'])

```

This code will convert the values in the 'date_col' column of the DataFrame, which are assumed to be strings, to datetime type and update the column in place.

You can also specify the format of the date string, for example:

```python
df['date_col'] = pd.to_datetime(df['date_col'], format='%Y-%m-%d')

```
This will convert 'date_col' from string to datetime assuming the date is in 'yyyy-mm-dd' format.

If the date string does not match the specified format, the function will raise a ValueError.

Once the dates are converted to datetime, it will allow you to perform operations such as comparison and arithmetic, extract year, month, day, hour and other properties of the datetime and can be useful when working with time series data


### Let's convert the "date" column to a datetime 

```python 
# -- convert date to datetime -- 
spy["date"] = pd.to_datetime(spy["date"])
# -- look at the data types -- 
print(spy.info())
# -- eyeball first 10 records 
spy.head()
```

In [14]:
# -- convert date to datetime -- 
spy["date"] = pd.to_datetime(spy["date"])
# -- look at the data types -- 
print(spy.info())
# -- eyeball first 10 records 
spy.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1008 entries, 0 to 1007
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   symbol                 1008 non-null   object        
 1   date                   1008 non-null   datetime64[ns]
 2   open                   1008 non-null   float64       
 3   high                   1008 non-null   float64       
 4   low                    1008 non-null   float64       
 5   close                  1008 non-null   float64       
 6   volume                 1008 non-null   int64         
 7   adjusted               1008 non-null   float64       
 8   adjusted_lag           1007 non-null   float64       
 9   adjusted_daily_return  1007 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(1)
memory usage: 78.9+ KB
None


Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted,adjusted_lag,adjusted_daily_return
0,SPY,2019-01-02,245.979996,251.210007,245.949997,250.179993,126925200,234.061646,,
1,SPY,2019-01-03,248.229996,248.570007,243.669998,244.210007,144140700,228.476257,234.061646,-0.023863
2,SPY,2019-01-04,247.589996,253.110001,247.169998,252.389999,142628800,236.129257,228.476257,0.033496
3,SPY,2019-01-07,252.690002,255.949997,251.690002,254.380005,103139100,237.991043,236.129257,0.007885
4,SPY,2019-01-08,256.820007,257.309998,254.0,256.769989,102512600,240.227051,237.991043,0.009395


## Extract date/time parts with the .dt accessor

After converting a column to a datetime type using pd.to_datetime(), you can use the .dt accessor to access the properties of the datetime objects in that column.

The .dt accessor provides a convenient way to perform operations on datetime columns and extract information from datetime objects, such as the day, month, year, hour, etc. Here are some examples of the common .dt functions:

```python
df["year"] = df["date_col"].dt.year
df["month"] = df["date_col"].dt.month
df["day"] = df["date_col"].dt.day
df["hour"] = df["date_col"].dt.hour
df["dayofweek"] = df["date_col"].dt.dayofweek

```

This will extract year, month, day, hour and day of the week from date_col and create new columns with these values.

Some of the other commonly used .dt functions are:

- .weekofyear - Returns the week number of the year
- .dayofyear - Returns the day of the year
- .quarter - Returns the quarter of the date
- .daysinmonth - Returns the number of days in the month of the date
- .is_leap_year - Returns boolean value indicating if the year of the datetime is a leap year
- .is_month_start, .is_month_end - Returns boolean indicating if first/last day of month
- .is_quarter_start, .is_quarter_end - Returns boolean indicating if first/last day of quarter

You can find more information on the .dt accessor and the different functions it provides in the pandas documentation.


https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.dayofweek.html 

It is important to note that the .dt accessor will only work on datetime columns, if you try to use it on a column that is not a datetime, it will raise an error.

    
Run the following 
    
```python
# extract via .dt.<accessors> 
# -- day -- 
spy["date"].dt.dayofweek
spy["date"].dt.day_name()
# -- month -- 
spy["date"].dt.month
spy["date"].dt.month_name()
# -- year -- 
spy["date"].dt.year
    
```
    

## Format datetime to string with .dt.strftime(string-format-time)

The .dt.strftime() function is used to convert datetime objects in a column of a DataFrame to strings using a specified format. This can be useful when you want to output datetime data in a specific format or when you want to create new columns with the datetime data in a specific format.

Here is an example of how you can use the .dt.strftime() function:
```python
df['date_col_formatted'] = df['date_col'].dt.strftime('%Y-%m-%d')

``` 
This will convert the values in the 'date_col' column to strings and format them as 'yyyy-mm-dd' and create a new column 'date_col_formatted' with these formatted values.

The format codes used in the strftime() function are the same as the codes used in the python datetime.strftime() method and you can find the list of codes in the python documentation. Some of the common codes are:

- %Y - 4-digit year
- %m - 2-digit month
- %d - 2-digit day
- %H - Hour (24-hour clock)
- %M - Minute
- %S - Second
- %A - Weekday as full name
- %B - Month name

The .dt.strftime() function is a powerful way to change the format of datetime data, it can be useful when you need to export data to a specific format or when you want to compare datetime data in different formats.

Here we want to genereate the following: 
1. month and year ex. Jan-2020
2. day month year ex. 01Jan2020
3. month-day-year ex 01-01-2020

```python
spy["date"].dt.strftime("%b-%Y")
spy["date"].dt.strftime('%d%b%Y')
spy["date"].dt.strftime("%m-%d-%Y")                   

```
**Formatting**
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

**strftime function**
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html



0       01-02-2019
1       01-03-2019
2       01-04-2019
3       01-07-2019
4       01-08-2019
           ...    
1003    12-23-2022
1004    12-27-2022
1005    12-28-2022
1006    12-29-2022
1007    12-30-2022
Name: date, Length: 1008, dtype: object

## Filter Dates 

To filter dates, first the column needs to be a ***datetime data type*** then simply query or use location based filtering. You can use pd.to_datetime() to convert a column of strings that represent dates to datetime data type. Once the column is in datetime format, you can then use different methods to filter the DataFrame based on dates.

### *NOTE: By default Panda's will convert comparison strings in "yyyy-mm-dd" format to datetimes*

You can use comparison operators like <, >, <=, >=, ==, != to filter rows based on a specific date or range of dates, for example:

```python
df[df['date_col'] >= '2022-01-01']
```

This will return a DataFrame containing only the rows where the value in the 'date_col' column is greater than or equal to '2022-01-01'.

Another method is to use the .loc[] or .query() method to filter the DataFrame. For example:

```python
# -- df.query() method -- 
df.query("date_col >= '2022-01-01'")

# OR 

# -- df.loc() method -- 
df.loc[df['date_col'] >= '2022-01-01']


```
Let's pratice filtering dates using Double Brackets!

1. filter after a date 
2. filter between two dates 
3. filter by day of week 
4. filter by year 

```python
# 1. filter after a date 
spy[spy["date"] > '2021-01-01']

# 2. filter between two dates 
spy[(spy["date"] >= '2021-01-01') & (spy["date"] < '2021-02-01') ]

# 3. filter by day of week 
spy[spy["date"].dt.day_name() == "Monday"]

# 4. filter by year 
spy[spy["date"].dt.year == "2021"] # tricked ya! 

spy[spy["date"].dt.year == 2021]
```

Let's pratice filtering dates using Query method! 

1. filter after a date 
2. filter between two dates 
3. filter by day of week 
4. filter by year 

```python
# 1. query after a date 
spy.query('date > "2021-01-01"')

# 2. query between two dates 
spy.query('date > "2021-01-01" and date < "2021-02-01"')

# 3. query by day of week 
spy.query('date.dt.day_name() == "Monday"')

# 4. filter by year 
spy.query('date.dt.year == "2021"') # tricked ya! 

spy.query('date.dt.year == 2021')
```

The .query() method can be a more readable and convenient way to filter rows in a DataFrame compared to using comparison operators or the .loc[] method. The .query() method allows you to pass a string containing a Boolean expression that specifies the conditions to filter the DataFrame by. This can make the filtering code more readable and understandable, especially when dealing with complex filtering conditions.

For example, instead of using comparison operators or the .loc[] method to filter a DataFrame, you can use the .query() method to pass a string containing the filtering conditions:

```python
df.query("column1 > 5 and column2 == 'value'")
```

It's important to note that the **.query() method is more efficient** than using the .loc[] method when dealing with large DataFrames, especially when the filtering conditions are simple and involve only a few columns.

Keep in mind that the query method will only work if the **columns you are trying to filter by don't have spaces in the column name** and you need to make sure that the column name you pass to the query method is the exact name of the column in the dataframe.

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted,adjusted_lag,adjusted_daily_return
757,SPY,2022-01-03,476.299988,477.850006,473.850006,477.709991,72668200,470.083679,467.377563,0.005790
758,SPY,2022-01-04,479.220001,479.980011,475.579987,477.549988,71178700,469.926239,470.083679,-0.000335
759,SPY,2022-01-05,477.160004,477.980011,468.279999,468.380005,104538900,460.902649,469.926239,-0.019202
760,SPY,2022-01-06,467.890015,470.820007,465.429993,467.940002,86858900,460.469666,460.902649,-0.000939
761,SPY,2022-01-07,467.950012,469.200012,464.649994,466.089996,85111600,458.649170,460.469666,-0.003954
...,...,...,...,...,...,...,...,...,...,...
1003,SPY,2022-12-23,379.649994,383.059998,378.029999,382.910004,59857300,382.910004,380.720001,0.005752
1004,SPY,2022-12-27,382.790009,383.149994,379.649994,381.399994,51638200,381.399994,382.910004,-0.003944
1005,SPY,2022-12-28,381.329987,383.390015,376.420013,376.660004,70911500,376.660004,381.399994,-0.012428
1006,SPY,2022-12-29,379.630005,384.350006,379.079987,383.440002,66970900,383.440002,376.660004,0.018000


## Filter Columns with Double Square Brackets [[ ]]

In pandas, you can use double square brackets `[[ ]]` to filter columns in a DataFrame. This is different from using single square brackets `[]`, which is used to filter rows.

When you use double square brackets aka "bracket bracket", you can select **one or multiple columns** from a DataFrame by specifying their names. Here's an example of how you can use double square brackets to select columns from a DataFrame:

```python
df = pd.read_csv("data.csv")
columns_to_keep = ['col1', 'col2', 'col3']
df = df[columns_to_keep]
df 
```
This will create a new DataFrame that contains only the columns 'col1', 'col2', 'col3' from the original DataFrame.

You can also use a list comprehension to filter columns
```python
cols_to_keep = [col for col in df.columns if 'string' in col]
df = df[cols_to_keep]
```
You can also use `.loc[]` or `.filter()` to select columns by name or by a boolean mask. 

```python
df.filter(like='string')
df.loc[:, df.columns.str.contains('string')]
```
While the `[[]]` notation is a shorthand for the `.loc[]` operator - few if anyone programs this way anymore - just use `[[]]` and make your life easier.



---

Create some new data frames usikng filtering by date and for specific columns:  

1. **Query 1:** select "date", "volume", "adjusted, "adjusted_daily_return" where year equals 2021
```python
q1 = spy.query('date.dt.year == 2021')[[]]
q1
```
2. **Query 2:** do the same thing just use a list like this
```python
list_o_columns = ["date", "volume", "adjusted", "adjusted_daily_return" ]
q2 = spy.query('date.dt.year == 2021')[list_o_columns]
q2
```

3. **Query 3:** filter for year >= 2020 and "date", "adjusted","adjusted_lag","adjusted_daily_return" 
```python
list_o_columns = []
q3 = spy.query('date.dt.year == 2021')[list_o_columns]
q3
```
4. **Query 4:** create a new column "min_max_diff" by subtracting the max from min, then filter where dates where the  for year >= 2020 and "date", "adjusted","adjusted_lag","adjusted_daily_return" 
```python
list_o_columns = []
q3 = spy.query('date.dt.year == 2021')[list_o_columns]
q3
```


In [None]:
spy.query('date.dt.year == 2021')[["date", "volume", "adjusted", "adjusted_daily_return" ]]

In [None]:
list_o_columns = ["date", "volume", "adjusted", "adjusted_daily_return" ]
spy.query('date.dt.year == 2021')[list_o_columns]

## Sorting with .sort_values()

sorting in pandas is easy, simply call the sort values method and pass a list of columns like this 

df.sort_values(by=['col1', 'col2'])

- ascending = False will sort decending (largest to smallest)
- na_position='first' will put null values 'first' or 'last'

```python

spy.sort_values(by=["adjusted_daily_return"])

spy.sort_values(by=["adjusted_daily_return"], na_position = 'first')

spy.query('date.dt.year == 2021').sort_values(by=["adjusted_daily_return"])

spy.query('date.dt.year == 2021').sort_values(by=["adjusted_daily_return"], ascending = False)

spy.query('date.dt.year == 2020').sort_values(by=["adjusted_daily_return"], ascending = False, na_position = 'first')

```

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

In [None]:
spy

In [None]:
spy.sort_values(by=["adjusted_daily_return"], ascending = False, na_position = 'first')\
[["date","adjusted","adjusted_daily_return"]]

In [None]:
spy.sort_values(by=["adjusted_daily_return"], na_position = 'first')



In [None]:
spy.sort_values(by=["adjusted_daily_return"], na_position = 'first')

## Transposing a Dataset

The .T method simply turns rows into columns and columns into rows. Really handy in some situations. 

```
spy.T
spy[["date","adjusted_daily_return"]].T

```

In [None]:
spy[["date","adjusted_daily_return"]].T

## Basic Column (Series) Math Functions 


Series.mean - Return the mean.

Series.median - Return the median.

Series.sum - Return the sum.

Series.min Return the minimum.

Series.max Return the maximum.

Series.idxmin Return the **index of the minimum.**

Series.idxmax Return the **index of the maximum.**


```python
spy["adjusted_daily_return"].mean()

spy.query('date.dt.year == 2021')["adjusted_daily_return"].sum()

list_o_columns = ["date", "volume", "adjusted", "adjusted_daily_return" ]
spy.query('date.dt.year == 2021')[list_o_columns].mean()

spy["adjusted_daily_return"].min()

spy["adjusted_daily_return"].count()

spy["adjusted_daily_return"].max()

spy["adjusted_daily_return"].sum()

spy["adjusted_daily_return"].std()

spy["adjusted_daily_return"].idxmax()
spy.iloc[50,]

idxmin = spy["adjusted_daily_return"].idxmin()
spy.iloc[idxmin,]


```

In [None]:
spy.query('date.dt.year == 2021')["adjusted_daily_return"].mean()

In [None]:
mean_return = spy.query('date.dt.year == 2021')["adjusted_daily_return"].mean()


In [None]:
spy["adjusted_daily_return"].max()

In [None]:
list_o_columns = ["date", "volume", "adjusted", "adjusted_daily_return" ]

spy.query('date.dt.year == 2021')[list_o_columns].mean()

In [None]:
spy.query('date.dt.year == 2021')["adjusted_daily_return"].sum()

In [None]:
spy["adjusted_daily_return"].idxmax()
spy.iloc[56,]

In [None]:
idxmin = spy["adjusted_daily_return"].idxmin()
spy.iloc[idxmin,]

### Annualized Standard Devaition 

The Annualized Standard Deviation is the standard deviation multiplied by the square root of the number of periods in one year.

\begin{equation*}
Annualized Standard Devaition  = \sigma * \sqrt{( number of days)}
\end{equation*}

```python
spy["adjusted_daily_return"].std() * np.sqrt(spy["adjusted_daily_return"].count())
```



In [None]:
spy["adjusted_daily_return"].std() * np.sqrt(spy["adjusted_daily_return"].count())

In [None]:
spy["adjusted_daily_return"].std() * (spy["adjusted_daily_return"].count())**(1/2)

## Basic Plots 

Histogram 

Line Plot 



In [None]:
spy["adjusted_daily_return"].plot.hist(bins=20, alpha=0.5)

In [None]:
spy.plot.line(x="date", y="adjusted_daily_return")

In [None]:
spy.plot.line(x="date", y="adjusted")
spy.plot.line(x="date", y="close")