# Introduction to pandas


### Objective of this notebook
In this notebook we will be introduce to pandas. Pandas built on top of Python and is designed to work with tabular data.  

Notebook is adapted from Wendy Lee 2024

We import the library as pd
You can check the version you are working with using the **__version__** attribute. This is useful if there are version conflicts between packages.

In [3]:
# Import libraries
import pandas as pd
# print(pd.__version__)

## Working with DataFrame ##
By convention, we use a generic name `df` for DataFrame objects

#### Constructing a dataframe
We can using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose **keys** are the **column names** (Names and Scores in this example), and whose **values** are a **list of entries**. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels using an ascending count from 0 (0, 1, 2, 3, ...) for the **row labels**.

In [6]:
# list1 = ['a','b','c','d']
# list2 = [1,3,4,6]
# df = pd.DataFrame({"Names": list1, "Scores": list2})
# df

Unnamed: 0,Names,Scores
0,a,1
1,b,3
2,c,4
3,d,6


#### Assigning Index to our constructor
The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an index parameter in our constructor:

In [7]:
# productList = ['Product A', 'Product B', 'Product C', 'Product D']
# pd.DataFrame({"Names": list1, "Scores": list2}, index=productList)

Unnamed: 0,Names,Scores
Product A,a,1
Product B,b,3
Product C,c,4
Product D,d,6


## Series ##
- A Series is a sequence of data values. You can create a Series with nothing more than a list.

- It is different than a list: 1) it has an explicit index. 2) it has a type dtype, short for data type.
- The dtype can be integer, float, and object int64 (int64 in Python stands for 64 bit integer. The 64 refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”)

In [9]:
# myList = [1,2,3,'-',4,5,8]
# pd.Series(myList)

We can also create a pandas series with a list of strings

In [13]:
# sList = ['ABC','DEF','GHI']
# pd.Series(sList)

Unnamed: 0,0
0,ABC
1,DEF
2,GHI


- Noticed that the dtype is `object`, not `str` as expected. It's because panadas uses object by default for strings. You can explicitly set the dtype to string by using the **dtype** argument.

In [15]:
# pd.Series(sList, dtype="string")

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [16]:
# revenue = [30, 38, 58]
# years = ['2018 Sales', '2019 Sales', '2020 Sales']
# pd.Series(revenue, index=years, name="Product A")

## Reading data files ##
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

We'll use the `pd.read_csv()` function to read the data into a DataFrame.

In [33]:
# csvFile = 'https://raw.githubusercontent.com/csbfx/advpy122-data/master/top_movies_2020.csv'
# movies = pd.read_csv(csvFile)
# movies

We can use the `shape` attribute to check the size (rows, columns) of the DataFrame.

In [19]:
# movies.shape

We can examine the contents of the resultant DataFrame using the `head()` command.

By default, `head()`, grabs the first five rows. You can specify the number of rows you want to print.

In [22]:
# movies.head(8)

We can use `tail()` to retrieve the **last** 5 rows of the DataFrame.

In [23]:
# movies.tail()

We can use the `dtypes` attribute to find out the data type of each column of the table.

In [24]:
# movies.dtypes

We can even get more information about the columns by calling the `info` method:

In [25]:
# movies.info()

This will tell us a bunch more information. From top to bottom we can see:
- what class the object is (a DataFrame)
- what the index looks like (a range from Gone with the Wind to Liar Liar)
- how many data columns we have
- for each column, how many values and their dtype
- a summary of how many columns have each dtype
- how much memory the object is taking up (more on that in a future chapter)

We can customize the column header names by assigning values to the columns attribute.

In [35]:
# movies.columns = ['Movie','Gross', 'Gross adj', 'Year']
# movies


We can select which columns we want to view. This does not change the DataFrame, however we can assign these columns to a new DataFrame or overwrite the original DataFrame

In [37]:
# movies[['Gross adj','Movie']]

In [38]:
# movies['Gross adj']

To get even more information about the contents of the different columns, we can try the `describe` method:

In [44]:
## We are assigning result to a new dataframe. The original dataframe stays intact
# new_movies = movies.drop(columns=['Year'])

# new_movies.describe()

It's hard to read the numbers in scientific notation. We can change the number format.  
  
Change the display format for float.  
We can use commas to separate thousands and use 2 decimal points. To learn more about python string format: https://www.guru99.com/python-string-format.html

In [42]:
# pd.options.display.float_format = "{:,.2f}".format

# movies.drop(columns=['Year']).describe()

Unnamed: 0,Gross,Gross adj
count,200.0,200.0
mean,256492048.62,560869366.3
std,170567531.47,227797683.45
min,9183673.0,370330510.0
25%,116926360.25,414518727.25
50%,234196310.0,500451231.5
75%,363303312.5,616672963.5
max,936662225.0,1895421694.0


You can also get run `describe()` on a specific column (which is a Series rather than a DataFrame).

In [None]:
# movies['Gross adj'].describe()

The `pd.read_csv()` function has over [30 optional parameters](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an `index_col`.

In [48]:
# movies = pd.read_csv(csvFile)

## Setting column 0 as index
# movies = pd.read_csv(csvFile, index_col = 0)
# movies.head()