# Pandas

While NumPy can be used to import data, it is optimized around numerical data. Many data sets include categorical variables. For these data sets, it is best to use a library called `pandas`, which focuses on creating and manipulating data frames. 

### Read data
With `pandas` imported, we can read in .csv files with the `pandas` function `read_csv()`.

In that function, we can specify the file we want to use with a URL or with the path to a local file as a string.

This saves the data in a structure called a DataFrame.

We are going to be using data on [long term average precipitation and temperature values in Boston from ~1980s-2010 from NOAA](https://www.ncei.noaa.gov/data/normals-monthly/doc/NORMAL_MLY_documentation.pdf).

In [87]:
filename = ""
# filename = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/boston_precip_temp.csv"



Our data is now saved as a data frame in Python as the variable `df`. With the data now in the environment, we can take a look at the first few rows with `df.head()`.

We can see that this data frame has several different columns, with information about stations, precipitation and temperature.

If you have an excel file you can also `pd.read_excel()`. You can specify the sheet name, as well. The default is the first sheet, and you can provide either a single sheet name, or a list of sheets you want as an alternative, which gives you a dictionary of pandas DataFrames.

If you say `sheet_name=None`, you will get all of the sheets back.

In [5]:
data_dir = ""


# xlsx = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/boston_precip_temp.xlsx"




In [4]:
# pd.read_excel(xlsx)

In [3]:
pd.read_excel(xlsx, sheet_name=['Sheet1','Sheet2'])

In [None]:
pd.read_excel(xlsx)

## Making sure data is in correct form

When the data does not have the standard format, there can be issues. This tends to happen when the first line of the .csv file is not column names.

For an example, we'll take a look at [a data set of two files on arctic vegetation plots](http://dx.doi.org/10.3334/ORNLDAAC/1358).

In [2]:
environmental_data = data_dir + "Arrigetch_Peaks_Environmental_Data_raw.csv"
environmental_data = "https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/Arrigetch_Peaks_Environmental_Data_raw.csv"
pd.read_csv(environmental_data)

In [6]:
species_file = data_dir + "/Arrigetch_Peaks_Species_Data_raw.csv"
species_file = "/Users/fordfishman/GitHub/envs110/python/data/Arrigetch_Peaks_Species_Data_raw.csv"

pd.read_csv(species_file)

### Question

For in-class questions, we'll be working with a data set called Gapminder. It is in the `data` subdirectory in this repo as `gapminder.csv`. You can also find it at this stable url: `https://raw.githubusercontent.com/ENVS110a-SP23/python/main/data/gapminder.csv`.

Load this data set and display the first few rows with `.head()`. **Make sure to save it as a different variable name than `df` to make sure you don't overwrite the precipitation and temperature data frame.**

In [None]:
# your code here

## Summarize data frame

It is important to understand the data we are working with before we begin analysis. First, let's look at the dimenions of the data frame using `df.shape`. It gives the number of rows by the number of columns.

This shows that our data frame has 300 rows by 7 columns.

We can get out those numbers individually through indexing.

`len(df)` also gets back how many rows you have.

We can also use `df.columns` to display the column names.

## Renaming columns and rows

We can rename as many columns as you want with `df.rename(columns = {old_name:new_name,...})`. 

Note that you need to re-assign to `df` or make a new variable if you want to save the renamed columns.

We can also re-assign row names by saying `index` instead of `columns`. This is more rare, however.

### Question 

Using the gapminder data frame, print out the column names. Rename the `age5_surviving` and `babies_per_woman` columns to be shorter.

In [None]:
# your code here: 

### Categorical variables
Next, let's summarize the categorical, non-numerical variables. For instance, we can identify how many unique regions we have in the data set.

First, to select a column, we use the notation `df['COLUMN_NAME']`.

Depending on your column name, you can also refer to the column with `df.column_name` as well.

To identify unique entries in this column, we can use the `pd.unique()` function. 

We can also just use the `len()` function to see how many unique values we have.

### Numerical variables

Numerical columns can be summarized in several ways. Let's find the mean first.

To make things simpler, we'll just do calculations on the `population`, `life_expectancy`, and `babies_per_woman` columns. We can put those names in a `list` and then specify that list for the columns.

In [7]:
num_cols = [ 'date', 'temp', 'diurnal_temp_range', 'precip-total','snow-totals' ] # numerical columns



With this set of columns, we can run `.mean()` to find the mean of each column.

If we want a larger variety of summary statistics, we can use the `.describe()` method.

We can also break down subgroupings of our data with the method `.groupby()`.

### Question

Using the gapminder data, use `.groupby()` to get summary statistics by region.

In [None]:
## your code here: 

### Accessing rows and specific entries

You can also to access a specific row using `df.loc[ROW, :]`. The colon specifies to select all columns for that row number.

We can use `.loc` to find the value of specific entries, as well.

## for loops

### Math

If we multiply a data frame by a single number, each value in the column will be muliplied by that value.

We can turn this into a new column by assigning to `df['new_col_name']`.

Numpy functions work very well with numerical columns.

This new column is now reflected in the data frame. 

We can also do math between columns, since they have the same length. Elements of the same row are added, substacted, multiplied, or divided. 


### Create your own data frame

To make your own data frame without a .csv, we use the function `pd.DataFrame()`. There are many ways to use this function to construct a data frame. 

Here, we show how to convert a dictionary of lists into a data frame. Each list will be its own column, and you need to make sure the lists are all the same length. The keys of each list should be the column names.

In [8]:
data_dict = {
    'a': [1, 3, 5],
    'b': ['apple', 'banana', 'apple'],
    'c': [-2., -3., -5.]
}

You can also use lists of lists or 2D NumPy arrays to create data frames. Each list will be a row, instead of a column, and you will need to specify the column name as another argument in `pd.DataFrame()` called `columns`.

In [9]:
data_list = [
    [1, 'apple', -2.],
    [3, 'banana', -3.],
    [5, 'apple', -5.]
]


Note: we need to save this as a variable to use it in the future.

### Export data frame as .csv

If you have made modifications to a data set in Python and want to export that to a new .csv, you can easily do that with the `.to_csv()` method that all pandas data frames have.

In [55]:
my_df = pd.DataFrame(data_list, columns=['a', 'b', 'c'])


#### Question: Putting it together

In assignment 2, we moved information gathered from some researchers into a nested data structure. Instead, transfer these data into a Pandas dataframe. Display the data frame, and export it as a .csv file.

As a reminder, each list is in the same order as the researchers name -> all of Haley McCann's data is at index `0`.

In [None]:
researchers = ['Haley McCann', 'Siena Welch', 'Jaylin Mercado', 'Ismael Hayden', 'Nina Bright']

temperatures = [29.75, 12.63, 31.58, 7.16, 32.51]

populations = [442, 336, 505, 913, 933]

dates = ['5/25/2022','3/18/2022','6/28/2022','11/11/2022','7/6/2023']

### Your code here:


## Resources

- [Pandas docs](https://pandas.pydata.org/docs/)
- [Pandas getting started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
- [Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [PySpark for big data](https://spark.apache.org/docs/latest/api/python/)

This lesson is adapted from 
[Software Carpentry](http://swcarpentry.github.io/python-novice-gapminder/design/).