# pandas - Python Data Analysis Library
`pandas` is a software library written for data manipulation and analysis. It contains the `DataFrame` object for manipulating numerical tables and time series data. The dataframe in pandas combines aspects of `MATLAB` indexing with functionality similar to the statistical programming language `R`.
First, since `pandas` is an auxiliary library, you must allows load it.

In [None]:
import pandas as pd

`pandas` provides an easy to use function `read_table`.

In [None]:
# read table
dftemps = pd.read_table('data//GlobalTempbyMonth.txt', header=None, index_col=0, sep=r'\s+')
print(type(dftemps))

In [None]:
# show data
dftemps

Notice that the dataframe has row and column headers that can be either strings or numbers. The index for the rows in the dataframe `dftemps` appears as the first column of this file - the dates. The index of a dataframe is also bolded when printed. Dataframes must have a column containing unique values for every row; the index of a dataframe is this unique column.

In [None]:
# show index
dftemps.index

`pandas` has a function for reading excel files, `read_excel`

In [None]:
dfcarbon = pd.read_excel('data/GlobalCarbonBudget2022.xlsx','Global Carbon Budget', skiprows=20)

In [None]:
dfcarbon

In [None]:
dfcarbon.columns

In [None]:
name_dict = {}
for name in dfcarbon.columns:
    namelist = name.split(' ')
    name_dict[name] = ' '.join(namelist[0:2])
print(name_dict)

In [None]:
newdfcarbon = dfcarbon.rename(columns = name_dict)

In [None]:
newdfcarbon

## Indexing into a dataframe (aka slicing a dataframe)
The two methods that are essential to accessing information in a dataframe are `loc` and `iloc`. `loc` takes the row and column names as strings. `iloc` takes only integers that label the rows and columns. Using `iloc` allows you to index a dataframe using only numbers.

In [None]:
dftemps.iloc[3,5]

In [None]:
dftemps.loc['1903/10']

In [None]:
dfcarbon['ocean sink'][0:6]

In [None]:
dfcarbon.loc[0:6,'ocean sink']

In [None]:
dfcarbon.iloc[:,2]

In [None]:
dfcarbon['Year']

In [None]:
x = 'ocean sink'
dfcarbon.x

## Adding to Dataframes
You can add columns to a dataframe with the following syntax. Notice that `pandas` has a number of methods associated with dataframes, such as `mean`, `min`, `max`. These methods are meant to act directly on the dataframe. Other statistical tools are available directly from pandas, you can visit [Dataframe statistical methods](https://studyopedia.com/pandas/statistical-functions/). The basic statistical tools you have are
| Method      | Description                                      |
|-------------|--------------------------------------------------|
| sum()       | Return the sum of the values.                    |
| count()     | Return the count of non-empty values.            |
| max()       | Return the maximum of the values.                |
| min()       | Return the minimum of the values.                |
| mean()      | Return the mean of the values.                   |
| median()    | Return the median of the values.                 |
| std()       | Return the standard deviation of the values.     |
| describe()  | Return the summary statistics for each column.   |


In [None]:
# add new columns
dftemps['average'] = dftemps.mean(axis=1)
dftemps['min'] = dftemps.min(axis=1)
dftemps['max'] = dftemps.max(axis=1)

In [None]:
dftemps

In [None]:
dftemps.describe()

In [None]:
# sort by average
dftemps2 = dftemps.sort_values(by='average', ascending=False)

In [None]:
dftemps2

In [None]:
# get a certain column
dftemps2.average

In [None]:
newdfcarbon['total carbon'] = newdfcarbon['fossil emissions'] + newdfcarbon['land-use change'] 

In [None]:
newdfcarbon

## Reforming a Dataframe
Part of data wrangling is manipulating data into the most useful form. Suppose we didn't want the average temperature for every month, but every year. You can use `groupby` to group data in a different way. *Can you describe what happens in each of the following lines?*

In [None]:
# group by year and calculate the average temperature
dftemps['year'] = list(map(lambda x:x[:4], dftemps.index))
year_average = dftemps.groupby(dftemps.year).average.mean()
dftemps.set_index(dftemps.year)
year_average

In [None]:
dftemps['year'] = list(map(lambda x:x[:4], dftemps.index))
type(year_average)

Alternatively, you can look into using `pivot_table`, which by also groups rows by common entries (specified by the `index` parameter) and calculates the `mean` (you can also specify other operations) across them.

In [None]:
pivottemps = pd.pivot_table(dftemps, values= ['average','min','max'], index='year')

In [None]:
pivottemps

## Reordering a Dataframe
You can reorder the columns by using the `iloc` method and listing the column indices in the new order.

In [None]:
df2 = newdfcarbon.iloc[:,[0,1,2,8,3,4,5,6,7]]

In [None]:
df2

## Writing to a File
After manipulating any data using a dataframe, you can write your modified table to a `csv`, `excel`, or `json` file easily. With any dataframe, you can use the methods: `to_csv`, `to_excel`, or `to_json`.

In [None]:
newdfcarbon.to_csv('pandas_carbon.csv')

In [None]:
newdfcarbon.to_excel('pandas_carbon.xlsx')

In [None]:
newdfcarbon.to_json('pandas_carbon.json')

## One more thing....
`matplotlib` can take dataframe columns as arguments. This allows you to plot information from dataframes without too much effort.

In [None]:
import matplotlib.pyplot as mp

In [None]:
mp.plot(newdfcarbon.Year, newdfcarbon['total carbon'],newdfcarbon.Year,newdfcarbon['atmospheric growth'])
mp.xlabel('Year')
mp.ylabel('Carbon emitted (Gt)')
mp.legend(['total carbon','atmospheric growth'])

## Oops...One more thing
This gives me an idea. We have carbon emission data. Is there a correlation between those emissions and the temperature anomaly in `globaltemps.txt`?

In [None]:
gtemps = pd.read_table('data/globaltemps.txt', header=None)
pgtemps = gtemps[(gtemps[0]>1958) & (gtemps[0]<2022)]

In [None]:
pgtemps

# Exercise: Explorin'
After reading about *Exercise 3: Explorin'*, use this space to explore the file `Exoplanet_Archive10.2025.csv`

In [None]:
#create exoplanet dataframe
dfplanets = pd.read_csv('data/Exoplanet_Archive10.2025.csv')
dfplanets