# Processing Data With Pandas
During the first part of this lesson you learned the basics of pandas data structures (Series and DataFrame) and got familiar with basic methods loading and exploring data. Here, we will continue with basic data manipulation and analysis methods such calculations and selections.

Our focus here will be the following
1. Basic calculations
* Filtering and updating data
* Dealing with missing data
* Data type conversions
* Unique values
* Sorting data
* Grouping and transforming data
* Joining data
* Writing data to a file

---
We are now working in a new notebook file and we need to import pandas again.

In [1]:
import pandas as pd

Let's work with the the HUC_12 land cover data we used in the last exercise. Again these data are stored on an online repository which we can read directly with Pandas. However, we must be sure to specify the `HUC_12` be read in as a string, not as a number.

In [8]:
# Re-read EnviroAtlas data into the land_df dataframe, ensuring HUC_12 is a string
data_url = 'https://github.com/ENV859/EnviroAtlasData/blob/main/LandCover.csv?raw=true'
land_df = pd.read_csv(data_url,dtype={'HUC_12':'str'})

It's always a good idea to view the data, to ensure it was read correctly

In [9]:
land_df.head()

Unnamed: 0,HUC_12,N_INDEX,PFOR,PWETL,PDEV,PAGT,PAGP,PAGC,PFOR90,PWETL95
0,10100020101,100.0,81.643204,11.3925,0.0,0.0,0.0,0.0,92.245598,0.789988
1,10100020102,100.0,74.1082,12.3005,0.0,0.0,0.0,0.0,86.143501,0.265201
2,10100020103,100.0,78.816101,13.6751,0.0,0.0,0.0,0.0,92.095398,0.395778
3,10100020104,100.0,72.776901,7.57982,0.0,0.0,0.0,0.0,80.097702,0.259059
4,10100020105,100.0,74.281403,13.1165,0.0,0.0,0.0,0.0,81.182098,6.21581


## 1. Basic calculations
One of the most common things to do in pandas is to create new columns based on calculations between different variables (columns).

We can create a new column `FOR_WET` in our DataFrame by specifying the name of the column and giving it some default value (in this case the decimal number 0.0).

In [10]:
#Define a new column "FOR_WET"
land_df['FOR_WET'] = 0.0

#Check how the dataframe looks now
land_df.head()

Unnamed: 0,HUC_12,N_INDEX,PFOR,PWETL,PDEV,PAGT,PAGP,PAGC,PFOR90,PWETL95,FOR_WET
0,10100020101,100.0,81.643204,11.3925,0.0,0.0,0.0,0.0,92.245598,0.789988,0.0
1,10100020102,100.0,74.1082,12.3005,0.0,0.0,0.0,0.0,86.143501,0.265201,0.0
2,10100020103,100.0,78.816101,13.6751,0.0,0.0,0.0,0.0,92.095398,0.395778,0.0
3,10100020104,100.0,72.776901,7.57982,0.0,0.0,0.0,0.0,80.097702,0.259059,0.0
4,10100020105,100.0,74.281403,13.1165,0.0,0.0,0.0,0.0,81.182098,6.21581,0.0


In [11]:
#Check the datatype of the newly created column
land_df['FOR_WET'].dtypes

dtype('float64')

We see that Pandas created a new column and recognized automatically that the data type is float as we passed a 0.0 value to it.

Let’s update the column `FOR_WET` by calculating the sum of `PFOR` and `PWETL` columns to get an idea how much each HUC12 is either forest or wetland:

In [13]:
#Calculate the sum of forest and wetland
land_df['FOR_WET'] = land_df['PFOR'] + land_df['PWETL']

#Check the result
land_df.sample(5)

Unnamed: 0,HUC_12,N_INDEX,PFOR,PWETL,PDEV,PAGT,PAGP,PAGC,PFOR90,PWETL95,FOR_WET
5964,30300030205,67.609398,57.266399,0.334891,4.44375,27.946899,23.6227,4.32415,57.601299,0.0,57.60129
38721,101303040108,97.071899,0.005208,0.290625,0.619792,2.30833,1.88958,0.41875,0.267708,0.028125,0.295833
12901,40201010205,81.829498,51.464699,27.174,7.15986,11.0107,3.36709,7.6436,75.214203,3.42446,78.638699
55345,120901050201,57.076199,2.18326,0.085683,22.918501,20.005301,0.227389,19.777901,2.26894,0.0,2.268943
40126,101600050402,2.46209,0.405685,1.30659,3.88199,93.655899,3.8596,89.796303,0.405685,1.30659,1.712275


The calculations were stored into the `FOR_WET` field as expected.

## 2. Filtering and updating data
In the previous exercise, we reviewed how we can select specific columns from our data. Let's improve on that with some examples on how to select both sets of rows and columns.

### Indices and Selecting rows
Rows in a dataframe can be selected by their indices, but we have a choice indices to use: a *labeled* index or an *intrinsic* index. So what's the difference, and why are there two?

* The **label index** is a value assigned to each row that we can control. By default, when we import a CSV file into a dataframe, this index is simply a numeric value assigned sequentially to each row. However, we can assign this index to be any value. In our case, we could 

* The **intinrisic index** (also knows as the *integer index(

In [14]:
land_df.index

RangeIndex(start=0, stop=82915, step=1)

## 3. Dealing with missing data

## 4. Data type conversions

## 5. Unique values

## 6. Sorting data

## 7. Grouping and transforming data

## 8. Joining data

## 9. Writing data to a file