# Introduction to pandas

Pandas is an open source data analysis and manipulation tool. It's very useful!

We'll be learning about pandas using a snake dataset. 

The data is available on kaggle [here](https://www.kaggle.com/datasets/kkhandekar/conservation-status-of-the-worlds-snakes)



In [39]:
import pandas as pd 

## Opening flat files and creating dataframes

read_csv: get data from other flat files

Pandas allows you to create dataframes from flat files such as csv or excel spreadsheets.

A dataframe is essentially a spreadsheet. 

In [40]:
# Here we create a new dataframe by calling the "read_csv" method. The file name is the argument. 
# In this case, we don't need to specify the sep argument since comma is the default. 
# If we were creating a dataframe from another file type (for example, a tsv), we would add an argument.

snakes = pd.read_csv('ConservationStatusofWorldSnakes.csv')

snakes


Unnamed: 0,Family,Scientific name,Common name(s),Red List status,Population trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
...,...,...,...,...,...
244,TYPHLOPIDAE,Typhlops syntherus,Barahona Peninsula Blindsnake,NT,unknown
245,VIPERIDAE,Vipera barani,,NT,decreasing
246,VIPERIDAE,Vipera lotievi,Lotiev's Viper,NT,decreasing
247,VIPERIDAE,Vipera monticola,Atlas Dwarf Viper,NT,decreasing


### Load only a certain number of rows

If you want to only load a limited number of rows, you can use the nrows parameter

In [41]:
just_six_snakes = pd.read_csv('ConservationStatusofWorldSnakes.csv',nrows = 6)

just_six_snakes

Unnamed: 0,Family,Scientific name,Common name(s),Red List status,Population trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
5,DIPSADIDAE,Alsophis antiguae,Antiguan Racer,CR,increasing


## Data cleanup

Pandas is a great library for data cleanup

### Renaming columns

In the dataset, the column names have spaces.

To make analysis possible, we'll use the 


In [42]:
# Call the rename columns method. 
#Specify the columns parameter as a dictionary with the old and new name as key value pairs.


snakes = snakes.rename(columns = {'Scientific name':'scientific_name',
                                  'Common name(s)':'common_name',
                                 'Red List status': 'red_list_status',
                                 'Population trend':'population_trend'})
snakes

Unnamed: 0,Family,scientific_name,common_name,red_list_status,population_trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
...,...,...,...,...,...
244,TYPHLOPIDAE,Typhlops syntherus,Barahona Peninsula Blindsnake,NT,unknown
245,VIPERIDAE,Vipera barani,,NT,decreasing
246,VIPERIDAE,Vipera lotievi,Lotiev's Viper,NT,decreasing
247,VIPERIDAE,Vipera monticola,Atlas Dwarf Viper,NT,decreasing


### set_index

Let's say instead of having our index as numbers, we want them as the common name.



In [43]:
#snakes = snakes.set_index('common-name')

snakes.head()

Unnamed: 0,Family,scientific_name,common_name,red_list_status,population_trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing


### Reset index

Let's say we change our minds and want to reset the index to the original.

In [44]:
snakes = snakes.reset_index()

snakes.head()


Unnamed: 0,index,Family,scientific_name,common_name,red_list_status,population_trend
0,0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing


### Filter data

Let's say we want to filter data. For example, we're only going to want to show snakes that are decreasing in population (those that have a population trend value of "decreasing.") 

In [45]:
decreasing_snakes = snakes.query("population_trend == 'decreasing'")

decreasing_snakes

Unnamed: 0,index,Family,scientific_name,common_name,red_list_status,population_trend
3,3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
11,11,PSEUDOXYRHOPHIIDAE,Compsophis vinckei,,CR,decreasing
12,12,VIPERIDAE,Crotalus catalinensis,Santa Catalina Island Rattlesnake,CR,decreasing
16,16,COLUBRIDAE,Lampropeltis herrerae,,CR,decreasing
...,...,...,...,...,...,...
239,239,COLUBRIDAE,Stilosoma extenuatum,Short-tailed Snake,NT,decreasing
245,245,VIPERIDAE,Vipera barani,,NT,decreasing
246,246,VIPERIDAE,Vipera lotievi,Lotiev's Viper,NT,decreasing
247,247,VIPERIDAE,Vipera monticola,Atlas Dwarf Viper,NT,decreasing


In [46]:
# troubleshoo the replace method since it's not working

snakes['population-trend'] = snakes['population-trend'].replace(['NaN', 'No info'])

snakes

KeyError: 'population-trend'

### Drop empty values

Pandas also allows you to drop missing values


In [None]:
# Before dropna()

just_six_snakes

Unnamed: 0,Family,Scientific name,Common name(s),Red List status,Population trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
5,DIPSADIDAE,Alsophis antiguae,Antiguan Racer,CR,increasing


In [None]:
# After dropna()

just_six_snakes.dropna()

Unnamed: 0,Family,Scientific name,Common name(s),Red List status,Population trend
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
5,DIPSADIDAE,Alsophis antiguae,Antiguan Racer,CR,increasing


#### Drop rows only if all values are missing

This df method drops values only if all row values are missing

In [None]:
just_six_snakes.dropna(how='all')

Unnamed: 0,Family,Scientific name,Common name(s),Red List status,Population trend
0,BOLYERIDAE,Bolyeria multocarinata,Round Island Burrowing Boa,EX,
1,DIPSADIDAE,Borikenophis sanctaecrucis,Saint Croix Racer,EX,
2,TYPHLOPIDAE,Typhlops cariei,,EX,
3,ELAPIDAE,Aipysurus apraefrontalis,"Sahul Reef Snake, Short-nosed Sea Snake",CR,decreasing
4,ELAPIDAE,Aipysurus foliosquama,Leaf-scaled Sea Snake,CR,decreasing
5,DIPSADIDAE,Alsophis antiguae,Antiguan Racer,CR,increasing


### use_cols: import a subset of columns

Let's say we wanted to import a subset of columns. In that case, we can use the "use_cols" argument 

In [None]:
# Create a list of columns to use

cols = ["timestamp","company","Race_White","Race_Black","totalyearlycompensation"]

# Create a dataframe that uses only the selected columns.

data = pd.read_csv('Levels_Fyi_Salary_Data.csv', usecols = cols)



## dtypes: learn about data types

Pandas tries to guess the data type for data in flat files but sometimes it guesses wrong.

This is often true numbers that represent groups or qualities instead of quantities.

In [None]:
data.dtypes

# Notice how timestamp is an object but should be datetime

## dtype argument: change data types

You pass a dictionary to the dtype argument to change the data type.

## parse_dates: parse simple dates

Pandas does not read datetime data correctly. For this reason, we use the parse_dates argument within read_csv.

In [None]:
# Pass the datetime column into the parse_dates argument.

data = pd.read_csv('Levels_Fyi_Salary_Data.csv', parse_dates = ["timestamp"])

data.timestamp.head()

In [None]:
# Create dict of columns to combine into new datetime column

datime_parsed = {"timestamp": ["start_date","start_time"]}

# Load file, supplying the dict to parse_dates
survey_data = pd.read_csv('Levels_Fyi_Salary_Data.csv', parse_dates = datime_parsed)


## parse_dates: parse simple dates