## Intro to Pandas

We'll review the basics of lists, indexing, and sorting so we can get a good feeling for how pandas works

In [5]:
# You can sort lists using the built in list.sort() function
# This function sorts the list in-place, so you don't need to assign anything to the output
list = [2, 5, 1, 6]
list.sort()
print(list)

[1, 2, 5, 6]


In [6]:
# Our favorite type of loop: the for-each loop
# This one prints every number in the list in order
for num in list:
    print(num)

1
2
5
6


In [60]:
# Sometimes, its more helpful to move through the list using an index
# Using this, we can print only the first two numbers of the list
for i in range(0, len(list), 2):
    print(list[i])

1
5


In [61]:
# Helper function for cleaning: ignore this
def clean_column_names(df):
    new_columns = []
    for name in columns:
        new_columns.append(name.replace('\"', "").strip())
    df.columns = new_columns

In [63]:
# Now lets load in a new csv file
from pathlib import Path
path = Path(globals()['_dh'][0]).parent/"Data/cities.csv"

import pandas as pd
cities = pd.read_csv(path)
clean_column_names(cities)
display(cities)

Unnamed: 0,LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
0,41,5,59,"""N""",80,39,0,"""W""","""Youngstown""",OH
1,42,52,48,"""N""",97,23,23,"""W""","""Yankton""",SD
2,46,35,59,"""N""",120,30,36,"""W""","""Yakima""",WA
3,42,16,12,"""N""",71,48,0,"""W""","""Worcester""",MA
4,43,37,48,"""N""",89,46,11,"""W""","""Wisconsin Dells""",WI
...,...,...,...,...,...,...,...,...,...,...
123,39,31,12,"""N""",119,48,35,"""W""","""Reno""",NV
124,50,25,11,"""N""",104,39,0,"""W""","""Regina""",SA
125,40,10,48,"""N""",122,14,23,"""W""","""Red Bluff""",CA
126,40,19,48,"""N""",75,55,48,"""W""","""Reading""",PA


In [66]:
# Each row of a dataframe is a list, e.g.
print(cities.loc[0])

LatD                41
LatM                 5
LatS                59
NS                 "N"
LonD                80
LonM                39
LonS                 0
EW                 "W"
City      "Youngstown"
State               OH
Name: 0, dtype: object


In [70]:
# Each column is also a list
print(cities['City'])

0            "Youngstown"
1               "Yankton"
2                "Yakima"
3             "Worcester"
4       "Wisconsin Dells"
              ...        
123                "Reno"
124              "Regina"
125           "Red Bluff"
126             "Reading"
127             "Ravenna"
Name: City, Length: 128, dtype: object


In [55]:
# If we want, we can sort the dataframe by the values in one of these columns
cities = cities.sort_values("City")
display(cities)

Unnamed: 0,LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
127,41,9,35,"""N""",81,14,23,"""W""","""Ravenna""",OH
126,40,19,48,"""N""",75,55,48,"""W""","""Reading""",PA
125,40,10,48,"""N""",122,14,23,"""W""","""Red Bluff""",CA
124,50,25,11,"""N""",104,39,0,"""W""","""Regina""",SA
123,39,31,12,"""N""",119,48,35,"""W""","""Reno""",NV
...,...,...,...,...,...,...,...,...,...,...
4,43,37,48,"""N""",89,46,11,"""W""","""Wisconsin Dells""",WI
3,42,16,12,"""N""",71,48,0,"""W""","""Worcester""",MA
2,46,35,59,"""N""",120,30,36,"""W""","""Yakima""",WA
1,42,52,48,"""N""",97,23,23,"""W""","""Yankton""",SD


In [57]:
# Notice that the index of each row is the same!
# If we want to change the indices to match our new ordering, we can use
cities.reset_index(inplace=True, drop=True)
display(cities)

Unnamed: 0,LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
0,41,9,35,"""N""",81,14,23,"""W""","""Ravenna""",OH
1,40,19,48,"""N""",75,55,48,"""W""","""Reading""",PA
2,40,10,48,"""N""",122,14,23,"""W""","""Red Bluff""",CA
3,50,25,11,"""N""",104,39,0,"""W""","""Regina""",SA
4,39,31,12,"""N""",119,48,35,"""W""","""Reno""",NV
...,...,...,...,...,...,...,...,...,...,...
123,43,37,48,"""N""",89,46,11,"""W""","""Wisconsin Dells""",WI
124,42,16,12,"""N""",71,48,0,"""W""","""Worcester""",MA
125,46,35,59,"""N""",120,30,36,"""W""","""Yakima""",WA
126,42,52,48,"""N""",97,23,23,"""W""","""Yankton""",SD


### A quick aside: why is this all important?

- Data is everywhere and very important. Knowing how to handle it will help you be a better data analyst and eventually data scientist
- Our goal is to be doing high-level machine learning. This means you'll need to manage the data used to train and test your model and gather data to evaluate it's performance

#### What skills do you need to do this?

- In the future, you'll be given csv files or find your own data, and will need to sort through it
- You could do this manually, like we're doing here, but you might need to sort/clean data programatically
- You might also generate your own csv data (descriptive statistics, experimental results, etc) to sort through and graph