<a href="https://colab.research.google.com/github/Avipsa1/UPPP275-Notebooks/blob/main/Pandas_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of today's topics: 



*   Quick Python refresher
*   Pandas overview
*   Load data files
*   Select, filter, and slice data from a dataset
*   Merging and concatenating datasets
*   Grouping and summarizing data
*   Vectorization, map, and apply



In [1]:
import numpy as np   #We can rename packages with a choice of short names, for example 'np' for numpy and 'pd' for pandas
import pandas as pd

# We have already discussed the standard Python data types


*   Integer
*   Float
*   String
*   List
*   Set
*   Tuple
*   Dictionary
*   



In [2]:
# integers (int)
x = 100
type(x)

int

In [3]:
# floating-point numbers (float)
x = 100.5
type(x)

float

In [4]:
# sequence of characters (str)
x = 'Irvine, CA 92697'
len(x)

16

In [5]:
# list of items
x = [1, 'Irvine', 3, 'UCI', 92697]
len(x)

5

In [6]:
# sets are unique
x = {2, 2, 3, 3, 1}
x

{1, 2, 3}

In [7]:
# tuples are immutable sequences
latlng = (34.019425, -118.283413)
type(latlng)

tuple

In [9]:
# unpacking a tuple to get back values in separate variables
lat, lng = latlng
type(lat)

float

In [10]:
# dictionary of key:value pairs
person = {'first_name': 'Avipsa', 'last_name': 'Roy', 'profession': 'Faculty'}
type(person)

dict

In [14]:
person['first_name']

'Avipsa'

In [16]:
person['profession']

'Faculty'

# Coverting data types

In [17]:
# you can convert types
x = '100'
print(type(x))
y = int(x)
print(type(y))

<class 'str'>
<class 'int'>


# Looping through items in a dictionary and a tuple

In [18]:
# you can loop through an iterable, such as a list or tuple
for coord in latlng:
    print('Current coordinate is:', coord)

Current coordinate is: 34.019425
Current coordinate is: -118.283413


In [19]:
# loop through a dictionary keys and values as tuples
for key, value in person.items():
    print(key, value)

first_name Avipsa
last_name Roy
profession Faculty


## Working with Pandas

[pandas](https://pandas.pydata.org/) is a data analysis library in Python. It has two primary data structures we will work with: Series and DataFrames



*   **Pandas Series**





In [20]:
# a pandas series is based on a numpy array: it's fast, compact, and has more functionality
# perhaps most notably, it has an index which allows you to work naturally with tabular data
my_list = [28, 145, 77, 0.32]
my_series = pd.Series(my_list)
my_series

0     28.00
1    145.00
2     77.00
3      0.32
dtype: float64

In [21]:
# look at a list-representation of the index
my_series.index.tolist()

[0, 1, 2, 3]

In [22]:
# look at the series' values themselves
my_series.values

array([ 28.  , 145.  ,  77.  ,   0.32])

In [23]:
# what's the data type of the series' values?
type(my_series.values)

numpy.ndarray

In [24]:
# what's the data type of the individual values themselves?
my_series.dtype

dtype('float64')



*   **Pandas Dataframe**


In [30]:
# a dict can contain multiple lists and label them
my_dict = {'med_hh_income'  : [75125, 22075, 31950, 115400],
           'home_price' : [525000, 275000, 395000, 985000]}
my_dict

{'home_price': [525000, 275000, 395000, 985000],
 'med_hh_income': [75125, 22075, 31950, 115400]}

In [31]:
# a pandas dataframe can contain one or more columns
# each column is a pandas series
# each row is a pandas series
# you can create a dataframe by passing in a list, array, series, or dict
df = pd.DataFrame(my_dict)
df

Unnamed: 0,med_hh_income,home_price
0,75125,525000
1,22075,275000
2,31950,395000
3,115400,985000


In [32]:
# the row labels in the index are accessed by the .index attribute of the DataFrame object
df.index.tolist()

[0, 1, 2, 3]

In [33]:
# the column labels are accessed by the .columns attribute of the DataFrame object
df.columns

Index(['med_hh_income', 'home_price'], dtype='object')

In [34]:
# the data values are accessed by the .values attribute of the DataFrame object
# this is a numpy (two-dimensional) array
df.values

array([[ 75125, 525000],
       [ 22075, 275000],
       [ 31950, 395000],
       [115400, 985000]])