## Panda series and dataframe

#### What and why?
- In python, we have built-in data types, like list and dictionary.
- When we concern performance, this is too slow if we use list or dictionary to perform some calcullations for a big dataset
- numpy has an array data object which is faster then list because all elements in an array should be of the same data type
- numpy array is faster but pandas are still needed because there is an explicit index for pandas, which allows for more complex data manipulation and easier handling of missing data.
- pandas is fast but also has an explicit index which is good for referencing data


In [1]:
#list can store elements of different datatypes. for example:
b = [1, 2, 3]
print (b)
c = b*2
print (c)



[1, 2, 3]
[1, 2, 3, 1, 2, 3]


In [2]:
import numpy as np

a = np.array([1, 2, 3])

print(a)

d= a*2
print(d)

[1 2 3]
[2 4 6]


In [4]:
#numpy has an implicit index
a[0]


np.int64(1)

In [5]:
import pandas as pd


In [6]:
data = pd.Series([0.24, 0.5, 0.75, 1.0], index = ["a", "b", "c", "d"])

In [9]:
print(data)

a    0.24
b    0.50
c    0.75
d    1.00
dtype: float64


In [10]:
print(a)

[1 2 3]


### Data objects in pandas - pandas series
- 1D array of indexed data

#### how to create
- it can be created from a list, array, dictionary

In [8]:
#from a list
series_list = pd.Series([0.24, 0.5, 0.75, 1.0])
print(series_list)

0    0.24
1    0.50
2    0.75
3    1.00
dtype: float64


In [10]:
#from a list with explicit index
series_list_2 = pd.Series([0.24, 0.5, 0.75, 1.0], index = ["a", "b", "c", "d"])
print(series_list_2)

a    0.24
b    0.50
c    0.75
d    1.00
dtype: float64


In [11]:
#array
a = np.array([1,2,3])
series_array = pd.Series(a, index=["a","b","c"])
print(series_array)

a    1
b    2
c    3
dtype: int64


In [12]:
#from a dictionary
series_dictionary = pd.Series({"a":1,"b":2,"c":3})
print(series_dictionary)

a    1
b    2
c    3
dtype: int64


#### how to select data from it

#### selection with 'key'

In [15]:
#use the key of the original dictionary
print(series_dictionary["a"])

1


In [19]:
#also similar to iterate over a dictionary, we can also do this in pandas
print(series_dictionary.keys())
print(list(series_dictionary.items()))

Index(['a', 'b', 'c'], dtype='object')
[('a', 1), ('b', 2), ('c', 3)]


#### selection with index, similar to list

In [27]:
series_dictionary = pd.Series({"a":1, "b":2, "c":3})


#slicing with explicit index
print(series_dictionary["a":"b"]) #beware! when slicing a list, it is an open interval on the right!

#slicing with implicit index
print(series_dictionary[0:2]) #beware that it is the ssame as slicing a list as it is an open interval on the right


a    1
b    2
dtype: int64
a    1
b    2
dtype: int64


In [29]:
data = pd.Series(['a', 'b', 'c'],
                index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [31]:
#This is very confusing if one is using [] syntax for indexing and slicing in pandas series

#pandas is using explicit index when indexing
print(data[1])

#pandas is using implicit index when slicing
print(data[1:3]) #implicit index is 0, 1, 2, 3

a
3    b
5    c
dtype: object


In [None]:
#alternatively we can use iloc or loc for indexing and slicing in pandas series
#for more readable code

#.loc means that it is always explicit indes that is used in indexing and slicing
print(data.loc[1])
print(data.loc[1:3])


#.iloc means that it is always implicit index that is used in indexing and slicing
print(data.iloc[1])
print(data.iloc[1:3])

a
1    a
3    b
dtype: object
b
3    b
5    c
dtype: object


## Data structures in pandas - pandas dataframe
- similar to tabular dataset
- 2D array of indexed data: row index and column names
- from list, array, dictionary, we can create pandas series
- from multiple pandas series, we can create a dataframe


#### how to create

In [41]:
#create a dict
area_dict = {'Californina':423,
             'Texas': 450,
             'New York': 141,
             'Florida': 141,
             'Illinois': 149
             }

#create a pandas series
area = pd.Series(area_dict)
area

Californina    423
Texas          450
New York       141
Florida        141
Illinois       149
dtype: int64

In [45]:
#create a dict
population_dict = {'Californina':388,
             'Texas': 264,
             'New York': 196,
             'Florida': 195,
             'Illinois': 127
             }

population = pd.Series(population_dict)

population

Californina    388
Texas          264
New York       196
Florida        195
Illinois       127
dtype: int64

In [53]:
#create a dataframe with two pandas series
states = pd.DataFrame({'pop': population, 
                       'area': area})
states

Unnamed: 0,pop,area
Californina,388,423
Texas,264,450
New York,196,141
Florida,195,141
Illinois,127,149


In [54]:
#column attribute
states.columns # we can reassign column name with this attribut. good for raw data.

Index(['pop', 'area'], dtype='object')

In [72]:
#index attribute
states.index

Index(['Californina', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

#### how to select data from it

In [65]:
#select a column

#by using dictionary style indexing --> preferrable for clean code
print(states['pop']) # compare dictionary and pandas. here we use the [] brackets and a key to pick out the data

#by using attribute                 --> can be confusing
print(states.area) #compared with states.pop it is confusing because there is a pop method of dataframe class

Californina    388
Texas          264
New York       196
Florida        195
Illinois       127
Name: pop, dtype: int64
Californina    423
Texas          450
New York       141
Florida        141
Illinois       149
Name: area, dtype: int64


In [None]:
print(states["California":"New York"])
print(states[1:4])

KeyError: 'California'