# Intro to Pandas
___

**pandas** provides data structures and data manipulation tools designed for fast & easy data cleaning and analysis.

pandas adopts array-based computing from NumPy, but is designed for **tabular** or **heterogenous** data.

pandas has two primary data structures:

- **Series** - one-dimensional array-like object with a sequence of values having the same datatype
- **DataFrame** - rectangular table of data with ordered collection of columns, each of which can be a different value type

In [25]:
#As allways we import pandas and numpy
import pandas as pd
import numpy as np

### Series

Has a sequence of values (all the same datatype) and an associated array of data labels called its **index**. If not specified otherwise, the index values are sequential integers.

pandas can automatically determine datatype of values when a Series is created, but datatype can also be specified.

A Series is like a fixed-length, ordered dict with a mapping of index values to data values.

The array representation and index object of a Series can be accessed via its **values** and **index** attributes.

Problem 1: Create a series 4 numerical values. Print the series, the values and the index of the series.

In [26]:
s = pd.Series([10, 20, 30, 40])

print("Series:")
print(s)
print("\nValues:")
print(s.values)
print("\nIndex:")
print(s.index)

Series:
0    10
1    20
2    30
3    40
dtype: int64

Values:
[10 20 30 40]

Index:
RangeIndex(start=0, stop=4, step=1)


Problem 2: Create a new series with the same values but with string values for the index. Select and print out one of the values in the series with the index.

In [27]:
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Series with string index:")
print(s2)
print("\nAccess value at index 'b':")
print(s2['b'])

Series with string index:
a    10
b    20
c    30
d    40
dtype: int64

Access value at index 'b':
20


Problem 3: You can also create a series from a python dict. Create a dict called `states_dict` with the state name as the key and the number as the value.
'Ohio' 35000
'Texas' 71000
'Oregon' 16000 
'Utah' 5000
Use the dict `states_dict` to create a series called `states_series`.

In [28]:
states_dict = {
    'Ohio': 35000,
    'Texas': 71000,
    'Oregon': 16000,
    'Utah': 5000
}

states_series = pd.Series(states_dict)
print(states_series)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


Problem 4: Updated the index of `state_series` to be the abbreviation of the state names. Do this in place.

In [29]:
states_series.index = ['OH', 'TX', 'OR', 'UT']
print(states_series)

OH    35000
TX    71000
OR    16000
UT     5000
dtype: int64


### DataFrame

A pandas DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type.

Sort of a dict of Series all sharing the same index.

- DataFrames have both row and column indices
- DataFrames are physically 2D, but can represent higher-dimensional data using hierarchical indexing
- DataFrame rows are sometimes referred to as axis=0
- DataFrame columns are sometimes referred to as axis=1


You can create DataFrames from dicts as well. 

In [30]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
      }

df = pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Problem 5: Update the index of the DataFrame df to be the words for each number. i.e. 1 -> one 

In [31]:
df.index = ['zero', 'one', 'two', 'three', 'four', 'five']
df

Unnamed: 0,state,year,pop
zero,Ohio,2000,1.5
one,Ohio,2001,1.7
two,Ohio,2002,3.6
three,Nevada,2001,2.4
four,Nevada,2002,2.9
five,Nevada,2003,3.2


Problem 6: Figure out two ways to access a column from df. Output the results. 

In [32]:
print("Method 1 - dot notation:")
print(df.state)
print()

print("Method 2 - bracket notation:")
print(df['state'])

Method 1 - dot notation:
zero       Ohio
one        Ohio
two        Ohio
three    Nevada
four     Nevada
five     Nevada
Name: state, dtype: object

Method 2 - bracket notation:
zero       Ohio
one        Ohio
two        Ohio
three    Nevada
four     Nevada
five     Nevada
Name: state, dtype: object


Problem 7: Figure out two ways to access a row from df. Output the results. 

In [33]:
print("Method 1 - loc (label-based):")
print(df.loc['one'])
print()

print("Method 2 - iloc (integer-based):")
print(df.iloc[1])

Method 1 - loc (label-based):
state    Ohio
year     2001
pop       1.7
Name: one, dtype: object

Method 2 - iloc (integer-based):
state    Ohio
year     2001
pop       1.7
Name: one, dtype: object


Problem 8: Add a new column to your dataframe called 'rating' with the values `[5,4,3,2,1,0]`

In [34]:
df['rating'] = [5, 4, 3, 2, 1, 0]
df

Unnamed: 0,state,year,pop,rating
zero,Ohio,2000,1.5,5
one,Ohio,2001,1.7,4
two,Ohio,2002,3.6,3
three,Nevada,2001,2.4,2
four,Nevada,2002,2.9,1
five,Nevada,2003,3.2,0


Problem 9: Create another column called `nonsense` that is the rating multiplied by the pop.  

In [35]:
df['nonsense'] = df['rating'] * df['pop']
df

Unnamed: 0,state,year,pop,rating,nonsense
zero,Ohio,2000,1.5,5,7.5
one,Ohio,2001,1.7,4,6.8
two,Ohio,2002,3.6,3,10.8
three,Nevada,2001,2.4,2,4.8
four,Nevada,2002,2.9,1,2.9
five,Nevada,2003,3.2,0,0.0


Problem 10: Create three series using using numpy.

* series_numerical: The values 0-4 with index a-e
* series_zeros: All zeros with index a-e
* series_random: 5 random numbers qith index a-e

DataFrame called `numeric_df` from these three series. Each series will be a row and the columns will be a-e. 

In [36]:
index = ['a', 'b', 'c', 'd', 'e']

series_numerical = pd.Series(np.arange(5), index=index)
series_zeros = pd.Series(np.zeros(5), index=index)
series_random = pd.Series(np.random.random(5), index=index)

numeric_df = pd.DataFrame([series_numerical, series_zeros, series_random])
numeric_df

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,0.0,0.0,0.0,0.0,0.0
2,0.820649,0.857073,0.068809,0.55725,0.67327


Problem 11: Transpose the DataFrame `numeric_df` so that the series become columns instead of rows. Save this to a variable called `transposed_numeric_df`. Rename the columns to `numerical`, `zeros`, and `random`.

In [37]:
transposed_numeric_df = numeric_df.T
transposed_numeric_df.columns = ['numerical', 'zeros', 'random']
transposed_numeric_df

Unnamed: 0,numerical,zeros,random
a,0.0,0.0,0.820649
b,1.0,0.0,0.857073
c,2.0,0.0,0.068809
d,3.0,0.0,0.55725
e,4.0,0.0,0.67327


Problem 12: Output `transposed_numeric_df` ordered by `random` descending.

In [38]:
transposed_numeric_df.sort_values(by='random', ascending=False)

Unnamed: 0,numerical,zeros,random
b,1.0,0.0,0.857073
a,0.0,0.0,0.820649
e,4.0,0.0,0.67327
d,3.0,0.0,0.55725
c,2.0,0.0,0.068809


Problem 13: Select all rows from `transposed_numeric_df` where the column `numerical` is greater than 2.  

In [39]:
transposed_numeric_df[transposed_numeric_df['numerical'] > 2]

Unnamed: 0,numerical,zeros,random
d,3.0,0.0,0.55725
e,4.0,0.0,0.67327


Problem 14: Add a column called `random_5` that is the column random multiplied by 5. Select all rows from `transposed_numeric_df` where `numerical` is greater than `random_5`.

In [40]:
transposed_numeric_df['random_5'] = transposed_numeric_df['random'] * 5
transposed_numeric_df[transposed_numeric_df['numerical'] > transposed_numeric_df['random_5']]

Unnamed: 0,numerical,zeros,random,random_5
c,2.0,0.0,0.068809,0.344047
d,3.0,0.0,0.55725,2.786252
e,4.0,0.0,0.67327,3.36635


Problem 15: Add a column to transposed_numberic_df called `even` that is True when `numerical` is even and `False` when `numerical` is odd. 

In [41]:
transposed_numeric_df['even'] = transposed_numeric_df['numerical'] % 2 == 0
transposed_numeric_df

Unnamed: 0,numerical,zeros,random,random_5,even
a,0.0,0.0,0.820649,4.103246,True
b,1.0,0.0,0.857073,4.285365,False
c,2.0,0.0,0.068809,0.344047,True
d,3.0,0.0,0.55725,2.786252,False
e,4.0,0.0,0.67327,3.36635,True


Problem 16: Add a column called `even_odd` that has the value `odd` when `numerical` is odd and `even` when `numerical` is even.

In [42]:
transposed_numeric_df['even_odd'] = np.where(transposed_numeric_df['numerical'] % 2 == 0, 'even', 'odd')
transposed_numeric_df

Unnamed: 0,numerical,zeros,random,random_5,even,even_odd
a,0.0,0.0,0.820649,4.103246,True,even
b,1.0,0.0,0.857073,4.285365,False,odd
c,2.0,0.0,0.068809,0.344047,True,even
d,3.0,0.0,0.55725,2.786252,False,odd
e,4.0,0.0,0.67327,3.36635,True,even


Problem 17: Print out the sum of all columns.

In [43]:
print(transposed_numeric_df.sum())

numerical                  10.0
zeros                       0.0
random                 2.977052
random_5               14.88526
even                          3
even_odd     evenoddevenoddeven
dtype: object


Problem 18: Print out index of the row with the max value for each column. i.e. for `numerical` it will be the last row `e` because it has a value of 5. 

In [44]:
print(transposed_numeric_df.idxmax())

numerical    e
zeros        a
random       b
random_5     b
even         a
even_odd     b
dtype: object
