# Intro to Pandas
___

**pandas** provides data structures and data manipulation tools designed for fast & easy data cleaning and analysis.

pandas adopts array-based computing from NumPy, but is designed for **tabular** or **heterogenous** data.

pandas has two primary data structures:

- **Series** - one-dimensional array-like object with a sequence of values having the same datatype
- **DataFrame** - rectangular table of data with ordered collection of columns, each of which can be a different value type

In [5]:
#As allways we import pandas and numpy
import pandas as pd
import numpy as np

### Series

Has a sequence of values (all the same datatype) and an associated array of data labels called its **index**. If not specified otherwise, the index values are sequential integers.

pandas can automatically determine datatype of values when a Series is created, but datatype can also be specified.

A Series is like a fixed-length, ordered dict with a mapping of index values to data values.

The array representation and index object of a Series can be accessed via its **values** and **index** attributes.

Problem 1: Create a series 4 numerical values. Print the series, the values and the index of the series.

In [9]:
s = pd.Series([15,23,55,5])
print("Series:")
print(s)

print("\nValues:")
print(s.values)

print("\nIndex:")
print(s.index)

Series:
0    15
1    23
2    55
3     5
dtype: int64

Values:
[15 23 55  5]

Index:
RangeIndex(start=0, stop=4, step=1)


Problem 2: Create a new series with the same values but with string values for the index. Select and print out one of the values in the series with the index.

In [17]:
s = pd.Series([15,23,55,5], index=["a","b","c","d"])
print(s)

print("\nValue at index'b':")
print(s["c"])

a    15
b    23
c    55
d     5
dtype: int64

Value at index'b':
55


Problem 3: You can also create a series from a python dict. Create a dict called `states_dict` with the state name as the key and the number as the value.
'Ohio' 35000
'Texas' 71000
'Oregon' 16000 
'Utah' 5000
Use the dict `states_dict` to create a series called `states_series`.

In [18]:
states_dict= {"Ohio":3500, "Texas":71000, "Oregon":16000,"Utah":5000 }

states_series = pd.Series(states_dict)
print(states_dict)

{'Ohio': 3500, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}


Problem 4: Updated the index of `state_series` to be the abbreviation of the state names. Do this in place.

In [19]:
states_series.index=["OH","Tx","OR","UT"]

print(states_series)

OH     3500
Tx    71000
OR    16000
UT     5000
dtype: int64


### DataFrame

A pandas DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type.

Sort of a dict of Series all sharing the same index.

- DataFrames have both row and column indices
- DataFrames are physically 2D, but can represent higher-dimensional data using hierarchical indexing
- DataFrame rows are sometimes referred to as axis=0
- DataFrame columns are sometimes referred to as axis=1


You can create DataFrames from dicts as well. 

In [None]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
      }

df = pd.DataFrame(data)
df

Problem 5: Update the index of the DataFrame df to be the words for each number. i.e. 1 -> one 

In [21]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
      }

df = pd.DataFrame(data)
df.index=["one", "two", "three", "four", "five", "six"]
df

Unnamed: 0,state,year,pop
one,Ohio,2000,1.5
two,Ohio,2001,1.7
three,Ohio,2002,3.6
four,Nevada,2001,2.4
five,Nevada,2002,2.9
six,Nevada,2003,3.2


Problem 6: Figure out two ways to access a column from df. Output the results. 

In [24]:
print(df["state"])

print(df.year)

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


Problem 7: Figure out two ways to access a row from df. Output the results. 

In [25]:
#when you know th eindex name 

print(df.loc["four"])
#when you know the row number
print(df.iloc[0])

state    Nevada
year       2001
pop         2.4
Name: four, dtype: object
state    Ohio
year     2000
pop       1.5
Name: one, dtype: object


Problem 8: Add a new column to your dataframe called 'rating' with the values `[5,4,3,2,1,0]`

In [30]:
df["rating"]=[5,4,3,2,1,0]

df

Unnamed: 0,state,year,pop,rating
one,Ohio,2000,1.5,5
two,Ohio,2001,1.7,4
three,Ohio,2002,3.6,3
four,Nevada,2001,2.4,2
five,Nevada,2002,2.9,1
six,Nevada,2003,3.2,0


Problem 9: Create another column called `nonsense` that is the rating multiplied by the pop.  

In [None]:
#JUST MULTPLYING
df["nonsense"]= df["pop"] * df["rating"]

df

Unnamed: 0,state,year,pop,rating,nonsense
one,Ohio,2000,1.5,5,7.5
two,Ohio,2001,1.7,4,6.8
three,Ohio,2002,3.6,3,10.8
four,Nevada,2001,2.4,2,4.8
five,Nevada,2002,2.9,1,2.9
six,Nevada,2003,3.2,0,0.0


Problem 10: Create three series using using numpy.

* series_numerical: The values 0-4 with index a-e
* series_zeros: All zeros with index a-e
* series_random: 5 random numbers qith index a-e

DataFrame called `numeric_df` from these three series. Each series will be a row and the columns will be a-e. 

In [None]:
 #Create index labels
idx = ["a", "b", "c", "d", "e"]

# Create the three series
series_numerical = pd.Series(np.arange(5), index=idx)
series_zeros = pd.Series(np.zeros(5), index=idx)
series_random = pd.Series(np.random.rand(5), index=idx)


numeric_df = pd.DataFrame(
    [series_numerical, series_zeros, series_random],
    index=["series_numerical", "series_zeros", "series_random"]
)

numeric_df

Unnamed: 0,a,b,c,d,e
series_numerical,0.0,1.0,2.0,3.0,4.0
series_zeros,0.0,0.0,0.0,0.0,0.0
series_random,0.165371,0.788669,0.653938,0.277186,0.949238


Problem 11: Transpose the DataFrame `numeric_df` so that the series become columns instead of rows. Save this to a variable called `transposed_numeric_df`. Rename the columns to `numerical`, `zeros`, and `random`.

In [33]:

transposed_numeric_df = numeric_df.T

# Rename the columns
transposed_numeric_df.columns = ["numerical", "zeros", "random"]

transposed_numeric_df


Unnamed: 0,numerical,zeros,random
a,0.0,0.0,0.165371
b,1.0,0.0,0.788669
c,2.0,0.0,0.653938
d,3.0,0.0,0.277186
e,4.0,0.0,0.949238


Problem 12: Output `transposed_numeric_df` ordered by `random` descending.

In [43]:

transposed_numeric_df.sort_values(by="random", ascending=False)


Unnamed: 0,numerical,zeros,random,random_5,even
e,4.0,0.0,0.949238,4.746192,True
b,1.0,0.0,0.788669,3.943347,False
c,2.0,0.0,0.653938,3.26969,True
d,3.0,0.0,0.277186,1.385931,False
a,0.0,0.0,0.165371,0.826853,True


Problem 13: Select all rows from `transposed_numeric_df` where the column `numerical` is greater than 2.  

In [47]:

transposed_numeric_df[transposed_numeric_df["numerical"] > 2]


Unnamed: 0,numerical,zeros,random,random_5,even
d,3.0,0.0,0.277186,1.385931,False
e,4.0,0.0,0.949238,4.746192,True


Problem 14: Add a column called `random_5` that is the column random multiplied by 5. Select all rows from `transposed_numeric_df` where `numerical` is greater than `random_5`.

In [None]:

transposed_numeric_df["random_5"] = transposed_numeric_df["random"] * 5

transposed_numeric_df[
    transposed_numeric_df["numerical"] > transposed_numeric_df["random_5"]
]


Unnamed: 0,numerical,zeros,random,random_5
d,3.0,0.0,0.277186,1.385931


Problem 15: Add a column to transposed_numberic_df called `even` that is True when `numerical` is even and `False` when `numerical` is odd. 

In [38]:

transposed_numeric_df["even"] = transposed_numeric_df["numerical"] % 2 == 0

transposed_numeric_df


Unnamed: 0,numerical,zeros,random,random_5,even
a,0.0,0.0,0.165371,0.826853,True
b,1.0,0.0,0.788669,3.943347,False
c,2.0,0.0,0.653938,3.26969,True
d,3.0,0.0,0.277186,1.385931,False
e,4.0,0.0,0.949238,4.746192,True


Problem 16: Add a column called `even_odd` that has the value `odd` when `numerical` is odd and `even` when `numerical` is even.

In [49]:
transposed_numeric_df["even_odd"]=np.where(transposed_numeric_df["numerical"] % 2==0, 'even', 'odd')
transposed_numeric_df

Unnamed: 0,numerical,zeros,random,random_5,even,even_odd
a,0.0,0.0,0.165371,0.826853,True,even
b,1.0,0.0,0.788669,3.943347,False,odd
c,2.0,0.0,0.653938,3.26969,True,even
d,3.0,0.0,0.277186,1.385931,False,odd
e,4.0,0.0,0.949238,4.746192,True,even


Problem 17: Print out the sum of all columns.

In [51]:
print(transposed_numeric_df.sum())

numerical                  10.0
zeros                       0.0
random                 2.834402
random_5              14.172012
even                          3
even_odd     evenoddevenoddeven
dtype: object


Problem 18: Print out index of the row with the max value for each column. i.e. for `numerical` it will be the last row `e` because it has a value of 5. 

In [53]:
print(transposed_numeric_df.idxmax())

numerical    e
zeros        a
random       e
random_5     e
even         a
even_odd     b
dtype: object
