### PANDAS

__Data manipulation and cleaning__

* We often want to use data that is heterogeneous - a fancy word for data that is of different types. The most familiar type of heterogenous data for most people is an excel spreadsheet. In python, we have already seen tuples which allows for heterogeneous data.

* used most often for "data munging" which really means creating clean, easily used data sets

* two workhorse structures are: Series and DataFrame

* we can run basic plots using pandas methods

* Subsetting and grouping are probably the most powerful aspects of PANDAS.

__Series__
* one-dimensional array-like object that contains a sequence of values and an associated array of data labels (index) - the index means that when you print out a series object, it will display the index on the left and the actual associated values in the series on the right.
* you can also think about Series as a fix-length, ordered dictionary. In fact, you can convert dictionaries into Series.  
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

In [1]:
# series example
import pandas as pd
# we didn't provide an index, so the default one was automatically given
obj1 = pd.Series([4,7,-5,3])
print(obj1)
# we could explicitly give an index
obj2 = pd.Series([1,4,9,16,25,36],index = ['a','b','c','d','e','f'])
print(obj2)
print("you can print out just the index with a method called, unsurprisingly, .index")
print(obj2.index)
print("````````````````````````````")
print("We can apply Boolean conditions!")
print(obj2[obj2>24])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("We can use broadcasting to apply math to each element")
print(obj2*2)
print(" We can convert dictionaries to Series: ")
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
dic_example = {'zebras':2, 'lion':4,'kookaburra':5}
obj3 = pd.Series(dic_example)
print(obj3)
print("Series objects automatically align via index. So if you added obj3 to obj3, the animals would align and the summation would take place over the associated numbers.")
print(obj3+obj3)

0    4
1    7
2   -5
3    3
dtype: int64
a     1
b     4
c     9
d    16
e    25
f    36
dtype: int64
you can print out just the index with a method called, unsurprisingly, .index
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
````````````````````````````
We can apply Boolean conditions!
e    25
f    36
dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We can use broadcasting to apply math to each element
a     2
b     8
c    18
d    32
e    50
f    72
dtype: int64
 We can convert dictionaries to Series: 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
zebras        2
lion          4
kookaburra    5
dtype: int64
Series objects automatically align via index. So if you added obj3 to obj3, the animals would align and the summation would take place over the associated numbers.
zebras         4
lion           8
kookaburra    10
dtype: int64


__data frames__
* DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.
* there are some rules about how to set up data frames that follow Hadley Wickham's conventions (He is am important fellow in the R world - he created ggplot and is one of the main innovators behind RStudio). 
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
* There are several ways to create a DataFrame. One way (like the above example with Series) is to use a dictionary.
For example: 

In [2]:
import pandas as pd
# create  dictionary using plain PYTHON. THIS IS probably the easiest way to do this. 
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }
# -----------------------------
# turn this dictionary into a pandas dataframe called
# BRICS - Brazil.Russia.India.China.SouthAfrica
# -----------------------------
brics = pd.DataFrame(dict)
print(brics)

        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Dehli   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98


In [3]:
# Set the index for brics
brics.index = ["BR", "RU", "IN", "CH", "SA"]
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("Change the indexing from the default 0-->4 to BR, RU, IN, CH, SA")
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
# Print out brics with new index values
print(brics)

# Note: you can also convert a list into a pandas dataframe, too!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Change the indexing from the default 0-->4 to BR, RU, IN, CH, SA
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Dehli   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98


In [5]:
# THIS IS VERY EXCITING - you can import files, .csv are popular, in a manner 
# similar to the one provided by R!
# import malaria maize data set
#---------------------
# YOUR filepath will possibly look different depending on where you placed the 
# malariaMaize.csv file. 
#---------------------
mal_maize = pd.read_csv("/Users/daniellepresgraves/mypython/malariaMaize.csv")

# Print out dataset
print(mal_maize.head())
print(mal_maize.tail())

          Villages Maize_yield  mean_elevation_(m)  person-years  \
0            Gulim        High                2050         23458   
1           Denbun        High                2050         20630   
2            Wadra        High                2050         12805   
3  Zalimashebekuma        High                2050         16857   
4       Adel Agata        High                2050         12805   

   Number_of_malaria_cases  IncidenceRate/10000  
0                      683                  291  
1                       88                   43  
2                      170                  133  
3                      370                  219  
4                      226                  176  
         Villages Maize_yield  mean_elevation_(m)  person-years  \
16        Shakawa         Low                2200         17441   
17     Agnifereda         Low                2450         17309   
18       Jibgedel         Low                2450         15570   
19       Bokotabo      

In [6]:
mal_maize = pd.read_csv("/Users/daniellepresgraves/mypython/malariaMaize.csv")

# Print out number of cases column as Pandas Series: SINGLE BRACKET
print("Here it is as a series:")
print(mal_maize['Number_of_malaria_cases'])

print("------------------------------")
print("Here it is as a DataFrame")
# Print out number of cases column as Pandas DataFrame: DOUBLE BRACKET
print(mal_maize[['Number_of_malaria_cases']])
print("------------------------------")
# see the difference between Series and Dataframe? 
# One includes column names, for one thing. 

# Print out DataFrame with number of cases  and Maize_yield columns
print(mal_maize[['Number_of_malaria_cases', 'Maize_yield']])
print("~~~~~~~~~~~~~")
# Print out first 4 observations/rows: SLICING
print(mal_maize[0:4])

# Print out fifth, sixth, and seventh observation/row
print(mal_maize[4:7])

print("------------------------------")
print("Here is an example of iloc - although just pulling out columns is easier")
# Print out 3rd observation/row using index (indexing starts at 0 so this should be
# for Wadra) 
print(mal_maize.iloc[:,0:2])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
# there is a difference between iloc[index position] and loc[name of position] but in this case, since the 
# index is default, they should be accessed the same way. 
print(mal_maize.loc[0:3])

Here it is as a series:
0      683
1       88
2      170
3      370
4      226
5      552
6      857
7      395
8       71
9      218
10     180
11     177
12      94
13       5
14       8
15      79
16      66
17       5
18      23
19    2359
20     398
Name: Number_of_malaria_cases, dtype: int64
------------------------------
Here it is as a DataFrame
    Number_of_malaria_cases
0                       683
1                        88
2                       170
3                       370
4                       226
5                       552
6                       857
7                       395
8                        71
9                       218
10                      180
11                      177
12                       94
13                        5
14                        8
15                       79
16                       66
17                        5
18                       23
19                     2359
20                      398
----------------------------

## In lecture example: 
* Are body mass and brain mass related in mammals?
Use the mammals.csv file to investigate this question with plots: Graph body mas versus brain mass. Label the axes