### Pandas

Pandas is a python package for importing and working with tabular data,  such as spreadsheet tables

It provides for a dataframe style interface,  very similar to data frames in R

Pandas is using NumPy matrices "underneath" the Pandas structures,  and it is easy to "extract" NumPy matrices from a Pandas 
dataframe.   Many functions in sklearn and other packages will accept either Pandas data frames or NumPy matrices as inputs.

Pandas allows for different types of data per column, as opposed by the requirement in NumPy that the matrices be homogeneous.

Pandas has a massive number of built in functions,  that allow for vectorized operations on the DataFrame,  plus a bunch of visualization tools built into it.

Pandas does have some quirks in how you handle accessing and slicing the data, it doesn't use the standard square bracket approach   df[1,2] does not work,  watch for that


# Sources of this material

Chapter 5 of "Python for Data Analysis"- Wes McKinney

    https://learning.oreilly.com/library/view/python-for-data/9781491957653/ch05.html
    
The Pandas website is at

    https://pandas.pydata.org/

    
    

# Panda Series

1-D array like objects in Pandas

In [1]:
import pandas as pd
import numpy as np

In [2]:
# creating a series from a list

#lists have indices and objects

my_obj=pd.Series([4,7,-9,2])

In [3]:
my_obj

0    4
1    7
2   -9
3    2
dtype: int64

In [4]:
my_obj.values

array([ 4,  7, -9,  2], dtype=int64)

In [5]:
# this is the index to this obj

my_obj.index

RangeIndex(start=0, stop=4, step=1)

Question- what is a RangeIndex??   

In [6]:
# we can specify the index when creating the series

my_series2=pd.Series([-1,2,3,11],index=['a','b','c','d'])

# we can then select values with the index- note this as a difference from a numpy array

my_series2['b']

2

A series is sort of like a fixed length dictionary,  it is easy to create a series from a dictionary

In [7]:
s_dict={"Ohio":2,"Pennsylvania":11,"Texas":311}

my_series3=pd.Series(s_dict)

my_series3

Ohio              2
Pennsylvania     11
Texas           311
dtype: int64

Series automatically align the indexes when you do operations with them,  kind like a join

In [8]:

s_dict2={"Ohio":8,"Pennsylvania":9,"New York":89,"Texas":10}

my_series4=pd.Series(s_dict2)

my_series3+my_series4

#note the insertion of NaN when there is no matching entry,   output is also sorted by index

New York          NaN
Ohio             10.0
Pennsylvania     20.0
Texas           321.0
dtype: float64

In [9]:
# detecting the null entry

out_series=my_series3+my_series4

out_series.isnull()

New York         True
Ohio            False
Pennsylvania    False
Texas           False
dtype: bool

The interesting feature of Pandas Series is that the can be date indexed,  so that the series is representing a time series
in this form, there are a whole bunch of member functions for working with the data as a time series.

See

    https://pandas.pydata.org/docs/reference/series.html

We'll see about coming back to time series analysis a bit later if I can locate a good discussion of how Pandas series are used
in this approach.  They seem to be more commonly found as a component of pandas dataframes.


    https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

# DataFrames in Pandas

Pandas has dataframes,   each column of a data frame is a Pandas series,  these are the two basic storage forms in Pandas

We can move data into Pandas data frames in a number of ways

-manual entry

-importing CSV files

-pulling data from databases or from website APIs

The R data type "Factor" doesn't exist in Pandas (sadly)- you wind up using either strings or integers as factor-like 
variables-  you wind up using what is called either integer coding of categories, or "one-hot" encoding,  more on that later

# Manually entering data into a dataframe

In [10]:
# inserting variables into a data frame
a=np.array([1,2,3,4])
b=["Hey","Hey","My","My"]
c=(1.2, 2.1, 1.3, -2.0)

In [11]:
# use one or more cells to figure out what type of objects a, b and c are,  also what type of data is in each-  show this 
# work in this cell

# when looking at examples,  always, always be aware of what data types are in use







In [12]:
# Manually entering data into a dataframe
# insert into a data frame
# note that the data here is forced into the form of a dictionary, with the variable names in quotes

stuff_df=pd.DataFrame({'x':a,'y':b,'z':c})

stuff_df

Unnamed: 0,x,y,z
0,1,Hey,1.2
1,2,Hey,2.1
2,3,My,1.3
3,4,My,-2.0


In [13]:
# reading a csv file
#we will look at the standard red-wine data set

infile="C:\\Users\\hdavi\\Dropbox\\Data_Analytics\\DAT511_Data_Cleaning\\Lectures_Fall_2018\\Example_data\\UCI_wine_data\\winequality-red.csv"

# notice here that a regex string specifier is used in the pd.read_csv() funtion here "\;|\," to indicate that the separator 
# (or delimiter) being used is 
# ";" or ",",   the escape code \ is uses so that : and , are intrepeted as charactoers

# The red wine set used here is labeled as a csv is not really comma delimited (sigh),   it is ; delimited

# sep="\;" would have worked fine here,   I used the regex expression to show how to allow for two different delimiters.

#  there is also pd.read_table available,  which reads a range of formats

redwine_df=pd.read_csv(infile, sep="\;|\,")


# we have an available member function head to let us look at the data

redwine_df.head(4)


  redwine_df=pd.read_csv(infile, sep="\;|\,")


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\hdavi\\Dropbox\\Data_Analytics\\DAT511_Data_Cleaning\\Lectures_Fall_2018\\Example_data\\UCI_wine_data\\winequality-red.csv'

In [None]:
#API fetch

#this is a data frame download from the Open Data Buffalo repository at
#https://data.buffalony.gov/

# in this case, thei s a copy of the City of Buffalo Public Art Inventory
# note this just makes use of pd.read_csv again with the url from the data set,  which I found on Open Data Buffalo

url="https://data.buffalony.gov/resource/6xz2-syui.csv"
    
df_art=pd.read_csv(url)

#make sure the data frome has no white space in the column names- this causes odd problems later

df_art.columns = df_art.columns.str.strip()

In [None]:
df_art.head()

# Other input functions

There are a large number of pandas read functions,  see

    https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

It can read excel files,  csv, pickled data, general "tables" from text files or url, off of the clipboard, json files, html files or sites, hdf, feather, parquet, sas, spss, sql tables, stata file etc

It can write to most of these file formats as well.

Writing to a csv is very simple,   df.to_csv() for example writes a dataframe to a csv,  with labels etc.

# basic information and plots

Pandas has a lot of basic functions built in,  we would look at them using the redwine_df

In [None]:
# the head function

redwine_df.head(3)

In [None]:
# basic info
# size, shape and ndim

#size- number of elements in the array
print(redwine_df.size)
#shape,   rows by columns
print(redwine_df.shape)
#ndim  number of dimensions
print(redwine_df.ndim)

In [None]:
redwine_df.memory_usage()

In [None]:
# strip any white space out of the column names
# the pandas load function is prone to having trouble with leading or trailing whitespace


redwine_df.columns = redwine_df.columns.str.strip()

In [None]:
# In R, we had the summary function to give us a overview of the data frame values
# in python the equivalent is the member function describe

redwine_df.describe()

In [None]:
# get a lot of columns
redwine_df.columns

We have some weird double quote thing going on here,  we should probably figure out how to clean that up,  use some sort of 
regex approach

It turns out that pandas has a member function to change column names

In [None]:
# we can apply describe to only specific columns if desired
redwine_df[['"citric acid"','"chlorides"']].describe()

In [None]:
# We have member functions for count, mean, median etc

print(redwine_df[ '"alcohol"'].mean())
print(redwine_df[ '"alcohol"'].median())
print(redwine_df[ '"alcohol"'].std())

In [None]:
# we can find the location or index of the things like the min and max

print(redwine_df['"alcohol"'].idxmax())
print(redwine_df['"alcohol"'].idxmin())

In [None]:
# we can apply these to all columns

redwine_df.mean()

Pandas has some built-in plot functions,

note the need here for 

In [None]:
redwine_df['"citric acid"'].hist()

# Question or goal to complete

Go to open data buffalo,   look under "data" and filter down to show only datasets.

Find an interesting looking data set,  find the URL for the dataset and load it into a pandas data frame,  using the URL if 
you can.

Some of the URL links load the data in ways that are not compatible with the pd.read_csv function.  If that happens,  download
the data as a CSV and load it that way

Find out the following,  putting each of these in separate cell and adding comments to the notebooks

-the size of the data 
-show the head
-use describe() to get some basic statistics
-show the column names
- try to create a histogram of one column,  if you have continuous data in the dataframe


If you have dates in the dataset,  convert that column to the pandas datatime object

                  df['Date'] = df['Date'].astype('datetime64[ns]')

 pandas has member functions for working with dates as distinct type of variable   

# Grouping

This allows you to apply the built in operations using a categorical grouping variable

In [None]:
# there is a group by operation available
# note that in the groupy operation, I had to give the full name of the column df_art['type']
# which not what the pandas manual indicates

df_art['title'].groupby(df_art['type']).count()              

Creating a Pivot Table

In [None]:
table=pd.pivot_table(df_art,values='category',index=['site','type'],columns='city',aggfunc="count")

print(table)

In [None]:
df_art.columns

# Question or goal

Figure out how to create a Pivot table using the data you downloaded from Open Data Buffalo

Look for categories, and use the count option

# Melting a data frame

Converts a "wide format" data frame into a "long format" data frame

In a long format data frame,  we have 1 or more identifier columns,  here I used title as the identifier- think of this
like a data base index.   Note, you can have a composite index of more than one column

The other two columns are the name of the variable followed by the value

So the "melted form" is always  index, variable, value

Some types of operations are easily carried out on the melted or long form of a dataframe

I've figured out some cool plotting tactics, but other than that I don't use this form much.  I should learn to do that 
more effectively.

In [None]:
temp=pd.melt(df_art,id_vars=['title'],value_vars=['category','type','photo_url_link','latitude','longitude'])
temp.head(10)

In [None]:
temp.tail(10)

#  Question- Create a melt of your data set, with a couple of variables included

What variable or variables will you need in your index.  Do you need a composite index?

Bonus- how could you make use of this melted version of the data?   I've used melted data to produce some interesting plots, but it's tough to think of other uses

# Slicing and accessing sections of a data frame

In [None]:
# create an example frame from a dictionary

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
t_frame = pd.DataFrame(data)

In [None]:
t_frame

In [None]:
# accessing colums, two forms

print(t_frame['state'])
print("--------------------------")

print(t_frame.state)

In [None]:
# there is also a loc attribute that can pull rows

t_frame.loc[3]

In [None]:
t_frame.loc[2:3]

In [None]:
# we can add columns by assignment
t_frame['debt'] =-1

Look at t_frame after the debt term was added,  what has happened?   Can you figure out how to set debt to a set of different
random values (1 per state)


In [None]:
# Add a boolean column to t-frame identifying eastern states

t_frame['eastern']= (t_frame.state=="Ohio")

t_frame

In [None]:
# we can use the del method to remove a column

del t_frame['eastern']

t_frame.columns

In [None]:
# Reindexing- changeing the order of columns or rows

a=pd.Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])
a

In [None]:
a2=a.reindex(['a','b','c','d'])
a2

In [None]:
# reordering a data frame

frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

# note the insertion of the blank row for b,  NaN means not a number

frame2

In [None]:
# reordering columns,  using the columns keyword

states = ['Texas', 'Utah', 'California']

frame.reindex(columns=states)



It's not obvious to me what the point of reindexing is,  I guess you could clean up the appearance of dataframes
for output?  Any ideas?

In [None]:
# Dropping elements

data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
data

In [None]:
data.drop(['Ohio','Colorado'])

In [None]:
data.drop(['two','four'], axis='columns')

In [None]:
# Indexing with integers- works like standard numpy or python index- note the need to use iloc()

data.iloc[0,0]

In [None]:
data.iloc[0:2,]

In [None]:
#loc uses the column and row names

data.loc[["Colorado","Utah"],"three"]

# Question/activity

Show several slices of your data frame,   using column names and integer locations

Explain what is happening in the slice

# Function Application and Mapping

Mapping refers to applying functions across a data structure

The numpy methods also work on Pandas data frames

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

In [None]:
# applying a lambda function to the columns of an array
# a lambda function is one defined in a single line

f=lambda x: x.max()-x.min()

# the default apply() is by column,   as per the R apply() function

frame.apply(f)

In [None]:
# we can also use apply() across rows instead of columns

frame.apply(f,axis="columns")

Note that there are a number of member functions, which don't require the use of apply(),   frame.mean(), max, min, sum
etc

use dir(frame) to see what is availabe.

There is a version of apply called applymap() that acts on each element in the data frame

In [None]:
mysquare= lambda x: x**2

frame.applymap(mysquare)

In [None]:
# Sorting and Ranking

In [None]:
# sorting by index

tseries = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

tseries.sort_index()


In [None]:
tframe = pd.DataFrame(np.arange(8).reshape((2, 4)),index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
tframe

In [None]:
# sort by row
tframe.sort_index()

In [None]:
# sort by column

tframe.sort_index(axis='columns')

In [None]:
# sort by values,  rather than by indices
tframe.sort_values(by="b")

#Question

produce a version of your data set from open data buffalo,  sorted by one of the columns

Figure out how to create a reversed or backwards sorted version of the data frame as well