# Introduction to Pandas
Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. 
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures
We will introduce three fundamental Pandas data structures: the Series, DataFrame, and Index. 

In [2]:
#start our code sessions with the standard NumPy and Pandas imports
import numpy as np
import pandas as pd

#A Pandas Series is a one-dimensional array of indexed data. 
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
#The index is an array-like object of type pd.Index
data.index

#accessed by the associated index via the familiar Python square-bracket notation
print(data[1]) 
data[1:3]

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
0.5


1    0.50
2    0.75
dtype: float64

The index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
# item access works with string index
data['b'] 

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


0.5

Pandas Series is a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values.

In [13]:
#constructing a Series object directly from a Python dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [14]:
#typical dictionary-style item access can be performed
print(population['California'])
#slicing
print(population['California':'Illinois'])

38332521
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [15]:
#For example, data can be a list or NumPy array, in which case index defaults to an integer sequence
print(pd.Series([2, 4, 6]))
print('\n')

#data can be a scalar, which is repeated to fill the specified index:
print(pd.Series(5, index=[100, 200, 300]))
print('\n')

#data can be a dictionary, in which index defaults to the sorted dictionary keys
print(pd.Series({2:'a', 1:'b', 3:'c'}))

0    2
1    4
2    6
dtype: int64


100    5
200    5
300    5
dtype: int64


2    a
1    b
3    c
dtype: object


# DataFrame
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. 
You can think of a DataFrame as a sequence of aligned Series objects. 

In [16]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [17]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the Series object, the DataFrame has an index attribute that gives access to the index labels.
Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels.
Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

In [20]:
print(states.index)
print(states.columns)

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Index(['population', 'area'], dtype='object')


A Pandas DataFrame can be constructed in a variety of ways. 

In [26]:
#From a single Series object
print(pd.DataFrame(population, columns=['population']))
print('\n')
#From a list of dicts
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
print(pd.DataFrame(data))
print('\n')
#if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") 
print(pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]))
print('\n')

#From a dictionary of Series objects
print(pd.DataFrame({'population': population,
              'area': area}))
print('\n')

            population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135


   a  b
0  0  0
1  1  2
2  2  4


     a  b    c
0  1.0  2  NaN
1  NaN  3  4.0


            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995




# Load Data From CSV 
CSV (comma-separated value) files are a common file format for transferring and storing data. 
The ability to read, manipulate, and write data to and from CSV files using Python is a key skill.
We look at how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files.
Advanced features of file handling can be found here https://www.shanelynn.ie/python-pandas-read_csv-load-data-from-csv-files/

In [31]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'Berkeley.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
data = pd.read_csv("Berkeley.csv") 
# Preview the loaded data 
print(data)

       Admit  Gender Dept  Freq
0   Admitted    Male    A   512
1   Rejected    Male    A   313
2   Admitted  Female    A    89
3   Rejected  Female    A    19
4   Admitted    Male    B   353
5   Rejected    Male    B   207
6   Admitted  Female    B    17
7   Rejected  Female    B     8
8   Admitted    Male    C   120
9   Rejected    Male    C   205
10  Admitted  Female    C   202
11  Rejected  Female    C   391
12  Admitted    Male    D   138
13  Rejected    Male    D   279
14  Admitted  Female    D   131
15  Rejected  Female    D   244
16  Admitted    Male    E    53
17  Rejected    Male    E   138
18  Admitted  Female    E    94
19  Rejected  Female    E   299
20  Admitted    Male    F    22
21  Rejected    Male    F   351
22  Admitted  Female    F    24
23  Rejected  Female    F   317


Now you can run a query.

In [22]:
results = data.loc[(data["Admit"] == "Admitted") & 
          (data["Gender"] == "Female" )]

Write results to an output file with index using to_csv()

In [25]:
test['Freq'].to_csv("FrequencyAdmittedFemale.csv")

# Some Examples

In [29]:
import numpy as np
import pandas as pd

'''
The following code is to help you play with the concept of Dataframe in Pandas.

You can think of a Dataframe as something with rows and columns. It is
similar to a spreadsheet, a database table, or R's data.frame object.

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
'''

'''
To create a dataframe, you can pass a dictionary of lists to the Dataframe
constructor:
1) The key of the dictionary will be the column name
2) The associating list will be the values within that column.
'''
# Change False to True to see Dataframes in action
if True:
    data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
    football = pd.DataFrame(data)
    print (football)

'''
Pandas also has various functions that will help you understand some basic
information about your data frame. Some of these functions are:
1) dtypes: to get the datatype for each column
2) describe: useful for seeing basic statistics of the dataframe's numerical
   columns
3) head: displays the first five rows of the dataset
4) tail: displays the last five rows of the dataset
'''
# Change False to True to see these functions in action
if True:
    data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
    football = pd.DataFrame(data)
    print (football.dtypes)
    print ('\n')
    # Generate descriptive statistics 
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
    print (football.describe()) 
    print ('\n')
    print (football.head())
    print ('\n')
    print (football.tail())

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012
losses     int64
team      object
wins       int64
year       int64
dtype: object


          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000


   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012


   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packe

In [30]:
import pandas as pd

'''
The following code is to help you play with the concept of Series in Pandas.

You can think of Series as an one-dimensional object that is similar to
an array, list, or column in a database. By default, it will assign an
index label to each item in the Series ranging from 0 to N, where N is
the number of items in the Series minus one.

Please feel free to play around with the concept of Series and see what it does

*This playground is inspired by Greg Reda's post on Intro to Pandas Data Structures:
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
'''
# Change False to True to create a Series object
if False:
    series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
    print (series)

'''
You can also manually assign indices to the items in the Series when
creating the series
'''

# Change False to True to see custom index in action
if True:
    series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager', 'Course Number', 'Power Level'])
    print (series)

'''
You can use index to select specific items from the Series
'''
# Change False to True to see Series indexing in action
if True:
    series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
    print (series['Instructor'])
    print ("")
    print (series[['Instructor', 'Curriculum Manager', 'Course Number']])

'''
You can also use boolean operators to select specific items from the Series
'''
# Change False to True to see boolean indexing in action
if True:
    cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig',
                                                 'Puppy', 'Kitten'])
    print (cuteness > 3)
    print ("")
    print (cuteness[cuteness > 3])

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object
Dave

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object
Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64
