# Pandas
-----
 - Where to get help from? Stack Overflow!
 - Reference books: Python for Data Analyst by O'reilly, Learning the Pandas Library by Matt Harrison
 - Planet Python: https://planetpython.org/
 - Data Skeptic podcast: https://dataskeptic.com/





# The Series Data Structure

-------
 - The series is one of the core data structures in pandas. You think of it across between a list and a dictionary. 

In [None]:
import pandas as pd

pd.Series? # you can see the documentation, you can pass in some data, an index, and a name.


In [None]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

In [None]:
numbers = [1, 2, 3]
pd.Series(numbers)

 - Underneath panda stores series values in a typed array using the Numpy library. This offers significant speed-up when processing data versus traditional python lists. 
 - Underneath, pandas does some type conversion. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array. 

In [None]:
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

 - NAN is not none and when we try the equality test, it's false. 
 - You need to use special functions to test for the presence of not a number, such as the Numpy library `isnan`. 

In [None]:
numbers = [1, 2, None]
pd.Series(numbers)

In [None]:
import numpy as np
np.nan == None

In [None]:
np.nan == np.nan

In [None]:
np.isnan(np.nan)

 - A series can be created from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.   

In [None]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

 - Once the series has been created, we can get the index object using the index attribute. 

In [None]:
s.index

 - You could also separate your index creation from the data by passing in the index as a list explicitly to the series. 

In [None]:
s = pd.Series(['Tiger', 'Bear', 'Moose'], index = ['India', 'America', 'Canada'])
s

 - Pandas overrides the automatic creation to favor only and all of the indices values that you provided. So it will ignore it from your dictionary, all keys, which are not in your index, and pandas will add non type or NAN values for any index value you provide, which is not in your dictionary key list. 

In [None]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

# Quering a Series
-------

In [None]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

 - To query by numeric location, starting at zero, use the `iloc` attribute. 

In [None]:
s.iloc[3]

 - To query by the index label, you can use the `loc` attribute.
 - Keep in mind that `iloc` and `loc` are not methods, they are attributes.

In [None]:
s.loc['Golf']

In [None]:
s['Golf']

 - So what happens if your index is a list of integers? This is a bit complicated, and Pandas can't determine automatically whether you're intending to query by index position or index label. 

In [None]:
sports = {99: 'Bhutan',
         100: 'Scotland',
         101: 'Japan',
         102: 'South Korea'}
s = pd.Series(sports)

In [None]:
s[0] # This won't call s.iloc[0] as one might expect, it generates an error instead

In [None]:
s.iloc[0]

 - Let's talk about working with the data. A common task is to want to consider all of the values inside of a series and want to do some sort of operation. 

In [None]:
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

 - We could write a little routine which iterates over all of the items in the series and adds them together to get a total. This works, but it's slow.

In [None]:
total = 0
for item in s:
    total += item
print(total)

 - Pandas and the underlying NumPy libraries support a method of computation called vectorization. 
 - we just call `np.sum` and pass in an iterable item. 

In [None]:
import numpy as np

total = np.sum(s)
print(total)

 - `head` method reduces the amount of data printed out by the series to the first five elements. 

In [None]:
# This creats a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()

In [None]:
len(s)

#### `timeit`
- let's see if the second solution is actually faster than the other one? we can examin this with a magic function! :D
- Magic functions begin with a % sign. If we type this sign and then hit the Tab key, we can see a list of the available magic functions. 
- We're actually going to use what's called a cellular magic function. These start with two percentage signs.`%%timeit`
- You can give timeit the number of loops that you would like to run. By default, we'll use 1,000 loops. 

In [None]:
%%timeit -n 100
summary = 0
for item in s:
    summary += item

In [None]:
%%timeit -n 100
summary = np.sum(s)

 - Related feature in Pandas and NumPy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. 

In [None]:
s += 2 #adds two to each item in s using broadcasting
s.head()

In [None]:
for label, value in s.iteritems():
    s.at[label] = value + 2
s.head()

In [None]:
%%timeit -n 10
s = pd.Series(np.random.randint(0, 1000, 10000))
for label, value in s.iteritems():
    s.loc[label] = value + 2

In [None]:
%%timeit -n 10
s = pd.Series(np.random.randint(0, 1000, 10000))
s += 2

 - The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types.
 - While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate. 

In [None]:
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s

 - There are a couple of important considerations when using append:
     1. Pandas is going to take your series and try to infer the best data types to use. 
     2. the append method doesn't actually change the underlying series. It instead returns a new series which is made up of the two appended together. 
     (This is actually a significant issue for new Pandas users who are used to objects being changed in place. So watch out for it, not just with append but with other Pandas functions as well.)
     3. we see that when we query the appended series for those who have cricket as their national sport, we don't get a single value, but a series itself. This is actually very common, and if you have a relational database background, this is very similar to every table query resulting in a return set which itself is a table. 
     

In [None]:
original_sports = pd.Series({'Archery': 'Bhutan',
                            'Golf': 'Scotland',
                            'Sumo': 'Japan',
                            'Taekwando': 'South Korea'})
cricket_loving_countries = pd.Series(['Autralia',
                                     'Barbados',
                                     'Pakistan',
                                     'England'],
                                    index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)

In [None]:
original_sports

In [None]:
cricket_loving_countries

In [None]:
all_countries

# The DataFrame Data Structure
-------
 - The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working with in data analysis and cleaning tasks. 
 - The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. 
 - In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array. 

In [None]:
import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                       'Item Purchased': 'Dog Food',
                       'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevin',
                       'Item Purchased': 'Kitty Litter',
                       'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                       'Item Purchased': 'Bird Seed',
                       'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index = ['Store 1', 'Store 1', 'Store 2'])
df.head()

 - Because the DataFrame is two-dimensional, passing a single value to the `loc` indexing operator will return series if there's only one row to return. 

In [None]:
df.loc['Store 2']

In [None]:
type(df.loc['Store 2'])

 - It's important to remember that the indices and column names along either axes, horizontal or vertical, could be non-unique. 
 - For instance, in this example, we see two purchase records for Store 1 as different rows. If we use a single value with the DataFrame `loc` attribute, multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame. 

In [None]:
df.loc['Store 1']

 - One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes. 

In [None]:
df.loc['Store 1', 'Cost']

 - What if we just wanted to do column selection and just get a list of all of the costs? 
     1. First, you can get a transpose of the DataFrame, using the capital T attribute, which swaps all of the columns and rows. This essentially turns your column names into indices. And we can then use the `loc` method. This works, but it's pretty ugly:

In [None]:
df.T

In [None]:
df.T.loc['Cost']

 - Since iloc and loc are used for row selection, the Panda's developers reserved indexing operator directly on the DataFrame for column selection. 
 - In a Panda's DataFrame, columns always have a name. So this selection is always label based
     2. So the second way to do it is by simply using indexing operator:
 

In [None]:
df['Cost']

 - You can also chain operations together.
 - Chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. 
 - For selecting a data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error. 

In [None]:
df.loc['Store 1']['Cost']

 - `.loc` also supports slicing. If we wanted to select all rows, we can use a column to indicate a full slice from beginning to end. And then add the column name as the second parameter as a string. In fact, if we wanted to include multiply columns, we could do so in a list. And Pandas will bring back only the columns we have asked for. 

In [None]:
df.loc[:, ['Name', 'Cost']]

 - It's easy to delete data in series and DataFrames, and we can use the `drop` function to do so. 
 - This function takes a single parameter, which is the index or roll label, to drop. 
 - The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed. 

In [None]:
df.drop('Store 1')

In [None]:
df

 - Let's make a copy with the copy method and do a drop on it instead. This is a very typical pattern in Pandas, where in place changes to a DataFrame are only done if need be, usually on changes involving indices. 

In [None]:
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

 - Drop has two interesting optional parameters. The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. 
 - The second parameter is the axes, which should be dropped. By default, this value is 0, indicating the row axes. But you could change it to 1 if you want to drop a column. 

In [None]:
copy_df.drop?

In [None]:
del copy_df['Name']
copy_df

In [None]:
df['Location'] = None
df

# Dataframe Indexing and Loading

In [None]:
costs = df['Cost']
costs

In [None]:
costs += 2
costs

In [None]:
df

In [None]:
!cat olympics.csv

In [None]:
df = pd.read_csv('olympics.csv')
df.head()

In [None]:
df = pd.read_csv('olympics.csv', index_col = 0, skiprows = 1)
df.head()

In [None]:
df.columns

In [None]:
for col in df.columns:
    if col[:2] == '01':
        df.rename(columns = {col:'Gold' + col[4:]}, inplace = True)
    if col[:2] == '02':
        df.rename(columns = {col:'Silver' + col[4:]}, inplace = True)
    if col[:2] == '03':
        df.rename(columns = {col:'Bronze' + col[4:]}, inplace = True)
    if col[:1] == '№':
        df.rename(columns = {col:'#' + col[1:]}, inplace = True)

df.head()

# Quering a DataFrame

In [None]:
df['Gold'] > 0

In [None]:
only_gold = df.where(df['Gold'] > 0)
only_gold.head()

In [None]:
only_gold['Gold'].count()

In [None]:
df['Gold'].count()

In [None]:
only_gold = only_gold.dropna()
only_gold.head()

In [None]:
only_gold = df[df['Gold'] > 0]
only_gold.head()

In [None]:
len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])

In [None]:
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]

# Indexing DataFrames

In [None]:
df.head()

In [None]:
df['country'] = df.index
df = df.set_index('Gold')
df.head()

In [None]:
df = df.reset_index()
df.head()

In [None]:
df = pd.read_csv('census.csv')
df.head()

In [None]:
df['SUMLEV'].unique()

In [None]:
df = df[df['SUMLEV'] == 50]
df.head()

In [None]:
columns_to_keep = ['STNAME',
                  'CTYNAME',
                  'BIRTHS2010',
                  'BIRTHS2011',
                  'BIRTHS2012',
                  'BIRTHS2013',
                  'BIRTHS2014',
                  'BIRTHS2015',
                  'POPESTIMATE2010',
                  'POPESTIMATE2011',
                  'POPESTIMATE2012',
                  'POPESTIMATE2013',
                  'POPESTIMATE2014',
                  'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

In [None]:
df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

In [None]:
df.loc['Michigan', 'Washtenaw County']

In [None]:
df.loc[[('Michigan', 'Washtenaw County'),
        ('Michigan', 'Wayne County')]]

# Missing values

In [None]:
df = pd.read_csv('log.csv')
df

In [None]:
df.fillna?

In [None]:
df = df.set_index('time')
df = df.sort_index()
df

In [None]:
df = df.reset_index()
df = df.set_index(['time', 'user'])
df

In [None]:
df = df.fillna(method = 'ffill')
df.head()