# Merging Dataframes
-----------

- We used the DataFrame of store purchases from one of our previous lectures, where the index is a list of stores and the columns store purchase data. 

In [None]:
import pandas as pd

df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
                  {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
                  {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
                 index=['Store 1', 'Store 1', 'Store 2'])
df

- f we want to add some new column called Date to the DataFrame, we just use the square bracket operator directly on the DataFrame, as long as the column is as long as the rest of the records. 

In [None]:
df['Date'] = ['December 1', 'January 1', 'mid-May']
df

- If we want to add some new field, may be a delivery flag, that's easy too since it's a scalar value. 

In [None]:
df['Delivered'] = True
df

- The problem comes in when we have only a few items to add. In order for this to work, we have to supply pandas the list which is long enough for the DataFrame, so that each row could be populated. This means that we have to input none values ourselves.
- If each of our rows has a unique index, then we could just assign the new column identifier to the series. 

In [None]:
df['Feedback'] = ['Positive', None, 'Negative']
df

- For instance, if we reset the index in this example so the DataFrame is labeled from 1 through 2, then we create a new series with these labels, we can apply it. 
- The nice aspect of this approach is that we could just ignore the items that we don't know about, and pandas will put missing values in for us. 

In [None]:
adf = df.reset_index()
adf['Date'] = pd.Series({0: 'December 1', 2: 'mid-May'})
adf

- Now more commonly, we want to join two larger DataFrames together, and this is a bit more complex. 
- A Venn Diagram is traditionally used to show set membership. For example, the circle on the left is the population of students at a university. The circle on the right is the population of staff at a university. And the overlapping region in the middle are all of those students who are also staff. 
- We could think of these two populations as indices in separate DataFrames, maybe with the label of Person Name.

$~$

- When we want to join the DataFrames together, we have some choices to make.
 - First what if we want a list of all the people regardless of whether they're staff or student, and all of the information we can get on them? In database terminology, this is called a **full outer join** And in set theory, it's called a **union**. In the Venn diagram, it represents everyone in any circle. 
 - It's quite possible though that we only want those people who we have maximum information for, those people who are both staff and students. In database terminology, this is called an **inner join**. Or in set theory, the **intersection**. And this is represented in the Venn diagram as the overlapping parts of each circle. 
 
$~$

- An exapmle:

In [None]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())

- If we want the union of these, we would call merge passing in the DataFrame on the left and the DataFrame on the right, and telling merge that we want it to use an **outer** join. 
- We tell merge that we want to use the left and right indices as the joining columns. 
- We can see everyone is listed in this new dataframe, since Mike does not have a role, and John does not have a school, those cells are listed as missing values. 

In [None]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

-  If we wanted to get the intersection, that is, just those students who are also staff, we could set the how attribute to **inner**. And we set the resulting DataFrame has only James and Sally in it. 

In [None]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

- Now there are two other common use cases when merging DataFrames. Both are examples of what we would call set addition. 
 - The first is when we would want to get a list of all staff regardless of whether they were students or not. But if they were students, we would want to get their student details as well. To do this we would use a **left** join. 

In [None]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

- 
 - Next, We want a list of all of the students and their roles if they were also staff. To do this we would do a **right** join. 

In [None]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

- The merge method has a couple of other interesting parameters. 
 - First, you don't need to use indices to join on, you can use columns as well. Here's an example:

In [None]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

- So what happens when we have conflicts between the DataFrames? Let's take a look by creating new staff and student DataFrames that have a location information added to them. In the staff DataFrame, this is an office location where we can find the staff person. And we can see the Director of HR is on State Street, while the two students are on Washington Avenue. But for the student DataFrame, the location information is actually their home address. 


In [None]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

- The merge function preserves this information, but appends an `_x` or `_y` to help differentiate between which index went with which column of data. The `_x` is always the left DataFrame information, and the `_y` is always the right DataFrame information. you could control the names of `_x` and `_y` with additional parameters if you want to.

- let's talk about multi-indexing and multiple columns. 
- It's quite possible that the first name for students and staff might overlap, but the last name might not. In this case, we use a list of the multiple columns that should be used to join keys on the `left_on` and `right_on` parameters. 

In [None]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
print(staff_df)
print()
print(student_df)
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

-------
# Idiomatic Pandas: Making Code Pandorable
-------
- The best Python solutions to problems are celebrated as Idiomatic Python.
- An idiomatic solution is often one which has both high performance and high readability. 
- let's see a couple of key features of how you can make your code pandorable. 

- The first of these is called method chaining. 
- chain indexing:
 - `df.loc['Washtenaw']['Total Population']`
 - This is generally a bad practice, because pandas could return a copy of a view depending upon numpy.
 - code smell: If you see a `][` you should think carefully about what you are doing!

- Method chaining though, is a little bit different. The general idea behind method chaining is that every method on an object returns a reference to that object. The beauty of this is that you can condense many different operations on a DataFrame, for instance, into one line or at least one statement of code. 

In [None]:
import pandas as pd
df = pd.read_csv('census.csv')
df

- Here's an example of two pieces of code in pandas using our census data. The first is the pandorable way to write the code with method chaining. 
- you can see that when we first run a `where` query, then a `dropna`, then a `set_index`, and then a `rename`. 

In [None]:
(df.where(df['SUMLEV']==50)
    .dropna()
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

- The second example is a more traditional way of writing code.

In [None]:
df = df[df['SUMLEV']==50]
df.set_index(['STNAME','CTYNAME'], inplace=True)
df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

- Here's another pandas idiom. Python has a wonderful function called map, which is sort of a basis for functional programming in the language. When you want to use map in Python, you pass it some function you want called, and some iterable, like a list, that you want the function to be applied to. The results are that the function is called against each item in the list, and there's a resulting list of all of the evaluations of that function. 
- Pandas has a similar function called applymap. In applymap, you provide some function which should operate on each cell on a DataFrame, and the return set is itself a DataFrame. 
- Now I think applymap is fine, but I actually rarely use it. Instead, I find myself often wanting to map across all of the rows in a DataFrame. And pandas has a function that I use heavily there, called apply. ( Dr. Brook said! :D )

- First, we need to write a function which takes in a particular row of data, finds a minimum and maximum values, and returns a new row of data. 

In [None]:
import numpy as np
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    return pd.Series({'min': np.min(data), 'max': np.max(data)})

- Then we just need to call `apply` on the DataFrame. `apply` takes the function and the axis on which to operate as parameters. Now, we have to be a bit careful, we've talked about axis zero being the rows of the DataFrame in the past. But this parameter is really the **parameter of the index to use**. 

In [None]:
df.apply(min_max, axis=1)

- If you're doing this as part of data cleaning your likely to find yourself wanting to add new data to the existing DataFrame. In that case you just take the row values and add in new columns indicating the max and minimum scores. 

In [None]:
import numpy as np
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row
df.apply(min_max, axis=1)

In [None]:
rows = ['POPESTIMATE2010',
        'POPESTIMATE2011',
        'POPESTIMATE2012',
        'POPESTIMATE2013',
        'POPESTIMATE2014',
        'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows]), axis=1)

-------
# Group by
-------

- We've seen that even though PANDAS allows us to iterate over every row in a data frame this is generally a slow way to accomplish a given task and it's not very pandorable. 
- For instance, if we wanted to write some code to iterate over all the of the states and generate a list of the average census population numbers. We could do so using a loop in the `unique` function.
- Another option is to use the dataframe `groupby` function. This function takes some column name or names and splits the dataframe up into chunks based on those names, it returns a dataframe group by object. Which can be iterated upon, and then returns a tuple where the first item is the group condition, and the second item is the data frame reduced by that grouping. 

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
df

In [None]:
%%timeit -n 10
for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Counties in state ' + state + ' have an average population of ' + str(avg))

In [None]:
%%timeit -n 10
for group, col in df.groupby('STNAME'):
    avg = np.average(col['CENSUS2010POP'])
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

- you can actually provide a function to group by as well and use that to segment your data. This is a bit of a fabricated example but lets say that you have a big batch job with lots of processing and you want to work on only a third or so of the states at a given time. 
- We could create some function which returns a number between zero and two based on the first character of the state name. Then we can tell group by to use this function to split up our data frame. It's important to note that in order to do this you need to set the index of the data frame to be the column that you want to group by first. 

In [None]:
df.head()

- Here's an example. We'll create some new function called `fun` and if the first letter of the parameter is a capital `M` we'll return a 0. If it's a capital `Q` we'll return a 1 and otherwise we'll return a 2. 
- Then we'll pass this function to the data frame `groupby`, and see that the data frame is segmented by the calculated group number.

$~$

- This kind of technique, which is sort of a light weight hashing, is commonly used to distribute tasks across multiple workers or cores in a processor, nodes in a supercomputer, or disks in a database.


In [None]:
df = df.set_index('STNAME')

def fun(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in df.groupby(fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')


- A common work flow with `groupby` is that you split your data, you apply some function, then you combine the results. This is called **split apply combine** pattern.
- we've seen the splitting method, but what about apply? Certainly iterative methods as we've seen can do this, but the `groupby` object also has a method called `agg` which is short for aggregate. This method applies a function to the column or columns of data in the group, and returns the results. 
- With `agg`, you simply pass in a dictionary of the column names that you're interested in, and the function that you want to apply.
- For instance to build a summary data frame for the average populations per state, we could just give `agg` a dictionary with the Census 2010 pop key and the numpy average function

In [None]:
df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]

In [None]:
df.groupby('STNAME').agg({'CENSUS2010POP': np.average})

- You see, when you pass in a dictionary to `agg`, it can be used to either to identify the columns to apply a function on or to name an output column if there's multiple functions to be run. The difference depends on the keys that you pass in from the dictionary and how they're named. 
- In short, while much of the documentation and examples will talk about a single groupby object, there's really two different objects. The data frame groupby and the series groupby. And these objects behave a little bit differently with aggregate. 

In [None]:
print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']))
print(type(df.groupby(level=0)['POPESTIMATE2010']))

In [None]:
df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg([np.average, np.sum])

- We can do the same thing with a data frame instead of a series. We set the index to be the state name, we group by the index, and we project two columns. The population estimate in 2010, the population estimate in 2011. 
- When we call aggregate with two parameters, it builds a nice hierarchical column space and all of our functions are applied. 

In [None]:
(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']
    .agg([np.average, np.sum]))

------
# Scales
-----

- As a data scientist there four scales that it's worth knowing. 
 - The first is a **ratio scale**. In the ratio scale the measurements units are equally spaced and mathematical operations, such as subtract, division, and multiplication are all valid. Good examples of ratio scale measurements might be the height and weight.
 - The next scale is the **interval scale**. In the interval scale the measurement units are equally spaced like the ratio scale. But there's no clear absence of value. That is there isn't a true zero, and so operation such as multiplication and division are not valid. An example of the interval scale might be the temperatures measured in Celsius or Fahrenheit. Since there's never an absence of temperature and 0 degrees is a meaningful value of temperature. 
 - The next scale is the **ordinal scale**, in the ordinal scale the order of values is important but the differences between the values are not equally spaced. Letter grades such as A+, A are a good example. Ordinal data is very common in machine learning and can sometimes be a challenge to work with. 
 - The last scale is the **nominal scale** which is often just called **categorical data**  Here the names of teams in a sport might be good example. There are a limited number of teams but changing their order or playing mathematical function to them is meaningless. Categorical values are very common and we generally refer to categories where there are only two possible values as **binary**. 

- Pandas has a number of interesting functions to deal with converting between measurement scales. 
- Let's start first with **nominal data**, which in Pandas is called **categorical data**. Panda is actually has a built in type for categorical data and you could set a column of your data to categorical data by using the `astype` method. 
- `astype` tries to change the underlying type of your data, in this case to category data. You can further change this to ordinal data by passing in an ordered flags set to true and passing in the categories in an ordered fashion. 

In [None]:
import pandas as pd
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns={0: 'Grades'}, inplace=True)
df

- Now when we instruct Pandas to render this as categorical data, we see that the D type has been set as category and that there are 11 different categories. 

In [None]:
df['Grades'].astype('category').head()

- If we want to indicate to Pandas that this data is in a logical order, we pass the `ordered` equals true flag and we see those reflected in the category D type using the less than sign. 

In [None]:
from pandas.api.types import CategoricalDtype
grades = df['Grades'].astype(CategoricalDtype(
                             categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                             ordered=True))
grades.head()

- What can you do with this? Well, ordinal data has ordering so it can help you with the Boolean masking. For instance, if we have our list of grades and we compared them with a C. If we did this lexographically, we would see that a C+ and a C- are both actually greater than a C. 

In [None]:
grades > 'C'

- Sometimes it's useful to represent categorical values as each being a column with a true or a false as to whether that category applies. This is especially common in feature extraction, which is a topic in the third course in this specialization. 
- Variables with a **Boolean value** are typically called **dummy variables**. And pandas has a built-in function called get dummies, which will convert the values of a single column into multiple columns of 0's and 1's, indicating the presence of a dummy variable. 

- There's one more function on scales that I'd like to talk about. And that's on reducing a value which is on the interval or ratio scale, like a number grade, into one that is categorical like a letter grade. Now, this might seem a bit counter intuitive to you since you're losing information about the value. But it's useful on a couple of places. 
 - First, if you're visualizing the frequencies of categories, and this can be an extremely useful approach and histograms are regularly used with converted interval or racial data. 
 - Second, if you're using a machine learning classification approach on data, then you need to be using categorical data. So reducing dimensionality is useful there too. 

$~$

- panda has a function called `cut`, which takes in argument which is some real like structure of a column or a data frame or a series. It also takes a number of bins to be used and all bins are kept at equal spacing.

$~$

- Let's go back to our census data for an example. We saw that we could group by state, then aggregate to get a list of the average county size by state. If we further apply cut to this with say 10 bins, we can see that the states listed as categoricals using the average county size. 

In [None]:
df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg([np.average])
pd.cut(df['average'],10)

- Another example of cut: suppose we have series that hold height data for jacket wearers. use `pd.cut` to bin this data into 3 bins.

In [None]:
s = pd.Series([168, 180, 174, 190, 170, 185, 179, 181, 175, 169, 182, 177, 180, 171])
pd.cut(s, 3)
pd.cut(s, 3, labels=['Small', 'Medium', 'Large'])

-----
# Pivot Tables
-----
- A pivot table is a way of summarizing data in a data frame for a particular purpose. It makes heavy use of the aggregation function. 
- A pivot table is itself a data frame, where the rows represent one variable that you're interested in, the columns another, and the cell's some aggregate value. 
- A pivot table also tends to includes marginal values as well, which are the sums for each column and row. This allows you to be able to see the relationship between two variables at just a glance. 

- Here we'll load a new data set, cars.csv. This data set comes from the Open Data Initiative of the Canadian government. And has information about the efficiency of different electric cars which are available for purchase. 

In [None]:
df = pd.read_csv('cars.csv')

- When we look at the head of the data frame, we'll see that there are model years, vendors, sizes of cars, and statistics, like how big the battery is in kilowatt hours. 

In [None]:
df.head()

- A pivot table allows us to pivot out one of these columns into a new column headers and compare it against another column as row indices. For instance, let's say we wanted to compare the makes of electric vehicles versus the years and that we wanted to do this comparison in terms of battery capacity. 
- To do this, we tell pandas we want the values to be kilowatts, the index to be the year and the columns to be the make. Then we specify that the aggregation function, and here we'll use the NumPy mean. 

In [None]:
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)

- Here's the results. We see there are NaN values for vendors who didn't have an entry in a given year like Ford in 2012. And we see that most vendors don't have a change in battery capacity over the years, except for Tesla, as they've introduced several new models. 

- Now, pivot tables aren't limited to one function that you might want to apply. You can pass aggfunc, a list of the different functions to apply, and pandas will provide you with the result using hierarchical column names. Here, I’ll also pass margins equals true. And that you can see for each of the functions there's now an all category, which shows the overall mean and the minimum values for a given year and a given vendor. 

In [None]:
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=[np.mean,np.min], margins=True)

-----
# Date Functionality in Pandas
-----
- Pandas has four main time related classes. `Timestamp`, 
`DatetimeIndex`, `Period`, and `PeriodIndex`. 

In [None]:
import pandas as pd
import numpy as np

### `Timestamp`
----
- Timestamp represents a single timestamp and associates values with points in time. 
- Timestamp is interchangeable with Python's `datetime` in most cases. 

In [None]:
pd.Timestamp('9/1/2016 10:05AM')

### `Period`
------
- Suppose we weren't interested in a specific point in time, and instead wanted a span of time. This is where `Period` comes into play.
- `Period` represents a single time span, such as a specific day or month. 


In [None]:
pd.Period('1/2016')

In [None]:
pd.Period('3/5/2016')

### `DatetimeIndex`
-----
- The index of a `Timestamp` is `DatetimeIndex`. Let's look at a quick example. 

In [None]:
t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')])
t1

In [None]:
type(t1.index)

### PriodIndex
-----
- Similarly, the index of `Period` is `PeriodIndex`. 

In [None]:
t2 = pd.Series(list('def'), [pd.Period('2016-09'), pd.Period('2016-10'), pd.Period('2016-11')])
t2

In [None]:
type(t2.index)

### Converting to Datetime
------
- Suppose we have a list of dates as strings. 
- If we create a DataFrame using these dates as the index. And some randomly generated data, this is the DataFrame we get. 
- Looking at the index we can see that it’s pretty messy and the dates are all in different formats. 

In [None]:
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1, columns=list('ab'))
ts3

- Using pandas `to_datetime`, pandas will try to convert these to Datetime and put them in a standard format. 

In [None]:
ts3.index = pd.to_datetime(ts3.index)
ts3

- `to_datetime` also has options to change the date parse order. For example, we can pass in the argument `dayfirst = True` to parse the date in European date format. 

In [None]:
pd.to_datetime('4.7.12', dayfirst=True)

### Timedeltas
-----
- Timedeltas are differences in times. We can see that when we take the difference between September 3rd and September 1st, we get a Timedelta of two days. 

In [None]:
pd.Timestamp('9/3/2016')-pd.Timestamp('9/1/2016')

- We can also do something like find what the date and time is for 12 days and three hours past September 2nd, at 8:10 AM. 

In [None]:
pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')

### Working with Dtes in a Dataframe
------
- Let's look at a few tricks for working with dates in a DataFrame. 
- Suppose we want to look at nine measurements, taken bi-weekly, every Sunday, starting in October 2016. Using `date_range`, we can create this `DatetimeIndex`. 

In [None]:
dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN')
dates

- Now, let's create a DataFrame using these dates, and some random data, and see what we can do with it. 

In [None]:
df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(),
                  'Count 2': 120 + np.random.randint(-5, 10, 9)}, index=dates)
df

- First, we can check what day of the week a specific date is. For example, here we can see that all the dates in our index are on a Sunday. 

In [None]:
df.index.day_name()

- We can use diff to find the difference between each date's value. 

In [None]:
df.diff()

- Suppose we wanted to know what the mean count is for each month in our DataFrame. 
- We can do this using `resample`. 

In [None]:
df.resample('M').mean()

- We can use partial string indexing to find values from a particular year, or from a particular month, or we can even slice on a range of dates. 
- For example, here we only want the values from December 2016 onwards. 

In [None]:
df['2017']

In [None]:
df['2016-12']

In [None]:
df['2016-12':]

- Another cool thing we can do is change the frequency of our dates in our DataFrame using `asfreq`. 
- If we use this to change the frequency from bi-weekly to weekly, we'll end up with missing values every other week. So let's use the forward fill method on those missing values. 

In [None]:
df.asfreq('W', method='ffill')

- One last thing I wanted to briefly touch upon is plotting time series. 
- Importing `matplotlib.pyplot`, and using the iPython magic `%mapplotlib inline`, will allow you to visualize the time series in the notebook. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df.plot()