### General Pandas References

1. https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
  - What do I need to know about the pandas index?: [#17, Part 1](https://www.youtube.com/watch?v=OYZNk7Z9s6I&index=17&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y), [#18, Part 2](https://www.youtube.com/watch?v=15q-is8P_H4&index=18&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y)
  - How do I avoid a SettingWithCopyWarning in pandas?: [#27](https://www.youtube.com/watch?v=4R4WsDJ-KVc)

### How to use the index



### Avoiding SettingWithCopyWarning in Pandas

Additional References: https://tomaugspurger.github.io/modern-1-intro


In [30]:
import numpy as np
import pandas as pd

movies = pd.read_csv('https://raw.githubusercontent.com/MichaelSzczepaniak/AllAboutPandas/master/datasets/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [5]:
# A good habit to get into is to look for NULL values.  Calling the
# isnull() method on the content_rating column returns another series:
movies['content_rating'].isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: content_rating, dtype: bool

In R, columns are **vectors**.  In pandas, columns are **series**.  Boolean vectors and series are treated similarly in both languages in that:

+ TRUE (R) = True (Python) = 1
+ FALSE (R) = False (Python) = 0

So if sum the values, we'll get count of NULLs in the series:

In [7]:
movies.content_rating.isnull().sum()

3

Which rows are these NULL values in?  Since the `isnull()` method returns a boolean series, we can find the NULLs using:

In [9]:
movies[movies['content_rating'].isnull()]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


where **NaN** is used by panadas to designate missing values.  In R, we could get the unique values by doing `unique(movies$content_rating)`.  In pandas, we call the [unique method on the dataframe object](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) like this:

In [10]:
movies['content_rating'].unique()

array(['R', 'PG-13', 'NOT RATED', 'PG', 'UNRATED', 'APPROVED', 'PASSED',
       'G', 'X', nan, 'TV-MA', 'GP', 'NC-17'], dtype=object)

If we wanted the counts of each rating, in R we'd do `table(movies$content_rating)`.  In pandas, we can call the `value_counts()` **method** on the dataframe object like this:

In [12]:
movies['content_rating'].value_counts()

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
NC-17          7
PASSED         7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

We want to consider all the _NOT RATED_ entries as a missing values.  Which row are the _NOT RATED_ values in?

In [15]:
movies[movies['content_rating'] == 'NOT RATED'].head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
41,8.5,Sunset Blvd.,NOT RATED,Drama,110,"[u'William Holden', u'Gloria Swanson', u'Erich..."
63,8.4,M,NOT RATED,Crime,99,"[u'Peter Lorre', u'Ellen Widmann', u'Inge Land..."
66,8.4,Munna Bhai M.B.B.S.,NOT RATED,Comedy,156,"[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']"


Get the series of _NOT RATED_ and overwrite them with `numpy.nan` and this will generate a `SettingWithCopyWarning`:

In [17]:
movies[movies['content_rating'] == 'NOT RATED']['content_rating'] = np.nan  # ~ 4" 35"' of video

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


This isn't an error.  It's a warning.  But we don't know if the `NaN` values actually got replaced or not, so we need to check to see if the operation was done:

In [18]:
movies['content_rating'].isnull().sum()

3

We would have expected this to be 68 if the operation completed as we exected.  Since we got our original results (3), it did not execute as we expected.  This shows why this check is important because under other circumstances, the operation **will** complete.

Notice the hint in the warning message: `Try using .loc[row_indexer,col_indexer] = value instead`
Let's try taking this advice and seeing what happens:

In [20]:
movies.loc[movies['content_rating'] == 'NOT RATED', 'content_rating'] = np.nan
movies['content_rating'].isnull().sum()

68

Looks like it worked.  For a good reference on how `.loc` works, [try this link](https://www.youtube.com/watch?v=xvpNA7bC8cs).

### What is going on with this warning?

(~6" 40"' into [the video](https://www.youtube.com/watch?v=4R4WsDJ-KVc)) Start by realizing that the line that generated the error:  

`movies[movies['content_rating'] == 'NOT RATED']`  

is actually **2** operations.

The first part:  `movies[movies['content_rating'] == 'NOT RATED']`  
is a _**get** item_ operation.  

The second part:  `...['content_rating'] = np.nan` or `....content_rating = np.nan`  
is a _**set** item_ operation.

The problem is that pandas can't gaurantee that the **get** operation returned a **view** or **copy** of the data.  If a **view** was returned, it would effect the dataframe.  If a **copy** was returned, it would not effect the original dataframe (because it would change the copy).  The warning is generated because pandas was not sure what happened (whether a view or copy was returned), so it's warning the user about this uncertainty.

### Why did .loc() fix this issue?

It did so by turning the line that generated the warning from 2 operations (get and set) to a single **set** operation.  The moral of this story is: If you are trying to select rows and columns in the **same** line of code, use the **.loc()** method.

This is one of the ways to deal with this warning, but it is not the only way.  As mentioned earlier, this warning can be generated from a variety of circumstances.  Let's look at another way this warning comes up.

In [24]:
top_movies = movies.loc[movies['star_rating'] >= 9, :]  # all the movies with star_rating >= 9, all the rows (:)

In [26]:
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


Say we want to change a value in this dataframe.  For example, change duration from 142 to 150 for *The Shawshank Redemption*.

In [27]:
top_movies.loc[0, 'duration'] = 150

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


There it is again!  But as we'll see, it being generated for a different reason this time.  As before, let's check to see if the operation completed as expected.

In [28]:
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


Unlike the earlier example, the operation **did** complete this time.

### What is causing the warning to be generated this time?

Similar to the explanation in the previous example, panadas was not sure whether **top_movies** is a **copy** of the _movies_ dataframe (which it is) or a **view** on the _movies_ dataframe.

### So how do we fix this situation?

The line that generated the error this time: `movies.loc[0, 'duration'] = 150` was **not** the one actually causing the problem.  To fix this issue, we need to be explicitly clear that we are operating on a **copy** of the _movies_ dataframe and we can do that by using the **.copy()** method.

In [29]:
top_movies = movies.loc[movies['star_rating'] >= 9, :].copy()
top_movies.loc[0, 'duration'] = 150
top_movies  # no warning this time!

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


### How to deal with SettingWithCopyWarning Summary

1. If you are trying to select rows and columns in the **same** line of code, use the **.loc()** method.
2. If you want to operate on a copy of dataframe, explicitly use the **.copy()** method to create the dataframe which you will be manipulating.