### Avoiding SettingWithCopyWarning in Pandas

References:

1. https://www.youtube.com/watch?v=4R4WsDJ-KVc

In [2]:
import numpy as np
import pandas as pd

movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [5]:
# A good habit to get into is to look for NULL values.  Calling the
# isnull() method on the content_rating column returns another series:
movies['content_rating'].isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: content_rating, dtype: bool

In R, columns are **vectors**.  In pandas, columns are **series**.  Boolean vectors and series are treated similarly in both languages in that:

+ TRUE (R) = True (Python) = 1
+ FALSE (R) = False (Python) = 0

So if sum the values, we'll get count of NULLs in the series:

In [7]:
movies.content_rating.isnull().sum()

3

Which rows are these NULL values in?  Since the `isnull()` method returns a boolean series, we can find the NULLs using:

In [9]:
movies[movies['content_rating'].isnull()]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


where **NaN** is used by panadas to designate missing values.  In R, we could get the unique values by doing `unique(movies$content_rating)`.  In pandas, we call the same function, but as a **method** on the dataframe object like this:

In [10]:
movies['content_rating'].unique()

array(['R', 'PG-13', 'NOT RATED', 'PG', 'UNRATED', 'APPROVED', 'PASSED',
       'G', 'X', nan, 'TV-MA', 'GP', 'NC-17'], dtype=object)

If we wanted the counts of each rating, in R we'd do `table(movies$content_rating)`.  In pandas, we call the `value_counts()` **method** on the dataframe object like this:

In [12]:
movies['content_rating'].value_counts()

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
NC-17          7
PASSED         7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64