# How do I avoid a SettingWithCopyWarning in pandas?

🐼 Tuto on pandas by Data School - Exercice performed by Dorian.H Mekni 🥇 | Wed 16 Dec 2020

In [1]:
import pandas as pd 

In [3]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [4]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [7]:
# Counting how mnay missing values there are : 
movies.content_rating.isnull().sum() 

3

In [8]:
# Generating a boolean Series we are passing to the dataframe : 
movies[movies.content_rating.isnull()]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"



✅ The three movies get displayed as wanted. 


In [9]:
# Zooming on the content rating unique values : 
movies.content_rating.value_counts()

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64


⭐️ For the sake of this demonstration, we assume that the NOT RATED shall be assimilated or counted as missing values : 
    


⭐️ Let's replace NOT RATED mention with NaN to take full advantages of the non value fonctionality :


In [12]:
# Reading through the NOT RATED content rating :
movies[movies.content_rating == 'NOT RATED']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
41,8.5,Sunset Blvd.,NOT RATED,Drama,110,"[u'William Holden', u'Gloria Swanson', u'Erich..."
63,8.4,M,NOT RATED,Crime,99,"[u'Peter Lorre', u'Ellen Widmann', u'Inge Land..."
66,8.4,Munna Bhai M.B.B.S.,NOT RATED,Comedy,156,"[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']"
...,...,...,...,...,...,...
665,7.7,Lolita,NOT RATED,Drama,152,"[u'James Mason', u'Shelley Winters', u'Sue Lyon']"
673,7.7,Blow-Up,NOT RATED,Drama,111,"[u'David Hemmings', u'Vanessa Redgrave', u'Sar..."
763,7.6,Hunger,NOT RATED,Biography,96,"[u'Stuart Graham', u'Laine Megaw', u'Brian Mil..."
827,7.5,The Wind That Shakes the Barley,NOT RATED,Drama,127,"[u'Cillian Murphy', u'Padraic Delaney', u'Liam..."



🧐 The NaN value is not a string but it is a value form the numpy library so we're going to actually import numpy into our working environement :


In [13]:
import numpy as np

In [14]:
# Overwrite this Series with NaN : 
movies[movies.content_rating == 'NOT RATED'].content_rating = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [16]:
movies.content_rating.isnull().sum()

3

❗️It actually didn't work ! 

In [17]:
# Let's take into consideration the recommendation given by pandas : 
movies.loc[movies.content_rating == 'NOT RATED', 'content_rating'] = np.nan

In [18]:
# Let's run the check : 
movies.content_rating.isnull().sum()

68


☝🏻✅ Bingo, it worked ! 



🧐 If you're trying to select rows and columns in a same line of code, throw a .loc and it'll work better with pandas. 


In [21]:
# Let's create a dataframe gathering the top movies : 
top_movies = movies.loc[movies.star_rating >=9, :]

In [22]:
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."



⭐️ Let's pretend that the durtation of the The Shawshank Redemption is incorrect and that we want to fix it : 


In [24]:
top_movies.loc[0, 'duration'] = 150

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s



😒 Although, we use .loc, pandas is warning us .... 

In [25]:
# Let's check if it worked first : 
top_movies 

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."



✅ 🥳... And it actually modified the duration of our targeted top movie !! 



🧐 Worth mentionning : Sometimes a warning is going to tell you off but it's going to work, and sometimes it won't be working so it's always better to check if it worked or not !



⭐️ Everytime you're duplicating a dataframe, you should explicitly use the copy() method so that pandas knows it is a copy. Thus never confused about is top movies a copy ? or a view of 4 rows from movies ? 


In [26]:
# Let's apply the copy method ; 
top_movies = movies.loc[movies.star_rating >=9, :].copy()

In [28]:
# Let's now modify our targeted duration : 
top_movies.loc[0, 'duration'] = 150

In [29]:
# Reading through : 
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."



✅ It worked brillantly ! 


In [30]:
# Making sure it did not modify the original dataframe movies : 
movies.head() 

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."



✅ ...and it did not ! 



🙏🏻 Thank you !

👋🏻 See you in the next one !