# Methods for handling missing values
**pandas provides the following methods to handle missing values:**

- `isna`: Returns a Series of booleans based on whether each value is missing or not.
- `notna`: Exact opposite of isna.
- `fillna`: Fills missing values in a variety of ways
- `dropna`: Drops the missing values from the Series

In [26]:
5+5

10

In [27]:
x = 5+5

In [28]:
x

10

In [23]:
import pandas as pd

In [35]:
movie = pd.read_csv("data/movie.csv")

In [25]:
movie[movie["year"].isna()]

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1
176,Miami Vice,,Color,TV-14,60.0,,,Don Johnson,982.0,Philip Michael Thomas,...,184.0,,Action|Crime|Drama|Mystery|Thriller,21.0,16769,cult tv|detective|drugs|police|undercover,English,USA,1500000.0,7.5
257,The A-Team,,Color,TV-PG,60.0,,,George Peppard,669.0,Dirk Benedict,...,432.0,,Action|Adventure|Crime,29.0,25402,1980s|cult tv|famous opening theme|good versus...,English,USA,,7.6
276,"10,000 B.C.",,,,22.0,Christopher Barnard,0.0,Mathew Buck,5.0,,...,,,Comedy,,6,,,,,7.2
398,Hannibal,,Color,TV-14,44.0,,,Caroline Dhavernas,544.0,Scott Thompson,...,148.0,,Crime|Drama|Horror|Mystery|Thriller,103.0,159910,blood|cannibalism|fbi|manipulation|psychiatrist,English,USA,,8.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4683,Heroes,,Color,TV-14,60.0,,,Sendhil Ramamurthy,1000.0,Masi Oka,...,833.0,,Drama|Fantasy|Sci-Fi|Thriller,75.0,202115,father daughter relationship|serial killer|sup...,English,USA,,7.7
4688,Home Movies,,Color,TV-PG,22.0,,,Brendon Small,59.0,Ron Lynch,...,6.0,,Animation|Comedy|Drama,11.0,7458,coach|friend|school|series|tv series,English,USA,,8.2
4704,Revolution,,Color,TV-14,43.0,,,Billy Burke,2000.0,Tracy Spiridakos,...,576.0,,Action|Adventure|Drama|Sci-Fi,23.0,72017,2020s|near future|one word series title|post a...,English,USA,,6.7
4752,Happy Valley,,Color,TV-MA,58.0,,,Shirley Henderson,887.0,James Norton,...,250.0,,Crime|Drama,11.0,12848,caravan|police|police sergeant|tied to a chair...,English,UK,,8.5


In [6]:
#number of missing values in year column
movie["year"].isna().sum()


106

In [9]:
#get all rows in which the year column is missing
filter_1 = movie["year"].isna()
missing_year = movie[filter_1]


In [10]:
missing_year["year"].count()

0

In [11]:
missing_year["year"]

4      NaN
176    NaN
257    NaN
276    NaN
398    NaN
        ..
4683   NaN
4688   NaN
4704   NaN
4752   NaN
4912   NaN
Name: year, Length: 106, dtype: float64

In [None]:
# use fillna to fill missing values.

In [13]:
complete_year = movie

In [15]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            4916 non-null   object 
 1   year             4810 non-null   float64
 2   color            4897 non-null   object 
 3   content_rating   4616 non-null   object 
 4   duration         4901 non-null   float64
 5   director_name    4814 non-null   object 
 6   director_fb      4814 non-null   float64
 7   actor1           4909 non-null   object 
 8   actor1_fb        4909 non-null   float64
 9   actor2           4903 non-null   object 
 10  actor2_fb        4903 non-null   float64
 11  actor3           4893 non-null   object 
 12  actor3_fb        4893 non-null   float64
 13  gross            4054 non-null   float64
 14  genres           4916 non-null   object 
 15  num_reviews      4867 non-null   float64
 16  num_voted_users  4916 non-null   int64  
 17  plot_keywords 

In [18]:
complete_year["year"].fillna(2024,inplace=True)

In [20]:
complete_year.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            4916 non-null   object 
 1   year             4916 non-null   float64
 2   color            4897 non-null   object 
 3   content_rating   4616 non-null   object 
 4   duration         4901 non-null   float64
 5   director_name    4814 non-null   object 
 6   director_fb      4814 non-null   float64
 7   actor1           4909 non-null   object 
 8   actor1_fb        4909 non-null   float64
 9   actor2           4903 non-null   object 
 10  actor2_fb        4903 non-null   float64
 11  actor3           4893 non-null   object 
 12  actor3_fb        4893 non-null   float64
 13  gross            4054 non-null   float64
 14  genres           4916 non-null   object 
 15  num_reviews      4867 non-null   float64
 16  num_voted_users  4916 non-null   int64  
 17  plot_keywords 

In [21]:
complete_year[complete_year["year"]==2024]

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
4,Star Wars: Episode VII - The Force Awakens,2024.0,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1
176,Miami Vice,2024.0,Color,TV-14,60.0,,,Don Johnson,982.0,Philip Michael Thomas,...,184.0,,Action|Crime|Drama|Mystery|Thriller,21.0,16769,cult tv|detective|drugs|police|undercover,English,USA,1500000.0,7.5
257,The A-Team,2024.0,Color,TV-PG,60.0,,,George Peppard,669.0,Dirk Benedict,...,432.0,,Action|Adventure|Crime,29.0,25402,1980s|cult tv|famous opening theme|good versus...,English,USA,,7.6
276,"10,000 B.C.",2024.0,,,22.0,Christopher Barnard,0.0,Mathew Buck,5.0,,...,,,Comedy,,6,,,,,7.2
398,Hannibal,2024.0,Color,TV-14,44.0,,,Caroline Dhavernas,544.0,Scott Thompson,...,148.0,,Crime|Drama|Horror|Mystery|Thriller,103.0,159910,blood|cannibalism|fbi|manipulation|psychiatrist,English,USA,,8.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4683,Heroes,2024.0,Color,TV-14,60.0,,,Sendhil Ramamurthy,1000.0,Masi Oka,...,833.0,,Drama|Fantasy|Sci-Fi|Thriller,75.0,202115,father daughter relationship|serial killer|sup...,English,USA,,7.7
4688,Home Movies,2024.0,Color,TV-PG,22.0,,,Brendon Small,59.0,Ron Lynch,...,6.0,,Animation|Comedy|Drama,11.0,7458,coach|friend|school|series|tv series,English,USA,,8.2
4704,Revolution,2024.0,Color,TV-14,43.0,,,Billy Burke,2000.0,Tracy Spiridakos,...,576.0,,Action|Adventure|Drama|Sci-Fi,23.0,72017,2020s|near future|one word series title|post a...,English,USA,,6.7
4752,Happy Valley,2024.0,Color,TV-MA,58.0,,,Shirley Henderson,887.0,James Norton,...,250.0,,Crime|Drama,11.0,12848,caravan|police|police sergeant|tied to a chair...,English,UK,,8.5


**let us use `movie` data set folr the following examples:**

In [2]:
# use isna to count the number of missing values

In [36]:
#use dropna to drop missing values.
#drop any row in which year is missing
movie = movie.dropna(subset=["year"])
movie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4810 entries, 0 to 4915
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            4810 non-null   object 
 1   year             4810 non-null   float64
 2   color            4795 non-null   object 
 3   content_rating   4552 non-null   object 
 4   duration         4798 non-null   float64
 5   director_name    4810 non-null   object 
 6   director_fb      4810 non-null   float64
 7   actor1           4803 non-null   object 
 8   actor1_fb        4803 non-null   float64
 9   actor2           4800 non-null   object 
 10  actor2_fb        4800 non-null   float64
 11  actor3           4792 non-null   object 
 12  actor3_fb        4792 non-null   float64
 13  gross            4052 non-null   float64
 14  genres           4810 non-null   object 
 15  num_reviews      4770 non-null   float64
 16  num_voted_users  4810 non-null   int64  
 17  plot_keywords 

### Exrcises:

In [None]:
# filter rows in which color is missing

In [37]:
#drop rows in which color is missing --new variable

In [None]:
#fill rows in which color is missing with "Color"

# Sorting:

The `sort_values` method sorts the Series `from least to greatest by default`. 

It places `missing values at the end`.

In [41]:
import pandas as pd
movieDf = pd.read_csv("data/movie.csv")

In [42]:
movieDf.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


In [44]:
movieDf.sort_values(by="year",ascending=False)# ascending=False if we want desc order

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
3884,The Veil,2016.0,Color,R,93.0,Phil Joanou,21.0,Lily Rabe,763.0,Shannon Woodward,...,359.0,,Horror,39.0,4146,documentary crew|mass suicide|survivor|tied fe...,English,USA,4000000.0,4.7
2375,My Big Fat Greek Wedding 2,2016.0,Color,PG-13,94.0,Kirk Jones,52.0,Nia Vardalos,567.0,Louis Mandylor,...,261.0,59573085.0,Comedy|Family|Romance,156.0,13562,family restaurant|greek|remarriage|suburb|wedding,English,USA,18000000.0,6.1
2794,Miracles from Heaven,2016.0,Color,PG,109.0,Patricia Riggen,36.0,Jennifer Garner,3000.0,Brighton Sharbino,...,579.0,61693523.0,Drama,63.0,6276,child cancer|christian film|christianity|falli...,English,USA,13000000.0,6.8
92,Independence Day: Resurgence,2016.0,Color,PG-13,120.0,Roland Emmerich,776.0,Vivica A. Fox,890.0,Sela Ward,...,535.0,102315545.0,Action|Adventure|Sci-Fi,286.0,58137,alien|battle|defense|independence day|mothership,English,USA,165000000.0,5.5
153,Kung Fu Panda 3,2016.0,Color,PG,95.0,Alessandro Carloni,5.0,J.K. Simmons,24000.0,Angelina Jolie Pitt,...,967.0,143523463.0,Action|Adventure|Animation|Comedy|Family,210.0,64322,china|kung fu|panda|pig|village,English,USA,145000000.0,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4683,Heroes,,Color,TV-14,60.0,,,Sendhil Ramamurthy,1000.0,Masi Oka,...,833.0,,Drama|Fantasy|Sci-Fi|Thriller,75.0,202115,father daughter relationship|serial killer|sup...,English,USA,,7.7
4688,Home Movies,,Color,TV-PG,22.0,,,Brendon Small,59.0,Ron Lynch,...,6.0,,Animation|Comedy|Drama,11.0,7458,coach|friend|school|series|tv series,English,USA,,8.2
4704,Revolution,,Color,TV-14,43.0,,,Billy Burke,2000.0,Tracy Spiridakos,...,576.0,,Action|Adventure|Drama|Sci-Fi,23.0,72017,2020s|near future|one word series title|post a...,English,USA,,6.7
4752,Happy Valley,,Color,TV-MA,58.0,,,Shirley Henderson,887.0,James Norton,...,250.0,,Crime|Drama,11.0,12848,caravan|police|police sergeant|tied to a chair...,English,UK,,8.5


In [45]:
movieDf.sort_values(by="title")# ascending=False if we want desc order

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
4349,#Horror,2015.0,Color,Not Rated,101.0,Tara Subkoff,37.0,Timothy Hutton,501.0,Balthazar Getty,...,56.0,,Drama|Horror|Mystery|Thriller,35.0,1547,bullying|cyberbullying|girl|internet|throat sl...,English,USA,1500000.0,3.3
3629,10 Cloverfield Lane,2016.0,Color,PG-13,104.0,Dan Trachtenberg,16.0,Bradley Cooper,14000.0,John Gallagher Jr.,...,82.0,71897215.0,Drama|Horror|Mystery|Sci-Fi|Thriller,411.0,126893,alien|bunker|car crash|kidnapping|minimal cast,English,USA,15000000.0,7.3
2964,10 Days in a Madhouse,2015.0,Color,R,111.0,Timothy Hines,0.0,Christopher Lambert,1000.0,Kelly LeBrock,...,247.0,14616.0,Drama,1.0,314,,English,USA,12000000.0,7.5
2799,10 Things I Hate About You,1999.0,Color,PG-13,97.0,Gil Junger,19.0,Joseph Gordon-Levitt,23000.0,Heath Ledger,...,835.0,38176108.0,Comedy|Drama|Romance,133.0,222099,dating|protective father|school|shrew|teen movie,English,USA,16000000.0,7.2
276,"10,000 B.C.",,,,22.0,Christopher Barnard,0.0,Mathew Buck,5.0,,...,,,Comedy,,6,,,,,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3597,[Rec] 2,2009.0,Color,R,85.0,Jaume Balagueró,57.0,Jonathan D. Mellor,37.0,Pablo Rosso,...,6.0,27024.0,Horror,222.0,55597,apartment|apartment building|blood sample|cruc...,Spanish,Spain,5600000.0,6.6
2127,eXistenZ,1999.0,Color,R,115.0,David Cronenberg,0.0,Jennifer Jason Leigh,1000.0,Sarah Polley,...,716.0,2840417.0,Horror|Sci-Fi|Thriller,196.0,77493,assassin|game|game designer|pod|virtual reality,English,Canada,31000000.0,6.8
579,xXx,2002.0,Color,PG-13,132.0,Rob Cohen,357.0,Vin Diesel,14000.0,Eve,...,212.0,141204016.0,Action|Adventure|Thriller,191.0,142569,agent|nsa|nsa agent|prague|russian,English,USA,70000000.0,5.8
782,xXx: State of the Union,2005.0,Color,PG-13,101.0,Lee Tamahori,93.0,Sunny Mabrey,287.0,Nona Gaye,...,218.0,26082914.0,Action|Adventure|Crime|Thriller,77.0,51349,coup d'etat|mutiny|president|u.s. navy|washing...,English,USA,87000000.0,4.3


## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>


### Exercise 2
<span  style="color:green; font-size:16px">Use the notna method to find the number of non-missing values in the actor 1 Facebook like column. Verify this
number is the same as the count method.

### Exercise 3
<span  style="color:green; font-size:16px">Use one line of code to fill the missing values of actor1_fb with the maximum of actor2_fb. Save this result to
variable actor1_fb_full</span>

### Exercise 4
<span  style="color:green; font-size:16px">Verify the results of problem 3 by selecting just the values of actor1_fb_full that were filled by actor2_fb.</span>


# Uniqueness

**There are a few methods that deal with unique values in a Series:**

- `unique`: Returns a numpy array of all the unique values in order of their appearance
- `nunique`: Returns the number of unique values in the Series
- `drop_duplicates`: Returns a pandas Series of just the unique values
- `duplicated` : check if there is any dup;icate value

In [49]:
movieDf["year"].unique()

array([2009., 2007., 2015., 2012.,   nan, 2010., 2016., 2006., 2008.,
       2013., 2011., 2014., 2005., 1997., 2004., 1999., 1995., 2003.,
       2001., 2002., 1998., 2000., 1990., 1991., 1994., 1996., 1982.,
       1993., 1979., 1992., 1989., 1984., 1988., 1978., 1962., 1980.,
       1972., 1981., 1968., 1985., 1940., 1963., 1987., 1986., 1973.,
       1983., 1976., 1977., 1970., 1971., 1969., 1960., 1965., 1964.,
       1927., 1974., 1937., 1975., 1967., 1951., 1961., 1946., 1953.,
       1954., 1959., 1932., 1947., 1956., 1945., 1952., 1930., 1966.,
       1939., 1950., 1948., 1958., 1957., 1943., 1944., 1938., 1949.,
       1936., 1941., 1955., 1942., 1929., 1935., 1933., 1916., 1934.,
       1925., 1920.])

In [48]:
movieDf["year"].nunique()

91

In [None]:
# count the number of unique values 

In [51]:
df= pd.DataFrame({"id":[1,2,3,4,4],"name":["Ahmed","Ahmed","Mohamed","sara","sara"]})
df

Unnamed: 0,id,name
0,1,Ahmed
1,2,Ahmed
2,3,Mohamed
3,4,sara
4,4,sara


In [53]:
#check if a row is dulpicate
df.duplicated().sum()

1

In [56]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [55]:
#drop entire duplicate row
df.drop_duplicates()

Unnamed: 0,id,name
0,1,Ahmed
1,2,Ahmed
2,3,Mohamed
3,4,sara


In [58]:
df

Unnamed: 0,id,name
0,1,Ahmed
1,2,Ahmed
2,3,Mohamed
3,4,sara
4,4,sara


In [57]:
#check if name has duplicates
df.duplicated(subset=["name"])

0    False
1     True
2    False
3    False
4     True
dtype: bool

In [60]:
df.drop_duplicates(subset="name")

Unnamed: 0,id,name
0,1,Ahmed
2,3,Mohamed
3,4,sara
