# Methods for handling missing values
**pandas provides the following methods to handle missing values:**

- `isna`: Returns a Series of booleans based on whether each value is missing or not.
- `notna`: Exact opposite of isna.
- `fillna`: Fills missing values in a variety of ways
- `dropna`: Drops the missing values from the Series

In [None]:
import pandas as pd

In [None]:
# let us wread movie dataset
movie = pd.read_csv("data/movie.csv")

In [None]:
#show first five rows in movie :
movie.head()

In [None]:
# to check if any cell has missing value in the movie dataset:
movie.isna()

In [None]:
# we can use the sum function it will sum each column in the df and scince it is boolean it will return number of missing values
movie.isna().sum()

In [None]:
#number of missing values in year column
movie["year"].isna().sum()


In [None]:
#get all rows in which the year column is missing
filter_1 = movie["year"].isna()
missing_year = movie[filter_1]


In [None]:
# count method wil count each non missing value in the column, here it the same as notna().sum()
print(movie["year"].count())
print(movie["year"].notna().sum())

# use fillna to fill missing values
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [None]:
movie.info()

In [None]:
# fill any missing year with 2024, 
movie["year"].fillna(2024)

movie.info()

In [None]:
movie["year"].fillna(2024,inplace=True)

In [None]:
movie.info()

In [None]:
complete_year[complete_year["year"]==2024]

In [None]:
#use dropna to drop missing values.
#drop any row in which year is missing
movie = movie.dropna(subset=["year"])
movie.info()

### Exrcises:

In [None]:
# filter rows in which color is missing

In [None]:
#drop rows in which color is missing --new variable

In [None]:
#fill rows in which color is missing with "Color"

# Sorting:

The `sort_values` method sorts the Series `from least to greatest by default`. 

It places `missing values at the end`.

In [None]:
df = pd.DataFrame({
'col1': ['A', 'A', 'B', None, 'D', 'C'],
  'col2': [2, 1, 9, 8, 7, 4],
  'col3': [0, 1, 9, 4, 2, 3],
  'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
df 

In [None]:
df.sort_values(by=['col1'])

In [None]:
#Sort by multiple columns
df.sort_values(by=['col1', 'col2'])

In [None]:
#Sort Descending
df.sort_values(by='col1', ascending=False)

In [None]:
#Putting NAs first
df.sort_values(by='col1', ascending=False, na_position='first')

In [None]:
movieDf = pd.read_csv("data/movie.csv")

In [None]:
movieDf.head()

#pandas sort_values with examples:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

In [None]:
movieDf.sort_values(by="year",ascending=False)# ascending=False if we want desc order

In [None]:
movieDf.sort_values(by="title")# ascending=False if we want desc order

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>


### Exercise 2
<span  style="color:green; font-size:16px">Use the notna method to find the number of non-missing values in the actor 1 Facebook like column. Verify this
number is the same as the count method.

### Exercise 3
<span  style="color:green; font-size:16px">Use one line of code to fill the missing values of actor1_fb with the maximum of actor2_fb. Save this result to
variable actor1_fb_full</span>

### Exercise 4
<span  style="color:green; font-size:16px">Verify the results of problem 3 by selecting just the values of actor1_fb_full that were filled by actor2_fb.</span>


# Uniqueness

**There are a few methods that deal with unique values in a Series:**

- `unique`: Returns a numpy array of all the unique values in order of their appearance
- `nunique`: Returns the number of unique values in the Series
- `drop_duplicates`: Returns a pandas Series of just the unique values
- `duplicated` : check if there is any dup;icate value

In [None]:
movieDf["year"].unique()

In [None]:
movieDf["year"].nunique()

In [None]:
# count the number of unique values 

In [None]:
df= pd.DataFrame({"id":[1,2,3,4,4],"name":["Ahmed","Ahmed","Mohamed","sara","sara"]})
df

In [None]:
#check if a row is dulpicate
df.duplicated().sum()

In [None]:
df.duplicated()

In [None]:
#drop entire duplicate row
df.drop_duplicates()

In [None]:
df

In [None]:
#check if name has duplicates
df.duplicated(subset=["name"])

In [None]:
df.drop_duplicates(subset="name")