## Topic: Check Duplicates and Null (NaN) in DF

### OUTCOMES

- 1. Check Unique Values in a column
    - unique()
    - nunique()

- 2. Different Between unique() and nunique()

- 3. Check Null
    - df.isnull()
    - df.notnull()
    - df['col'].hasnans 


In [1]:
import pandas as pd

### 1. Check Unique Values in a column

#### Using unique()

- Return the the unique(non-duplicate) values.

- count NaN as a Unique values

- syntax:
    - df['Column_name'].unique()


In [3]:
# DataFrame

data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice", "David", "Bob"],
    "City": ["New York", "London", "Paris", "New York", "Tokyo", "London"],
    "Score": [85, 90, 78, 85, 95, 90]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,City,Score
0,Alice,New York,85
1,Bob,London,90
2,Charlie,Paris,78
3,Alice,New York,85
4,David,Tokyo,95
5,Bob,London,90


In [None]:
# - here: row -> 1 and 5 are duplicates

In [None]:
# Example : unique()
df['Name'].unique()

# return only unique values

array(['Alice', 'Bob', 'Charlie', 'David'], dtype=object)

In [5]:
# how many unique value in a column with NaN

len(df['Name'].unique())

4

#### nunique()

- return number of unique values
- do not count NaN as a unique value.

- Syntax:
    - df['column_name'].nunique()  [for series]

    - df.nunique()  [for DataFrame]

In [None]:
# Example - nunique() on Series

df['City'].nunique()

# there are 4 unique City (NaN is not count)

4

In [8]:
df

Unnamed: 0,Name,City,Score
0,Alice,New York,85
1,Bob,London,90
2,Charlie,Paris,78
3,Alice,New York,85
4,David,Tokyo,95
5,Bob,London,90


In [9]:
# nunique() on DataFrame
df.nunique()

Name     4
City     4
Score    4
dtype: int64

### 2. Different Between unique() and nunique()

- unique():
    - Only work with Series.

    - Return the Unique values that contain NaN values also.

- nunique():
    - Work with both Series and DataFrame.
    - 


### 3. Check Null

- Null values (NaN) indicate Missing data.

#### df.isnull()

- Return Boolean Value
    - True => Missing (NaN Data is present)

    - False => not Missing (NaN Data is not present)

- Syntax:
    - df['column_name'].isnull()

    - df.isnull()

In [15]:
# DataFarme
df = pd.read_csv("student_data.csv")
df.head(3)

Unnamed: 0,StudentID,FullName,Data Structure Marks,Algorithm Marks,Python Marks,CompletionStatus,EnrollmentDate,Instructor,Location
0,PH1001,Alif Rahman,85.0,85.0,88.0,Completed,2024-01-15,Mr. Karim,Dhaka
1,PH1002,Fatima Akhter,92.0,92.0,,In Progress,2024-01-20,Ms. Salma,Chattogram
2,PH1003,Imran Hossain,88.0,88.0,85.0,Completed,2024-02-10,Mr. Karim,Dhaka


In [17]:
# Example - df.isnull() [Entire data or table]
df.isnull().head(5)

# True => NaN value is Present
# False => Not NaN value

Unnamed: 0,StudentID,FullName,Data Structure Marks,Algorithm Marks,Python Marks,CompletionStatus,EnrollmentDate,Instructor,Location
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,True,True,False,False,False,False,False


In [21]:
# isnull() with single column [Series]

df['Python Marks'].isnull()


0     False
1      True
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15     True
16    False
17    False
18     True
19    False
Name: Python Marks, dtype: bool

In [None]:
# Total Nan is Python Marks column (Specific Column)

df['Python Marks'].isnull().sum()


np.int64(5)

#### df.notnull()

- Opposite of isnull()

- True => Not NaN values

- False => Yes NaN values present

In [25]:
# df.notnull()
df.notnull().head(5)

Unnamed: 0,StudentID,FullName,Data Structure Marks,Algorithm Marks,Python Marks,CompletionStatus,EnrollmentDate,Instructor,Location
0,True,True,True,True,True,True,True,True,True
1,True,True,True,True,False,True,True,True,True
2,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True
4,True,True,False,False,True,True,True,True,True


In [None]:
# Total Not NaN value in dataframe is rows-wise
total_NaN = df.notnull().sum()
total_NaN

StudentID               20
FullName                20
Data Structure Marks    16
Algorithm Marks         16
Python Marks            15
CompletionStatus        20
EnrollmentDate          20
Instructor              20
Location                20
dtype: int64

#### df['col_name'].hasnans

- Apply only one single column

- True => column contain at least one NaN.

- False => Column completely filled (no missing data)

In [None]:
# example-> df['col_name'].hasnans

df['StudentID'].hasnans

# False menas there is no NaN values

False

In [None]:
df['Python Marks'].hasnans

# True means this column has a NaN value

True

#### Quiz

1. What will be the shape of the DataFrame created from the given data?

In [10]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age" : [25, 30, 35],
    "City": ["New York", "london", "Paris"] 
}

df = pd.DataFrame(data)

In [12]:
df.shape

(3, 3)

2. What does df.loc[df['City'].str.contains("New")] do?
ans: Select rows where City contians New

3. What does np.where(df['Data Structure Marks']>90, 'A+' , 'A') do?

ans: Returns "A+" for marks >90 , "A" otherwise


4. What type of values will df['Passed in DS'] = df['Data Structure Marks'] > 70 create

ans: Boolean Values True/False


5. What is the difference between df['Column'].unique() and df['Column'].nunique()?

ans: unique() returns array of unique values, nunique() return count

6. What does df.dropna() do by default?

ans: Removes rows with any NaN values

In [None]:
7. What is the purpose of the axis parameter in df.apply(function, axis=1)

axis=0 -> applies the function column-wise

axis=1 ->  applies the function row-wise



In [None]:
8. What does df['EnrollmentDate'].dt.year extract?

ans: The year component from datetime

Object `extract` not found.


9. What is the purpose of df['Python Marks'].fillna(df['Python Marks'].mean(), inplace=True)?

ans: Fills Missing Python Marks with the column mean