In [None]:
# Working with Missing Data in Pandas
# Missing Data can occur when no information is provided for one or more
# items or for a whole unit. Missing Data is a very big problem in a real-life scenarios.
# Missing Data can also refer to as NA(Not Available) values in pandas.
# In DataFrame sometimes many datasets simply arrive with missing data,
# either because it exists and was not collected or it never existed.
# For Example, Suppose different users being surveyed may choose not to share their income,
# some users may choose not to share the address in this way many datasets went missing.

# Dataset : https://raw.githubusercontent.com/yashy1626/ds_dataset/refs/heads/main/ufo.csv

In [None]:
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/yashy1626/ds_dataset/refs/heads/main/ufo.csv')
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [None]:
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18241 entries, 0 to 18240
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             18215 non-null  object
 1   Colors Reported  2882 non-null   object
 2   Shape Reported   15597 non-null  object
 3   State            18241 non-null  object
 4   Time             18241 non-null  object
dtypes: object(5)
memory usage: 712.7+ KB


In [None]:
# check for missing values
ufo.isnull().sum()

Unnamed: 0,0
City,26
Colors Reported,15359
Shape Reported,2644
State,0
Time,0


In [None]:
# handle missing values
ufo.dropna()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00
36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30
44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00
82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00
...,...,...,...,...,...
18213,Pasadena,GREEN,FIREBALL,CA,12/28/2000 19:10
18216,Garden Grove,ORANGE,LIGHT,CA,12/29/2000 16:10
18220,Shasta Lake,BLUE,DISK,CA,12/29/2000 20:30
18233,Anchorage,RED,VARIOUS,AK,12/31/2000 21:00


In [None]:
# fillna()
# used to Fill NA/NaN values

ufo['City'].fillna(value='CityNotKnown')
ufo.isnull().sum()

Unnamed: 0,0
City,26
Colors Reported,15359
Shape Reported,2644
State,0
Time,0


In [None]:
# fillna()
# used to Fill NA/NaN values

ufo['City'].fillna(value='CityNotKnown',inplace=True)
ufo.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ufo['City'].fillna(value='CityNotKnown',inplace=True)


Unnamed: 0,0
City,0
Colors Reported,15359
Shape Reported,2644
State,0
Time,0


In [None]:
# Load Another Dataset
data = pd.read_csv('https://raw.githubusercontent.com/yashy1626/ds_dataset/refs/heads/main/sample11.csv')
data

Unnamed: 0,Id,Name,Marks,Percentage
0,1,Alex,78.0,78.0
1,2,Alex,23.0,
2,3,Alex,,67.0
3,4,Alex,12.0,
4,5,Alex,,
5,6,Alex,54.0,
6,7,Alex,65.0,66.0


In [None]:
# interpolate()
'''
Python Pandas interpolate() method is used to fill NaN values in the DataFrame
or Series using various interpolation techniques to fill the missing values
rather than hard-coding the value.
Interpolation in Python is a technique used to estimate unknown data points
between two known data points.
'''