# Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis

Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal.

In [1]:
import pandas as pd     

In [2]:
dataset_Covid19= pd.read_csv("covid_19_india.csv") 

<b>info()</b>  method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [3]:
dataset_Covid19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3675 entries, 0 to 3674
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Sno                       3675 non-null   int64 
 1   Date                      3675 non-null   object
 2   Time                      3675 non-null   object
 3   State/UnionTerritory      3675 non-null   object
 4   ConfirmedIndianNational   3675 non-null   object
 5   ConfirmedForeignNational  3675 non-null   object
 6   Cured                     3675 non-null   int64 
 7   Deaths                    3675 non-null   int64 
 8   Confirmed                 3675 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 258.5+ KB


Use the method <b>head()</b> to display the first five rows of the dataframe.

In [4]:
dataset_Covid19.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,30/01/20,6:00 PM,Kerala,1,0,0,0,1
1,2,31/01/20,6:00 PM,Kerala,1,0,0,0,1
2,3,01/02/20,6:00 PM,Kerala,2,0,0,0,2
3,4,02/02/20,6:00 PM,Kerala,3,0,0,0,3
4,5,03/02/20,6:00 PM,Kerala,3,0,0,0,3


In [5]:
missing_data = dataset_Covid19.isnull()


<B>"True"</B> stands for <B>missing value</B>, while <B>"False" </B>stands for <B>not missing value</B>

In [6]:
missing_data.head(5)

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False


### Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_couts()"  counts the number of "True" values. 

In [7]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

Sno
False    3675
Name: Sno, dtype: int64

Date
False    3675
Name: Date, dtype: int64

Time
False    3675
Name: Time, dtype: int64

State/UnionTerritory
False    3675
Name: State/UnionTerritory, dtype: int64

ConfirmedIndianNational
False    3675
Name: ConfirmedIndianNational, dtype: int64

ConfirmedForeignNational
False    3675
Name: ConfirmedForeignNational, dtype: int64

Cured
False    3675
Name: Cured, dtype: int64

Deaths
False    3675
Name: Deaths, dtype: int64

Confirmed
False    3675
Name: Confirmed, dtype: int64



In [8]:
import numpy as np
# dictionary of lists 
# C : Confirmed |  A : Active | R: Recovered  | D: Deceased
IndianStates_Status = {'Maharastra_C_A_R_D':['1,59,133', '67,600', np.nan, '7,273'], 
		'Delhi_C_A_R_D': ['80,188', '28,329', '49,301', np.nan], 
		'TamilNadu_C_A_R_D':[np.nan, '78,335', '33,216', '44,094']} 

In [9]:
# creating a dataframe using dictionary 
df = pd.DataFrame(IndianStates_Status) 
df

Unnamed: 0,Maharastra_C_A_R_D,Delhi_C_A_R_D,TamilNadu_C_A_R_D
0,159133.0,80188.0,
1,67600.0,28329.0,78335.0
2,,49301.0,33216.0
3,7273.0,,44094.0


<B>NaN</B>, standing for <B>not a number</B>, is a numeric data type used to represent any value that is <B>undefined or unpresentable</B>


In [10]:
# using notnull() function 
df.notnull() 

Unnamed: 0,Maharastra_C_A_R_D,Delhi_C_A_R_D,TamilNadu_C_A_R_D
0,True,True,False
1,True,True,True
2,False,True,True
3,True,False,True


In [11]:
#Dropping the whole row using index 
df.drop([0, 1])

Unnamed: 0,Maharastra_C_A_R_D,Delhi_C_A_R_D,TamilNadu_C_A_R_D
2,,49301.0,33216
3,7273.0,,44094


In [12]:
#Dropping the columns using index 
df.drop(['TamilNadu_C_A_R_D'], axis=1)

Unnamed: 0,Maharastra_C_A_R_D,Delhi_C_A_R_D
0,159133.0,80188.0
1,67600.0,28329.0
2,,49301.0
3,7273.0,


In [13]:
df

Unnamed: 0,Maharastra_C_A_R_D,Delhi_C_A_R_D,TamilNadu_C_A_R_D
0,159133.0,80188.0,
1,67600.0,28329.0,78335.0
2,,49301.0,33216.0
3,7273.0,,44094.0


In [14]:
# dictionary of lists 
# C : Confirmed |  A : Active | R: Recovered  | D: Deceased

# dictionary of lists 
dict = {'First Day':[100, 90, np.nan, 95], 
        'Second Day': [30, 45, 56, np.nan], 
        'Third  Day':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from list 
Covid_Status_Daywise = pd.DataFrame(dict) 

In [15]:
Covid_Status_Daywise

Unnamed: 0,First Day,Second Day,Third Day
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [16]:
#Replace "NaN" by mean value in "First Day" column
avg_1 = Covid_Status_Daywise["First Day"].astype("float").mean(axis = 0)

In [17]:
avg_1 

95.0

Replace <b>"NaN" </b>by mean value in <b>"First Day"</b> column

In [18]:
Covid_Status_Daywise["First Day"]=Covid_Status_Daywise["First Day"].replace(np.nan, avg_1)

In [19]:
Covid_Status_Daywise["First Day"]

0    100.0
1     90.0
2     95.0
3     95.0
Name: First Day, dtype: float64