# Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis

Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal.

In [4]:
import pandas as pd     

In [8]:
dataset_Covid19= pd.read_csv("covid_19_india.csv") 

<b>info()</b>  method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [14]:
dataset_Covid19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3135 entries, 0 to 3134
Data columns (total 9 columns):
Sno                         3135 non-null int64
Date                        3135 non-null object
Time                        3135 non-null object
State/UnionTerritory        3135 non-null object
ConfirmedIndianNational     3135 non-null object
ConfirmedForeignNational    3135 non-null object
Cured                       3135 non-null int64
Deaths                      3135 non-null int64
Confirmed                   3135 non-null int64
dtypes: int64(4), object(5)
memory usage: 220.5+ KB


Use the method <b>head()</b> to display the first five rows of the dataframe.

In [23]:
dataset_Covid19.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,30/01/20,6:00 PM,Kerala,1,0,0,0,1
1,2,31/01/20,6:00 PM,Kerala,1,0,0,0,1
2,3,1/2/2020,6:00 PM,Kerala,2,0,0,0,2
3,4,2/2/2020,6:00 PM,Kerala,3,0,0,0,3
4,5,3/2/2020,6:00 PM,Kerala,3,0,0,0,3


In [24]:
missing_data = dataset_Covid19.isnull()


<B>"True"</B> stands for <B>missing value</B>, while <B>"False" </B>stands for <B>not missing value</B>

In [25]:
missing_data.head(5)

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False


### Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_couts()"  counts the number of "True" values. 

In [22]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

Sno
False    3135
Name: Sno, dtype: int64

Date
False    3135
Name: Date, dtype: int64

Time
False    3135
Name: Time, dtype: int64

State/UnionTerritory
False    3135
Name: State/UnionTerritory, dtype: int64

ConfirmedIndianNational
False    3135
Name: ConfirmedIndianNational, dtype: int64

ConfirmedForeignNational
False    3135
Name: ConfirmedForeignNational, dtype: int64

Cured
False    3135
Name: Cured, dtype: int64

Deaths
False    3135
Name: Deaths, dtype: int64

Confirmed
False    3135
Name: Confirmed, dtype: int64



In [12]:
import numpy as np
# dictionary of lists 
# C : Confirmed |  A : Active | R: Recovered  | D: Deceased
IndianStates_Status = {'Maharastra_C_A_R_D':['1,59,133', '67,600', np.nan, '7,273'], 
		'Delhi_C_A_R_D': ['80,188', '28,329', '49,301', np.nan], 
		'TamilNadu_C_A_R_D':[np.nan, '78,335', '33,216', '44,094']} 

In [15]:
# creating a dataframe using dictionary 
df = pd.DataFrame(IndianStates_Status) 
df

Unnamed: 0,Delhi_C_A_R_D,Maharastra_C_A_R_D,TamilNadu_C_A_R_D
0,80188.0,159133.0,
1,28329.0,67600.0,78335.0
2,49301.0,,33216.0
3,,7273.0,44094.0


<B>NaN</B>, standing for <B>not a number</B>, is a numeric data type used to represent any value that is <B>undefined or unpresentable</B>


In [18]:
# using notnull() function 
df.notnull() 

Unnamed: 0,Delhi_C_A_R_D,Maharastra_C_A_R_D,TamilNadu_C_A_R_D
0,True,True,False
1,True,True,True
2,True,False,True
3,False,True,True


In [19]:
#Dropping the whole row using index 
df.drop([0, 1])

Unnamed: 0,Delhi_C_A_R_D,Maharastra_C_A_R_D,TamilNadu_C_A_R_D
2,49301.0,,33216
3,,7273.0,44094


In [21]:
#Dropping the columns using index 
df.drop(['TamilNadu_C_A_R_D'], axis=1)

Unnamed: 0,Delhi_C_A_R_D,Maharastra_C_A_R_D
0,80188.0,159133.0
1,28329.0,67600.0
2,49301.0,
3,,7273.0


In [22]:
df

Unnamed: 0,Delhi_C_A_R_D,Maharastra_C_A_R_D,TamilNadu_C_A_R_D
0,80188.0,159133.0,
1,28329.0,67600.0,78335.0
2,49301.0,,33216.0
3,,7273.0,44094.0


In [42]:
# dictionary of lists 
# C : Confirmed |  A : Active | R: Recovered  | D: Deceased

# dictionary of lists 
dict = {'First Day':[100, 90, np.nan, 95], 
        'Second Day': [30, 45, 56, np.nan], 
        'Third  Day':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from list 
Covid_Status_Daywise = pd.DataFrame(dict) 

In [48]:
Covid_Status_Daywise

Unnamed: 0,First Day,Second Day,Third Day
0,100.0,30.0,
1,90.0,45.0,40.0
2,95.0,56.0,80.0
3,95.0,,98.0


In [49]:
#Replace "NaN" by mean value in "First Day" column
avg_1 = Covid_Status_Daywise["First Day"].astype("float").mean(axis = 0)

In [50]:
avg_1 

95.0

Replace <b>"NaN" </b>by mean value in <b>"First Day"</b> column

In [51]:
Covid_Status_Daywise["First Day"]=Covid_Status_Daywise["First Day"].replace(np.nan, avg_1)

In [52]:
Covid_Status_Daywise["First Day"]

0    100.0
1     90.0
2     95.0
3     95.0
Name: First Day, dtype: float64