# Data Wrangling
In this notebook Data Wrangling is explained by using Python.

## What is Data Wrangling?

![](01.png)

![](02.png)

![](03.png)

## Importing Libraries and Dataset

In [4]:
# Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Importing Dataset
kashti = sns.load_dataset('titanic')

In [6]:
kashti.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Mathematical Operations

In [7]:
# Simple Operations (Math Operaters)
# Doing simple math operations on numeric value columns

(kashti['age'] * 6).head(10)

0    132.0
1    228.0
2    156.0
3    210.0
4    210.0
5      NaN
6    324.0
7     12.0
8    162.0
9     84.0
Name: age, dtype: float64

## Dealing with missing values

- In a data set missing values are either? N/A or NaN or 0 or a blank cell.
- Jab kabhi data na ho kisi aik row main kisi b aik parameter ka

> __Steps:__
1. Koshish karen dobra data collet kar len ya dekh len ager kahin ghalti hy.
2. Missing value wala variable (column) hi nikal den ager data per effect nahe hta ya simple row or data entry remove kar den.
3. Replace the missing values:
    1.  How?
        1.  Average value of entire variable or similar data point
        2.  frequency or MODE replacement
        3.  Replace based on other functions (Data sampler knows that)
        4.  ML algorithm can also be used
        5.  Leave it like that
    2.  Why?
        1.  Its better because no data is lost
        2.  Less accurate

In [8]:
# where exactly missing values are?
kashti.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [10]:
# Use drop.na method

print(kashti.shape)
# this will specifically removes deck column
# inplace = true modifies the data frame
kashti.dropna(subset=['deck'], axis=0, inplace=True)

(891, 15)


In [11]:
# After removing 'deck' column, see again missing values
kashti.isnull().sum()

survived        0
pclass          0
sex             0
age            19
sibsp           0
parch           0
fare            0
embarked        2
class           0
who             0
adult_male      0
deck            0
embark_town     2
alive           0
alone           0
dtype: int64

In [12]:
# to update the main dataframe
kashti = kashti.dropna()    # Removes NAN values from whole dataframe
kashti.isnull().sum() 

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [13]:
# After droping all NAN values let see how much data left
kashti.shape

(182, 15)

## Replacing Missing values with __mean__ of that column

In [36]:
ks1 = sns.load_dataset('titanic')
ks1.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [37]:
# finding an average (mean)
mean_age = ks1['age'].mean()
mean_age

29.69911764705882

In [38]:
# Replacing NAN values in 'age' column with mean of that column (updating as well)
ks1['age'] = ks1['age'].replace(np.nan, mean_age)

In [39]:
ks1.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

- __age__ column missing values are replaced.

In [26]:
# Similarly replacing 'deck' column NAN values
#mean_deck = ks1['deck'].mean()
#mean_deck

# Replacing NAN values in 'age' column with mean of that column (updating as well)
#ks1['age'] = ks1['age'].replace(np.nan, mean_age)

#ks1.isnull().sum()

## Data Formatting

- Data ko aik common standard per lana
- Ensures data is consistant and understandable
  - Easy to gather
  - Easy to workwith
  - Data ek hi unit main ho

In [28]:
# Know the data type and convert it into the known one
kashti.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [31]:
# Use this Method to convert datatype / type casting
kashti['survived'] = kashti['survived'].astype('float64')
kashti.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kashti['survived'] = kashti['survived'].astype('float64')


survived        float64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [32]:
# Use this Method to convert datatype / type casting
kashti['survived'] = kashti['survived'].astype('int64')
kashti.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kashti['survived'] = kashti['survived'].astype('int64')


survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [40]:
# Here We will convert the age into days insted of years
ks1['age'] = ks1['age'] * 365
ks1.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,8030.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,13870.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,9490.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,12775.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,12775.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,10840.177941,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,19710.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,730.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,9855.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,5110.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [41]:
ks1['age'] = ks1['age'].astype('int64')
ks1.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,8030,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,13870,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,9490,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,12775,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,12775,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [42]:
# Always Rename afterwards
ks1.rename(columns= {'age': 'age in days'}, inplace=True)
ks1.head()

Unnamed: 0,survived,pclass,sex,age in days,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,8030,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,13870,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,9490,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,12775,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,12775,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Data Normalization

- Uniform the data
- They have same impact
- aik machli samundar main or aik jar main
- Also for Computational reasons

In [44]:
ks1.head()

Unnamed: 0,survived,pclass,sex,age in days,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,8030,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,13870,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,9490,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,12775,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,12775,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [45]:
ks4 = ks1[['age in days', 'fare']]
ks4.head()

Unnamed: 0,age in days,fare
0,8030,7.25
1,13870,71.2833
2,9490,7.925
3,12775,53.1
4,12775,8.05


- The above data is really in wide range, it's hard to compare. So, we need to normalize
- Normalization changes the values to the range of 0-to-1 (now both variable has similar influence on our models)

## Method for Normalization

1.  Simple Feature Scaling
    1.  x(new) = x(old) / x(max)
2.  Min-Max method
3.  Z-Score (standard score) -3 -to- +3
4.  Log Transformation

### Simple Feature Scaling

In [46]:
# simple feature scaling
ks4['fare'] = ks4['fare'] / ks4['fare'].max()
ks4['age in days'] = ks4['age in days'] / ks4['age in days'].max()
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = ks4['fare'] / ks4['fare'].max()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['age in days'] = ks4['age in days'] / ks4['age in days'].max()


Unnamed: 0,age in days,fare
0,0.275,0.014151
1,0.475,0.139136
2,0.325,0.015469
3,0.4375,0.103644
4,0.4375,0.015713


### Min - Max Method

In [49]:
# min - max method

ks4['fare'] = (ks4['fare'] - ks4['fare'].min()) / (ks4['fare'].max() - ks4['fare'].min())
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = (ks4['fare'] - ks4['fare'].min()) / (ks4['fare'].max() - ks4['fare'].min())


Unnamed: 0,age in days,fare
0,0.275,0.014151
1,0.475,0.139136
2,0.325,0.015469
3,0.4375,0.103644
4,0.4375,0.015713


### Z-Score Method (Standard Score)

In [50]:
# z-score method (standard score)

ks4['fare'] = (ks4['fare'] - ks4['fare'].mean()) / ks4['fare'].std()
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = (ks4['fare'] - ks4['fare'].mean()) / ks4['fare'].std()


Unnamed: 0,age in days,fare
0,0.275,-0.502163
1,0.475,0.786404
2,0.325,-0.48858
3,0.4375,0.420494
4,0.4375,-0.486064


### Log Transformation Method

In [52]:
# log transformation method
ks = sns.load_dataset('titanic')

ks['fare'] = np.log(ks['fare'])
ks.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,1.981001,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,4.266662,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,2.070022,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,3.972177,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,2.085672,S,Third,man,True,,Southampton,no,True


## Binning

- Grouping of values into smaller number of values (bins)
- Convert numeric into categories (jawan, bachay, booray) etc
- To have better understanding of groups
  - low vs mid vs high price

In [None]:
df = sns.load_dataset('titanic')
bins = np.linspace(min(df['age']), max(df['age']), 15000)
age_groups = ['Bachay', 'Jawan', 'Boorhay']
df['age'] = pd.cut(df['age'], bins, labels=age_groups, include_lowest=True)
df['age']

# How this will change the names in dataset based on grouping?

### Converting Categories into dummies
- Easy to use for computation
- Male Female (0, 1)

In [56]:
pd.get_dummies(ks1['sex'])
ks.head()
# how to append in dataframe

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,1.981001,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,4.266662,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,2.070022,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,3.972177,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,2.085672,S,Third,man,True,,Southampton,no,True
